The purpose of this class is to provide an easier method for analyzing groups of aligned sequences (either NT or AA) using numpy-array/pandas. It will treat the group of aligned sequences as rows within the dataframe. Each column in the table will correspond to a specific base or residue position in the group. Letters in each cell (row x col) are represented by their ascii-value rather than a string for faster operations (Note: another valid way is to use pandas categories to represent the table, but I think categories might need a few bugs ironed out first)
When sequences are aligned in the table, then isolating a region of interest becomes very easy. For example, you can look at all bases at position 25 and 30 by simple writing
pos_interest = seqtable[[25,30]]
Finally, when combining the dataframe functions with interactive plots such as plotly, then it hopefully provides easier visualization methods for identifying patterns contributed by specific residue/base positions within the sequence group.
SOME WARNINGS BEFORE USING:
Avoid converting each the seqtable attribute into a dataframe
i.e. dont do this:
pd.DataFrame(sq.seq_table.view_bases()))
If you do this, then all elements in the dataframe will be converted to strings and treated as objects. Objects in python are very large as compared to integers or characters so the memory usage becomes extremely large. Instead, use pandas categories if desired but I find that categories are still kind-of buggy. For this reason, most of the variables in the class treat each string using their ascii values
import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go
%load_ext autoreload
%autoreload 2
init_notebook_mode()
seqs = ['AAA', 'ACT', 'ACA']
quals = ['ZCA', 'DGJ', 'JJJ']
sq = seq_tables.seqtable(seqs, quals)
This class has three important attributes
1) sq.seq_df
: returns a dataframe of the original sequence and/or quality:
sq.seq_df
2) sq.seq_table
: Essential attribute that represents all letters in the sequences as a table of rows/columns of ascii values
sq.seq_table
View the table as letters rather than ascii values
sq.view_bases()
3) sq.qual_table
: this represents the quality string in a fastq file as actual quality scores (already adjusted to i.e. '!' = 0)
sq.qual_table
Because all sequences are in a numpy array/datatable, it becomes very easy to perform slicing operations and only isolate regions of interest
# Slice the data frame and only look at first and last column
sq.view_bases()[:, [0,-1]]
sq = seq_tables.read_fastq("Demo.fq")
sq.seq_df.iloc[:5]
sq.view_bases()[:5,:]
sq.seq_table.shape
We can get a distribution of quality score vs base position (Lets use by default FASTQC bins)
(dist, plotsmade1) = sq.get_quality_dist()
dist
Now lets consider all positions independently (i.e. don't bin together positions)
(dist, plotsmade) = sq.get_quality_dist(bins=range(sq.seq_table.shape[1]))
dist
The above tables are better viewed when plotted (i.e. box plots of quality score distribution vs bin)
iplot(plotsmade1)
Lets not consider sequences with low quality everywhere, so lets filter out sequences whose quality score is less than 20 in more than half of the bases
filtered = sq.quality_filter(q=20, p=50)
filtered.seq_df.iloc[:5,:]
In addition, lets convert any low quality base to ambiguous
converted = sq.convert_low_bases_to_null(q=20)
converted.seq_df.iloc[:5,:]
sq.iloc[:5]
We can also randomly subsample sequences
sq.subsample(5)
Lets say you are only interested in looking at sections/regions of a sequence
sq.slice_sequences([3,10,20], name='sliced').iloc[:5,:]
Get the hamming distances of all sequences to the first sequence
import time
t1 = time.time()
hamm = sq.hamming_distance(sq.seq_df.iloc[0]['seqs'])
t2 = time.time()
t = t2-t1
print('Time: ' + str(t))
hamm[:5]
Compare time and answer to distance module (pip install distance)
This implementation is much faster than using an apply function where we call the standard hamming distance function.
import distance
ref = sq.seq_df.iloc[0]['seqs']
t1 = time.time()
hamm = sq.seq_df['seqs'].apply(lambda x: distance.hamming(x, ref))
t2 =time.time()
t = t2 - t1
print('Time: ' + str(t))
hamm[:5]
Lets look if any specific bases positions have a high error rate
error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs']).sum()/sq.seq_table.shape[0]
iplot([go.Scatter(x = error.index, y = error, mode='markers')])
In addition to standard hamming function, we call also define whether we want to ignore bases containing a specific letter.
This time, lets ignore any bases equal to 'N' when considering error rate
error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs'], ignore_characters=['N']).sum()/sq.seq_table.shape[0]
iplot(go.Figure(data=[go.Scatter(x = error.index, y = error, mode='markers')], layout=go.Layout({'xaxis':{'title': 'base position'}, 'yaxis':{'title': 'frequency equal to reference'}})))
Now do we find any conserved bases at a specific position?
"""
Determine the distribution of letters at each postion
"""
dist = sq.get_seq_dist()
dist
Lets plot the distribution of bases using a sequence logo based on methods described here : https://github.com/ISA-tools/SequenceLogoVis
plot_data = seq_tables.draw_seqlogo_barplots(dist/dist.sum(), alphabet='nt', yaxistitle='frequency')
iplot(plot_data)