SEQ_TABLES TUTORIAL

The purpose of this class is to provide an easier method for analyzing groups of aligned sequences (either NT or AA) using numpy-array/pandas. It will treat the group of aligned sequences as rows within the dataframe. Each column in the table will correspond to a specific base or residue position in the group. Letters in each cell (row x col) are represented by their ascii-value rather than a string for faster operations (Note: another valid way is to use pandas categories to represent the table, but I think categories might need a few bugs ironed out first)

When sequences are aligned in the table, then isolating a region of interest becomes very easy. For example, you can look at all bases at position 25 and 30 by simple writing

pos_interest = seqtable[[25,30]]

Finally, when combining the dataframe functions with interactive plots such as plotly, then it hopefully provides easier visualization methods for identifying patterns contributed by specific residue/base positions within the sequence group.

SOME WARNINGS BEFORE USING:

  1. Avoid converting each the seqtable attribute into a dataframe

    i.e. dont do this:

     pd.DataFrame(sq.seq_table.view_bases()))

    If you do this, then all elements in the dataframe will be converted to strings and treated as objects. Objects in python are very large as compared to integers or characters so the memory usage becomes extremely large. Instead, use pandas categories if desired but I find that categories are still kind-of buggy. For this reason, most of the variables in the class treat each string using their ascii values

In [2]:
import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go
In [3]:
%load_ext autoreload
%autoreload 2
In [4]:
init_notebook_mode()

There are multiple ways to initialize this class. Essentially it just needs a list of sequences aligned to one another. In addition, you can also past in a list of quality scores

In [11]:
seqs = ['AAA', 'ACT', 'ACA']
quals = ['ZCA', 'DGJ', 'JJJ']
sq = seq_tables.seqtable(seqs, quals)

This class has three important attributes

1) sq.seq_df : returns a dataframe of the original sequence and/or quality:

In [12]:
sq.seq_df
Out[12]:
seqs quals
0 AAA ZCA
1 ACT DGJ
2 ACA JJJ

2) sq.seq_table: Essential attribute that represents all letters in the sequences as a table of rows/columns of ascii values

In [13]:
sq.seq_table
Out[13]:
1 2 3
0 65 65 65
1 65 67 84
2 65 67 65

View the table as letters rather than ascii values

In [14]:
sq.view_bases()
Out[14]:
array([[b'A', b'A', b'A'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'A']], 
      dtype='|S1')

3) sq.qual_table : this represents the quality string in a fastq file as actual quality scores (already adjusted to i.e. '!' = 0)

In [15]:
sq.qual_table
Out[15]:
1 2 3
0 57 34 32
1 35 38 41
2 41 41 41

Because all sequences are in a numpy array/datatable, it becomes very easy to perform slicing operations and only isolate regions of interest

In [19]:
# Slice the data frame and only look at first and last column
sq.view_bases()[:, [0,-1]]
Out[19]:
array([[b'A', b'A'],
       [b'A', b'T'],
       [b'A', b'A']], 
      dtype='|S1')

Now lets load fastq data

In [5]:
sq = seq_tables.read_fastq("Demo.fq")
sq.seq_df.iloc[:5]
Out[5]:
seqs quals
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG... B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATACAG... A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"} CGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAGGA... AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"} GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG... CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"} GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGTCA... BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...
In [23]:
sq.view_bases()[:5,:]
Out[23]:
array([[b'G', b'N', b'C', ..., b'T', b'G', b'A'],
       [b'G', b'G', b'A', ..., b'T', b'C', b'C'],
       [b'C', b'G', b'T', ..., b'A', b'A', b'C'],
       [b'G', b'C', b'C', ..., b'T', b'G', b'A'],
       [b'G', b'G', b'A', ..., b'T', b'T', b'T']], 
      dtype='|S1')
In [24]:
sq.seq_table.shape
Out[24]:
(3000, 250)

How do the quality scores in this data set look

We can get a distribution of quality score vs base position (Lets use by default FASTQC bins)

In [7]:
(dist, plotsmade1) = sq.get_quality_dist()
dist
Out[7]:
1 2 3 4 5 6 7 8 9 10-14 ... 205-209 210-214 215-219 220-224 225-229 230-234 235-239 240-244 245-249 250-254
count 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 15000.000000 ... 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 15000.000000 3000.000000
mean 32.461333 32.855000 33.074667 33.214000 32.972000 32.285000 32.454667 32.358333 32.924000 33.957000 ... 32.416067 32.170867 31.733000 31.514533 30.088267 29.954067 30.340867 30.250333 30.004200 22.639667
std 3.007134 2.527338 2.031042 1.773486 2.330313 4.571879 2.896678 3.010695 2.215821 4.554324 ... 8.113011 8.209306 8.263074 8.380561 9.213078 9.250358 9.030052 9.235223 9.219869 9.778839
min 16.000000 2.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 15.000000 ... 12.000000 2.000000 2.000000 2.000000 12.000000 2.000000 2.000000 12.000000 12.000000 12.000000
10% 32.000000 32.000000 32.000000 32.000000 32.000000 30.000000 32.000000 32.000000 32.000000 32.000000 ... 14.000000 14.000000 13.000000 13.000000 13.000000 13.000000 14.000000 13.000000 14.000000 12.000000
25% 32.000000 32.000000 33.000000 33.000000 33.000000 33.000000 32.000000 32.000000 32.000000 33.000000 ... 32.000000 32.000000 31.000000 30.000000 26.000000 24.000000 25.000000 25.000000 24.000000 14.000000
50% 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 34.000000 ... 37.000000 37.000000 37.000000 37.000000 35.000000 35.000000 36.000000 36.000000 35.000000 15.000000
75% 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 38.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 33.000000
90% 34.000000 34.000000 34.000000 35.000000 34.000000 35.000000 34.000000 34.000000 34.000000 38.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
max 36.000000 36.000000 36.000000 36.000000 36.000000 37.000000 37.000000 37.000000 37.000000 38.000000 ... 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000 39.000000 38.000000

10 rows × 57 columns

Now lets consider all positions independently (i.e. don't bin together positions)

In [27]:
(dist, plotsmade) = sq.get_quality_dist(bins=range(sq.seq_table.shape[1]))
dist
Out[27]:
1 2 3 4 5 6 7 8 9 10 ... 240 241 242 243 244 245 246 247 248 249
count 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 ... 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000 3000.000000
mean 32.461333 32.855000 33.074667 33.214000 32.972000 32.285000 32.454667 32.358333 32.924000 32.768333 ... 31.450000 29.579667 29.431667 30.900333 29.890000 29.364000 30.539000 29.843333 29.744333 30.530333
std 3.007134 2.527338 2.031042 1.773486 2.330313 4.571879 2.896678 3.010695 2.215821 2.743934 ... 8.380388 9.606860 9.695256 8.786740 9.472089 9.636967 8.842278 9.265891 9.340666 8.939522
min 16.000000 2.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 16.000000 ... 12.000000 12.000000 12.000000 12.000000 12.000000 12.000000 12.000000 12.000000 12.000000 12.000000
10% 32.000000 32.000000 32.000000 32.000000 32.000000 30.000000 32.000000 32.000000 32.000000 32.000000 ... 14.000000 13.000000 13.000000 13.000000 13.000000 14.000000 14.000000 14.000000 14.000000 14.000000
25% 32.000000 32.000000 33.000000 33.000000 33.000000 33.000000 32.000000 32.000000 32.000000 32.000000 ... 29.000000 24.000000 24.000000 26.000000 24.000000 24.000000 25.000000 24.000000 24.000000 26.000000
50% 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 ... 37.000000 36.000000 36.000000 37.000000 36.000000 33.500000 36.000000 35.000000 34.000000 36.000000
75% 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 34.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
90% 34.000000 34.000000 34.000000 35.000000 34.000000 35.000000 34.000000 34.000000 34.000000 34.000000 ... 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000 37.000000
max 36.000000 36.000000 36.000000 36.000000 36.000000 37.000000 37.000000 37.000000 37.000000 37.000000 ... 39.000000 38.000000 38.000000 38.000000 38.000000 39.000000 39.000000 38.000000 39.000000 39.000000

10 rows × 249 columns

The above tables are better viewed when plotted (i.e. box plots of quality score distribution vs bin)

In [8]:
iplot(plotsmade1)

Lets not consider sequences with low quality everywhere, so lets filter out sequences whose quality score is less than 20 in more than half of the bases

In [32]:
filtered = sq.quality_filter(q=20, p=50)
filtered.seq_df.iloc[:5,:]
Out[32]:
seqs quals
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG... B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATACAG... A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"} CGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAGGA... AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"} GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG... CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"} GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGTCA... BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...

In addition, lets convert any low quality base to ambiguous

In [33]:
converted = sq.convert_low_bases_to_null(q=20)
converted.seq_df.iloc[:5,:]
Out[33]:
seqs quals
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} b'GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGG... B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} b'GGAGGAGACGATGACTTCGGTCCCGTNGCCCCATGCGTCGATAC... A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"} b'CGTGACGTTGGNCGNGTCCGGGNGCGGCCTCCAGACGCNCGGAG... AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"} b'GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGG... CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"} b'GGAGGAGNCGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGT... BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...

Selecting Data

The pandas location functions should also work for seqtables. We just pass in the loc, iloc, and ix functions to the pandas dataframe within the object

In [34]:
sq.iloc[:5]
Out[34]:
                                                                                                 seqs  \
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N...  GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATACAG...   
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N...  CGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAGGA...   
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N...  GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGTCA...   

                                                                                                quals  
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N...  B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...  
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N...  A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...  
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N...  AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...  
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N...  CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...  
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N...  BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...  

We can also randomly subsample sequences

In [36]:
sq.subsample(5)
Out[36]:
                                                                                                 seqs  \
M01012:51:000000000-A4H0H:1:1101:9254:7741 1:N:...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATCCAA...   
M01012:51:000000000-A4H0H:1:1101:20634:8736 1:N...  CGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGT...   
M01012:51:000000000-A4H0H:1:1101:11626:6741 1:N...  GACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTC...   
M01012:51:000000000-A4H0H:1:1101:15663:6193 1:N...  GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:11316:6590 1:N...  GCCGTGACGTTGGACGAGTCCGGAGGCGGCCTCCAGACGCCCGGAG...   

                                                                                                quals  
M01012:51:000000000-A4H0H:1:1101:9254:7741 1:N:...  CCCCCBAAAA@AGGGGGGFGFGGHGHGGGHHGGGHHGGGGGGHGHH...  
M01012:51:000000000-A4H0H:1:1101:20634:8736 1:N...  AAAA?@A1AD1AFEAF1GGEECAADGHH2FH?EEGEGGHFFHHFEA...  
M01012:51:000000000-A4H0H:1:1101:11626:6741 1:N...  AAA1A@DF1FFFFEGECAGGGCGHGHHFHFHHHFHGDDEGEGCFBF...  
M01012:51:000000000-A4H0H:1:1101:15663:6193 1:N...  AABCCBCCCCCCGGGGGGEGGGGGGGEGGGGGHHHHHCEE@EGGGG...  
M01012:51:000000000-A4H0H:1:1101:11316:6590 1:N...  BBBBB@BBBBAAGGGGGGGGGGGGGGEGGGGGGHHFFGGEGGGGFF...  

Lets say you are only interested in looking at sections/regions of a sequence

In [58]:
sq.slice_sequences([3,10,20], name='sliced').iloc[:5,:]
Out[58]:
sliced
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} b'CTC'
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} b'AGG'
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"} b'TGG'
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"} b'CTC'
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"} b'AGG'

Analyzing the distribution of sequences and bases in a group

One good feature of treating sequences as a table is that it makes operations such as hamming distance easy and fast (Fast because everything is a number, we dont have to use an apply function, and we are only relying on numpy array operations)

Get the hamming distances of all sequences to the first sequence

In [54]:
import time
t1 = time.time()
hamm = sq.hamming_distance(sq.seq_df.iloc[0]['seqs'])
t2 = time.time()
t = t2-t1
print('Time: ' + str(t))
hamm[:5]
Time: 0.006999969482421875
Out[54]:
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}      0
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}    189
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}    190
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}     43
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}    197
dtype: int32

Compare time and answer to distance module (pip install distance)

This implementation is much faster than using an apply function where we call the standard hamming distance function.

In [55]:
import distance
ref = sq.seq_df.iloc[0]['seqs']
t1 = time.time()
hamm = sq.seq_df['seqs'].apply(lambda x: distance.hamming(x, ref))
t2 =time.time()
t = t2 - t1
print('Time: ' + str(t))
hamm[:5]
Time: 0.11701202392578125
Out[55]:
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}      0
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}    189
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}    190
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}     43
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}    197
Name: seqs, dtype: int64

Lets look if any specific bases positions have a high error rate

In [10]:
error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs']).sum()/sq.seq_table.shape[0]
iplot([go.Scatter(x = error.index, y = error, mode='markers')])

In addition to standard hamming function, we call also define whether we want to ignore bases containing a specific letter.

This time, lets ignore any bases equal to 'N' when considering error rate

In [9]:
error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs'], ignore_characters=['N']).sum()/sq.seq_table.shape[0]
iplot(go.Figure(data=[go.Scatter(x = error.index, y = error, mode='markers')], layout=go.Layout({'xaxis':{'title': 'base position'}, 'yaxis':{'title': 'frequency equal to reference'}})))

Now do we find any conserved bases at a specific position?

In [12]:
"""
Determine the distribution of letters at each postion
"""
dist = sq.get_seq_dist()
dist
Out[12]:
1 2 3 4 5 6 7 8 9 10 ... 241 242 243 244 245 246 247 248 249 250
A 56.0 101 1313.0 57.0 107.0 1293.0 1562.0 1327.0 57.0 94.0 ... 363.0 559.0 237.0 238.0 1340.0 199.0 310.0 546.0 183.0 1251.0
C 71.0 1520 1529.0 84.0 44.0 72.0 55.0 1538.0 1289.0 59.0 ... 2020.0 540.0 700.0 2066.0 619.0 789.0 1919.0 504.0 823.0 863.0
G 2818.0 1317 68.0 2802.0 1325.0 1552.0 1319.0 71.0 1577.0 1314.0 ... 410.0 359.0 1750.0 460.0 309.0 1669.0 539.0 426.0 1655.0 655.0
N 0.0 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
T 55.0 58 90.0 57.0 1524.0 83.0 64.0 64.0 77.0 1533.0 ... 207.0 1542.0 313.0 236.0 732.0 343.0 232.0 1524.0 339.0 231.0

5 rows × 250 columns

Lets plot the distribution of bases using a sequence logo based on methods described here : https://github.com/ISA-tools/SequenceLogoVis

In [13]:
plot_data = seq_tables.draw_seqlogo_barplots(dist/dist.sum(), alphabet='nt', yaxistitle='frequency')
iplot(plot_data)