SEQ_TABLES TUTORIAL¶

The purpose of this class is to provide an easier method for analyzing groups of aligned sequences (either NT or AA) using numpy-array/pandas. It will treat the group of aligned sequences as rows within the dataframe. Each column in the table will correspond to a specific base or residue position in the group. Letters in each cell (row x col) are represented by their ascii-value rather than a string for faster operations (Note: another valid way is to use pandas categories to represent the table, but I think categories might need a few bugs ironed out first)

When sequences are aligned in the table, then isolating a region of interest becomes very easy. For example, you can look at all bases at position 25 and 30 by simple writing

pos_interest = seqtable[[25,30]]

Finally, when combining the dataframe functions with interactive plots such as plotly, then it hopefully provides easier visualization methods for identifying patterns contributed by specific residue/base positions within the sequence group.

SOME WARNINGS BEFORE USING:

Avoid converting each the seqtable attribute into a dataframe

i.e. dont do this:
```
 pd.DataFrame(sq.seq_table.view_bases()))
```
If you do this, then all elements in the dataframe will be converted to strings and treated as objects. Objects in python are very large as compared to integers or characters so the memory usage becomes extremely large. Instead, use pandas categories if desired but I find that categories are still kind-of buggy. For this reason, most of the variables in the class treat each string using their ascii values

import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go

%load_ext autoreload
%autoreload 2

init_notebook_mode()

There are multiple ways to initialize this class. Essentially it just needs a list of sequences aligned to one another. In addition, you can also past in a list of quality scores¶

seqs = ['AAA', 'ACT', 'ACA']
quals = ['ZCA', 'DGJ', 'JJJ']
sq = seq_tables.seqtable(seqs, quals)

This class has three important attributes

1) sq.seq_df : returns a dataframe of the original sequence and/or quality:

sq.seq_df

2) sq.seq_table: Essential attribute that represents all letters in the sequences as a table of rows/columns of ascii values

sq.seq_table

View the table as letters rather than ascii values

sq.view_bases()

array([[b'A', b'A', b'A'],
       [b'A', b'C', b'T'],
       [b'A', b'C', b'A']], 
      dtype='|S1')

3) sq.qual_table : this represents the quality string in a fastq file as actual quality scores (already adjusted to i.e. '!' = 0)

sq.qual_table

Because all sequences are in a numpy array/datatable, it becomes very easy to perform slicing operations and only isolate regions of interest

# Slice the data frame and only look at first and last column
sq.view_bases()[:, [0,-1]]

array([[b'A', b'A'],
       [b'A', b'T'],
       [b'A', b'A']], 
      dtype='|S1')

Now lets load fastq data¶

sq = seq_tables.read_fastq("Demo.fq")
sq.seq_df.iloc[:5]

sq.view_bases()[:5,:]

array([[b'G', b'N', b'C', ..., b'T', b'G', b'A'],
       [b'G', b'G', b'A', ..., b'T', b'C', b'C'],
       [b'C', b'G', b'T', ..., b'A', b'A', b'C'],
       [b'G', b'C', b'C', ..., b'T', b'G', b'A'],
       [b'G', b'G', b'A', ..., b'T', b'T', b'T']], 
      dtype='|S1')

sq.seq_table.shape

(3000, 250)

How do the quality scores in this data set look¶

We can get a distribution of quality score vs base position (Lets use by default FASTQC bins)

(dist, plotsmade1) = sq.get_quality_dist()
dist

Now lets consider all positions independently (i.e. don't bin together positions)

(dist, plotsmade) = sq.get_quality_dist(bins=range(sq.seq_table.shape[1]))
dist

The above tables are better viewed when plotted (i.e. box plots of quality score distribution vs bin)

iplot(plotsmade1)

Lets not consider sequences with low quality everywhere, so lets filter out sequences whose quality score is less than 20 in more than half of the bases

filtered = sq.quality_filter(q=20, p=50)
filtered.seq_df.iloc[:5,:]

In addition, lets convert any low quality base to ambiguous

converted = sq.convert_low_bases_to_null(q=20)
converted.seq_df.iloc[:5,:]

Selecting Data¶

The pandas location functions should also work for seqtables. We just pass in the loc, iloc, and ix functions to the pandas dataframe within the object¶

sq.iloc[:5]

                                                                                                 seqs  \
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N...  GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATACAG...   
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N...  CGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAGGA...   
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N...  GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGTCA...   

                                                                                                quals  
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N...  B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...  
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N...  A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...  
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N...  AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...  
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N...  CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...  
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N...  BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...

We can also randomly subsample sequences

sq.subsample(5)

                                                                                                 seqs  \
M01012:51:000000000-A4H0H:1:1101:9254:7741 1:N:...  GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATCCAA...   
M01012:51:000000000-A4H0H:1:1101:20634:8736 1:N...  CGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGT...   
M01012:51:000000000-A4H0H:1:1101:11626:6741 1:N...  GACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTC...   
M01012:51:000000000-A4H0H:1:1101:15663:6193 1:N...  GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...   
M01012:51:000000000-A4H0H:1:1101:11316:6590 1:N...  GCCGTGACGTTGGACGAGTCCGGAGGCGGCCTCCAGACGCCCGGAG...   

                                                                                                quals  
M01012:51:000000000-A4H0H:1:1101:9254:7741 1:N:...  CCCCCBAAAA@AGGGGGGFGFGGHGHGGGHHGGGHHGGGGGGHGHH...  
M01012:51:000000000-A4H0H:1:1101:20634:8736 1:N...  AAAA?@A1AD1AFEAF1GGEECAADGHH2FH?EEGEGGHFFHHFEA...  
M01012:51:000000000-A4H0H:1:1101:11626:6741 1:N...  AAA1A@DF1FFFFEGECAGGGCGHGHHFHFHHHFHGDDEGEGCFBF...  
M01012:51:000000000-A4H0H:1:1101:15663:6193 1:N...  AABCCBCCCCCCGGGGGGEGGGGGGGEGGGGGHHHHHCEE@EGGGG...  
M01012:51:000000000-A4H0H:1:1101:11316:6590 1:N...  BBBBB@BBBBAAGGGGGGGGGGGGGGEGGGGGGHHFFGGEGGGGFF...

Lets say you are only interested in looking at sections/regions of a sequence

sq.slice_sequences([3,10,20], name='sliced').iloc[:5,:]

Analyzing the distribution of sequences and bases in a group¶

One good feature of treating sequences as a table is that it makes operations such as hamming distance easy and fast (Fast because everything is a number, we dont have to use an apply function, and we are only relying on numpy array operations)¶

Get the hamming distances of all sequences to the first sequence

import time
t1 = time.time()
hamm = sq.hamming_distance(sq.seq_df.iloc[0]['seqs'])
t2 = time.time()
t = t2-t1
print('Time: ' + str(t))
hamm[:5]

Time: 0.006999969482421875

M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}      0
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}    189
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}    190
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}     43
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}    197
dtype: int32

Compare time and answer to distance module (pip install distance)

This implementation is much faster than using an apply function where we call the standard hamming distance function.

import distance
ref = sq.seq_df.iloc[0]['seqs']
t1 = time.time()
hamm = sq.seq_df['seqs'].apply(lambda x: distance.hamming(x, ref))
t2 =time.time()
t = t2 - t1
print('Time: ' + str(t))
hamm[:5]

Time: 0.11701202392578125

M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}      0
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}    189
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}    190
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}     43
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}    197
Name: seqs, dtype: int64

Lets look if any specific bases positions have a high error rate

error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs']).sum()/sq.seq_table.shape[0]
iplot([go.Scatter(x = error.index, y = error, mode='markers')])

In addition to standard hamming function, we call also define whether we want to ignore bases containing a specific letter.

This time, lets ignore any bases equal to 'N' when considering error rate

error = sq.compare_to_reference(sq.seq_df.iloc[0]['seqs'], ignore_characters=['N']).sum()/sq.seq_table.shape[0]
iplot(go.Figure(data=[go.Scatter(x = error.index, y = error, mode='markers')], layout=go.Layout({'xaxis':{'title': 'base position'}, 'yaxis':{'title': 'frequency equal to reference'}})))

Now do we find any conserved bases at a specific position?

"""
Determine the distribution of letters at each postion
"""
dist = sq.get_seq_dist()
dist

Lets plot the distribution of bases using a sequence logo based on methods described here : https://github.com/ISA-tools/SequenceLogoVis

plot_data = seq_tables.draw_seqlogo_barplots(dist/dist.sum(), alphabet='nt', yaxistitle='frequency')
iplot(plot_data)

	1	2	3	4	5	6	7	8	9	10-14	...	205-209	210-214	215-219	220-224	225-229	230-234	235-239	240-244	245-249	250-254
count	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	15000.000000	...	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	3000.000000
mean	32.461333	32.855000	33.074667	33.214000	32.972000	32.285000	32.454667	32.358333	32.924000	33.957000	...	32.416067	32.170867	31.733000	31.514533	30.088267	29.954067	30.340867	30.250333	30.004200	22.639667
std	3.007134	2.527338	2.031042	1.773486	2.330313	4.571879	2.896678	3.010695	2.215821	4.554324	...	8.113011	8.209306	8.263074	8.380561	9.213078	9.250358	9.030052	9.235223	9.219869	9.778839
min	16.000000	2.000000	16.000000	16.000000	16.000000	16.000000	16.000000	16.000000	16.000000	15.000000	...	12.000000	2.000000	2.000000	2.000000	12.000000	2.000000	2.000000	12.000000	12.000000	12.000000
10%	32.000000	32.000000	32.000000	32.000000	32.000000	30.000000	32.000000	32.000000	32.000000	32.000000	...	14.000000	14.000000	13.000000	13.000000	13.000000	13.000000	14.000000	13.000000	14.000000	12.000000
25%	32.000000	32.000000	33.000000	33.000000	33.000000	33.000000	32.000000	32.000000	32.000000	33.000000	...	32.000000	32.000000	31.000000	30.000000	26.000000	24.000000	25.000000	25.000000	24.000000	14.000000
50%	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	34.000000	...	37.000000	37.000000	37.000000	37.000000	35.000000	35.000000	36.000000	36.000000	35.000000	15.000000
75%	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	38.000000	...	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	33.000000
90%	34.000000	34.000000	34.000000	35.000000	34.000000	35.000000	34.000000	34.000000	34.000000	38.000000	...	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000
max	36.000000	36.000000	36.000000	36.000000	36.000000	37.000000	37.000000	37.000000	37.000000	38.000000	...	39.000000	39.000000	39.000000	39.000000	39.000000	39.000000	39.000000	39.000000	39.000000	38.000000

	1	2	3	4	5	6	7	8	9	10	...	240	241	242	243	244	245	246	247	248	249
count	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	...	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000	3000.000000
mean	32.461333	32.855000	33.074667	33.214000	32.972000	32.285000	32.454667	32.358333	32.924000	32.768333	...	31.450000	29.579667	29.431667	30.900333	29.890000	29.364000	30.539000	29.843333	29.744333	30.530333
std	3.007134	2.527338	2.031042	1.773486	2.330313	4.571879	2.896678	3.010695	2.215821	2.743934	...	8.380388	9.606860	9.695256	8.786740	9.472089	9.636967	8.842278	9.265891	9.340666	8.939522
min	16.000000	2.000000	16.000000	16.000000	16.000000	16.000000	16.000000	16.000000	16.000000	16.000000	...	12.000000	12.000000	12.000000	12.000000	12.000000	12.000000	12.000000	12.000000	12.000000	12.000000
10%	32.000000	32.000000	32.000000	32.000000	32.000000	30.000000	32.000000	32.000000	32.000000	32.000000	...	14.000000	13.000000	13.000000	13.000000	13.000000	14.000000	14.000000	14.000000	14.000000	14.000000
25%	32.000000	32.000000	33.000000	33.000000	33.000000	33.000000	32.000000	32.000000	32.000000	32.000000	...	29.000000	24.000000	24.000000	26.000000	24.000000	24.000000	25.000000	24.000000	24.000000	26.000000
50%	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	33.000000	...	37.000000	36.000000	36.000000	37.000000	36.000000	33.500000	36.000000	35.000000	34.000000	36.000000
75%	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	34.000000	...	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000
90%	34.000000	34.000000	34.000000	35.000000	34.000000	35.000000	34.000000	34.000000	34.000000	34.000000	...	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000	37.000000
max	36.000000	36.000000	36.000000	36.000000	36.000000	37.000000	37.000000	37.000000	37.000000	37.000000	...	39.000000	38.000000	38.000000	38.000000	38.000000	39.000000	39.000000	38.000000	39.000000	39.000000

	1	2	3	4	5	6	7	8	9	10	...	241	242	243	244	245	246	247	248	249	250
A	56.0	101	1313.0	57.0	107.0	1293.0	1562.0	1327.0	57.0	94.0	...	363.0	559.0	237.0	238.0	1340.0	199.0	310.0	546.0	183.0	1251.0
C	71.0	1520	1529.0	84.0	44.0	72.0	55.0	1538.0	1289.0	59.0	...	2020.0	540.0	700.0	2066.0	619.0	789.0	1919.0	504.0	823.0	863.0
G	2818.0	1317	68.0	2802.0	1325.0	1552.0	1319.0	71.0	1577.0	1314.0	...	410.0	359.0	1750.0	460.0	309.0	1669.0	539.0	426.0	1655.0	655.0
N	0.0	4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
T	55.0	58	90.0	57.0	1524.0	83.0	64.0	64.0	77.0	1533.0	...	207.0	1542.0	313.0	236.0	732.0	343.0	232.0	1524.0	339.0	231.0

	seqs	quals
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}	GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...	B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}	GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATACAG...	A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}	CGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAGGA...	AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}	GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGGAG...	CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}	GGAGGAGACGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGTCA...	BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...

	seqs	quals
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}	b'GNCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGG...	B#>AABCCCCCCGGGGGGGGGGGGGGEGGGGGHHHHHGGGGGGGGG...
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}	b'GGAGGAGACGATGACTTCGGTCCCGTNGCCCCATGCGTCGATAC...	A@ABB@A>AADAGGGGGGGGFGGHGH2EEBGEEHHHGGGGGFGGGH...
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}	b'CGTGACGTTGGNCGNGTCCGGGNGCGGCCTCCAGACGCNCGGAG...	AAAAAA>>>>>1BE1AEBFFGG0EEGCGEFCC>FGGGG/EEEEGGG...
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}	b'GCCGTGACGTTGGACGAGTCCGGGGGCGGCCTCCAGACGCCCGG...	CCCCADBACCCCGGFGGGGGGGGGGGGGFGGGHHHHGGGGEFGGCG...
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}	b'GGAGGAGNCGATGACTTCGGTCCCGTGGCCCCATGCGTCGATGT...	BBBBBBA4AADBGBFGGGGGFGGGGHGFGHHGGHHHGGGGGGHFGG...

	sliced
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}	b'CTC'
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}	b'AGG'
M01012:51:000000000-A4H0H:1:1101:14519:1548 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f3::552c261b9eb6363a487b62c9"}	b'TGG'
M01012:51:000000000-A4H0H:1:1101:14011:1558 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f4::552c261b9eb6363a487b62c9"}	b'CTC'
M01012:51:000000000-A4H0H:1:1101:16697:1561 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f5::552c261b9eb6363a487b62c9"}	b'AGG'

	seqs	quals
0	AAA	ZCA
1	ACT	DGJ
2	ACA	JJJ