SEQ_TABLES TUTORIAL: Analyzing aligned BCR sequence reads

The purpose of this tutorial is to show how seqtables can be used for various investigations in BCR sequences. It can be useful for BCR analysis because usually there are speicfic regions with a BCR sequence that are of interest. Therefore, it is useful to be able to slice sequences easily and extract regions of interest. Also you might want to be able to perform hamming distance operations to show how many mutations there are to a germline gene or analyze amino acid distribution after fixing a specific position.

In this example we will work on a small dataset of sample BCR related sequences. This dataset was annotated using IMGT. IMGT was selected because of its 'IMGT GAP' field in which all sequences are aligned relative to a BCR scaffold structure of numbering. In other words, insertions are removed from the sequeqnce and deletions are filled with gaps or '.'. Because all sequences are aligned against a specific scaffold, we can analyze all sequences together in a table using seqtables class.

In [5]:
import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go
In [6]:
%load_ext autoreload
%autoreload 2
In [7]:
init_notebook_mode()

First lets load the summary file from IMGT

In [8]:
imgt_file = pd.read_csv('./imgt_results/1_Summary.txt', sep='\t').fillna('').set_index('Sequence ID')
imgt_file.iloc[0]
Out[8]:
Sequence number                                                                               1
Functionality                                                             unknown (see comment)
V-GENE and allele                                           Musmus IGHV5-6-4*01 F (see comment)
V-REGION score                                                                              474
V-REGION identity %                                                                        61.6
V-REGION identity nt                                                                 154/250 nt
V-REGION identity % (with ins/del events)                                                      
V-REGION identity nt (with ins/del events)                                                     
J-GENE and allele                                                                              
J-REGION score                                                                                 
J-REGION identity %                                                                            
J-REGION identity nt                                                                           
D-GENE and allele                                                                              
D-REGION reading frame                                                                         
CDR1-IMGT length                                                                              8
CDR2-IMGT length                                                                              8
CDR3-IMGT length                                                                              X
CDR-IMGT lengths                                                                          8.8.X
FR-IMGT lengths                                                                    [25.17.25.X]
AA JUNCTION                                                                                    
JUNCTION frame                                                                             null
Orientation                                                                                   +
Functionality comment                                                    No rearrangement found
V-REGION potential ins/del                    Low V-REGION identity (61,60% ) this may indic...
J-GENE and allele comment                                                                      
V-REGION insertions                                                                            
V-REGION deletions                                                                             
Sequence                                      gncgtgacgttggacgagtccgggggcggcctccagacgcccggag...
Unnamed: 29                                                                                    
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object

But the more important table is from the gapped file where all reads are aligned to one-another

In [9]:
gapped_reads = pd.read_csv('./imgt_results/2_IMGT-gapped-nt-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads.iloc[0]
Out[9]:
Sequence number                                                      1
Functionality                                                  unknown
V-GENE and allele                                Musmus IGHV5-6-4*01 F
J-GENE and allele                                                     
D-GENE and allele                                                     
V-D-J-REGION                                                          
V-J-REGION                                                            
V-REGION             gncgtgacgttggacgagtccgggggc...ggcctccagacgcccg...
FR1-IMGT             gncgtgacgttggacgagtccgggggc...ggcctccagacgcccg...
CDR1-IMGT                         nngttcaccttc............agcagttatccc
FR2-IMGT             atgcagtgggngcgacaggcgcccggcaaggggntcgnatgngtcg...
CDR2-IMGT                               anttgtaaggat......ggtagtagcacn
FR3-IMGT             aaatncgcggcggcggtnang...ggccgtgccaccatntcgaggg...
CDR3-IMGT                                                             
JUNCTION                                                              
J-REGION                                                              
FR4-IMGT                                                              
Unnamed: 18                                                           
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object
In [10]:
gapped_reads_aa = pd.read_csv('./imgt_results/4_IMGT-gapped-AA-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads_aa.iloc[0]
Out[10]:
Sequence number                                                      1
Functionality                                                  unknown
V-GENE and allele                                Musmus IGHV5-6-4*01 F
J-GENE and allele                                                     
D-GENE and allele                                                     
V-D-J-REGION                                                          
V-J-REGION                                                            
V-REGION             XVTLDESGG.GLQTPGGALSLXCKASXFTF....SSYPMQWXRQAP...
FR1-IMGT                                    XVTLDESGG.GLQTPGGALSLXCKAS
CDR1-IMGT                                                 XFTF....SSYP
FR2-IMGT                                             MQWXRQAPGKGXXXVAX
CDR2-IMGT                                                   XCKD..GSST
FR3-IMGT                                    KXAAAVX.GRATXSRXXXQXTXRLQL
CDR3-IMGT                                                             
JUNCTION                                                              
J-REGION                                                              
FR4-IMGT                                                              
Unnamed: 18                                                           
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object

Finally, lets also load the raw sequence fastq file as a seqtable. This will contain pandas dataframes of the sequences and quality scores as integers/ascii values. We can use this file to perform any quality filtering.

In [16]:
raw_file = seq_tables.read_fastq('./imgt_results/Demo.fq')
print('these are the sequences in ascii')
raw_file.seq_table.iloc[:2]
these are the sequences in ascii
Out[16]:
1 2 3 4 5 6 7 8 9 10 ... 241 242 243 244 245 246 247 248 249 250
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} 71 78 67 71 84 71 65 67 71 84 ... 67 84 71 67 65 71 67 84 71 65
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} 71 71 65 71 71 65 71 65 67 71 ... 65 65 71 67 67 65 67 84 67 67

2 rows × 250 columns

In [15]:
print('these are their qualities')
raw_file.qual_table.iloc[:2]
these are their qualities
Out[15]:
1 2 3 4 5 6 7 8 9 10 ... 241 242 243 244 245 246 247 248 249 250
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"} 33 2 29 32 32 33 34 34 34 34 ... 37 37 37 37 37 33 37 33 33 15
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"} 32 31 32 33 33 31 32 29 32 32 ... 37 37 33 36 37 37 37 24 33 33

2 rows × 250 columns

In [17]:
raw_file.shape()
Out[17]:
(3000, 250)

Lets look at the quality score distributions for these reads

In [18]:
iplot(raw_file.get_quality_dist()[1])

Step 1: Remove low quality reads

IMGT does not include the quality information in the results, but we can use the raw sequences to filter out low quality reads.

Note we could have done this in one step if we had quality reads associated in our gapped file

In [19]:
raw_file.quality_filter(q=30, p=80, inplace=True)
raw_file.seq_table.shape
Out[19]:
(2534, 250)

Only select IMGT results whose sequence header is in the filtered raw_file seqtable

In [20]:
gapped_reads = gapped_reads.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
gapped_reads_aa = gapped_reads_aa.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
imgt_file = imgt_file.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]

Analysis Using Seq Tables

Lets first start by creating the seq table using the gapped file

In [21]:
# uppercase sequences, just for looks
gapped_reads['V-REGION'] = gapped_reads['V-REGION'].str.upper()
In [22]:
# CREATE THE SEQ TABLE, SET ITS INDEX EQUAL TO THE INDEX OF THE ORIGINALL GAPPED READS df
bcr_st = seq_tables.seqtable(gapped_reads['V-REGION'], index=gapped_reads.index)
bcr_st.seq_table.iloc[:5]
Out[22]:
1 2 3 4 5 6 7 8 9 10 ... 311 312 313 314 315 316 317 318 319 320
Sequence ID
M01012:51:000000000-A4H0H:1:1101:16832:1547_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 71 67 71 67 78 78 78 78 78 78
M01012:51:000000000-A4H0H:1:1101:14011:1558_1:N:0: 71 67 67 71 84 71 65 67 71 84 ... 78 78 78 78 78 78 78 78 78 78
M01012:51:000000000-A4H0H:1:1101:16697:1561_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 71 67 71 67 78 78 78 78 78 78
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 71 67 71 67 78 78 78 78 78 78
M01012:51:000000000-A4H0H:1:1101:16245:1601_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 71 78 78 78 78 78 78 78 78 78

5 rows × 320 columns

Again, the values in the seqtable correspond to: Rows - unique sequences Columns - speicfiic position within a sequence

Each cell element is the ascii representation of the base/residue

Below we can view the bases

In [28]:
bcr_st.iloc[0].view_bases(num_base_show=10)[:10]
Out[28]:
array([[b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.']], 
      dtype='|S1')

Or we could view bases side by side such that every base in a group is at the same position

In [30]:
bcr_st.view_bases(num_base_show=10).T[0, :10]
Out[30]:
array([b'.', b'G', b'.', b'.', b'.', b'G', b'G', b'G', b'N', b'.'], 
      dtype='|S1')
In [31]:
# LETS also create one for the amino acid usage
bcr_st_aa = seq_tables.seqtable(gapped_reads_aa['V-REGION'], index=gapped_reads.index, seqtype='AA')
bcr_st_aa.seq_table.iloc[:5]
Out[31]:
1 2 3 4 5 6 7 8 9 10 ... 97 98 99 100 101 102 103 104 105 106
Sequence ID
M01012:51:000000000-A4H0H:1:1101:16832:1547_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 69 68 84 71 84 89 89 67 88 88
M01012:51:000000000-A4H0H:1:1101:14011:1558_1:N:0: 65 86 84 76 68 69 83 71 71 46 ... 88 88 88 88 88 88 88 88 88 88
M01012:51:000000000-A4H0H:1:1101:16697:1561_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 69 68 84 71 84 89 70 67 88 88
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 69 68 84 65 84 89 89 67 88 88
M01012:51:000000000-A4H0H:1:1101:16245:1601_1:N:0: 46 46 46 46 46 46 46 46 46 46 ... 69 68 84 65 84 89 89 88 88 88

5 rows × 106 columns

Remove reads that do not have any results

In [32]:
good_reads = imgt_file[imgt_file['Functionality'] != 'No results'].index
bcr_st_aa = bcr_st_aa.loc[good_reads]
bcr_st = bcr_st.loc[good_reads]

Store some IMGT variables that describe how to annotate each AA in the results table

One of the benefits of seqtables is that it makes it easy to select specific regions of interest because of pandas indexing. So to show this for BCR analysis, we want to know which AA residues belong to framework and CDR regions

In [33]:
# here we map the length of each of the annotated reqions within a BCR seq reported by IMGT number (framework and CDR regions)
max_len = gapped_reads_aa['V-REGION'].apply(len).max()
template = gapped_reads_aa[gapped_reads_aa['V-REGION'].apply(len) == max_len].iloc[0]
bcr_lengths = [
    ('FR1', len(template['FR1-IMGT'])),
    ('CDR1', len(template['CDR1-IMGT'])),
    ('FR2', len(template['FR2-IMGT'])),
    ('CDR2', len(template['CDR2-IMGT'])),
    ('FR3', len(template['FR3-IMGT'])),
]
current_len = sum([x[1] for x in bcr_lengths])
max_vdj_len = gapped_reads_aa['V-D-J-REGION'].apply(len).max()
bcr_lengths.append(('CDR3/FR4', max_vdj_len - current_len))
bcr_lengths
Out[33]:
[('FR1', 26),
 ('CDR1', 12),
 ('FR2', 17),
 ('CDR2', 10),
 ('FR3', 39),
 ('CDR3/FR4', 40)]

Here we explicitly define which AA residues correspond to each region

In [34]:
pos = 1
region_to_pos ={}
for x in bcr_lengths:
    region_to_pos[x[0]] = list(range(pos, pos + x[1]))
    pos += x[1]
# region_to_pos

Just out of interest, we can create a frequency plot of all the sequences in the dataset

SOME things to note first

  1. Lets ignore bases that are not an AA (those with gaps .)
  2. Lets ignore stop codons
In [35]:
# lets add in FR and CDR annotations to each seq logo position
annotation_text = []
for x in bcr_lengths:
    annotation_text.extend([x[0]]*x[1])
iplot(bcr_st_aa.seq_logo(ignore_characters=['.', 'X', '*'],  additional_text=[('REGION',annotation_text)])[0])
..\seq_tables.py:170: UserWarning:

Currently only frequency logos are used, will allow for entropy in future

Lets look at the sequence logo of CDR3 sequences

In [37]:
CDR3_list = gapped_reads_aa.loc[gapped_reads_aa['CDR3-IMGT'] != '', 'CDR3-IMGT']
cdr3_table = seq_tables.seqtable(CDR3_list, seqtype='AA')
iplot(cdr3_table.seq_logo()[0])
..\seq_tables.py:170: UserWarning:

Currently only frequency logos are used, will allow for entropy in future

Germline Mutational Information

Next we will look at mutation profiles for sequences derived from a specific VH gene germline

Lets setup some important variables first

In [38]:
# get a list of the germline sequences from IMGT (NT SEQUENCES)
germ_lines = [r for r in open('imgt_results/mouse_germlines2.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines) if i%2 == 1]
germlines = pd.DataFrame({'header': header, 'seqs': seqs})
In [39]:
# get a list of the germline sequences IMGT (AA SEQUENCES)
germ_lines_AA = [r for r in open('imgt_results/mouse_germlines_AA.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines_AA) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines_AA) if i%2 == 1]
germlines_AA = pd.DataFrame({'header': header, 'seqs': seqs})
germlines_AA.seqs = germlines_AA.seqs.apply(lambda x: x + '.' * (max_len-len(x)))
In [40]:
# just look at the first gene hit
gapped_reads_aa['V-GENE'] =  gapped_reads_aa['V-GENE and allele'].apply(lambda x: x.split(',')[0])
In [42]:
gapped_reads_aa.groupby(by='V-GENE').apply(len).sort_values(ascending=False).iloc[:5]
Out[42]:
V-GENE
Musmus IGHV5-12-1*01 F    388
Musmus IGHV5-6-4*01 F     362
Musmus IGHV5-6-2*01 F     308
Musmus IGHV5-9-4*01 F     277
Musmus IGHV5-6*03 [F]     230
dtype: int64

Variables above loaded

Analyze the mutation rate of sequences derived from IGHV5-12-1

In [43]:
gene = 'IGHV5-12-1'

# Slice IGHV-12-1 sequences from the seqtable
gene_subset = bcr_st_aa.loc[gapped_reads_aa[gapped_reads_aa['V-GENE'].str.contains(gene)].index]
germ_gene = germlines_AA.loc[germlines_AA['header'].str.contains(gene.replace(' ', '_')), 'seqs'].iloc[0]

Compare all reads which map to IGHV5-12-1 to germline. Calculate which residues are not equal to the germline gene IGHV5-12-1

  • Note: We will ignore residues that are equal to '.' or X. they will not be treated as mismatches
In [45]:
compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
compared_to_germline.iloc[:2]
Out[45]:
1 2 3 4 5 6 7 8 9 10 ... 97 98 99 100 101 102 103 104 105 106
Sequence ID
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0: False False False False False False False False False False ... False False False False True False False False False False
M01012:51:000000000-A4H0H:1:1101:12743:1835_1:N:0: False False False False False False False False False False ... True False False True True False True False False False

2 rows × 106 columns

In [47]:
# total # of mutations ever
compared_to_germline.sum().sum()
Out[47]:
10134
In [46]:
# mismatches in position 3:
compared_to_germline[3].iloc[:5]
Out[46]:
Sequence ID
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:12743:1835_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:13829:1951_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:16107:1956_1:N:0:     True
M01012:51:000000000-A4H0H:1:1101:13196:1975_1:N:0:     True
Name: 3, dtype: bool
In [48]:
iplot(go.Figure(
        data=[go.Scatter(y=(compared_to_germline).sum(axis=1), x=list(range(compared_to_germline.shape[0])), mode='markers')], 
        layout=go.Layout(title='Hamming distance from germline considering all positions', xaxis={'title':'Sequence'}, yaxis={'title':'Hamming distance'})
    ))

Look at the distribution of hamming distances in each framework/CDR region

In [49]:
# select mismatches in specific positions and take the sum of those regions
# loop through each of the bcr regions (FR1, FR2, FR3, CDR1, CDR2,) and slice the mismatches in each position
hamming_by_region = {
    region: compared_to_germline[region_to_pos[region]].sum(axis=1).values / (1.0 * len(region_to_pos[region]))
    for region in region_to_pos if max(region_to_pos[region]) in compared_to_germline.columns }
iplot(go.Figure(
        data=[go.Box(x=hamming_by_region[region], name=region) for region in hamming_by_region],
        layout=go.Layout(title='Normalized hamming distance distributions in each BCR region', xaxis={'title': 'Hamming distance'})
    ),
)

Analyze position mutation frequency

In [50]:
iplot(go.Figure(
        data=[go.Scatter(y=(compared_to_germline).sum(axis=0)/(1.0 * compared_to_germline.shape[0]), mode='markers')], 
        layout=go.Layout(title='Mutations as a function of position', xaxis={'title':'Position'}, yaxis={'title':'Percent mutated'})
    ))

Analysis of mutation types

In this last analysis we will look patterns in specific types of mutations

*Note this is normally more useful on DNA sequences when we want to look for biases such as transversions

This example shows how we can easily "stack" all mutations that are not equal to the germline at each position

In [55]:
refarray= pd.DataFrame(np.array([germ_gene], dtype='S').view('S1').view(np.uint8), index=list(range(1, len(germ_gene) + 1)))
compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
all_mutated_residues = gene_subset.seq_table[compared_to_germline].stack().astype(np.uint8).reset_index().drop( 'Sequence ID', axis=1,).rename(columns={'level_1': 'AA Pos', 0: 'BCR AA'}).set_index('AA Pos')
all_mutated_residues.iloc[:10]
Out[55]:
BCR AA
AA Pos
47 71
49 71
55 83
58 75
59 68
64 71
66 71
68 71
69 65
70 65

For each of the mutated regions, we can also associated the "germline" residue at that position. Then we can group together mutation types and look at the distribution

In [56]:
# analalyze occurrences of all mutations from germline in the dataset
all_mutated_residues = all_mutated_residues.merge(refarray, left_index=True, right_index=True, how='inner').rename(columns={0: 'Germ AA'})
mutation_profile = all_mutated_residues.groupby(by=['BCR AA', 'Germ AA']).apply(len).sort_values(ascending=False).reset_index()
mutation_profile[['BCR AA', 'Germ AA']] = mutation_profile[['BCR AA', 'Germ AA']].applymap(lambda x: chr(x))
mutation_profile.set_index(['BCR AA', 'Germ AA']).rename(columns={0:'counts'}).iloc[:10]
Out[56]:
counts
BCR AA Germ AA
N S 639
G A 592
A T 548
Q K 410
R Y 387
A F 376
S N 370
L M 360
A S 353
G R 317

The above code works ,but it can be slow. Another method of doing the above steps is provided as a wrapper function in seq tables

*Note: As always we could look at the mutation profile in specific positions

In [57]:
gene_subset.mutation_profile(reference_seq=germ_gene, positions=None, ignore_characters=['X', '.']).sort_values(ascending=False).iloc[:10]
Out[57]:
ref   mut 
b'S'  b'N'    639.0
b'A'  b'G'    592.0
b'T'  b'A'    548.0
b'K'  b'Q'    410.0
b'Y'  b'R'    387.0
b'F'  b'A'    376.0
b'N'  b'S'    370.0
b'M'  b'L'    360.0
b'S'  b'A'    353.0
b'R'  b'G'    317.0
dtype: float64