SEQ_TABLES TUTORIAL: Analyzing aligned BCR sequence reads¶

The purpose of this tutorial is to show how seqtables can be used for various investigations in BCR sequences. It can be useful for BCR analysis because usually there are speicfic regions with a BCR sequence that are of interest. Therefore, it is useful to be able to slice sequences easily and extract regions of interest. Also you might want to be able to perform hamming distance operations to show how many mutations there are to a germline gene or analyze amino acid distribution after fixing a specific position.

In this example we will work on a small dataset of sample BCR related sequences. This dataset was annotated using IMGT. IMGT was selected because of its 'IMGT GAP' field in which all sequences are aligned relative to a BCR scaffold structure of numbering. In other words, insertions are removed from the sequeqnce and deletions are filled with gaps or '.'. Because all sequences are aligned against a specific scaffold, we can analyze all sequences together in a table using seqtables class.

import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go

%load_ext autoreload
%autoreload 2

init_notebook_mode()

First lets load the summary file from IMGT¶

imgt_file = pd.read_csv('./imgt_results/1_Summary.txt', sep='\t').fillna('').set_index('Sequence ID')
imgt_file.iloc[0]

Sequence number                                                                               1
Functionality                                                             unknown (see comment)
V-GENE and allele                                           Musmus IGHV5-6-4*01 F (see comment)
V-REGION score                                                                              474
V-REGION identity %                                                                        61.6
V-REGION identity nt                                                                 154/250 nt
V-REGION identity % (with ins/del events)                                                      
V-REGION identity nt (with ins/del events)                                                     
J-GENE and allele                                                                              
J-REGION score                                                                                 
J-REGION identity %                                                                            
J-REGION identity nt                                                                           
D-GENE and allele                                                                              
D-REGION reading frame                                                                         
CDR1-IMGT length                                                                              8
CDR2-IMGT length                                                                              8
CDR3-IMGT length                                                                              X
CDR-IMGT lengths                                                                          8.8.X
FR-IMGT lengths                                                                    [25.17.25.X]
AA JUNCTION                                                                                    
JUNCTION frame                                                                             null
Orientation                                                                                   +
Functionality comment                                                    No rearrangement found
V-REGION potential ins/del                    Low V-REGION identity (61,60% ) this may indic...
J-GENE and allele comment                                                                      
V-REGION insertions                                                                            
V-REGION deletions                                                                             
Sequence                                      gncgtgacgttggacgagtccgggggcggcctccagacgcccggag...
Unnamed: 29                                                                                    
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object

But the more important table is from the gapped file where all reads are aligned to one-another¶

gapped_reads = pd.read_csv('./imgt_results/2_IMGT-gapped-nt-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads.iloc[0]

Sequence number                                                      1
Functionality                                                  unknown
V-GENE and allele                                Musmus IGHV5-6-4*01 F
J-GENE and allele                                                     
D-GENE and allele                                                     
V-D-J-REGION                                                          
V-J-REGION                                                            
V-REGION             gncgtgacgttggacgagtccgggggc...ggcctccagacgcccg...
FR1-IMGT             gncgtgacgttggacgagtccgggggc...ggcctccagacgcccg...
CDR1-IMGT                         nngttcaccttc............agcagttatccc
FR2-IMGT             atgcagtgggngcgacaggcgcccggcaaggggntcgnatgngtcg...
CDR2-IMGT                               anttgtaaggat......ggtagtagcacn
FR3-IMGT             aaatncgcggcggcggtnang...ggccgtgccaccatntcgaggg...
CDR3-IMGT                                                             
JUNCTION                                                              
J-REGION                                                              
FR4-IMGT                                                              
Unnamed: 18                                                           
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object

gapped_reads_aa = pd.read_csv('./imgt_results/4_IMGT-gapped-AA-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads_aa.iloc[0]

Sequence number                                                      1
Functionality                                                  unknown
V-GENE and allele                                Musmus IGHV5-6-4*01 F
J-GENE and allele                                                     
D-GENE and allele                                                     
V-D-J-REGION                                                          
V-J-REGION                                                            
V-REGION             XVTLDESGG.GLQTPGGALSLXCKASXFTF....SSYPMQWXRQAP...
FR1-IMGT                                    XVTLDESGG.GLQTPGGALSLXCKAS
CDR1-IMGT                                                 XFTF....SSYP
FR2-IMGT                                             MQWXRQAPGKGXXXVAX
CDR2-IMGT                                                   XCKD..GSST
FR3-IMGT                                    KXAAAVX.GRATXSRXXXQXTXRLQL
CDR3-IMGT                                                             
JUNCTION                                                              
J-REGION                                                              
FR4-IMGT                                                              
Unnamed: 18                                                           
Name: M01012:51:000000000-A4H0H:1:1101:15861:1495_1:N:0:, dtype: object

Finally, lets also load the raw sequence fastq file as a seqtable. This will contain pandas dataframes of the sequences and quality scores as integers/ascii values. We can use this file to perform any quality filtering.¶

raw_file = seq_tables.read_fastq('./imgt_results/Demo.fq')
print('these are the sequences in ascii')
raw_file.seq_table.iloc[:2]

these are the sequences in ascii

print('these are their qualities')
raw_file.qual_table.iloc[:2]

these are their qualities

raw_file.shape()

(3000, 250)

Lets look at the quality score distributions for these reads

iplot(raw_file.get_quality_dist()[1])

Step 1: Remove low quality reads

IMGT does not include the quality information in the results, but we can use the raw sequences to filter out low quality reads.

Note we could have done this in one step if we had quality reads associated in our gapped file

raw_file.quality_filter(q=30, p=80, inplace=True)
raw_file.seq_table.shape

(2534, 250)

Only select IMGT results whose sequence header is in the filtered raw_file seqtable

gapped_reads = gapped_reads.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
gapped_reads_aa = gapped_reads_aa.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
imgt_file = imgt_file.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]

Analysis Using Seq Tables¶

Lets first start by creating the seq table using the gapped file¶

# uppercase sequences, just for looks
gapped_reads['V-REGION'] = gapped_reads['V-REGION'].str.upper()

# CREATE THE SEQ TABLE, SET ITS INDEX EQUAL TO THE INDEX OF THE ORIGINALL GAPPED READS df
bcr_st = seq_tables.seqtable(gapped_reads['V-REGION'], index=gapped_reads.index)
bcr_st.seq_table.iloc[:5]

Again, the values in the seqtable correspond to: Rows - unique sequences Columns - speicfiic position within a sequence

Each cell element is the ascii representation of the base/residue

Below we can view the bases

bcr_st.iloc[0].view_bases(num_base_show=10)[:10]

array([[b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.'],
       [b'.']], 
      dtype='|S1')

Or we could view bases side by side such that every base in a group is at the same position

bcr_st.view_bases(num_base_show=10).T[0, :10]

array([b'.', b'G', b'.', b'.', b'.', b'G', b'G', b'G', b'N', b'.'], 
      dtype='|S1')

# LETS also create one for the amino acid usage
bcr_st_aa = seq_tables.seqtable(gapped_reads_aa['V-REGION'], index=gapped_reads.index, seqtype='AA')
bcr_st_aa.seq_table.iloc[:5]

Remove reads that do not have any results

good_reads = imgt_file[imgt_file['Functionality'] != 'No results'].index
bcr_st_aa = bcr_st_aa.loc[good_reads]
bcr_st = bcr_st.loc[good_reads]

Store some IMGT variables that describe how to annotate each AA in the results table¶

One of the benefits of seqtables is that it makes it easy to select specific regions of interest because of pandas indexing. So to show this for BCR analysis, we want to know which AA residues belong to framework and CDR regions

# here we map the length of each of the annotated reqions within a BCR seq reported by IMGT number (framework and CDR regions)
max_len = gapped_reads_aa['V-REGION'].apply(len).max()
template = gapped_reads_aa[gapped_reads_aa['V-REGION'].apply(len) == max_len].iloc[0]
bcr_lengths = [
    ('FR1', len(template['FR1-IMGT'])),
    ('CDR1', len(template['CDR1-IMGT'])),
    ('FR2', len(template['FR2-IMGT'])),
    ('CDR2', len(template['CDR2-IMGT'])),
    ('FR3', len(template['FR3-IMGT'])),
]
current_len = sum([x[1] for x in bcr_lengths])
max_vdj_len = gapped_reads_aa['V-D-J-REGION'].apply(len).max()
bcr_lengths.append(('CDR3/FR4', max_vdj_len - current_len))
bcr_lengths

[('FR1', 26),
 ('CDR1', 12),
 ('FR2', 17),
 ('CDR2', 10),
 ('FR3', 39),
 ('CDR3/FR4', 40)]

Here we explicitly define which AA residues correspond to each region

pos = 1
region_to_pos ={}
for x in bcr_lengths:
    region_to_pos[x[0]] = list(range(pos, pos + x[1]))
    pos += x[1]
# region_to_pos

Just out of interest, we can create a frequency plot of all the sequences in the dataset¶

SOME things to note first

Lets ignore bases that are not an AA (those with gaps .)
Lets ignore stop codons

# lets add in FR and CDR annotations to each seq logo position
annotation_text = []
for x in bcr_lengths:
    annotation_text.extend([x[0]]*x[1])
iplot(bcr_st_aa.seq_logo(ignore_characters=['.', 'X', '*'],  additional_text=[('REGION',annotation_text)])[0])

..\seq_tables.py:170: UserWarning:

Currently only frequency logos are used, will allow for entropy in future

Lets look at the sequence logo of CDR3 sequences

CDR3_list = gapped_reads_aa.loc[gapped_reads_aa['CDR3-IMGT'] != '', 'CDR3-IMGT']
cdr3_table = seq_tables.seqtable(CDR3_list, seqtype='AA')
iplot(cdr3_table.seq_logo()[0])

..\seq_tables.py:170: UserWarning:

Currently only frequency logos are used, will allow for entropy in future

Germline Mutational Information¶

Next we will look at mutation profiles for sequences derived from a specific VH gene germline

Lets setup some important variables first

# get a list of the germline sequences from IMGT (NT SEQUENCES)
germ_lines = [r for r in open('imgt_results/mouse_germlines2.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines) if i%2 == 1]
germlines = pd.DataFrame({'header': header, 'seqs': seqs})

# get a list of the germline sequences IMGT (AA SEQUENCES)
germ_lines_AA = [r for r in open('imgt_results/mouse_germlines_AA.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines_AA) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines_AA) if i%2 == 1]
germlines_AA = pd.DataFrame({'header': header, 'seqs': seqs})
germlines_AA.seqs = germlines_AA.seqs.apply(lambda x: x + '.' * (max_len-len(x)))

# just look at the first gene hit
gapped_reads_aa['V-GENE'] =  gapped_reads_aa['V-GENE and allele'].apply(lambda x: x.split(',')[0])

gapped_reads_aa.groupby(by='V-GENE').apply(len).sort_values(ascending=False).iloc[:5]

V-GENE
Musmus IGHV5-12-1*01 F    388
Musmus IGHV5-6-4*01 F     362
Musmus IGHV5-6-2*01 F     308
Musmus IGHV5-9-4*01 F     277
Musmus IGHV5-6*03 [F]     230
dtype: int64

Variables above loaded

Analyze the mutation rate of sequences derived from IGHV5-12-1

gene = 'IGHV5-12-1'

# Slice IGHV-12-1 sequences from the seqtable
gene_subset = bcr_st_aa.loc[gapped_reads_aa[gapped_reads_aa['V-GENE'].str.contains(gene)].index]
germ_gene = germlines_AA.loc[germlines_AA['header'].str.contains(gene.replace(' ', '_')), 'seqs'].iloc[0]

Compare all reads which map to IGHV5-12-1 to germline. Calculate which residues are not equal to the germline gene IGHV5-12-1

Note: We will ignore residues that are equal to '.' or X. they will not be treated as mismatches

compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
compared_to_germline.iloc[:2]

# total # of mutations ever
compared_to_germline.sum().sum()

10134

# mismatches in position 3:
compared_to_germline[3].iloc[:5]

Sequence ID
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:12743:1835_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:13829:1951_1:N:0:    False
M01012:51:000000000-A4H0H:1:1101:16107:1956_1:N:0:     True
M01012:51:000000000-A4H0H:1:1101:13196:1975_1:N:0:     True
Name: 3, dtype: bool

iplot(go.Figure(
        data=[go.Scatter(y=(compared_to_germline).sum(axis=1), x=list(range(compared_to_germline.shape[0])), mode='markers')], 
        layout=go.Layout(title='Hamming distance from germline considering all positions', xaxis={'title':'Sequence'}, yaxis={'title':'Hamming distance'})
    ))

Look at the distribution of hamming distances in each framework/CDR region

# select mismatches in specific positions and take the sum of those regions
# loop through each of the bcr regions (FR1, FR2, FR3, CDR1, CDR2,) and slice the mismatches in each position
hamming_by_region = {
    region: compared_to_germline[region_to_pos[region]].sum(axis=1).values / (1.0 * len(region_to_pos[region]))
    for region in region_to_pos if max(region_to_pos[region]) in compared_to_germline.columns }
iplot(go.Figure(
        data=[go.Box(x=hamming_by_region[region], name=region) for region in hamming_by_region],
        layout=go.Layout(title='Normalized hamming distance distributions in each BCR region', xaxis={'title': 'Hamming distance'})
    ),
)

Analyze position mutation frequency

iplot(go.Figure(
        data=[go.Scatter(y=(compared_to_germline).sum(axis=0)/(1.0 * compared_to_germline.shape[0]), mode='markers')], 
        layout=go.Layout(title='Mutations as a function of position', xaxis={'title':'Position'}, yaxis={'title':'Percent mutated'})
    ))

Analysis of mutation types¶

In this last analysis we will look patterns in specific types of mutations

*Note this is normally more useful on DNA sequences when we want to look for biases such as transversions

This example shows how we can easily "stack" all mutations that are not equal to the germline at each position

refarray= pd.DataFrame(np.array([germ_gene], dtype='S').view('S1').view(np.uint8), index=list(range(1, len(germ_gene) + 1)))
compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
all_mutated_residues = gene_subset.seq_table[compared_to_germline].stack().astype(np.uint8).reset_index().drop( 'Sequence ID', axis=1,).rename(columns={'level_1': 'AA Pos', 0: 'BCR AA'}).set_index('AA Pos')
all_mutated_residues.iloc[:10]

For each of the mutated regions, we can also associated the "germline" residue at that position. Then we can group together mutation types and look at the distribution

# analalyze occurrences of all mutations from germline in the dataset
all_mutated_residues = all_mutated_residues.merge(refarray, left_index=True, right_index=True, how='inner').rename(columns={0: 'Germ AA'})
mutation_profile = all_mutated_residues.groupby(by=['BCR AA', 'Germ AA']).apply(len).sort_values(ascending=False).reset_index()
mutation_profile[['BCR AA', 'Germ AA']] = mutation_profile[['BCR AA', 'Germ AA']].applymap(lambda x: chr(x))
mutation_profile.set_index(['BCR AA', 'Germ AA']).rename(columns={0:'counts'}).iloc[:10]

The above code works ,but it can be slow. Another method of doing the above steps is provided as a wrapper function in seq tables

*Note: As always we could look at the mutation profile in specific positions

gene_subset.mutation_profile(reference_seq=germ_gene, positions=None, ignore_characters=['X', '.']).sort_values(ascending=False).iloc[:10]

ref   mut 
b'S'  b'N'    639.0
b'A'  b'G'    592.0
b'T'  b'A'    548.0
b'K'  b'Q'    410.0
b'Y'  b'R'    387.0
b'F'  b'A'    376.0
b'N'  b'S'    370.0
b'M'  b'L'    360.0
b'S'  b'A'    353.0
b'R'  b'G'    317.0
dtype: float64

	1	2	3	4	5	6	7	8	9	10	...	241	242	243	244	245	246	247	248	249	250
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}	71	78	67	71	84	71	65	67	71	84	...	67	84	71	67	65	71	67	84	71	65
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}	71	71	65	71	71	65	71	65	67	71	...	65	65	71	67	67	65	67	84	67	67

	1	2	3	4	5	6	7	8	9	10	...	241	242	243	244	245	246	247	248	249	250
M01012:51:000000000-A4H0H:1:1101:15861:1495 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f1::552c261b9eb6363a487b62c9"}	33	2	29	32	32	33	34	34	34	34	...	37	37	37	37	37	33	37	33	33	15
M01012:51:000000000-A4H0H:1:1101:16832:1547 1:N:0:10\| <{"SEQ_ID": "552c2da19eb63635e1c950f2::552c261b9eb6363a487b62c9"}	32	31	32	33	33	31	32	29	32	32	...	37	37	33	36	37	37	37	24	33	33

	1	2	3	4	5	6	7	8	9	10	...	311	312	313	314	315	316	317	318	319	320
Sequence ID
M01012:51:000000000-A4H0H:1:1101:16832:1547_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	71	67	71	67	78	78	78	78	78	78
M01012:51:000000000-A4H0H:1:1101:14011:1558_1:N:0:	71	67	67	71	84	71	65	67	71	84	...	78	78	78	78	78	78	78	78	78	78
M01012:51:000000000-A4H0H:1:1101:16697:1561_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	71	67	71	67	78	78	78	78	78	78
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	71	67	71	67	78	78	78	78	78	78
M01012:51:000000000-A4H0H:1:1101:16245:1601_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	71	78	78	78	78	78	78	78	78	78

	1	2	3	4	5	6	7	8	9	10	...	97	98	99	100	101	102	103	104	105	106
Sequence ID
M01012:51:000000000-A4H0H:1:1101:16832:1547_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	69	68	84	71	84	89	89	67	88	88
M01012:51:000000000-A4H0H:1:1101:14011:1558_1:N:0:	65	86	84	76	68	69	83	71	71	46	...	88	88	88	88	88	88	88	88	88	88
M01012:51:000000000-A4H0H:1:1101:16697:1561_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	69	68	84	71	84	89	70	67	88	88
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	69	68	84	65	84	89	89	67	88	88
M01012:51:000000000-A4H0H:1:1101:16245:1601_1:N:0:	46	46	46	46	46	46	46	46	46	46	...	69	68	84	65	84	89	89	88	88	88

	BCR AA
AA Pos
47	71
49	71
55	83
58	75
59	68
64	71
66	71
68	71
69	65
70	65

	1	2	3	4	5	6	7	8	9	10	...	97	98	99	100	101	102	103	104	105	106
Sequence ID
M01012:51:000000000-A4H0H:1:1101:14388:1594_1:N:0:	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	True	False	False	False	False	False
M01012:51:000000000-A4H0H:1:1101:12743:1835_1:N:0:	False	False	False	False	False	False	False	False	False	False	...	True	False	False	True	True	False	True	False	False	False

		counts
BCR AA	Germ AA
N	S	639
G	A	592
A	T	548
Q	K	410
R	Y	387
A	F	376
S	N	370
L	M	360
A	S	353
G	R	317