The purpose of this tutorial is to show how seqtables can be used for various investigations in BCR sequences. It can be useful for BCR analysis because usually there are speicfic regions with a BCR sequence that are of interest. Therefore, it is useful to be able to slice sequences easily and extract regions of interest. Also you might want to be able to perform hamming distance operations to show how many mutations there are to a germline gene or analyze amino acid distribution after fixing a specific position.
In this example we will work on a small dataset of sample BCR related sequences. This dataset was annotated using IMGT. IMGT was selected because of its 'IMGT GAP' field in which all sequences are aligned relative to a BCR scaffold structure of numbering. In other words, insertions are removed from the sequeqnce and deletions are filled with gaps or '.'. Because all sequences are aligned against a specific scaffold, we can analyze all sequences together in a table using seqtables class.
import sys
sys.path.insert(0, '../')
import seq_tables
from plotly.offline import iplot, init_notebook_mode
import numpy as np
import pandas as pd
from plotly import graph_objs as go
%load_ext autoreload
%autoreload 2
init_notebook_mode()
imgt_file = pd.read_csv('./imgt_results/1_Summary.txt', sep='\t').fillna('').set_index('Sequence ID')
imgt_file.iloc[0]
gapped_reads = pd.read_csv('./imgt_results/2_IMGT-gapped-nt-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads.iloc[0]
gapped_reads_aa = pd.read_csv('./imgt_results/4_IMGT-gapped-AA-sequences.txt', sep='\t').fillna('').set_index('Sequence ID')
gapped_reads_aa.iloc[0]
raw_file = seq_tables.read_fastq('./imgt_results/Demo.fq')
print('these are the sequences in ascii')
raw_file.seq_table.iloc[:2]
print('these are their qualities')
raw_file.qual_table.iloc[:2]
raw_file.shape()
Lets look at the quality score distributions for these reads
iplot(raw_file.get_quality_dist()[1])
Step 1: Remove low quality reads
IMGT does not include the quality information in the results, but we can use the raw sequences to filter out low quality reads.
Note we could have done this in one step if we had quality reads associated in our gapped file
raw_file.quality_filter(q=30, p=80, inplace=True)
raw_file.seq_table.shape
Only select IMGT results whose sequence header is in the filtered raw_file seqtable
gapped_reads = gapped_reads.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
gapped_reads_aa = gapped_reads_aa.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
imgt_file = imgt_file.loc[raw_file.seq_table.index.map(lambda x: x[:50].replace(' ', '_'))]
# uppercase sequences, just for looks
gapped_reads['V-REGION'] = gapped_reads['V-REGION'].str.upper()
# CREATE THE SEQ TABLE, SET ITS INDEX EQUAL TO THE INDEX OF THE ORIGINALL GAPPED READS df
bcr_st = seq_tables.seqtable(gapped_reads['V-REGION'], index=gapped_reads.index)
bcr_st.seq_table.iloc[:5]
Again, the values in the seqtable correspond to: Rows - unique sequences Columns - speicfiic position within a sequence
Each cell element is the ascii representation of the base/residue
Below we can view the bases
bcr_st.iloc[0].view_bases(num_base_show=10)[:10]
Or we could view bases side by side such that every base in a group is at the same position
bcr_st.view_bases(num_base_show=10).T[0, :10]
# LETS also create one for the amino acid usage
bcr_st_aa = seq_tables.seqtable(gapped_reads_aa['V-REGION'], index=gapped_reads.index, seqtype='AA')
bcr_st_aa.seq_table.iloc[:5]
Remove reads that do not have any results
good_reads = imgt_file[imgt_file['Functionality'] != 'No results'].index
bcr_st_aa = bcr_st_aa.loc[good_reads]
bcr_st = bcr_st.loc[good_reads]
One of the benefits of seqtables is that it makes it easy to select specific regions of interest because of pandas indexing. So to show this for BCR analysis, we want to know which AA residues belong to framework and CDR regions
# here we map the length of each of the annotated reqions within a BCR seq reported by IMGT number (framework and CDR regions)
max_len = gapped_reads_aa['V-REGION'].apply(len).max()
template = gapped_reads_aa[gapped_reads_aa['V-REGION'].apply(len) == max_len].iloc[0]
bcr_lengths = [
('FR1', len(template['FR1-IMGT'])),
('CDR1', len(template['CDR1-IMGT'])),
('FR2', len(template['FR2-IMGT'])),
('CDR2', len(template['CDR2-IMGT'])),
('FR3', len(template['FR3-IMGT'])),
]
current_len = sum([x[1] for x in bcr_lengths])
max_vdj_len = gapped_reads_aa['V-D-J-REGION'].apply(len).max()
bcr_lengths.append(('CDR3/FR4', max_vdj_len - current_len))
bcr_lengths
Here we explicitly define which AA residues correspond to each region
pos = 1
region_to_pos ={}
for x in bcr_lengths:
region_to_pos[x[0]] = list(range(pos, pos + x[1]))
pos += x[1]
# region_to_pos
SOME things to note first
# lets add in FR and CDR annotations to each seq logo position
annotation_text = []
for x in bcr_lengths:
annotation_text.extend([x[0]]*x[1])
iplot(bcr_st_aa.seq_logo(ignore_characters=['.', 'X', '*'], additional_text=[('REGION',annotation_text)])[0])
Lets look at the sequence logo of CDR3 sequences
CDR3_list = gapped_reads_aa.loc[gapped_reads_aa['CDR3-IMGT'] != '', 'CDR3-IMGT']
cdr3_table = seq_tables.seqtable(CDR3_list, seqtype='AA')
iplot(cdr3_table.seq_logo()[0])
Next we will look at mutation profiles for sequences derived from a specific VH gene germline
Lets setup some important variables first
# get a list of the germline sequences from IMGT (NT SEQUENCES)
germ_lines = [r for r in open('imgt_results/mouse_germlines2.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines) if i%2 == 1]
germlines = pd.DataFrame({'header': header, 'seqs': seqs})
# get a list of the germline sequences IMGT (AA SEQUENCES)
germ_lines_AA = [r for r in open('imgt_results/mouse_germlines_AA.fa')]
header = [a[1:].strip() for i, a in enumerate(germ_lines_AA) if i%2 == 0]
seqs = [a.strip() for i, a in enumerate(germ_lines_AA) if i%2 == 1]
germlines_AA = pd.DataFrame({'header': header, 'seqs': seqs})
germlines_AA.seqs = germlines_AA.seqs.apply(lambda x: x + '.' * (max_len-len(x)))
# just look at the first gene hit
gapped_reads_aa['V-GENE'] = gapped_reads_aa['V-GENE and allele'].apply(lambda x: x.split(',')[0])
gapped_reads_aa.groupby(by='V-GENE').apply(len).sort_values(ascending=False).iloc[:5]
Variables above loaded
Analyze the mutation rate of sequences derived from IGHV5-12-1
gene = 'IGHV5-12-1'
# Slice IGHV-12-1 sequences from the seqtable
gene_subset = bcr_st_aa.loc[gapped_reads_aa[gapped_reads_aa['V-GENE'].str.contains(gene)].index]
germ_gene = germlines_AA.loc[germlines_AA['header'].str.contains(gene.replace(' ', '_')), 'seqs'].iloc[0]
Compare all reads which map to IGHV5-12-1 to germline. Calculate which residues are not equal to the germline gene IGHV5-12-1
compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
compared_to_germline.iloc[:2]
# total # of mutations ever
compared_to_germline.sum().sum()
# mismatches in position 3:
compared_to_germline[3].iloc[:5]
iplot(go.Figure(
data=[go.Scatter(y=(compared_to_germline).sum(axis=1), x=list(range(compared_to_germline.shape[0])), mode='markers')],
layout=go.Layout(title='Hamming distance from germline considering all positions', xaxis={'title':'Sequence'}, yaxis={'title':'Hamming distance'})
))
Look at the distribution of hamming distances in each framework/CDR region
# select mismatches in specific positions and take the sum of those regions
# loop through each of the bcr regions (FR1, FR2, FR3, CDR1, CDR2,) and slice the mismatches in each position
hamming_by_region = {
region: compared_to_germline[region_to_pos[region]].sum(axis=1).values / (1.0 * len(region_to_pos[region]))
for region in region_to_pos if max(region_to_pos[region]) in compared_to_germline.columns }
iplot(go.Figure(
data=[go.Box(x=hamming_by_region[region], name=region) for region in hamming_by_region],
layout=go.Layout(title='Normalized hamming distance distributions in each BCR region', xaxis={'title': 'Hamming distance'})
),
)
Analyze position mutation frequency
iplot(go.Figure(
data=[go.Scatter(y=(compared_to_germline).sum(axis=0)/(1.0 * compared_to_germline.shape[0]), mode='markers')],
layout=go.Layout(title='Mutations as a function of position', xaxis={'title':'Position'}, yaxis={'title':'Percent mutated'})
))
In this last analysis we will look patterns in specific types of mutations
*Note this is normally more useful on DNA sequences when we want to look for biases such as transversions
This example shows how we can easily "stack" all mutations that are not equal to the germline at each position
refarray= pd.DataFrame(np.array([germ_gene], dtype='S').view('S1').view(np.uint8), index=list(range(1, len(germ_gene) + 1)))
compared_to_germline = gene_subset.compare_to_reference(germ_gene, ignore_characters=['.', 'X'], flip=True)
all_mutated_residues = gene_subset.seq_table[compared_to_germline].stack().astype(np.uint8).reset_index().drop( 'Sequence ID', axis=1,).rename(columns={'level_1': 'AA Pos', 0: 'BCR AA'}).set_index('AA Pos')
all_mutated_residues.iloc[:10]
For each of the mutated regions, we can also associated the "germline" residue at that position. Then we can group together mutation types and look at the distribution
# analalyze occurrences of all mutations from germline in the dataset
all_mutated_residues = all_mutated_residues.merge(refarray, left_index=True, right_index=True, how='inner').rename(columns={0: 'Germ AA'})
mutation_profile = all_mutated_residues.groupby(by=['BCR AA', 'Germ AA']).apply(len).sort_values(ascending=False).reset_index()
mutation_profile[['BCR AA', 'Germ AA']] = mutation_profile[['BCR AA', 'Germ AA']].applymap(lambda x: chr(x))
mutation_profile.set_index(['BCR AA', 'Germ AA']).rename(columns={0:'counts'}).iloc[:10]
The above code works ,but it can be slow. Another method of doing the above steps is provided as a wrapper function in seq tables
*Note: As always we could look at the mutation profile in specific positions
gene_subset.mutation_profile(reference_seq=germ_gene, positions=None, ignore_characters=['X', '.']).sort_values(ascending=False).iloc[:10]