Jumping into the source code for seqtables¶

We can use Pandas to analyze aligned sequences in a table. This can be useful for quickly generating AA or NT distribution by position and accessing specific positions of an aligned sequence

seq_tables.draw_seqlogo_barplots(seq_dist, alphabet=None, label_cutoff=0.09, use_properties=True, additional_text={}, show_consensus=True, scale_by_distance=False, title='', annotation_font_size=14, yaxistitle='', bargap=None, plotwidth=None, plotheight=500, num_y_ticks=3)[source]¶

Uses plotly to generate a sequence logo as a series of bar graphs. This method of sequence logo is taken from the following source:

Repository: https://github.com/ISA-tools/SequenceLogoVis

Paper: E. Maguire, P. Rocca-Serra, S.-A. Sansone, and M. Chen, Redesigning the sequence logo with glyph-based approaches to aid interpretation, In Proceedings of EuroVis 2014, Short Paper (2014)

Parameters:	seq_dist (Dataframe) – Rows should be unique letters, Columns should be a specific position within the sequence, Values should be the values that represent the frequency OR bits of a specific letter at a specific position alphabet (string) – AA or NT label_cutoff (int, default=0.09) – Defines a cutoff for when to stop adding ‘text’ labels to the plot (i.e. dont add labels for low freq letters) use_properties (bool, default=True) – If True and if the alphabet is AA then it will color AA by their properties additional_text (list of tuples) – For each tuple, element 0 is string/label, element 1 is string of letters at each position. i.e. additional_text = [(‘text’, ‘AACCA’), (‘MORE’, ‘ATTTT’)]. This is meant to add extra layers of information for the sequence. For example may you would like to include the WT sequence at the bottom show_consensus (bool) – If True will show the consensus sequence at the bottom of the graph scale_by_distance (bool) – If True, then each x-coordinate of the bar graph will be equal to the column position in the dataframe. For example if you are showing a motif in residues 10, 50, and 90 then the xcoordinates of the bars wil be 10, 50, 90 rather than 1,2,3 annotation_font_size (int) – size of text font yaxistitle (string) – how to label the y axis bargap (float, default=None) – bargap parameter for plots. If None, then lets plotly handle it plotwidth (float, default=None) – defines the total width of the plot and individual bars num_y_ticks (int) – How many ytick labels should appear in plot
Returns:	fig – plot object for plotting in plotly
Return type:	plotly.graph_objs.Figure

Examples

>>> from seq_tables import draw_seqlogo_barplots
>>> import pandas as pd
>>> from plotly.offline import iplot, init_notebook_mode
>>> init_notebook_mode()
>>> distribution = pd.DataFrame({1: [0.9, 0.1, 0 ,0], 2: [0.5, 0.2, 0.1, 0.2], 3: [0, 0, 0, 1], 4: [0.25, 0.25, 0.25, 0.25]}, index=['A', 'C', 'T', 'G'])
>>> plotdata = draw_seqlogo_barplots(distribution)
>>> iplot(plotdata)

seq_tables.get_quality_dist(qual_df, bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]¶

Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.

Parameters:	bins (list of ints or tuples, default=None) – bins defines how to group together the columns/sequence positions when aggregating the statistics. Note bins=None If bins is none, then by default, bins are set to the same ranges defined by fastqc report percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function. exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns:	data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards
Return type:	DataFrame

Examples

Show the median of the quality at the first ten positions in the sequence

>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C'])
>>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])

Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics

>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])

We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)

seq_tables.read_fastq(input_file, limit=None, chunk_size=10000, use_header_as_index=True, use_pandas=True)[source]¶: Load a fastq file as class SeqTable

seq_tables.read_sam(input_file, limit=None, chunk_size=100000, cleave_softclip=False, use_header_as_index=True)[source]¶: Load a SAM file into class SeqTable

class seq_tables.seqtable(seqdata=None, qualitydata=None, start=1, index=None, seqtype='NT', phred_adjust=33, null_qual='!', **kwargs)[source]¶

Class for viewing aligned sequences within a list or dataframe. This will take a list of sequences and create views such that we can access each position within specific positions. It will also associate quality scores for each base if provided.

Parameters:

seqdata (Series, or list of strings) – List containing a set of sequences aligned to one another
qualitydata (Series or list of quality scores, default=None) – If defined, then user is passing in quality data along with the sequences)
start (int) – Explicitly define where the aligned sequences start with respect to some refernce frame (i.e. start > 2 means sequences start at position 2 not 1)
index (list of values defining the index, default=None) –
seqtype (string of 'AA' or 'NT', default='NT') – Defines the format of the data being passed into the dataframe
phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning)

seq_df¶: Dataframe – Each row in the dataframe is a sequence. It will always contain a ‘seqs’ column representing the sequences past in. Optionally it will also contain a ‘quals’ column representing quality scores

seq_table¶: Dataframe – Dataframe representing sequences as characters in a table. Each row in the dataframe is a sequence. Each column represents the position of a base/residue within the sequence. The 4th position of sequence 2 is found as seq_table.ix[1, 4]

qual_table¶: Dataframe, optional – Dataframe representing the quality score for each character in seq_table

Examples

>>> sq = seq_tables.seqtable(['AAA', 'ACT', 'ACA'])
>>> sq.hamming_distance('AAA')
>>> sq = read_fastq('fastqfile.fq')

adjust_ref_seq(ref, table_columns, ref_start, return_as_np=True)[source]¶

Aligns a reference sequence such that its position matches positions within the seqtable of interest

Parameters:	ref (str) – Represents the reference sequence table_columns (list or series) – Defines the column positions or column names ref_start (int) – Defines where the reference starts relative to the sequence

compare_to_reference(reference_seq, positions=None, ref_start=0, flip=False, set_diff=False, ignore_characters=[], return_num_bases=False)[source]¶

Calculate which positions within a reference are not equal in all sequences in dataframe

Parameters:

reference_seq (string) – A string that you want to align sequences to
positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
flip (bool) – If True, then find bases that ARE MISMATCHES(NOT equal) to the reference
set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
ignore_characters (char or list of chars) – When performing distance/finding mismatches, always treat these characters as matches
return_num_bases (bool) –
If true, returns a second argument defining the number of relevant bases present in each row

..important:: Change in output results

Setting return_num_bases to true will change how results are returned (two elements rather than one are returned)

Returns:

Dataframe of boolean variables showing whether base is equal to reference at each position

convert_low_bases_to_null(q, replace_with='N', inplace=False)[source]¶

This will convert all letters whose corresponding quality is below a cutoff to the value replace_with

Parameters:	q (int) – quality score cutoff, convert all bases whose quality is < than q inplace (boolean) – If False, returns a copy of the object filtered by quality score replace_with (char) – a character to replace low bases with

get_consensus(positions=None, modecutoff=0.5)[source]¶

Returns the sequence consensus of the bases at the defined positions

Parameters:	positions – Slice which positions in the table should be conidered modecutoff – Only report the consensus base of letters which appear more than the provided modecutoff (in other words, the mode must be greater than this frequency)

get_quality_dist(bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]¶

Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.

Parameters:	bins (list of ints or tuples, default=None) – bins defines how to group together the columns/sequence positions when aggregating the statistics. Note bins=None If bins is none, then by default, bins are set to the same ranges defined by fastqc report percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function. exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns:	data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards
Return type:	DataFrame

Examples

Show the median of the quality at the first ten positions in the sequence

>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C'])
>>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])

Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics

>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])

We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)

get_seq_dist(positions=None, method='counts', ignore_characters=[])[source]¶: Returns the distribution of bases or amino acids at each position.

hamming_distance(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶

Determine hamming distance of all sequences in dataframe to a reference sequence.

Parameters:

reference_seq (string) – A string that you want to align sequences to
positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
normalized (bool) – If True, then divides hamming distance by the number of relevant bases

mutation_TS_TV_profile(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[])[source]¶

Return the ratio of transition rates (A->G, C->T) to transversion rates (A->T/C) observed between the reference sequence and sequences in table.

Parameters:	reference_seq (string) – A string that you want to align sequences to positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
Returns:	ratio – TS Freq / TV Freq TS (float): TS Freq TV (float): TV Freq
Return type:	float

mutation_profile(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶

Return the type of mutation rates observed between the reference sequence and sequences in table.

Parameters:	reference_seq (string) – A string that you want to align sequences to positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters normalized (bool) – If True, then frequency of each mutation
Returns:	profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T)
Return type:	pd.Series

mutation_profile_deprecated(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶

Return the type of mutation rates observed between the reference sequence and sequences in table.

Parameters:

reference_seq (string) – A string that you want to align sequences to
positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
normalized (bool) – If True, then frequency of each mutation

Returns:

profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T) transversions (float): Returns frequency of transversion mutation transition (float): Returns frequency of transition mutation

Note

Transversion and transitions only apply to situations when the seqtype is a NT

Return type:

pd.Series

Note

This function has been deprecated because we found a better speed-optimized method

qual_to_table(qualphred, phred_adjust=33, return_table=False)[source]¶

Given a set of quality score strings, updates the return a new dataframe such that each column represents the quality at each position as a number

Parameters:	qualphred – (Series or list of quality scores, default=None): If defined, then user is passing in quality data along with the sequences) phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning) return_table (boolean, default=False) – If True, then the attribute self.qual_table is returned
Returns:	self.qual_table – each row corresponds to a specific sequence and each column corresponds to
Return type:	Dataframe

quality_filter(q, p, inplace=False, ignore_null_qual=True)[source]¶

Filter out sequences based on their average qualities at each base/position

Parameters:	q (int) – quality score cutoff p (int/float/percent 0-100) – the percent of bases that must have a quality >= the cutoff q inplace (boolean) – If False, returns a copy of the object filtered by quality score ignore_null_qual (boolean) – Ignore bases that are not represented. (i.e. those with quality of 0)

subsample(numseqs)[source]¶

Return a random sample of sequences as a new object

Parameters:	numseqs (int) – How many sequences to sample
Returns:	SeqTable Object

table_to_seq(new_name)[source]¶: Return the sequence list

update_seqdf()[source]¶: Make seq_df attribute in sync with seq_table and qual_table Sometimes it might be useful to make changes to the seq_table attribute. For example, may you have your own custom code where you change the values of seq_table to be ‘.’ or something random. Well you want to make sure that seq_df updates accordingly because the full length strings are the most useful in the end

Previous topic

This Page

Jumping into the source code for seqtables¶