Jumping into the source code for seqtables¶
We can use Pandas to analyze aligned sequences in a table. This can be useful for quickly generating AA or NT distribution by position and accessing specific positions of an aligned sequence
-
seq_tables.
draw_seqlogo_barplots
(seq_dist, alphabet=None, label_cutoff=0.09, use_properties=True, additional_text={}, show_consensus=True, scale_by_distance=False, title='', annotation_font_size=14, yaxistitle='', bargap=None, plotwidth=None, plotheight=500, num_y_ticks=3)[source]¶ Uses plotly to generate a sequence logo as a series of bar graphs. This method of sequence logo is taken from the following source:
Repository: https://github.com/ISA-tools/SequenceLogoVis
Paper: E. Maguire, P. Rocca-Serra, S.-A. Sansone, and M. Chen, Redesigning the sequence logo with glyph-based approaches to aid interpretation, In Proceedings of EuroVis 2014, Short Paper (2014)
Parameters: - seq_dist (Dataframe) –
Rows should be unique letters, Columns should be a specific position within the sequence,
Values should be the values that represent the frequency OR bits of a specific letter at a specific position
- alphabet (string) – AA or NT
- label_cutoff (int, default=0.09) – Defines a cutoff for when to stop adding ‘text’ labels to the plot (i.e. dont add labels for low freq letters)
- use_properties (bool, default=True) – If True and if the alphabet is AA then it will color AA by their properties
- additional_text (list of tuples) –
For each tuple, element 0 is string/label, element 1 is string of letters at each position.
i.e. additional_text = [(‘text’, ‘AACCA’), (‘MORE’, ‘ATTTT’)]. This is meant to add extra
layers of information for the sequence. For example may you would like to include the WT sequence at the bottom
- show_consensus (bool) – If True will show the consensus sequence at the bottom of the graph
- scale_by_distance (bool) –
If True, then each x-coordinate of the bar graph will be equal to the column position in the dataframe.
For example if you are showing a motif in residues 10, 50, and 90 then the xcoordinates of the bars wil be 10, 50, 90 rather than 1,2,3
- annotation_font_size (int) – size of text font
- yaxistitle (string) – how to label the y axis
- bargap (float, default=None) – bargap parameter for plots. If None, then lets plotly handle it
- plotwidth (float, default=None) – defines the total width of the plot and individual bars
- num_y_ticks (int) – How many ytick labels should appear in plot
Returns: fig – plot object for plotting in plotly
Return type: plotly.graph_objs.Figure
Examples
>>> from seq_tables import draw_seqlogo_barplots >>> import pandas as pd >>> from plotly.offline import iplot, init_notebook_mode >>> init_notebook_mode() >>> distribution = pd.DataFrame({1: [0.9, 0.1, 0 ,0], 2: [0.5, 0.2, 0.1, 0.2], 3: [0, 0, 0, 1], 4: [0.25, 0.25, 0.25, 0.25]}, index=['A', 'C', 'T', 'G']) >>> plotdata = draw_seqlogo_barplots(distribution) >>> iplot(plotdata)
- seq_dist (Dataframe) –
-
seq_tables.
get_quality_dist
(qual_df, bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]¶ Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.
Parameters: - bins (list of ints or tuples, default=None) –
bins defines how to group together the columns/sequence positions when aggregating the statistics.
Note
bins=None
If bins is none, then by default, bins are set to the same ranges defined by fastqc report
- percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function.
- exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution
- sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns: data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards
Return type: DataFrame
Examples
Show the median of the quality at the first ten positions in the sequence
>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C']) >>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])
Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics
>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])
We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)
- bins (list of ints or tuples, default=None) –
-
seq_tables.
read_fastq
(input_file, limit=None, chunk_size=10000, use_header_as_index=True, use_pandas=True)[source]¶ Load a fastq file as class SeqTable
-
seq_tables.
read_sam
(input_file, limit=None, chunk_size=100000, cleave_softclip=False, use_header_as_index=True)[source]¶ Load a SAM file into class SeqTable
-
class
seq_tables.
seqtable
(seqdata=None, qualitydata=None, start=1, index=None, seqtype='NT', phred_adjust=33, null_qual='!', **kwargs)[source]¶ Class for viewing aligned sequences within a list or dataframe. This will take a list of sequences and create views such that we can access each position within specific positions. It will also associate quality scores for each base if provided.
Parameters: - seqdata (Series, or list of strings) – List containing a set of sequences aligned to one another
- qualitydata (Series or list of quality scores, default=None) – If defined, then user is passing in quality data along with the sequences)
- start (int) – Explicitly define where the aligned sequences start with respect to some refernce frame (i.e. start > 2 means sequences start at position 2 not 1)
- index (list of values defining the index, default=None) –
- seqtype (string of 'AA' or 'NT', default='NT') – Defines the format of the data being passed into the dataframe
- phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning)
-
seq_df
¶ Dataframe – Each row in the dataframe is a sequence. It will always contain a ‘seqs’ column representing the sequences past in. Optionally it will also contain a ‘quals’ column representing quality scores
-
seq_table
¶ Dataframe – Dataframe representing sequences as characters in a table. Each row in the dataframe is a sequence. Each column represents the position of a base/residue within the sequence. The 4th position of sequence 2 is found as seq_table.ix[1, 4]
-
qual_table
¶ Dataframe, optional – Dataframe representing the quality score for each character in seq_table
Examples
>>> sq = seq_tables.seqtable(['AAA', 'ACT', 'ACA']) >>> sq.hamming_distance('AAA') >>> sq = read_fastq('fastqfile.fq')
-
adjust_ref_seq
(ref, table_columns, ref_start, return_as_np=True)[source]¶ Aligns a reference sequence such that its position matches positions within the seqtable of interest
Parameters:
-
compare_to_reference
(reference_seq, positions=None, ref_start=0, flip=False, set_diff=False, ignore_characters=[], return_num_bases=False)[source]¶ Calculate which positions within a reference are not equal in all sequences in dataframe
Parameters: - reference_seq (string) – A string that you want to align sequences to
- positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
- ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
- flip (bool) – If True, then find bases that ARE MISMATCHES(NOT equal) to the reference
- set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
- ignore_characters (char or list of chars) – When performing distance/finding mismatches, always treat these characters as matches
- return_num_bases (bool) –
If true, returns a second argument defining the number of relevant bases present in each row
..important:: Change in output results
Setting return_num_bases to true will change how results are returned (two elements rather than one are returned)
Returns: Dataframe of boolean variables showing whether base is equal to reference at each position
-
convert_low_bases_to_null
(q, replace_with='N', inplace=False)[source]¶ This will convert all letters whose corresponding quality is below a cutoff to the value replace_with
Parameters: - q (int) – quality score cutoff, convert all bases whose quality is < than q
- inplace (boolean) – If False, returns a copy of the object filtered by quality score
- replace_with (char) – a character to replace low bases with
-
get_consensus
(positions=None, modecutoff=0.5)[source]¶ Returns the sequence consensus of the bases at the defined positions
Parameters: - positions – Slice which positions in the table should be conidered
- modecutoff – Only report the consensus base of letters which appear more than the provided modecutoff (in other words, the mode must be greater than this frequency)
-
get_quality_dist
(bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]¶ Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.
Parameters: - bins (list of ints or tuples, default=None) –
bins defines how to group together the columns/sequence positions when aggregating the statistics.
Note
bins=None
If bins is none, then by default, bins are set to the same ranges defined by fastqc report
- percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function.
- exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution
- sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns: data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards
Return type: DataFrame
Examples
Show the median of the quality at the first ten positions in the sequence
>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C']) >>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])
Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics
>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])
We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)
- bins (list of ints or tuples, default=None) –
-
get_seq_dist
(positions=None, method='counts', ignore_characters=[])[source]¶ Returns the distribution of bases or amino acids at each position.
-
hamming_distance
(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶ Determine hamming distance of all sequences in dataframe to a reference sequence.
Parameters: - reference_seq (string) – A string that you want to align sequences to
- positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
- ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
- set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
- normalized (bool) – If True, then divides hamming distance by the number of relevant bases
-
mutation_TS_TV_profile
(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[])[source]¶ Return the ratio of transition rates (A->G, C->T) to transversion rates (A->T/C) observed between the reference sequence and sequences in table.
Parameters: - reference_seq (string) – A string that you want to align sequences to
- positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
- ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
- set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
Returns: ratio – TS Freq / TV Freq TS (float): TS Freq TV (float): TV Freq
Return type:
-
mutation_profile
(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶ Return the type of mutation rates observed between the reference sequence and sequences in table.
Parameters: - reference_seq (string) – A string that you want to align sequences to
- positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
- ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
- set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
- normalized (bool) – If True, then frequency of each mutation
Returns: profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T)
Return type: pd.Series
-
mutation_profile_deprecated
(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]¶ Return the type of mutation rates observed between the reference sequence and sequences in table.
Parameters: - reference_seq (string) – A string that you want to align sequences to
- positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
- ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
- set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
- normalized (bool) – If True, then frequency of each mutation
Returns: profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T) transversions (float): Returns frequency of transversion mutation transition (float): Returns frequency of transition mutation
Note
Transversion and transitions only apply to situations when the seqtype is a NT
Return type: pd.Series
Note
This function has been deprecated because we found a better speed-optimized method
-
qual_to_table
(qualphred, phred_adjust=33, return_table=False)[source]¶ Given a set of quality score strings, updates the return a new dataframe such that each column represents the quality at each position as a number
Parameters: - qualphred – (Series or list of quality scores, default=None): If defined, then user is passing in quality data along with the sequences)
- phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning)
- return_table (boolean, default=False) – If True, then the attribute self.qual_table is returned
Returns: self.qual_table – each row corresponds to a specific sequence and each column corresponds to
Return type: Dataframe
-
quality_filter
(q, p, inplace=False, ignore_null_qual=True)[source]¶ Filter out sequences based on their average qualities at each base/position
Parameters: - q (int) – quality score cutoff
- p (int/float/percent 0-100) – the percent of bases that must have a quality >= the cutoff q
- inplace (boolean) – If False, returns a copy of the object filtered by quality score
- ignore_null_qual (boolean) – Ignore bases that are not represented. (i.e. those with quality of 0)
-
subsample
(numseqs)[source]¶ Return a random sample of sequences as a new object
Parameters: numseqs (int) – How many sequences to sample Returns: SeqTable Object
-
update_seqdf
()[source]¶ Make seq_df attribute in sync with seq_table and qual_table Sometimes it might be useful to make changes to the seq_table attribute. For example, may you have your own custom code where you change the values of seq_table to be ‘.’ or something random. Well you want to make sure that seq_df updates accordingly because the full length strings are the most useful in the end