Jumping into the source code for seqtables

We can use Pandas to analyze aligned sequences in a table. This can be useful for quickly generating AA or NT distribution by position and accessing specific positions of an aligned sequence

seq_tables.draw_seqlogo_barplots(seq_dist, alphabet=None, label_cutoff=0.09, use_properties=True, additional_text={}, show_consensus=True, scale_by_distance=False, title='', annotation_font_size=14, yaxistitle='', bargap=None, plotwidth=None, plotheight=500, num_y_ticks=3)[source]

Uses plotly to generate a sequence logo as a series of bar graphs. This method of sequence logo is taken from the following source:

Repository: https://github.com/ISA-tools/SequenceLogoVis

Paper: E. Maguire, P. Rocca-Serra, S.-A. Sansone, and M. Chen, Redesigning the sequence logo with glyph-based approaches to aid interpretation, In Proceedings of EuroVis 2014, Short Paper (2014)

Parameters:
  • seq_dist (Dataframe) –

    Rows should be unique letters, Columns should be a specific position within the sequence,

    Values should be the values that represent the frequency OR bits of a specific letter at a specific position

  • alphabet (string) – AA or NT
  • label_cutoff (int, default=0.09) – Defines a cutoff for when to stop adding ‘text’ labels to the plot (i.e. dont add labels for low freq letters)
  • use_properties (bool, default=True) – If True and if the alphabet is AA then it will color AA by their properties
  • additional_text (list of tuples) –

    For each tuple, element 0 is string/label, element 1 is string of letters at each position.

    i.e. additional_text = [(‘text’, ‘AACCA’), (‘MORE’, ‘ATTTT’)]. This is meant to add extra

    layers of information for the sequence. For example may you would like to include the WT sequence at the bottom

  • show_consensus (bool) – If True will show the consensus sequence at the bottom of the graph
  • scale_by_distance (bool) –

    If True, then each x-coordinate of the bar graph will be equal to the column position in the dataframe.

    For example if you are showing a motif in residues 10, 50, and 90 then the xcoordinates of the bars wil be 10, 50, 90 rather than 1,2,3

  • annotation_font_size (int) – size of text font
  • yaxistitle (string) – how to label the y axis
  • bargap (float, default=None) – bargap parameter for plots. If None, then lets plotly handle it
  • plotwidth (float, default=None) – defines the total width of the plot and individual bars
  • num_y_ticks (int) – How many ytick labels should appear in plot
Returns:

fig – plot object for plotting in plotly

Return type:

plotly.graph_objs.Figure

Examples

>>> from seq_tables import draw_seqlogo_barplots
>>> import pandas as pd
>>> from plotly.offline import iplot, init_notebook_mode
>>> init_notebook_mode()
>>> distribution = pd.DataFrame({1: [0.9, 0.1, 0 ,0], 2: [0.5, 0.2, 0.1, 0.2], 3: [0, 0, 0, 1], 4: [0.25, 0.25, 0.25, 0.25]}, index=['A', 'C', 'T', 'G'])
>>> plotdata = draw_seqlogo_barplots(distribution)
>>> iplot(plotdata)
seq_tables.get_quality_dist(qual_df, bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]

Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.

Parameters:
  • bins (list of ints or tuples, default=None) –

    bins defines how to group together the columns/sequence positions when aggregating the statistics.

    Note

    bins=None

    If bins is none, then by default, bins are set to the same ranges defined by fastqc report

  • percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function.
  • exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution
  • sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns:

data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards

Return type:

DataFrame

Examples

Show the median of the quality at the first ten positions in the sequence

>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C'])
>>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])

Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics

>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])

We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)

seq_tables.read_fastq(input_file, limit=None, chunk_size=10000, use_header_as_index=True, use_pandas=True)[source]

Load a fastq file as class SeqTable

seq_tables.read_sam(input_file, limit=None, chunk_size=100000, cleave_softclip=False, use_header_as_index=True)[source]

Load a SAM file into class SeqTable

class seq_tables.seqtable(seqdata=None, qualitydata=None, start=1, index=None, seqtype='NT', phred_adjust=33, null_qual='!', **kwargs)[source]

Class for viewing aligned sequences within a list or dataframe. This will take a list of sequences and create views such that we can access each position within specific positions. It will also associate quality scores for each base if provided.

Parameters:
  • seqdata (Series, or list of strings) – List containing a set of sequences aligned to one another
  • qualitydata (Series or list of quality scores, default=None) – If defined, then user is passing in quality data along with the sequences)
  • start (int) – Explicitly define where the aligned sequences start with respect to some refernce frame (i.e. start > 2 means sequences start at position 2 not 1)
  • index (list of values defining the index, default=None) –
  • seqtype (string of 'AA' or 'NT', default='NT') – Defines the format of the data being passed into the dataframe
  • phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning)
seq_df

Dataframe – Each row in the dataframe is a sequence. It will always contain a ‘seqs’ column representing the sequences past in. Optionally it will also contain a ‘quals’ column representing quality scores

seq_table

Dataframe – Dataframe representing sequences as characters in a table. Each row in the dataframe is a sequence. Each column represents the position of a base/residue within the sequence. The 4th position of sequence 2 is found as seq_table.ix[1, 4]

qual_table

Dataframe, optional – Dataframe representing the quality score for each character in seq_table

Examples

>>> sq = seq_tables.seqtable(['AAA', 'ACT', 'ACA'])
>>> sq.hamming_distance('AAA')
>>> sq = read_fastq('fastqfile.fq')
adjust_ref_seq(ref, table_columns, ref_start, return_as_np=True)[source]

Aligns a reference sequence such that its position matches positions within the seqtable of interest

Parameters:
  • ref (str) – Represents the reference sequence
  • table_columns (list or series) – Defines the column positions or column names
  • ref_start (int) – Defines where the reference starts relative to the sequence
compare_to_reference(reference_seq, positions=None, ref_start=0, flip=False, set_diff=False, ignore_characters=[], return_num_bases=False)[source]

Calculate which positions within a reference are not equal in all sequences in dataframe

Parameters:
  • reference_seq (string) – A string that you want to align sequences to
  • positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
  • ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
  • flip (bool) – If True, then find bases that ARE MISMATCHES(NOT equal) to the reference
  • set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
  • ignore_characters (char or list of chars) – When performing distance/finding mismatches, always treat these characters as matches
  • return_num_bases (bool) –

    If true, returns a second argument defining the number of relevant bases present in each row

    ..important:: Change in output results

    Setting return_num_bases to true will change how results are returned (two elements rather than one are returned)
Returns:

Dataframe of boolean variables showing whether base is equal to reference at each position

convert_low_bases_to_null(q, replace_with='N', inplace=False)[source]

This will convert all letters whose corresponding quality is below a cutoff to the value replace_with

Parameters:
  • q (int) – quality score cutoff, convert all bases whose quality is < than q
  • inplace (boolean) – If False, returns a copy of the object filtered by quality score
  • replace_with (char) – a character to replace low bases with
get_consensus(positions=None, modecutoff=0.5)[source]

Returns the sequence consensus of the bases at the defined positions

Parameters:
  • positions – Slice which positions in the table should be conidered
  • modecutoff – Only report the consensus base of letters which appear more than the provided modecutoff (in other words, the mode must be greater than this frequency)
get_quality_dist(bins=None, percentiles=[0.1, 0.25, 0.5, 0.75, 0.9], exclude_null_quality=True, sample=None)[source]

Returns the distribution of quality across the given sequence, similar to FASTQC quality seq report.

Parameters:
  • bins (list of ints or tuples, default=None) –

    bins defines how to group together the columns/sequence positions when aggregating the statistics.

    Note

    bins=None

    If bins is none, then by default, bins are set to the same ranges defined by fastqc report

  • percentiles (list of floats, default=[0.1, 0.25, 0.5, 0.75, 0.9]) – value passed into pandas describe() function.
  • exclude_null_quality (boolean, default=True) – do not include quality scores of 0 in the distribution
  • sample (int, default=None) – If defined, then we will only calculate the distribution on a random subsampled population of sequences
Returns:

data – contains the distribution information at every bin (min value, max value, desired precentages and quartiles) graphs (plotly object): contains plotly graph objects for generating plots of the data afterwards

Return type:

DataFrame

Examples

Show the median of the quality at the first ten positions in the sequence

>>> table = SeqTable(['AAAAAAAAAA', 'AAAAAAAAAC', 'CCCCCCCCCC'], qualitydata=['6AA9-C9--6C', '6AA!1C9BA6C', '6AA!!C9!-6C'])
>>> box_data, graphs = table.get_quality_dist(bins=range(10), percentiles=[0.5])

Now repeat the example from above, except group together all values from the first 5 bases and the next 5 bases i.e. All qualities between positions 0-4 will be grouped together before performing median, and all qualities between 5-9 will be grouped together). Also, return the bottom 10 and upper 90 percentiles in the statsitics

>>> box_data, graphs = table.get_quality_dist(bins=[(0,4), (5,9)], percentiles=[0.1, 0.5, 0.9])

We can also plot the results as a series of boxplots using plotly >>> from plotly.offline import init_notebook_mode, iplot, plot, iplot_mpl # assuming ipython.. >>> init_notebook_mode() >>> plotly.iplot(graphs) # using outside of ipython >>> plotly.plot(graphs)

get_seq_dist(positions=None, method='counts', ignore_characters=[])[source]

Returns the distribution of bases or amino acids at each position.

hamming_distance(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]

Determine hamming distance of all sequences in dataframe to a reference sequence.

Parameters:
  • reference_seq (string) – A string that you want to align sequences to
  • positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
  • ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
  • set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
  • normalized (bool) – If True, then divides hamming distance by the number of relevant bases
mutation_TS_TV_profile(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[])[source]

Return the ratio of transition rates (A->G, C->T) to transversion rates (A->T/C) observed between the reference sequence and sequences in table.

Parameters:
  • reference_seq (string) – A string that you want to align sequences to
  • positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
  • ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
  • set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
Returns:

ratio – TS Freq / TV Freq TS (float): TS Freq TV (float): TV Freq

Return type:

float

mutation_profile(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]

Return the type of mutation rates observed between the reference sequence and sequences in table.

Parameters:
  • reference_seq (string) – A string that you want to align sequences to
  • positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
  • ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
  • set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
  • normalized (bool) – If True, then frequency of each mutation
Returns:

profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T)

Return type:

pd.Series

mutation_profile_deprecated(reference_seq, positions=None, ref_start=0, set_diff=False, ignore_characters=[], normalized=False)[source]

Return the type of mutation rates observed between the reference sequence and sequences in table.

Parameters:
  • reference_seq (string) – A string that you want to align sequences to
  • positions (list, default=None) – specific positions in both the reference_seq and sequences you want to compare
  • ref_start (int, default=0) – where does the reference sequence start with respect to the aligned sequences
  • set_diff (bool) – If True, then we want to analyze positions that ARE NOT listed in positions parameters
  • normalized (bool) – If True, then frequency of each mutation
Returns:

profile – Returns the counts (or frequency) for each mutation observed (i.e. A->C or A->T) transversions (float): Returns frequency of transversion mutation transition (float): Returns frequency of transition mutation

Note

Transversion and transitions only apply to situations when the seqtype is a NT

Return type:

pd.Series

Note

This function has been deprecated because we found a better speed-optimized method

qual_to_table(qualphred, phred_adjust=33, return_table=False)[source]

Given a set of quality score strings, updates the return a new dataframe such that each column represents the quality at each position as a number

Parameters:
  • qualphred – (Series or list of quality scores, default=None): If defined, then user is passing in quality data along with the sequences)
  • phred_adjust (integer, default=33) – If quality data is passed, then this will be used to adjust the quality score (i.e. Sanger vs older NGS quality scorning)
  • return_table (boolean, default=False) – If True, then the attribute self.qual_table is returned
Returns:

self.qual_table – each row corresponds to a specific sequence and each column corresponds to

Return type:

Dataframe

quality_filter(q, p, inplace=False, ignore_null_qual=True)[source]

Filter out sequences based on their average qualities at each base/position

Parameters:
  • q (int) – quality score cutoff
  • p (int/float/percent 0-100) – the percent of bases that must have a quality >= the cutoff q
  • inplace (boolean) – If False, returns a copy of the object filtered by quality score
  • ignore_null_qual (boolean) – Ignore bases that are not represented. (i.e. those with quality of 0)
subsample(numseqs)[source]

Return a random sample of sequences as a new object

Parameters:numseqs (int) – How many sequences to sample
Returns:SeqTable Object
table_to_seq(new_name)[source]

Return the sequence list

update_seqdf()[source]

Make seq_df attribute in sync with seq_table and qual_table Sometimes it might be useful to make changes to the seq_table attribute. For example, may you have your own custom code where you change the values of seq_table to be ‘.’ or something random. Well you want to make sure that seq_df updates accordingly because the full length strings are the most useful in the end