API

distance_between_occurences(seq, k_mer, overlap=True)

Takes a DNA sequence and a \(k\)-mer and calcules the return times for the \(k\)-mer.

Parameters:
  • seq (ndarray, str, or DNA) – The DNA sequence to analyze.
  • k_mer (str) – The \(k\)-mer to calculate return times for.
  • overlap (bool, optional) – Whether the \(k\)-mers should overlap.
Returns:

The return times.

Return type:

ndarray

Note

The distance between occurences is defined as the number of nucleotides between the first base of the \(k\)-mer and first base of its next occurence.

Examples

>>> distance_between_occurences("ATGATA", "A")
array([2, 1])
>>> distance_between_occurences("ATGATA", "AT")
array([2])
>>> distance_between_occurences("ATGAAATA", "AT")
array([4])
>>> distance_between_occurences("ATAAAATAAATA", "ATA")
array([4, 3])
>>> distance_between_occurences("ATAAAATAAATA", "ATA", overlap=False)
array([8])
seq_to_array(seq, k=1, overlap=True)

Converts a DNA sequence into a Numpy vector. If \(k>1\), then it creates a vector of the \(k\)-mers.

Parameters:
  • seq (DNA or str) – The sequence to convert.
  • k (int, optional) – The \(k\) value to use. Defaults to 1.
  • overlap (bool, optional) – Whether the \(k\)-mers should overlap. Defaults to True.
Returns:

An array representing the sequence.

Return type:

ndarray

Examples

>>> seq_to_array("ATGC")
array(['A', 'T', 'G', 'C'], dtype='<U1')
>>> seq_to_array("ATGC", k=2)
array(['AT', 'TG', 'GC'], dtype='<U2')
>>> seq_to_array("ATGC", k=2, overlap=False)
array(['AT', 'GC'], dtype='<U2')
krtd(seq, k, overlap=True, reverse_complement=False, return_full_dict=False, metrics=None)

Calculates the \(k\)-mer return time distribution for a sequence.

Parameters:
  • seq (DNA or str) – The sequence to analyze.
  • k (int) – The \(k\) value to use.
  • overlap (bool, optional) – Whether the \(k\)-mers should overlap. Defaults to True.
  • reverse_complement (bool, optional) – Whether to calculate distances between a \(k\)-mer and its next occurence or the distances between \(k\)-mers and their reverse complements.
  • return_full_dict (bool, optional) – Whether to return a full dictionary containing every \(k\)-mer and its RTD. For large values of \(k\), as the sparsity of the space in creased, returning a full dictionary may be very slow. If False, returns a defaultdict. Functionally, this should be identical to a full dictionary if accessing dictionary elements. Defaults to False.
  • metrics (list) – A list of functions which, if passed, will be applied to each RTD array.

Warning

Setting return_full_dict=True will take exponentially more time and as k increases.

Returns:A dictionary of the shape {k_mer: distances} in which k_mer is a str and distances is a ndarray. If metrics is passed, the values of the dictionary will be dictionaries mapping each function to its value (see examples below).
Return type:dict
Raises:ValueError – When the sequence is degenerate.

Examples

>>> print(krtd("ATGCACAGTTCAGA", 1))
{'A': array([3, 1, 4, 1]),
 'C': array([1, 4]),
 'G': array([4, 4]),
 'T': array([6, 0])}
>>> print(krtd("ATGCACAGTTCAGA", 1, metrics=[np.mean, np.std]))
{'A': {'mean': 2.25, 'std': 1.299038105676658},
 'C': {'mean': 2.5, 'std': 1.5},
 'G': {'mean': 4.0, 'std': 0.0},
 'T': {'mean': 3.0, 'std': 3.0}}
>>> print(krtd("ATGCACAGTTCAGA", 2, reverse_complement=True))
{'AA': array([], dtype=int64),
 'AC': array([2]),
 'AG': array([], dtype=int64),
 'AT': array([], dtype=int64),
 'CA': array([1, 3, 8]),
 'CT': array([], dtype=int64),
 'GA': array([2]),
 'GC': array([], dtype=int64),
 'GT': array([2]),
 'TC': array([2]),
 'TG': array([1, 3, 8]),
 'TT': array([], dtype=int64)}
>>> print(krtd("ATGATTGGATATTATGAGGA", 1)) # no value for "C" is printed since it's not in the original sequence
{'A': array([2, 4, 1, 2, 2, 2]),
 'G': array([3, 0, 7, 1, 0]),
 'T': array([2, 0, 3, 1, 0, 1])}
>>> print(krtd("ATGATTGGATATTATGAGGA", 1, return_full_dict=True)) # now it is
{'A': array([2, 4, 1, 2, 2, 2]),
 'C': array([], dtype=int64),
 'G': array([3, 0, 7, 1, 0]),
 'T': array([2, 0, 3, 1, 0, 1])}
codon_rtd(seq, metrics=None)

An alias for krtd(seq, 3, overlap=False, return_full_dict=True) which calculates the return time distribution for codons.

Parameters:
Returns:

See krtd().

Return type:

dict

Raises:

ValueError – When the sequence is not able to be divided into codons.

rtd_metric_dict_to_array(rtd_metric_dict)

A convenience function for deterministically turning RTD metric dicts (such as the output of krtd()) into arrays, which is useful for computing distances, etc.

The output array is a vector with \(4^{k}n\) elements where \(n\) is the number of metrics that were analyzed. To understand the order of the array, first consider an RTD metric dictionary with only one metric. The zero-based index would correspond directly to the alphabetical index of the \(k\)-mer. If \(k=1\), the metric for A would be in position 0, C in 1, G, in 2, T in 3. If there is more than one metric, the metrics’ values for the \(k\)-mer are listed in alphabetical order before proceeding to the next \(k\)-mer. See the example for a clarification.

Parameters:rtd_metric_dict (dict) – A dictionary mapping \(k\)-mers to dictionaries of metrics and their float values.

Example

>>> d = krtd("ATGCATGCCGTA", 1, metrics=[np.mean, np.std])
>>> print(d)
{'A': {'mean': 4.5, 'std': 1.5},
 'C': {'mean': 1.5, 'std': 1.5},
 'G': {'mean': 2.5, 'std': 0.5},
 'T': {'mean': 3.5, 'std': 0.5}}
>>> rtd_metric_dict_to_array(d)
array([4.5, 1.5, 1.5, 1.5, 2.5, 0.5, 3.5, 0.5])
>>> d = krtd("ATGCATGCCGTA", 5, metrics=[np.mean, np.std])
>>> rtd_metric_dict_to_array(d).shape # should be (4**5)*2 or 2048
(2048,)