API¶

distance_between_occurences(seq, k_mer, overlap=True)¶

Takes a DNA sequence and a \(k\)-mer and calcules the return times for the \(k\)-mer.

Parameters:	seq (ndarray, str, or DNA) – The DNA sequence to analyze. k_mer (str) – The \(k\)-mer to calculate return times for. overlap (bool, optional) – Whether the \(k\)-mers should overlap.
Returns:	The return times.
Return type:	ndarray

Note

The distance between occurences is defined as the number of nucleotides between the first base of the \(k\)-mer and first base of its next occurence.

Examples

>>> distance_between_occurences("ATGATA", "A")
array([2, 1])
>>> distance_between_occurences("ATGATA", "AT")
array([2])
>>> distance_between_occurences("ATGAAATA", "AT")
array([4])
>>> distance_between_occurences("ATAAAATAAATA", "ATA")
array([4, 3])
>>> distance_between_occurences("ATAAAATAAATA", "ATA", overlap=False)
array([8])

seq_to_array(seq, k=1, overlap=True)¶

Converts a DNA sequence into a Numpy vector. If \(k>1\), then it creates a vector of the \(k\)-mers.

Parameters:	seq (DNA or str) – The sequence to convert. k (int, optional) – The \(k\) value to use. Defaults to 1. overlap (bool, optional) – Whether the \(k\)-mers should overlap. Defaults to True.
Returns:	An array representing the sequence.
Return type:	ndarray

Examples

>>> seq_to_array("ATGC")
array(['A', 'T', 'G', 'C'], dtype='<U1')
>>> seq_to_array("ATGC", k=2)
array(['AT', 'TG', 'GC'], dtype='<U2')
>>> seq_to_array("ATGC", k=2, overlap=False)
array(['AT', 'GC'], dtype='<U2')

krtd(seq, k, overlap=True, reverse_complement=False, return_full_dict=False, metrics=None)¶

Calculates the \(k\)-mer return time distribution for a sequence.

Parameters:

seq (DNA or str) – The sequence to analyze.
k (int) – The \(k\) value to use.
overlap (bool, optional) – Whether the \(k\)-mers should overlap. Defaults to True.
reverse_complement (bool, optional) – Whether to calculate distances between a \(k\)-mer and its next occurence or the distances between \(k\)-mers and their reverse complements.
return_full_dict (bool, optional) – Whether to return a full dictionary containing every \(k\)-mer and its RTD. For large values of \(k\), as the sparsity of the space in creased, returning a full dictionary may be very slow. If False, returns a defaultdict. Functionally, this should be identical to a full dictionary if accessing dictionary elements. Defaults to False.
metrics (list) – A list of functions which, if passed, will be applied to each RTD array.

Warning

Setting return_full_dict=True will take exponentially more time and as k increases.

Returns:	A dictionary of the shape `{k_mer: distances}` in which `k_mer` is a str and distances is a `ndarray`. If `metrics` is passed, the values of the dictionary will be dictionaries mapping each function to its value (see examples below).
Return type:	dict
Raises:	`ValueError` – When the sequence is degenerate.

Examples

>>> print(krtd("ATGCACAGTTCAGA", 1))
{'A': array([3, 1, 4, 1]),
 'C': array([1, 4]),
 'G': array([4, 4]),
 'T': array([6, 0])}
>>> print(krtd("ATGCACAGTTCAGA", 1, metrics=[np.mean, np.std]))
{'A': {'mean': 2.25, 'std': 1.299038105676658},
 'C': {'mean': 2.5, 'std': 1.5},
 'G': {'mean': 4.0, 'std': 0.0},
 'T': {'mean': 3.0, 'std': 3.0}}
>>> print(krtd("ATGCACAGTTCAGA", 2, reverse_complement=True))
{'AA': array([], dtype=int64),
 'AC': array([2]),
 'AG': array([], dtype=int64),
 'AT': array([], dtype=int64),
 'CA': array([1, 3, 8]),
 'CT': array([], dtype=int64),
 'GA': array([2]),
 'GC': array([], dtype=int64),
 'GT': array([2]),
 'TC': array([2]),
 'TG': array([1, 3, 8]),
 'TT': array([], dtype=int64)}
>>> print(krtd("ATGATTGGATATTATGAGGA", 1)) # no value for "C" is printed since it's not in the original sequence
{'A': array([2, 4, 1, 2, 2, 2]),
 'G': array([3, 0, 7, 1, 0]),
 'T': array([2, 0, 3, 1, 0, 1])}
>>> print(krtd("ATGATTGGATATTATGAGGA", 1, return_full_dict=True)) # now it is
{'A': array([2, 4, 1, 2, 2, 2]),
 'C': array([], dtype=int64),
 'G': array([3, 0, 7, 1, 0]),
 'T': array([2, 0, 3, 1, 0, 1])}

codon_rtd(seq, metrics=None)¶

An alias for krtd(seq, 3, overlap=False, return_full_dict=True) which calculates the return time distribution for codons.

Parameters:	seq (DNA or str) – The sequence to analyze. metrics (list) – See `krtd()`.
Returns:	See `krtd()`.
Return type:	dict
Raises:	`ValueError` – When the sequence is not able to be divided into codons.

rtd_metric_dict_to_array(rtd_metric_dict)¶

A convenience function for deterministically turning RTD metric dicts (such as the output of krtd()) into arrays, which is useful for computing distances, etc.

The output array is a vector with \(4^{k}n\) elements where \(n\) is the number of metrics that were analyzed. To understand the order of the array, first consider an RTD metric dictionary with only one metric. The zero-based index would correspond directly to the alphabetical index of the \(k\)-mer. If \(k=1\), the metric for A would be in position 0, C in 1, G, in 2, T in 3. If there is more than one metric, the metrics’ values for the \(k\)-mer are listed in alphabetical order before proceeding to the next \(k\)-mer. See the example for a clarification.

Parameters:	rtd_metric_dict (dict) – A dictionary mapping \(k\)-mers to dictionaries of metrics and their float values.

Example

>>> d = krtd("ATGCATGCCGTA", 1, metrics=[np.mean, np.std])
>>> print(d)
{'A': {'mean': 4.5, 'std': 1.5},
 'C': {'mean': 1.5, 'std': 1.5},
 'G': {'mean': 2.5, 'std': 0.5},
 'T': {'mean': 3.5, 'std': 0.5}}
>>> rtd_metric_dict_to_array(d)
array([4.5, 1.5, 1.5, 1.5, 2.5, 0.5, 3.5, 0.5])
>>> d = krtd("ATGCATGCCGTA", 5, metrics=[np.mean, np.std])
>>> rtd_metric_dict_to_array(d).shape # should be (4**5)*2 or 2048
(2048,)

krtd

Navigation

Related Topics

API¶