API¶
-
distance_between_occurences
(seq, k_mer, overlap=True)¶ Takes a DNA sequence and a \(k\)-mer and calcules the return times for the \(k\)-mer.
Parameters: Returns: The return times.
Return type: Note
The distance between occurences is defined as the number of nucleotides between the first base of the \(k\)-mer and first base of its next occurence.
Examples
>>> distance_between_occurences("ATGATA", "A") array([2, 1]) >>> distance_between_occurences("ATGATA", "AT") array([2]) >>> distance_between_occurences("ATGAAATA", "AT") array([4]) >>> distance_between_occurences("ATAAAATAAATA", "ATA") array([4, 3]) >>> distance_between_occurences("ATAAAATAAATA", "ATA", overlap=False) array([8])
-
seq_to_array
(seq, k=1, overlap=True)¶ Converts a DNA sequence into a Numpy vector. If \(k>1\), then it creates a vector of the \(k\)-mers.
Parameters: Returns: An array representing the sequence.
Return type: Examples
>>> seq_to_array("ATGC") array(['A', 'T', 'G', 'C'], dtype='<U1') >>> seq_to_array("ATGC", k=2) array(['AT', 'TG', 'GC'], dtype='<U2') >>> seq_to_array("ATGC", k=2, overlap=False) array(['AT', 'GC'], dtype='<U2')
-
krtd
(seq, k, overlap=True, reverse_complement=False, return_full_dict=False, metrics=None)¶ Calculates the \(k\)-mer return time distribution for a sequence.
Parameters: - seq (DNA or str) – The sequence to analyze.
- k (int) – The \(k\) value to use.
- overlap (bool, optional) – Whether the \(k\)-mers should overlap. Defaults to True.
- reverse_complement (bool, optional) – Whether to calculate distances between a \(k\)-mer and its next occurence or the distances between \(k\)-mers and their reverse complements.
- return_full_dict (bool, optional) – Whether to return a full dictionary
containing every \(k\)-mer and its RTD. For large values of
\(k\), as the sparsity of the space in creased, returning a full
dictionary may be very slow. If False, returns a
defaultdict
. Functionally, this should be identical to a full dictionary if accessing dictionary elements. Defaults to False. - metrics (list) – A list of functions which, if passed, will be applied to each RTD array.
Warning
Setting
return_full_dict=True
will take exponentially more time and ask
increases.Returns: A dictionary of the shape {k_mer: distances}
in whichk_mer
is a str and distances is andarray
. Ifmetrics
is passed, the values of the dictionary will be dictionaries mapping each function to its value (see examples below).Return type: dict Raises: ValueError
– When the sequence is degenerate.Examples
>>> print(krtd("ATGCACAGTTCAGA", 1)) {'A': array([3, 1, 4, 1]), 'C': array([1, 4]), 'G': array([4, 4]), 'T': array([6, 0])} >>> print(krtd("ATGCACAGTTCAGA", 1, metrics=[np.mean, np.std])) {'A': {'mean': 2.25, 'std': 1.299038105676658}, 'C': {'mean': 2.5, 'std': 1.5}, 'G': {'mean': 4.0, 'std': 0.0}, 'T': {'mean': 3.0, 'std': 3.0}} >>> print(krtd("ATGCACAGTTCAGA", 2, reverse_complement=True)) {'AA': array([], dtype=int64), 'AC': array([2]), 'AG': array([], dtype=int64), 'AT': array([], dtype=int64), 'CA': array([1, 3, 8]), 'CT': array([], dtype=int64), 'GA': array([2]), 'GC': array([], dtype=int64), 'GT': array([2]), 'TC': array([2]), 'TG': array([1, 3, 8]), 'TT': array([], dtype=int64)} >>> print(krtd("ATGATTGGATATTATGAGGA", 1)) # no value for "C" is printed since it's not in the original sequence {'A': array([2, 4, 1, 2, 2, 2]), 'G': array([3, 0, 7, 1, 0]), 'T': array([2, 0, 3, 1, 0, 1])} >>> print(krtd("ATGATTGGATATTATGAGGA", 1, return_full_dict=True)) # now it is {'A': array([2, 4, 1, 2, 2, 2]), 'C': array([], dtype=int64), 'G': array([3, 0, 7, 1, 0]), 'T': array([2, 0, 3, 1, 0, 1])}
-
codon_rtd
(seq, metrics=None)¶ An alias for
krtd(seq, 3, overlap=False, return_full_dict=True)
which calculates the return time distribution for codons.Parameters: Returns: See
krtd()
.Return type: Raises: ValueError
– When the sequence is not able to be divided into codons.
-
rtd_metric_dict_to_array
(rtd_metric_dict)¶ A convenience function for deterministically turning RTD metric dicts (such as the output of
krtd()
) into arrays, which is useful for computing distances, etc.The output array is a vector with \(4^{k}n\) elements where \(n\) is the number of metrics that were analyzed. To understand the order of the array, first consider an RTD metric dictionary with only one metric. The zero-based index would correspond directly to the alphabetical index of the \(k\)-mer. If \(k=1\), the metric for A would be in position 0, C in 1, G, in 2, T in 3. If there is more than one metric, the metrics’ values for the \(k\)-mer are listed in alphabetical order before proceeding to the next \(k\)-mer. See the example for a clarification.
Parameters: rtd_metric_dict (dict) – A dictionary mapping \(k\)-mers to dictionaries of metrics and their float values. Example
>>> d = krtd("ATGCATGCCGTA", 1, metrics=[np.mean, np.std]) >>> print(d) {'A': {'mean': 4.5, 'std': 1.5}, 'C': {'mean': 1.5, 'std': 1.5}, 'G': {'mean': 2.5, 'std': 0.5}, 'T': {'mean': 3.5, 'std': 0.5}} >>> rtd_metric_dict_to_array(d) array([4.5, 1.5, 1.5, 1.5, 2.5, 0.5, 3.5, 0.5]) >>> d = krtd("ATGCATGCCGTA", 5, metrics=[np.mean, np.std]) >>> rtd_metric_dict_to_array(d).shape # should be (4**5)*2 or 2048 (2048,)