Preecha Patumcharoenpol, Se-Ran Jun, David Ussery, and Intawat Nookaew
Background: Genomic DNA sequences are the best ‘unique identifiers’ for an organism. Often biologists will want to know “where does this DNA sequence come from?”. However, matching a given DNA sequence with a likely genome can be difficult. For long DNA sequences, finding the best match in a large database can be time-consuming and computationally challenging. An easy and fast method for this involves looking at the distribution of smaller pieces (substrings) of DNA of the same length (“k”). The k-mer based approach has been explored as a basis of sequence analysis applications, including assembly, phylogenetic tree inference, and microbial classification. Although the k-mer based approach is not novel, selecting the appropriate k-mer length to obtain the best resolution in applications is rather arbitrary.
Results: KSiga is a computational tool which investigates k-mer content for assessing an optimal k-mer length for virus genome datasets of interest using a three step approach: (1) Cumulative Relative Entropy (CRE), (2) Average number of Common Features (ACF), and (3) Observed Feature Occurrence (OFC). Using the KSiga package, we demonstrate the reliability of these measurement by identifying an optimal k-mer length for a reference set of 6153 viral genomes. We are able to identify the optimal range of k-mer that can be used to group viral genomes visualized by a dendrogram.
Conclusion: KSiga provides a systematic way to measure the optimal k-mer length for virus genome sequences analysis. Our three step approach for an optimal k-mer length produces clusters in agreement with International Committee on Taxonomy of Viruses (ICTV) and the virus classification system, Baltimore classification, that our approach could potentially be used to improve virus genome classification. KSiga is available at https://github.com/yumyai/ksiga. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.
Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205