Skip to main content

KSiga: K-mer Signal analysis tool for virus genome analysis

By Posters No Comments

Preecha Patumcharoenpol, Se-Ran Jun, David Ussery, and Intawat Nookaew

Background: Genomic DNA sequences are the best ‘unique identifiers’ for an organism. Often biologists will want to know “where does this DNA sequence come from?”. However, matching a given DNA sequence with a likely genome can be difficult. For long DNA sequences, finding the best match in a large database can be time-consuming and computationally challenging. An easy and fast method for this involves looking at the distribution of smaller pieces (substrings) of DNA of the same length (“k”). The k-mer based approach has been explored as a basis of sequence analysis applications, including assembly, phylogenetic tree inference, and microbial classification. Although the k-mer based approach is not novel, selecting the appropriate k-mer length to obtain the best resolution in applications is rather arbitrary.

Results: KSiga is a computational tool which investigates k-mer content for assessing an optimal k-mer length for virus genome datasets of interest using a three step approach: (1) Cumulative Relative Entropy (CRE), (2) Average number of Common Features (ACF), and (3) Observed Feature Occurrence (OFC). Using the KSiga package, we demonstrate the reliability of these measurement by identifying an optimal k-mer length for a reference set of 6153 viral genomes. We are able to identify the optimal range of k-mer that can be used to group viral genomes visualized by a dendrogram.

Conclusion: KSiga provides a systematic way to measure the optimal k-mer length for virus genome sequences analysis. Our three step approach for an optimal k-mer length produces clusters in agreement with International Committee on Taxonomy of Viruses (ICTV) and the virus classification system, Baltimore classification, that our approach could potentially be used to improve virus genome classification. KSiga is available at https://github.com/yumyai/ksiga. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.


Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

Rapid Detection of Zika Genomes from Clinical Isolates

By Posters No Comments

Skylar A. Connor, Intawat Nookaew, David W. Ussery, Se-Ran Jun

The Zika virus is a growing problem, that recently entered the United States. Zika is a human-pathogenic flavivirus that can be transmitted to humans through multiple avenues. Most often the virus is transmitted from an infected mosquito to a human through a bite. The virus can then be spread from an infected human to its sexual partners, as well as from an infected pregnant mother to her fetus.

We are developing a method to use third generation sequencing machines to rapidly detect Zika from clinical isolates. We will be using clinical isolates from people who have tested positive for Zika within the state of Arkansas; provided by the Arkansas Department of Health. We will also collect clinical samples from people who do not have the Zika virus, to be used as a control group. Using new Nanopore third generation sequencing technology we will be able to sequence the full length Zika Genome within one read of one sample.

The Oxford Nonopore MinIon Sequencer is a new form of sequencing technology that has the ability to produce long sequences within the first fifteen minutes of the run, and can generate more than 5 million reads at time. Our preliminary results show that we generate sequences with an average of fifteen thousand base pairs within the first fifteen minutes. Since the Zika genome is ten thousand bases long, a single read can contain the full length viral genome.

Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR

Two-Component Signal Transduction System in 10,000 E. coli and Salmonella Genomes

By Posters No Comments

Duah Alkam, Visanu Wanchai, David Ussery

Background: Bacteria have the astounding ability to withstand, survive and even thrive in diverse environments. Response to different environments is mediated in part by Two-Component Signal Transduction Systems (TCST) which are composed of a transmembrane sensor histidine kinase and an intracellular response regulator. The sensor histidine kinase transmits extracellular cues downstream to the response regulator which effects intracellular changes (usually modifications in gene expression) which enable the cell to adapt to the environment. For example, the E. coli BarA-UvrY two-component system is needed for switching between glycolytic and gluconeogenic pathways. Thus two-component systems are essential to bacterial survival.

Results: We used protein functional domains to identify two-component systems in 5,305 genomes of Escherichia coli and 5,177 genomes of Salmonella enterica. We first identified 29 known histidine kinases and 31 known response regulators in the E. coli K-12 MG1655 reference genome, and 30 histidine kinases and 37 response regulators in the S. enterica subsp. enterica serovar Typhimurium LT2 reference genome. We then used the Pfam domains of these proteins to identify matching and novel proteins in the E. coli and S. enterica genomes. We found that a range of 30 to 35 histidine kinases and 35 to 40 response regulators are present across the E. coli genomes, in S. enterica we found a range of 30 to 37 histidine kinases and 35 to 42 response regulators.

Conclusion: Here we introduce a method to swiftly compare thousands of genomes by using protein functional domains. Using this computational approach, within a few seconds, we extracted the total number of two-component systems across roughly 5,000 genomes of each E. coli and S. enterica. We show that, by using protein functional domains, it will be possible to compare proteins of all bacteria within seconds. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.

Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

Integrated omics analyses reveal the details of metabolic adaptation of Clostridium thermocellum to lignocellulose-derived growth inhibitors released during the deconstruction of switchgrass. Biotechnol Biofuels

By Publications No Comments

Poudel S, Giannone RJ, Rodriguez M Jr, Raman B, Martin MZ, Engle NL, Mielenz JR, Nookaew I, Brown SD, Tschaplinski TJ, Ussery D, Hettich RL.
2017 Jan 10;10:14. doi: 10.1186/s13068-016-0697-5. PubMed PMID: 28077967; PubMed Central PMCID: PMC5223564.