Genomic Basis for Microcephaly in Brazilian strains of Zika Virus

By | Posters | No Comments

Se-Ran Jun1, Trudy M. Wassenaar2, Visanu Wanchai1, Preecha Patumcharoenpol1, Intawat Nookaew1, David W. Ussery1

Background: The Zika virus (ZIKV), mainly transmitted by mosquitoes, is an emerging human- pathogenic flavivirus, and has shown similar spread pattern and clinical characteristics to Dengue virus. However, the ZIKV pandemic in South America is a serious threat to pregnant women, causing microcephaly in developing fetuses. The mechanism is unknown, although epidemiological evidence suggests that microcephaly is not associated with the African lineage of ZIKV.

Results: We examined 105 ZIKV complete genomes and complete coding sequences for genomic understanding of Zika virus epidemiology, based on phylogenetic comparative analysis, adaptive evolution analysis, recombination analysis, and protein properties, including protease cleavage sites, Pfam domains, glycosylation sites, signal peptides, trans-membrane protein domains, and phosphorylation sites. Recombination events within or between Asian and Brazil lineages were not observed, nor were changes in protease cleavage, glycosylation sites, signal peptides or trans-membrane domains between African and Brazil strains. Selection pressure was recognized at several polymorphic sites, mainly in the protein NS4B for the Brazil lineage. Importantly, positively selected mutations in NS4B resulted in an increased potential to be phosphorylated in Brazil strains.

Conclusion: ZIKV protein NS4B, together with NS4A, has been recently shown to inhibit human fetal neural stem cells’ Akt-mTOR signaling, a key pathway for brain development. We hypothesize that positive selection of novel phosphorylation sites in the protein NS4B of Brazil strains could interfere with phosphorylation of Akt and mTOR, impairing Akt-mTOR signaling and has resulted in an increased risk for the development of neuropathies.

This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair. This research is supported by the Arkansas High Performance Computing Center, which is funded through multiple National Science Foundation grants and the Arkansas Economic Development Commission.


1Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, 72205
2Molecular Microbiology and Genomics Consultants, Zotzenheim, Germany

R-loop Forming Structure Prediction in Viral Genomes

By | Posters | No Comments

Thidathip Wongsurawat1, Piroon Jenjaroenpun1, Preecha Patumcharoenpol1, David Ussery1, and Intawat Nookaew1

Background: An R-loop is a triple-stranded nucleic acid structure comprising nascent RNA hybridized with its corresponding DNA template strand, while leaving the non-template DNA single-stranded. R-loop formation has been observed in a wide range of organisms, from bacteria to mammals. Possible roles of R-loops in transcription, telomere maintenance, genome instability, epigenetic regulation as well as disease involvement have been demonstrated. In viruses, R-loop detection is rare and their functional importance is poorly understood. Thus, we aim to investigate the prevalence and distribution of R-loop in the viral genomes.

Results: We use 6,153 viral complete genomes collected from NCBI as a reference set. R-loop prediction by QmRLFS-finder (http://rloop.bii.a-star.edu.sg/?pg=qmrlfs-finder) is performed on these genomes. A total of 1,586 out of 6,153 genomes contain at least one R-loop. The number of R-loops and the ratio of R-loop length per kb of the viral genome are presented. We find that herpesviruses are enriched with R-loops, especially human herpesvirus. In addition, the distribution of these R-loops throughout the genome is not uniform.

Conclusion: We report here the results of a search for the existence and prevalence of R-loops in viral genomes. The pervasiveness of R-loops, their enrichment at specific genomic locations suggest that these structural entities may represent a novel class of functional elements in herpesviruses. Future analysis will be focused on the R-loop-positive genes and regulatory elements of these viruses.

This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.


1Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, 72205

Genome-Based Phylogeny of Clostridioides difficile

By | Posters | No Comments

David W. Ussery, Se-Ran Jun, Duah Alkam, Joyce J Johnsrud, Visanu Wanchai, and Trudy Wassenaar

Background: Clostridioides difficile infections are a major problem in hospitals. Some strains of C. difficile can spread as community acquired infections, whilst other strains are unique to individuals. Genome sequences from clinical samples can be used for epidemiological monitoring of C. difficile infections.

Results: Using Average Amino acid Identity (AAI) of more than 500 Clostridioides difficile genomes, we find several distinct clusters, some of which reflect known nosocomial infections from hospitals. Further, we find additional genomes in GenBank that are likely to be in the C. difficile group, but have different names, and some of the C. difficile genomes are likely to belong to different genera.

Conclusion: Our analysis of all the currently available C. difficile genomes allows a framework to place newly sequenced clinical isolates, quickly determining novel strains, as well as potential community-outbreak strains. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.

Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

Department of Infectious Diseases, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

Molecular Microbiology and Genomics Consultants, Tannenstrasse 7, D-55576 Zotzenheim, Germany

dBBQs : dataBase of Bacterial Quality scores

By | Posters | No Comments

Visanu Wanchai, Preecha Patumcharoenpol, Intawat Nookaew, and David Ussery

Background: It is well-known that genome sequencing technologies are becoming significantly cheaper and faster. As a result of this, the exponential growth in sequencing data in public databases allows us to explore ever growing large collections of genome sequences. However, it is less known that the majority of available sequenced genome sequences in public databases are not complete, drafts of varying qualities. We have calculated quality scores for more than 100,000 bacterial genomes from all major genome repositories and put them in a fast and easy- to-use database.

Results: Prokaryotic genomic data from all sources were collected and combined to make a non- redundant set of bacterial and archaeal genomes. The genome quality score for each was calculated by four different measurements: assembly quality, number of rRNA and tRNA genes, and the occurrence of conserved functional domains. The dataBase of Bacterial Quality scores (dBBQs) was designed to store and retrieve quality scores. It offers searching function with Elasticsearch, a fast and scalable search and analytics engine for large scale database. In addition, the search results are shown in interactive JavaScript charts using dc.js. The analysis of quality scores across major public genome databases find that most (perhaps 80% or more) of the genomes are of acceptable quality for many uses. However, some genome sequences are of very quality, in a few cases even for ‘complete’ genomes.

Conclusion: dBBQs provides genome quality scores for all available prokaryotic genome sequences with a user-friendly Web- interface. These scores can be used as cut-offs to get a high- quality set of genomes for testing bioinformatics tools or improving the analysis. Moreover, all data of the four measurements that were combined to make the quality score for each genome, which can potentially be used for further analysis. dBBQs will be updated regularly as number of genomes in public databases growing rapidly and is freely use for non-commercial purpose. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.


Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

KSiga: K-mer Signal analysis tool for virus genome analysis

By | Posters | No Comments

Preecha Patumcharoenpol, Se-Ran Jun, David Ussery, and Intawat Nookaew

Background: Genomic DNA sequences are the best ‘unique identifiers’ for an organism. Often biologists will want to know “where does this DNA sequence come from?”. However, matching a given DNA sequence with a likely genome can be difficult. For long DNA sequences, finding the best match in a large database can be time-consuming and computationally challenging. An easy and fast method for this involves looking at the distribution of smaller pieces (substrings) of DNA of the same length (“k”). The k-mer based approach has been explored as a basis of sequence analysis applications, including assembly, phylogenetic tree inference, and microbial classification. Although the k-mer based approach is not novel, selecting the appropriate k-mer length to obtain the best resolution in applications is rather arbitrary.

Results: KSiga is a computational tool which investigates k-mer content for assessing an optimal k-mer length for virus genome datasets of interest using a three step approach: (1) Cumulative Relative Entropy (CRE), (2) Average number of Common Features (ACF), and (3) Observed Feature Occurrence (OFC). Using the KSiga package, we demonstrate the reliability of these measurement by identifying an optimal k-mer length for a reference set of 6153 viral genomes. We are able to identify the optimal range of k-mer that can be used to group viral genomes visualized by a dendrogram.

Conclusion: KSiga provides a systematic way to measure the optimal k-mer length for virus genome sequences analysis. Our three step approach for an optimal k-mer length produces clusters in agreement with International Committee on Taxonomy of Viruses (ICTV) and the virus classification system, Baltimore classification, that our approach could potentially be used to improve virus genome classification. KSiga is available at https://github.com/yumyai/ksiga. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.


Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205

Rapid Detection of Zika Genomes from Clinical Isolates

By | Posters | No Comments

Skylar A. Connor, Intawat Nookaew, David W. Ussery, Se-Ran Jun

The Zika virus is a growing problem, that recently entered the United States. Zika is a human-pathogenic flavivirus that can be transmitted to humans through multiple avenues. Most often the virus is transmitted from an infected mosquito to a human through a bite. The virus can then be spread from an infected human to its sexual partners, as well as from an infected pregnant mother to her fetus.

We are developing a method to use third generation sequencing machines to rapidly detect Zika from clinical isolates. We will be using clinical isolates from people who have tested positive for Zika within the state of Arkansas; provided by the Arkansas Department of Health. We will also collect clinical samples from people who do not have the Zika virus, to be used as a control group. Using new Nanopore third generation sequencing technology we will be able to sequence the full length Zika Genome within one read of one sample.

The Oxford Nonopore MinIon Sequencer is a new form of sequencing technology that has the ability to produce long sequences within the first fifteen minutes of the run, and can generate more than 5 million reads at time. Our preliminary results show that we generate sequences with an average of fifteen thousand base pairs within the first fifteen minutes. Since the Zika genome is ten thousand bases long, a single read can contain the full length viral genome.

Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR

Two-Component Signal Transduction System in 10,000 E. coli and Salmonella Genomes

By | Posters | No Comments

Duah Alkam, Visanu Wanchai, David Ussery

Background: Bacteria have the astounding ability to withstand, survive and even thrive in diverse environments. Response to different environments is mediated in part by Two-Component Signal Transduction Systems (TCST) which are composed of a transmembrane sensor histidine kinase and an intracellular response regulator. The sensor histidine kinase transmits extracellular cues downstream to the response regulator which effects intracellular changes (usually modifications in gene expression) which enable the cell to adapt to the environment. For example, the E. coli BarA-UvrY two-component system is needed for switching between glycolytic and gluconeogenic pathways. Thus two-component systems are essential to bacterial survival.

Results: We used protein functional domains to identify two-component systems in 5,305 genomes of Escherichia coli and 5,177 genomes of Salmonella enterica. We first identified 29 known histidine kinases and 31 known response regulators in the E. coli K-12 MG1655 reference genome, and 30 histidine kinases and 37 response regulators in the S. enterica subsp. enterica serovar Typhimurium LT2 reference genome. We then used the Pfam domains of these proteins to identify matching and novel proteins in the E. coli and S. enterica genomes. We found that a range of 30 to 35 histidine kinases and 35 to 40 response regulators are present across the E. coli genomes, in S. enterica we found a range of 30 to 37 histidine kinases and 35 to 42 response regulators.

Conclusion: Here we introduce a method to swiftly compare thousands of genomes by using protein functional domains. Using this computational approach, within a few seconds, we extracted the total number of two-component systems across roughly 5,000 genomes of each E. coli and S. enterica. We show that, by using protein functional domains, it will be possible to compare proteins of all bacteria within seconds. This work is funded in part from the Arkansas Research Alliance and the Helen Adams & Arkansas Research Alliance Professor & Chair.

Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205