Computational Genomics, or Computational Genetics, refers to the use of computational
and statistical analysis for understanding the structure and the function
of genetic material in organisms. The primary focus of research in computational
genomics in the past three decades has been the understanding of genomes and their
functional elements by analyzing biological sequence data.
The high demand for low-cost sequencing has driven the development of highthroughput
sequencing technologies, next-generation sequencing (NGS), that parallelize
the sequencing process, producing thousands or millions of sequences concurrently.
Moore's Law is the observation that the number of transistors on integrated
circuits doubles approximately every two years; correspondingly, the cost per transistor
halves. The cost of DNA sequencing declines much faster, which implies more
new DNA data will be obtained.
This large-scale sequence data, produced with high throughput sequencing technologies,
needs to be processed in a time-effective and cost-effective manner.
In this dissertation, we present a high-performance meta-genome gene identification
framework. This framework includes four modules: filter, alignment, error
correction, and gene identification. The following chapters describe the proposed
design and evaluation of this pipeline.
The most computationally expensive kernel in the framework is the alignment
procedure. Thus, the filter module is developed to determine unnecessary alignment
operations. Without the filter module, the alignment module requires 1.9 hours to
complete all-to-all alignment on a test file of size 512,000 sequences with each sequence
average length 750 base pairs by using ten Kepler K20 NVIDIA GPU. On the other
hand, when combined with the filter kernel, the total time is 11.3 minutes. Note that
the ideal speedup is nearly 91.4 times faster when new alignment kernel is run on ten
GPUs ( 10*9.14). We conclude that accuracy can be achieved at the expense of more
resources while operating frequency can still be maintained. |