The Future of Hadoop in Bioinformatics

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source  Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long  on ABI SOLiD sequencers). To do so, every read  needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing. Hadoop and Hadoop databases like HBase are also attractive because they can store and analyze any sort of complex data, allowing researchers to combine diverse forms of data such as nucleotide sequences, scholarly articles, and X-ray crystallographs showing molecular structure. Access to multiple forms of data allow for a more complete view of gene activation and disease pathways.  In many areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

The growth of Hadoop, software, and services :

The initial delay in the adoption of Hadoop for Big Data was mostly due to a lack of information and inertia within the community. Those researchers who knew about Hadoop still saw it as untested and were concerned about stability issues. As Hadoop gains exposure and grows more stable through patches and new releases, researchers will become more comfortable using it. Also, new Hadoop-related software will extend its applicability to other areas of bioinformatics. Dr. Taylor gave the example of Mahout, a Hadoop machine learning library, that can be used for classification (the automatic labeling of data) and clustering (forming groups of similar data within a larger set), both useful in bioinformatics.  The Hadoop and MapReduce paradigm is also being explored for automated reasoning and rule engines, which have tremendous potential.  IBM’s Watson on Jeopardy! has already used Hadoop to pre-process large unstructured datasets for automated reasoning.

The community around Hadoop is also developing, increasing researcher confidence. Already, helpful users, the wealth of related software, and growing availability of support make Hadoop the open-source solution of choice, and new related projects are on the way. Services for larger and more complex deployments ia also growing, with Cloudera as the leading provider. Dr. Taylor expects that as projects upgrade their clusters and new clusters come online, more and more will be running Cloudera’s Distribution including Apache Hadoop (CDH), which is free to download, open source, and simplified.

The evolution of bioinformatics:

Currently, Hadoop is used mostly in next generation sequencing because that’s where most of the Big Data is generated. As techniques advance, however, other fields are performing complex analytics on ever-expanding data sets, requiring innovative data solutions. New work on subjects like clustering, classification, and microarrays, which represent a tremendous amount of biological information in 2 dimensional arrays, is creating a need for parallelized analysis.  High-throughput expression data for genes, proteins, and metabolites is also used for topological network analysis, the inference of biological network not yet mapped out, and can benefit from Big Data analysis. As Hadoop and software in its ecosystem like Mahout and HBase develop these capabilities and researchers develop tests, algorithms, and applications, scientists in bioinformatics will turn more and more to Hadoop to solve new problems.  Dr. Taylor predicts an explosion of papers in the next year on applying Hadoop to bioinformatics in novel ways, which will both further the spread of Hadoop and advance the field of bioinformatics.

New projects are also developing that will require Hadoop, such as the Department of Energy’s knowledgebase. The DoE is working to build a predictive understanding of biological systems behavior by using microbial and plant genetic data, high-throughput analysis, modeling, and simulation, with the goal of solving energy and environmental problems. To do so, they are constructing a knowledgebase, a clustered cyberinfrastructure containing data, organizational methods, standards, analysis tools, and interfaces. The DoE knowledgebase will employ Hadoop and cloud computing to provide the bioinformatics community a freely available computational environment. Use and discovery within this space will both continue to advance bioinformatics and encourage the use of Hadoop.

This is just the beginning

Hadoop is already key to delivering on the promise of bioinformatics. But the very near future holds some even more incredible contributions. Because of its ability to store and process complex data of almost any kind, Hadoop provides a platform that makes it easier to integrate and analyze not just nucleotide sequences, but also PubMed articles, X-ray crystallography showing molecular structure and other highly valuable laboratory data and analyses. Combining diverse scientific data on Hadoop provides a huge opportunity for new approaches to understanding molecular function in gene activation and disease pathways.  We will explore those and other potential future capabilities in coming posts on this Big Data topic.

CTOvision Pro Special Technology Assessments

We produce special technology reviews continuously updated for CTOvision Pro members. Categories we cover include:

  • Analytical Tools - With a special focus on technologies that can make dramatic positive improvements for enterprise analysts.
  • Big Data - We cover the technologies that help organizations deal with massive quantities of data.
  • Cloud Computing - We curate information on the technologies enabling enterprise use of the cloud.
  • Communications - Advances in communications are revolutionizing how data gets moved.
  • GreenIT - A great and virtuous reason to modernize!
  • Infrastructure  - Modernizing Infrastructure can have dramatic benefits on functionality while reducing operating costs.
  • Mobile - This revolution is empowering the workforce in ways few of us ever dreamed of.
  • Security  -  There are real needs for enhancements to security systems.
  • Visualization  - Connecting computers with humans.
  • Hot Technologies - Firms we believe warrant special attention.




  1. [...] and better mission support the … Assessment on “What You Need To Know About Hadoop”The Future of Hadoop in BioinformaticsMicrosoft Focuses Big Data Efforts on HadoopCrucialPointLLCCloudera and SGI Partner to Take High [...]

  2. [...] Affairs ( may also like -Big Data Highlights from McKinsey: Personal Location DataThe Future of Hadoop in Bioinformatics /**/ Article source: [...]

  3. [...] Wizards Know Hadoop is Powerful: But they want more automationThe Future of Hadoop in BioinformaticsQuickstart Guide: Stand up your cloud-based servers with RackspaceCommon Hadoopable ProblemsUpcoming [...]

  4. [...] Big Data solutions for research as well, primarily in the field of bioinformatics, the application of computer science to biology. This ranges from organizing millions of short reads to sequence a genome to better tracking of [...]