A Million Monkeys Demonstrate the Power of Hadoop

There are many great use cases for Apache Hadoop, the open source framework for scalable, reliable, and distributed computing on commodity hardware built around Hadoop Distributed File System and MapReduce, such as delivering search engine results, sequencing genomes, and indexing entire libraries of text, but the Million Monkeys Project by Jesse Anderson may be the easiest to understand and the most fun.

The project was inspired by the Infinite Monkey Theorem which, in the simplest and most popular terms, states that a million monkeys with a million typewriters will, by randomly hitting the keys, eventually recreate the works of Shakespeare. The idea is that, though at any given instance the chance of a monkey typing a sonnet is essentially zero, with infinite instances it becomes almost certain. Anderson wanted to try this for himself but he didn’t have a million monkeys, a million typewriters, and infinite time and resources, so instead he used his home computer, Amazon’s Elastic Compute Cloud, and Hadoop to achieve the same results.

Anderson first generated a million virtual monkeys on Amazon’s EC2, which were really pseudo random number generators that would provide strings of 9 random characters. Anderson had to find a very efficient and reliable pseudo random number generator because at that scale, creating the strings was one of the most computationally expensive steps in the process, and he eventually settled on  Sean Luke’s Mersenne Twister. Next, he compared the generated string to the entirety of Shakespeare’s work and, if he found the string anywhere, he would mark it in almost real time, creating what he calls “performance art with monkeys and computers.” Comparing a 9 character string with every continuous set of 9 letters in all of William Shakespeare’s 38 works is no small task, and Anderson used a Bloom Filter to reduce CPU usage by 20-30%. A Bloom Filter works by creating a hash of the monkey’s string and comparing it to a file with all of the hashes and offsets of Shakespeare. Since hashes are shorter and simpler than the strings, this goes much faster but, because more than one string can result in a given hash, just because the hashes match doesn’t mean the strings will. If a match is found, the strings are then compared character by character.

The project took 1.5 months, generated 7.5 trillion character groups, and checked them against 5.5 trillion (5,429,503,678,976) possible combinations.  The project was concluded on October 6 when the last work, The Taming Of The Shrew, was completed. Normally, such a massive task would be out of the reach of one man without a team of computer scientists and supercomputers, but because Hadoop was able to break the overwhelming job into little segments running in parallel on servers in Amazon’s cloud, Jesse Anderson managed to do it himself on commodity hardware. Though the Million Monkeys Project was mostly for fun, it shares many similarities to other serious use cases for Hadoop. DNA sequencing, for example, involves matching short reads of a few dozen pairs to a full genome of millions or billions. of pairs Just like with the monkeys, the job gets much more manageable when broken down into smaller segments and, since commodity hardware and open source software spare research budgets, Hadoop has become a dominant tool in the sequencing community.

CTOvision Pro Special Technology Assessments

We produce special technology reviews continuously updated for CTOvision Pro members. Categories we cover include:

  • Analytical Tools - With a special focus on technologies that can make dramatic positive improvements for enterprise analysts.
  • Big Data - We cover the technologies that help organizations deal with massive quantities of data.
  • Cloud Computing - We curate information on the technologies enabling enterprise use of the cloud.
  • Communications - Advances in communications are revolutionizing how data gets moved.
  • GreenIT - A great and virtuous reason to modernize!
  • Infrastructure  - Modernizing Infrastructure can have dramatic benefits on functionality while reducing operating costs.
  • Mobile - This revolution is empowering the workforce in ways few of us ever dreamed of.
  • Security  -  There are real needs for enhancements to security systems.
  • Visualization  - Connecting computers with humans.
  • Hot Technologies - Firms we believe warrant special attention.

 

Recent Research

DoD Public And Private Cloud Mandates: And insights from a deployed communications professional on why it matters

Intel CEO Brian Krzanich and Cloudera CSO Mike Olson on Intel and Cloudera’s Technology Collaboration

Watch For More Product Feature Enhancements for Actifio Following $100M Funding Round

Navy Information Dominance Corps: IT still searching for the right governance model

DISA Provides A milCloud Overview: Looks like progress, but watch for two big risks

Innovators, Integrators and Tech Vendors: Here is what the government hopes they will buy from you in 2015

Navy continues to invest in innovation: Review their S&T efforts here

MSPA Unified Certification Standard For Cloud Service Providers: Is This A Commercial Version of FedRamp?

Watch Ben Fry And His Visualizations: Multiple use-cases come to mind, including national security efforts

Agenda And More Details for 4-5 March NIST Data Science Symposium

Actionable Insights From AFCEA Western Conference and Exposition 2014

US Navy’s Engineering Center Provides Deal Winning Info To Industry

solid