Today, Apache Hadoop is one of the most popular open source software frameworks for Big Data. There are many reasons for this, but key among them is the ability to process data in scalable ways. Hadoop can execute commands over data fast. What once was slow or even impossible can now be done fast.
The best way to get started with Hadoop is to leverage the 100% open source distribution of the Apache Hadoop capability provided by Cloudera (more here). Cloudera provides enterprise support and enterprise ready versions of Apache Hadoop. This approach offers several key benefits for Hadoop users:
- They provide services to help install, configure, optimize, and tune Hadoop for large-scale data processing and analysis.
They provide production support for Hadoop so that users do not have to rely on the OSS community for critical bug fixes or enhancements.
- They provide extensive training and certification programs for Hadoop.
- They assist and certify other vendors that are creating product offerings to work with Hadoop.
They have over 700 hardware and software partners including leading storage, business intelligence, reporting and business application technologies.
- They enhance the Hadoop solution with key features like batch processing, interactive SQL, security and interactive search as well as enterprise-grade continuous availability. This also includes a supplement to Hive (a Hadoop component) called Impala which solves many of the performance problems native to Hive.
Impala is one of the most recent pieces of technology that Cloudera has added to the Hadoop platform. Impala is an open source Massively Parallel Processing query engine for Apache Hadoop. Impala enables users to issue low-latency SQL queries to data stored in HDFS (Hadoop Distributed File System) and Apache HBase without requiring data movement or transformation. By doing this, Impala allows processes to perform faster analytics on data stored in Hadoop and removes the need to migrate data sets into specialized systems, or proprietary formats, to simply perform analytics. Compared to Hadoop’s Hive, Impala is able to take full advantage of hardware resources and typically generate less CPU load. One important thing to remember is that Impala is not a replacement of MapReduce or Hive, it simply works with these technologies to solve speed and efficiency issues.