Scalable Analytics: Is the End of Hadoop Near?

Scalable Analytics: Is Hadoop Officially Over?

In 2015, technology analyst firm Gartner removed 'big data' from its Hype Cycle for Emerging Technologies report. The media was quick to jump onto this and report that Big Data is dead, but really, it's just become the new norm.

Incidentally, this was coupled with the lower adoption rates of legacy systems such as Hadoop in favor of competitors. These two factors combined led to the speculation that legacy solutions couldn't scale as new Business Intelligence (BI) requirements emerged.

Four years later, do Spark and Hadoop development services still have the same grip on the big data industry the same way they once did?

How big data has changed

Few industries in the world see as much progress in such a short amount of time as the tech industry. Recent advancements in the tech space have meant a dramatic increase in the amount of data we produce, and how much of it is available for use. The internet has gotten larger and faster, landlines are being replaced with fiber optics, and mobile decides faster.

As a result, data is not only growing more voluminous, it's also getting more complex. Numerous cheap IoT devices are able to collect complex data, including analog signals like sound, temperature and air pressure. As such data sets continue to grow, traditional data processing finds itself in a position where it can't meet modern requirements.

The need for better transfer, visualization, querying, sharing, searching, storage, capturing, and, most significantly of all, analysis, has presented itself.

New architecture is needed, and newer technologies based on these ideas are needed to gain insights from the data. At the forefront of this new revolution are two technologies pitted against each other: Apache Spark vs. Hadoop.

The Rise and Fall of Hadoop

As far as Big Data technology goes, none has more history behind it than Hadoop. Started by Yahoo engineers and later taken over by the Apache Foundation, Hadoop solidified itself early on as the de facto Big Data processing engine.

All the conditions were ripe for Hadoop's rise: Google engineers had published a paper on MapReduce and built the technology on top of it, Yahoo was still a major player in the search industry, and the internet had a sudden growth spurt.

Hadoop's fall began with the widespread adoption of cloud computing, with on-premises processing falling out of love. Storage, maintenance and processing costs are all considerably cheaper in the cloud than on-premises. These economic challenges are what led to the formation and growth of Hadoop vendors such as Cloudera, Hortonworks and MapR.

Four years after their formation, Cloudera and Hortonworks were declared no longer profitable, announcing a merger that placed their market cap at a combined $5.2 billion.

With MapR's announcement of closure by the end of 2019, it's hard to argue for Hadoop's continual dominance of the Big Data industry. Even Google has sunset MapReduce (an integral part of Hadoop) in favor of newer home-grown technology.

However unfortunate Hadoop's position may seem, it's in the same position relatively old languages such as COBOL have in the tech space. Almost every ATM on Earth runs COBOL code, but the language is hardly being learned by people new to programming.

Hadoop is going to find its place in many industries as a legacy technology – the only thing keeping it going is pure inertia. Hadoop remains alive… for now.

The Rise of Spark

While the media was busy flexing the kind of power Hadoop could offer anyone dealing with Big Data, a silent contender rose amidst its ranks. Spark was initially expected to be a plug-in module for Hadoop, but has outgrown its older brother to the point Spark without Hadoop is becoming the new reality.

Spark made its way in the market as a faster, more modern alternative to Hadoop. Sure, the two can be used together, but developers reasoned: why go through all the trouble?

Spark utilizes in-memory processing that makes it hundreds, if not thousands of times faster than Hadoop. Critics of in-memory management are also kept silent by the fact that when running on disk, Spark is still at least ten times faster than Hadoop.

The rise of Spark was also aided by how easy it is to pick up in comparison because of an incredibly simple API; it was built from the group up with the intention of being easy to scale; and the amount of speed it provides makes for real-time processing closer to reality.

These factors combined led to a mass exodus from Hadoop to the more favorable Spark, and easy-to-adopt technologies like Kubernetes.

Growing competition non-monolithic systems

Hadoop was built to be a large, monolithic platform upon which smaller plugins could be installed. There weren't as many problems to solve, and Hadoop solves these to the best of its ability.

As new BI needs emerged, new technology was invented to fill the void, but not all of it could as easily integrate with Hadoop. The result: developers moved to newer, shiner and faster tech stacks.

Open source projects like Kafka and RabbitMQ for message queue processing, Elastic for data indexing and search, Flink, Hive and others emerged. With every new technology, a small piece of Hadoop's pie was eaten away. Aside from Spark, many see Kubernetes as the other technology that sealed the lid on Hadoop's grave.

Kubernetes is an open-source container orchestration system that is used inside Google to manage massive workloads. New Big Data frameworks could easily integrate with Kubernetes and support even more AI-oriented tech like TensorFlow and Pytorch on a single cluster.

Finally, as everything moved towards the cloud, Big Data wasn't going to be left behind. Data analytics was freshly designed with the backing of Google via Google Cloud and Amazon's AWS. The need for HDFS was further drastically reduced with the availability of Amazon S3, Google Cloud Storage and IBM's Cloud Object.

In the end, technology simply moved forward faster than Hadoop could hope to keep up with. Scalable analytics for Big Data has moved onto the cloud, and Apache Spark and Kubernetes are leading the way.