Ten Years Ago

Spark really WAS too big to fail

Apr 21, 2025

Ten years ago, I published Spark Is Too Big To Fail. In 2015, leading tech journalists and analysts published lukewarm and dismissive reviews of Spark Summit East. I begged to differ. I’ve republished the text below with some minor corrections and comments; Grammarly spotted some shameful examples of passive voice in the original. There are some broken links from the original text.

Reacting to growing interest in Apache Spark, there is a developing contrarian meme:

David Ramel asks: Are Spark and Hadoop friends or foes?
Jack Vaughan compares Spark to the PDP-11 and dismisses it as “just processing.”
Doug Henschen praises Spark, pans Databricks
Nicole Laskowski complains that Spark Summit East “felt like a Databricks show.”
Andrew Oliver thinks Spark needs to grow up
Andrew Brust worries that vendors are ahead of customers on Spark
IBM’s James Kobelius characterizes Spark as “the shiny new thing.”
Gartner’s Nick Heudecker asserts that Spark is “not enterprise-ready.”

Spark skepticism falls into three broad categories:

Hadoop Purism: Spark deviates from the MapReduce/HDFS framework, and some people aren’t happy about that
Backseat Driving: Some analysts argue that Spark is great, but Databricks, the commercial venture behind Spark, should do X, Y, or Z
FUD: Spark’s competitors — commercial and open source — plant “issues” and “concerns” about Spark with industry analysts

Let’s examine each in turn.

“Spark Competes With Hadoop”

Spark does not compete with Hadoop; it competes with MapReduce. Hadoop is an ecosystem of projects; all commercial distributions include core components (e.g., Hive, Pig, HBase), from which customers pick and choose. The ability to mix and match components is a strength for Hadoop.

Some software, like Spark, can run co-located in a Hadoop cluster or on clustered machines outside Hadoop. This capability should not surprise anyone; clustering and distributed computing existed before Hadoop. Why does it matter if a software component can run both ways? Users and use cases will drive implementation; if Spark works better with Cassandra than with HDFS or if a Spark user does not need the other Hadoop bits, that’s fine.

While there are reports of organizations that have abandoned MapReduce, most organizations will use Spark together with MapReduce; if users are happy with existing MapReduce jobs, there is no need to rewrite them. For new applications, however, some users will choose Spark over MapReduce for various reasons: better runtime performance, more efficient programming, more built-in features, or simply because it’s the latest thing. Isn’t competition a wonderful thing?

Organizations using standalone instances of Spark likely never considered using MapReduce for the application. Spark competes with SAS, Skytree, H2O, Graphlab, and some other machine learning software for these use cases.

Me: MapReduce is dead and gone, and so is Hadoop.

Databricks Envy

Sniping at Databricks is equally unwarranted. (Note: I’m not on the payroll.) There are only so many ways to build a viable open-source business model. Offering a commercial product with additional bits is one way to do so; that is how Cloudera and MapR operate. Databricks provides a hosted service for Spark with a few extra bits; if you don’t like Databricks’ offering, you can implement on-premises yourself or get Spark as a service through Amazon Web Services, BlueData, Qubole, or elsewhere.

And if you really must have a notebook for Spark, try Zeppelin.

Me: That didn’t age well LOL.

Of course, it’s true that Hortonworks open-sources everything. HDP loses $3.76 for every dollar it sells. They hope to make it up on volume.

Databricks contributes heavily to the open-source Spark project, supporting developers whose sole job is to improve Spark. Most importantly, Databricks provides leadership and release management, which inspires confidence that Spark will not turn into a muddled mess like Mahout.

The complaint that Spark Summit East “felt like a Databricks show” is odd — one rarely hears complaints that Oracle World “feels like an Oracle show.” Thirty-nine presentations were on the agenda at Spark Summit East, and one — Ion Stoica’s keynote — highlighted Databricks Cloud. In contrast, sponsored sessions accounted for a third of the sessions at the 2015 Strata + Hadoop World in Santa Clara.

“Spark Is Not Enterprise-Ready”

Some of the criticism is silly. Andrew Oliver is shocked to discover that Release 1.0 of Databricks Cloud’s notebook, currently still in beta release, isn’t as slick as Tableau. Also, a process he was watching timed out. But wait! That might be due to slow hotel wi-fi…

Meanwhile, SecurityTracker reports a major security flaw in IBM’s BigSQL.

Is Spark “enterprise-ready?” Customers ask the same question about Hadoop, and conservative enterprises answer “no” to both. No single threshold determines when a piece of software is “enterprise-ready.” Use cases matter; the standard for software that will run your ATMs is not the same as the standard for genomics research software.

According to Gartner’s Heudecker, “actual adopters are mid- and late-stage startups such as Spark pureplay DataBricks, ClearStory Data, and Paxata. Other companies primarily use Spark to power dashboards.” It's interesting to hear Gartner dismiss the dashboard market, but enterprises use Spark for more than dashboards. A top global bank uses Spark today for Basel reporting and stress testing; if you’re unfamiliar with stress testing, suffice it to say that a bank that gets this application wrong is in trouble.

Indeed, vendors are ahead of customers on Spark. That is hardly out of the ordinary with new technology; one could have said the same thing about Hive in 2010. Vendors are always ahead of customers; it’s their job.

Me: I’m so old I remember ClearStory Data and Paxata.

Spark is Too Big to Fail

What are the alternatives to Spark? Gartner’s Heudecker correctly notes that Spark excels at iterative processing; MapReduce must persist after each pass through the data, sandbagging performance. High-performance advanced analytics must run in memory; commercial products are available from SAS and Skytree, but for open-source distributed analytics, there are few alternatives to Spark. Flink and Tez lack Spark’s analytic libraries; Impala can support SQL but lacks capabilities for machine learning, streaming analytics, and graph analytics.

Is Spark fully buttoned down in Release 1.3? Who cares? Spark beats MapReduce for advanced analytics applications. It’s not close.

I am not suggesting that Spark is free of bugs or issues. Like every other commercial and open source software project, Spark has bugs; unlike some commercial products Gartner rates as “Leaders,” the Spark team is transparent about issues and fixes them quickly. It’s also fair to say that this time next year, Spark will have more features than it has today; the community of users and contributors will decide what to add.

Unlike some other open-source projects, Spark has strong leadership, a disciplined approach to development, and an impressive release cadence. People build software, and the people behind Spark have proven that they know what they are doing.

The list of Spark users is strong and growing. I’ve attended every Spark Summit since the first one in 2013, and there has been a noticeable growth in the number and sophistication of the applications presented. That is not hype; it is real progress by users who accomplish bigger and better things with Spark than they could have without it.

Spark has already achieved commercial support, ensuring it will live up to its promise. Available in every commercial Hadoop distribution and with Datastax, endorsed by SAP and Oracle, it is inconceivable that these players will let Spark fail. Reputations are at stake; there are few other options for open-source high-performance advanced analytics inside or outside of Hadoop.

Discussion about this post

Jack Vaughan

Mar 18

I don't think my 2015 article is aptly described in this piece. Maybe my number doesn't describe itself well enough. OK. Discussing Ken Olson lost a lot of readers no doubt.

I think article was supposed to convey that buyers should keep their eye on the goal and not wed to Hadoop or Spark without due consideration.

The only reason I include the whole thing in this commentary is that one has to go into the Way Back Machine to find it. The above link leads to a generic Spark page.

May your garden grow, Thomas. Sorry for hoisting one of the longest comments of the 21st Century....

Apache Spark meets the PDP-11 -- in the end, it's just processing

Long ago and far away, in 1992, I sat with my then-boss, Jon Titus, discussing the IT news of the day. Titus was a great editor, and very understated; he was also a PC pioneer, having created the early Mark-8 Personal Minicomputer that now sits in the Smithsonian. The newsroom conversation was much as it might be today, when we talk about Apple's surprise purchase of FoundationDB, or the rapid ascent of the Apache Spark processing engine in the headlines of the Hadoop world.

The news of the moment that day was the ouster of Ken Olsen, Digital Equipment Corp.'s co-founder and CEO. His departure was an inflection point along a trail that saw DEC go from being a gutsy mill town startup in Massachusetts to being a serious threat to IBM's industry leadership to being a forlorn acquisition candidate.

Like those in other editorial offices, we wondered what went wrong. Ultimately, what went wrong was the company got confused about what business it was really in. Seems absurd, but it can happen. Titus had a unique perspective on Olsen's quandary as smaller computers and new kinds of software came along to unseat its flagship PDP-11 and VAX computers. "DEC came to think they were selling minicomputers," he said. "But what they were selling was computing."

The simple things can sometimes be the hardest to remember. That's good to keep in mind in light of the growth of distributed data processing, and the highly touted Apache Spark framework's recent rise to prominence.

What's in a name?

There's some confusion today as Hadoop distributed processing is joined by Spark distributed processing in the Apache Software Foundation data ecosystem. Spark was the hot topic at last fall's Strata + Hadoop World conference in New York, and its echo was heard even more loudly at the recent Spark Summit East -- the first east coast edition of that event, also held in New York.

Some managers will wonder if Spark isn't just the shiny new object on the distributed computing block -- or, worse, just fodder for a developer's resume. Or both.

Some developers will urge their managers to jump from MapReduce-based Hadoop to Spark, and some -- with good reason -- will. Spark proponents claim that it can run batch processing jobs up to 100 times faster than MapReduce can -- it can also run stream processing and machine learning applications, which MapReduce can't do. But other managers will wonder if Spark isn't just the shiny new object on the distributed computing block -- or, worse, just fodder for a developer's resume. Or both.

Using the DEC story as a guide, if your organization has deployed a Hadoop cluster, you'd be advised to think about what you've been doing to date not as Hadoop computing, but, more simply, as computing. It isn't incorrect to think of Hadoop as a precursor to Spark, or Spark as a descendant of Hadoop -- but such generalities can only be taken so far.

Still, looking at the similarities between Apache Spark and Hadoop is a good first step. It's helpful to realize that Hadoop, in a way, greased the skids for Spark by bringing into wider currency basic notions of distributing workloads and managing compute clusters.

Hadoop computing, Spark computing

In Hadoop, open source APIs are used to link between different tools as applications demand. That has become part and parcel of Spark as well. And in fact, people looking to bring Spark into an organization often start out with systems that are an adjunct to the Hadoop Distributed File System, which can serve as an input source and as a persistent data store for Spark output.

Looking at differences is valuable, too, of course. On a basic level, at least for some types of jobs, Spark seems to provide a superior compute engine to MapReduce, the calculating engine that powered the original version of Hadoop before a new Hadoop 2 release opened up the framework to other platforms. As always, your mileage may vary. Also, Spark's reliance on in-memory architecture is a plus -- or maybe a minus, depending on your IT environment.

When Olsen's company put on its DECWorld user conference in Boston back in the day, it was flush enough to bring in the Queen Elizabeth II ocean liner to host some of the event. Thereafter, DEC went from an industry titan to a desperate company that disappeared into Compaq Computer. That all went down in about 10 years' time.

Hadoop vendors will do well to see Apache Spark as a partner technology. In most cases, they have been doing just that thus far. For users, there must be a realization that they aren't doing Hadoop computing -- and they aren't doing Spark computing, either. They're doing distributed data processing, and the particular engine is only one step in an ongoing progression.

3 replies by Thomas W. Dinsmore and others

3 more comments...

No posts

Thomas Dinsmore's Substack

Discussion about this post

Ready for more?