2011
It’s 2011. You work for a big company, and you lead a team of people who predict things for the business. Your team uses a mix of SAS, R, and Python on an aging Solaris machine, laptops, and several AWS accounts. You still use a Netezza appliance for an analytics data mart. IBM hasn’t fired your Netezza support people yet, but the IBM Client Executive shows up at every meeting.
IBM offers to bundle SPSS into your company’s Enterprise License Agreement for “free.” All you have to do is sign up for “code-based migration” from SAS. That’s a nice way to say IBM will get a bunch of people in Mumbai to rewrite all your SAS jobs.
You decline the offer. Base SAS and SAS/STAT work fine. Your new hires use R or Python, so there’s no need to expand the SAS footprint, and SAS renewal fees aren’t that expensive. Migration is expensive, and retraining is a pain.
SAS users are happy. Eventually, they will retire. Problem solved. No need to throw a rock at that beehive.
Mr. Short, your CIO, wants to invest in Hadoop. He wants a commercially supported version, so he invites Cloudera and MapR to pitch.
IBM tries to disrupt the process. The Client Exec offers InfoSphere BigInsights, IBM’s Hadoop distribution, for “free.” Shorty is ex-IBM, but even he can spot a sucker deal. The software is free, but the consulting and implementation services will cost him an arm and a leg.
Mike Olson delivers the pitch for Cloudera. His pitch: we’re the smart guys who invented Hadoop, we automated all that shit you have to do with the Apache distribution, we’re enterprise-ready, we have money in the bank, and here’s our list of reference customers. We have a conference next month. Bring your whole team; here are some tickets.
John Schroeder from MapR shows up wearing a tie. He talks about file systems: HDFS sucks, we have a proprietary file system that is 100X better. Look at this diagram. See, our file system is better. No, we can’t prove that. Keep looking at that diagram.
You attend the first Strata conference in Santa Clara. One of the keynote speakers is a “scientist” who talks about “telling stories” with data. Isn’t that precious? She must work with Marketing, they love stories. Your clients don’t want stories, they want the fucking numbers.
Another keynoter is a journalist worried about “surveillance.” Chemtrails and tinfoil hat stuff.
Some of the speakers talk about all the “insight” they get from Big Data. They don’t say how. What do they do, write Pig jobs? Apache Mahout, Hadoop’s machine learning project, is a mess. If you want insight, you extract your data from Hadoop and use it elsewhere.
You cruise the vendor booths. Cloudera, IBM, and AWS have big booths. IBM’s booth is bigger than your house. EMC flogs Greenplum, Splunk promotes its “engine for machine data”, and Microsoft pushes nothing but air. Revolution Analytics hands out stuffed animals. It’s a great way to capture a million garbage leads.
Tableau is here, too. Great tool for bar charts if you can decipher the Hive Metastore.
Shorty returns from Strata with a bad case of petabyte envy. He chooses Cloudera for an on-premises cluster. That’s funny. Cloudera, on-premises.
The KXEN people return. They have some new venture capital, and their pitch looks fresh. New branding, too: KXEN InfiniteInsight. Same engine under the hood, but the GUI looks slicker, they have a couple of new modules, and they run in Oracle as well as Teradata. The AutoML is still black-boxy and overhyped. Still no automated feature engineering.
Kaggle lands a series A venture round. The company hosts major competitions during the year and launches a community hub for data scientists. Seems like a great resource for the unemployed.
2012
It’s 2012. You work for a big company, and you lead a team of people who predict things for the business.
Your company has a large Cloudera cluster up and running. Dump trucks arrive hourly to unload log files, documents, images, and other business effluvium. Nobody knows what to do with any of it, but it relieves Shorty’s petabyte envy.
Some of the business analysts know a little SQL, but “schema on read” is a bridge too far. IT hires a couple of Pig and Hive programmers to create “data products” that analysts can suck into Tableau. They call themselves “data scientists.” It sounds better than “Pig Programmer.”
SAS wants everyone to know: they’re not dead yet.
SAS “Thought Leader” Mark Troester posts SAS Hadoop - A peek at the technology, a blog post about SAS/ACCESS Interface to Hadoop software. Users will be able to assign Hive tables to SAS libraries, submit HiveQL commands from SAS, and run seven SAS PROCs inside Hadoop.
Good luck getting SAS users to push workloads into Hadoop. SAS added PROC SQL to Base in the 1990s. Still, most SAS people don’t know how to use SQL; those who do use it select the data they want, copy it to the server, and process it in SAS from there. You can’t blame them. There are 164 PROCs and 468 functions in SAS 9.3 Base, STAT, and GRAPH; many of the functions do things that are very difficult to execute in SQL.
Two weeks later, SAS announces SAS Visual Analytics. Visual Analytics is a Tableau knockoff that lacks key features, is hard to use, and only runs on SAS LASR Analytic Server, an expensive proprietary back end.
Other than those things, it’s competitive.
SAS says you can co-locate LASR Server in your Hadoop cluster. That should be fun. LASR Server requires a lot more memory than a typical Hadoop node server, so first you must upgrade all of your hardware. Hadoop currently lacks a workload manager, so you can watch and laugh when LASR Server jobs run into MapReduce jobs and crash the cluster.
In August, 0xdata launches H2O, an open-source distributed machine learning platform, so finally we have machine learning in Hadoop. The first release supports GLM, k-means, PCA, and Naive Bayes, and the company plans to add more machine learning tools soon. The software runs in Cloudera and other distributions; users interact with H2O through Java bindings or a REST interface.
Gigaom says that 0xdata wants to "make everyone a data scientist", which seems like a wee bit of a stretch. Everyone who writes Java and knows what to do with a generalized linear model.
You attend Strata + Hadoop World in New York. The first keynote speaker thinks Big Data will feed the world, cure all the diseases, and deliver clean energy, an ambitious program for a file system. He doesn’t mention increasing sales, cutting costs, or improving compliance.
SAS has a speaker slot. They send a beard from Building R to cover; he speaks for seven minutes, quotes Alistair Croll, and says nothing specific about SAS software. That’s the smart play for this crowd.
At a breakout session, Mr. Beard brings up LASR Server. You ask how to avoid conflict between SAS and MapReduce workloads. Simple, says Mr. Beard. Run LASR Server during the day and MapReduce at night. You cough to suppress the giggles.
You cruise the vendor booths. The folks in the Google Cloud Platform booth bang the drum for BigQuery and Google Prediction API. Prediction API seems like a serious attempt at AutoML. It’s a black-box service for model training and inference that supports classification and regression use cases.
Unfortunately, it’s completely opaque. Google won’t say how it transforms data and what algorithms it uses. You can’t see the code, and you can’t export the model; what you build in Prediction API stays in Prediction API. There are minimal out-of-the-box model quality metrics.
That won’t fly with the Chief Risk Officer. Black boxes are fine for low-value decisions, but clients want to tear apart a high-stakes model.
Harvard Business Review publishes Data Scientist: The Sexiest Job of the 21st Century. You read the article and nod in agreement. You wear your “I love Sqoop + Oozie” T-shirt to the grocery store and the cashier gives you a “come hither” look.
I’m so appreciative that you’re telling this story. It’s great to see the through line for tech drawn coherently.
In 2011 I was a sub for a company named PurePredictive—the vision of Dr. Wellman was “data scientists in a box”. They even got a patent 2012/13). A patent that H2O undid with a law suite in 2018. PurePredictive was a shitshow, Wellman was/is brilliant and kind, and I caught the automation bug there.