A Short History of AutoML: Part Six
Big Data, Small Data
You’re reading the sixth installment in a series on the history of Automated Machine Learning, or AutoML. Here are the previous installments: Part One, Part Two, Part Three, Part Four, and Part Five.
2013
It’s 2013. You work for a big company, and you lead a team of people who predict things for the business. Owing to a reorganization, the Pig and Hive People are all on your team now, delivering “data products.” HR changed all the analyst titles to “data scientist,” because it’s cheaper than handing out more comp.
Your team still uses a mix of SAS, R, and Python on an aging Solaris machine, laptops, and AWS. Personal AWS accounts are out; IT has an EDP with AWS, so everything has to go through the Help Desk.
The Netezza appliance still works. The Cloudera cluster, like your local landfill, keeps on growing. The rest of the organization remains a data labyrinth, with something north of fifty data sources.
February: You attend the 2013 Strata Conference in Santa Clara. There are a few good presentations. Rajat Taneja from Electronic Arts shows how gaming companies capture data and use it to improve the gaming experience. Eric Colson of Stitch Fix explains how his company uses recommendation engines to anticipate customer fashion needs.
Alexander Gray, CTO of Skytree, delivers a presentation about machine learning that he could have delivered in 1997. As a rule, when a presenter’s first point is “Understand Your Goals,” you’re getting boilerplate.
Gray does not mention deep neural nets. Documents, images, and log files account for most of the data currently in Hadoop. To make sense of that mess, you need tools for natural language processing and computer vision; deep neural nets show real promise in complex tasks like image recognition, image segmentation, and text processing.
This conference is supposed to be about the cutting edge, and this guy tells us how to build a slightly better churn model.
Meanwhile, deep in the bowels of the Santa Clara Convention Center, in a very crowded room, there’s a presentation about distributed in-memory processing with a Berkeley product called “Spark.” Seems promising.
March: Some University researchers release a “research prototype” for Auto-WEKA, an AutoML tool that runs on the WEKA open-source software library. Auto-WEKA uses Bayesian optimization for algorithm selection and hyperparameter settings. The tool does not engineer features; users perform all data prep manually in WEKA.
Auto-WEKA supports classification problems only. It runs in memory on one machine in a JVM, so it’s inherently limited to small data problems. (Not that there’s anything wrong with that.) When Auto-WEKA finishes, you can deploy your model in WEKA.
Auto-WEKA is a great leap forward for the three people who use WEKA.
April: Skytree Software announces a Series A funding of $18 million, with backing from U.S. Venture Partners, UPS, Javelin Venture Partners, and Osage University Partners. The company plans to commercialize software developed at Georgia Tech under the FASTLab project.
Skytree Server is a library of distributed machine learning algorithms. It supports the usual suspects: Support Vector Machines (SVM), Nearest Neighbor, K-Means, Principal Component Analysis (PCA), Linear Regression, Two-Point Correlation, Kernel Density Estimation (KDE), Gradient Boosted Trees, and Random Forests.
Why use Skytree’s proprietary software? Skytree claims, without evidence, that it runs fast and scales to large datasets. The company claims that it can train models “faster than Mahout,” which is like saying your Honda Accord can go faster than Grandma’s Ford Pinto with Grandma behind the wheel.
There are no images or videos of a user interface on its website, so you can assume the user interface really sucks.
For reference customers, Skytree claims Adconian, Brookfield Residential Property Services, CANFAR, eHarmony, SETI Institute, and United States Golf Association. That’s the same list they claimed a year ago, so they haven’t landed any new ones.
Making conventional machine learning algorithms “scale out” is a waste of time. If you train a model with SVM on a billion records, you may get a model that is slightly better than the one you build in kernlab on your laptop with 10,000 records, but nobody will care.
June: The Apache Spark project enters the Apache Incubator.
July: SAS finally releases SAS/ACCESS for Hadoop, the product it announced fifteen months ago.
September: The Apache Spark Project releases Spark 0.8.0. Among many other enhancements, this release introduces MLlib, a machine learning library for Spark. MLlib supports seven algorithms: support vector machines (SVMs), logistic regression, regularized linear regression, k-means clustering, and alternating least squares collaborative filtering.
On the same day, Databricks emerges from stealth and announces a $14 million raise led by Andreesen Horowitz. The company plans to commercialize Apache Spark.
December: Lots of VC money for startups with a machine learning theme. So far this year: new funding for Alpine Data Labs, Alteryx, Arria, Ayasdi, Databricks, Datameer, Fractal Analytics, Guavus, Opera Solutions, RapidMiner, Revolution Analytics, and Skytree.
There’s even more money flowing to business solutions that leverage machine learning. Palantir leads the pack, raising $304 million in two rounds. Others include Dataminr, DataSift, DemandBase, Drillinginfo, Networked Insights, Ooyala, RetailNext, Tidemark, and WorldOne.
2014
It’s 2014. You work for a big company, and you lead a team of data scientists. Your team still uses a mix of SAS, R, and Python on an aging Solaris machine, laptops, and AWS.
Some of your people play with H2O on their laptops, but IT doesn’t want it in the Cloudera cluster. The Hive People like Impala, but they still run Hive jobs overnight.
There’s a food fight in IT right now over the Netezza box. The IBM faction wants to keep it running, Team AWS wants to migrate the data to Redshift, and Team Cloudera wants to replace it with Impala. There are no issues with Netezza; it works fine, the AWS and Cloudera factions just want to flex.
You tell your team: Let. Them. Fight.
February: You attend the 2014 Strata Conference. Meh. People with petabyte envy and their enablers.
March: Tianqi Chen, a PhD student at the University of Washington, releases XGBoost, an open-source package for gradient boosted trees.
April: 0xdata reveals a distributed deep learning capability in open-source H2O. Now we’re getting somewhere.
May/June: Just in time for Spark Summit, the Apache Spark team ships Release 1.0.0, with significant enhancements. The new release includes Spark SQL in alpha, improvements to MLlib, integration with YARN, and numerous other enhancements.
Spark’s machine learning library is still rudimentary. It’s nice to have L-BFGS as an optimization primitive, but you must build on top of that to have something useful.
You attend the Spark Summit. Several things impress you:
Interest in Spark is expanding like wildfire.
The Spark project adds features incredibly fast.
Tech vendors are falling all over themselves to embrace Spark.
IBM is one of the sponsors. Their speaker burns a speaking slot yapping about some garbage software from IBM Research. SAS does not sponsor the event, but you see one of their execs wandering around the exhibits. He looks dazed.
August: DataRobot, a Boston-based startup, raises $21 million in a Series A round. The company’s AutoML service, still in beta, supports several innovations: it automates feature engineering and uses a bootstrap optimization approach with open source machine learning algorithms.
This company doesn’t talk bullshit about Big Data, and it doesn’t build scale-out machine learning algorithms. Instead, the DataRobot software distributes model training experiments across many cloud instances. The advantage of this approach is speed, flexibility, and trustworthiness; they use the same software packages that data scientists use every day, and automate the programming effort needed to run a hundred experiments.
The disadvantage of DataRobot’s approach is that training datasets must be small enough to run on AWS on-demand instances. That is sufficient for most business prediction use cases, but it saddens the Big Data priests.
October: You attend the Strata + Hadoop World conference in New York at the Javits Center. Eli Collins of Cloudera delivers an “ethics and Big Data” piece, in which he argues that we should “use data for good.” The next presenter on the program shows how to use facial recognition to get people to buy more candy.
December: Your new CIO asks: Can we replace SAS with R? Yes, you reply, if you have bigger balls than the average CIO.
Pressed for an explanation, you respond. Anything you can do in SAS, you can do in R. There are things you can do in R that you can’t do in SAS.
But SAS is easier to learn and use than R; even R champions admit this. You’re going to have to retrain all the SAS users, and they will scream because they are busy and don’t want to spend time learning a new language just because you want to cut software costs. They know SAS renewals amount to less than what you spend on Gartner subscriptions and the annual IT offsite.
It’s not like the call center, where you tell everyone we’re using SAP and that’s it; they have to adapt. The SAS users aren’t call center agents. They are researchers, economists, and actuaries. Nobody tells the Chief Scientist or the Chief Risk Officer what tools their teams can use.
The CIO thanks you for the input and promises to take the matter under consideration. Nothing happens, confirming that his balls are not, in fact, bigger than those of most CIOs.


Fun fact - WEKA is still alive and promising exactly what DataRobot couldn't deliver.
NeuralMesh™ is built on a containerized microservices architecture that doesn’t just handle scale—it feeds on it, getting faster, more efficient, and more resilient at exabytes and beyond.
Also the customer list is impressive - Synthesia, Mid Journey, Stability and you name it.