Notes on Gartner's DSML MQ (2)

Methodology

Jul 04, 2024

This post is Part Two of a three-part series. Part One introduces the Magic Quadrant and covers Gartner’s Critical Capabilities for Data Science and Machine Learning Platforms.

I will publish Part Three next week. Part Two is a seven-minute read if you don’t move your lips.

Let’s review the approach Gartner analysts use to produce the Magic Quadrant. Most people don’t read this report section because it’s dull, and the writing stinks. Nevertheless, we must understand the methodology to know how Gartner makes the sausage.

Here’s how Gartner describes its Magic Quadrant approach:

Gartner Magic Quadrants are based on rigorous, fact-based analysis backed by a highly structured methodology.

Is the methodology structured? Yes.

Is the analysis rigorous? Well, not exactly. It depends on how you define that word.

Is the analysis fact-based? Gartner won a Federal lawsuit by arguing that the Magic Quadrant is pure opinion and not fact-based.

As they say in courtroom drama, were you lying then, or are you lying now?

Gartner Magic Quadrants reflect some facts and a lot of opinion. The 2024 DSML MQ summarizes the opinion of one analyst with deep experience in the category and six analysts without.

It’s critical to separate fact and opinion when interpreting the MQ. That’s not easy because Gartner hides the baseball. The report reveals some information on methodology but obscures the details. Like vendors who sell proprietary software, they don’t want you to know how they build the product.

An opaque model can be useful if you validate it. Working data scientists know this. How would you validate a Magic Quadrant? What would be the appropriate measure of ground truth? For example, would you measure the ROI for customers who pick Leaders? What about market share or revenue growth for vendors that Gartner says are Leaders?

Those are rhetorical questions, of course.

There are two axes in a Magic Quadrant:

Completeness of Vision (CoV) – the horizontal axis – is a mashup of a vendor’s market understanding, product strategy, innovation, GTM strategy, and business model.
The vertical axis, Ability to Execute (A2E), is an index of a vendor’s product capabilities, company viability, marketing execution, customer experience, and minor metrics.

At a high level, CoV represents where a vendor says it is going. Vendors who present a vision that aligns with Gartner’s view of the future get high scores on CoV.

A2E indicates a vendor’s ability to achieve its goals. Vendors with rich product features, strong customer adoption, and deep engineering teams get high scores on A2E.

Thus, in plain English:

Leaders have the “right” goals and the resources to attain them
Challengers have lots of resources but “questionable” goals
Visionaries say the right things but may struggle to realize their ambition
Niche Players are just lost

That’s Gartner’s opinion, of course.

Let’s unpack those scores. I will focus on the top three drivers for each axis.

Two key sources of information drive the Completeness of Vision (CoV) score:

Interactive briefings in which vendors provide Gartner with updates on their strategy, market positioning, recent key developments, and product roadmap.
Gartner also captures some information about strategy and innovation from the RFI document.

CoV is simpler to break down because it is mainly fact-free. Gartner analysts form an opinion about the vendor’s “market understanding” and strategy from the interactive briefings. It’s comparable to a college admissions committee reading student essays; applicants must hit the right talking points to succeed.

Vendors whose “market understanding” aligns with Gartner’s do better than those who don’t. It’s human nature. A person who agrees with me is a genius. Vendors who learn to say what Gartner likes get good marks for Completeness of Vision. Gartner dunks on vendors who are not so aligned.

Investment in analyst relations (A/R) can improve a vendor’s CoV rating. Vendors align their MQ pitch to Gartner’s current market understanding by meeting the analysts throughout the year and listening carefully. Additionally, vendors can and do influence the analysts’ perception of the market to favor the vendor’s position.

CoV ratings tend to be volatile from year to year. The composition of the Gartner analyst team can impact the results. A vendor’s advance prep for the “interactive briefing” makes an enormous difference.

The key driver in the Ability to Execute axis is an assessment of each vendor's current product. Gartner assesses a vendor’s product features through a paper-and-pencil RFI process and a recorded demo.

Gartner analysts construct the RFI from the Critical Capabilities we covered in Part One. Vendors work hard to influence the process. In advance briefings, vendors show the analysts features they think are unique. If the analysts like the feature, they add it to the RFI. The vendor gets a point for the feature they planted.

A few years ago, the MQ RFI included a question about federated machine learning. At the time, IBM was the only vendor that offered that capability. You can bet they planted it. I’m not knocking federated machine learning. But if you list the top ten machine learning techniques, it would not include federated machine learning.

Oh, who am I kidding? Nobody gives a shit about federated machine learning.

After screening vendors for eligibility, Gartner sends participating vendors the RFI. Every vendor says they support every feature, as they do for every other RFI, which is one reason why people should stop using RFIs.

A few years ago, I was on an MQ response team. An executive who reviewed the draft response asked why the team hadn’t checked every box “like they do at Alteryx.”

Trust me on this: Alteryx had no business checking off every box, then or now.

Gartner requires links to product documentation, which serves as a modest check on vendor behavior. Emphasis on “modest.” The scoring is – how do I say this diplomatically? – a bit soft.

For example, features need not work particularly well or at all:

Running local large scale HuggingFace models is a complex and very costly setup, and both quality and performance tend to be below proprietary LLM APIs. We strongly recommend that you make your first experiments in the domain of LLMs using Hosted LLM APIs. The vast majority of LLM API providers have strong guarantees as to not reusing your data for training their models.

Beta features are fine as long as they are public:

Extensions are ok. Extensions are those things people build that don’t go through testing or release engineering:

Vendors also submit a recorded product demo. That should keep things honest because nobody ever fakes a recorded product demo.

Vendors use “stacks” of products to check off boxes in the RFI.

Amazon SageMaker alone includes fifty-five separate services. Data scientists also need Amazon Bedrock, AWS Glue, Amazon EMR, Amazon S3, and other services like Amazon Cloudwatch and Amazon Lambda.
The SAS Viya stack includes forty-one products. Most SAS customers use Viya together with a stack running on SAS 9.
Altair competes with partially rebranded assets acquired when they bought Datawatch, Personics, Panopticon, Angoss, World Programming, RapidMiner, and Cambridge Semantics.

Gartner does not evaluate the user experience for these stacks or how well the components work together. For that, they would have to touch software. In Gartner’s world, a bag of refrigerator parts equals a refrigerator.

Gartner computes summary scores for these categories:

Data Pipelines
Feature Management
Experiment Tracking
Model Management
Deployment and Serving
Monitoring and Observability
Generative AI
Performance and Scalability

They publish the summary scores for each vendor in the Critical Capabilities report; RFI responses and detailed scores remain hidden. That makes it impossible to explain why some vendors do surprisingly well or poorly, and Gartner analysts get very testy when people question the scoring.

Overall, Gartner’s product assessment is “fact-based” but not unbiased. Vendors cannot score points if Gartner does not ask about a capability. If Gartner frames the market in a biased manner, the facts they gather paint a biased picture. As Part One notes, Gartner's process of framing the market is entirely opaque.

It’s like a media channel that reports all the scandals for one political party but never reports scandals for other political parties. The reporting is factual but biased.

The MQ team collects vendor financial data and uses market share and growth estimates from another Gartner research unit. Gartner does not disclose how it uses this data to compute a viability score, and the company has never published evidence to show that its viability score has predictive power. That said, most vendors above the median line on Ability to Execute in this Magic Quadrant are large and stable firms.

The viability score contributes to a vendor’s overall Ability to Execute (A2E) score (the vertical axis in the Magic Quadrant.) For example, DataRobot outperformed Altair, Alibaba, SAS, and IBM in Gartner’s product assessment but scored lower overall in A2E. That reflects DataRobot’s recent management turnover and other struggles that affect the viability score.

To assess customer satisfaction, Gartner uses metrics from Gartner Peer Insights, a platform that captures user feedback and reviews. Vendors learned to game this platform long ago. They flog the user base for reviews, focusing on new users who haven’t used the product long enough to discover its limits.

Dataiku excels at this canvassing and leads the ratings. DataRobot rose markedly in the rankings a few years ago when it started begging customers for reviews. Alteryx and RapidMiner used to be top floggers but cut back in the past year or so.

Is Peer Insights accurate? Sure, just like you can always trust Yelp when you go out for soup dumplings. Vendor promotion skews the sample toward happy new customers. The questionnaire stinks and the sample sizes are too small. Vendors like AWS and Databricks don’t bother to promote the survey, so they get very few responses.

Data-driven customer usage, retention, and churn measures are far better indicators of customer satisfaction and value than bad survey data.

Gartner evaluates how well vendors align with its view of the market. Does Gartner understand the market better than anyone else? This is a critical question if you want to know which vendors will help you move your organization in the right direction.

Ashish Vaswani and his team at Google introduced transformer architecture in 2017. Data scientists quickly adopted models like BERT; DSML vendors included them in software as early as 2018. Weights & Biases opened its doors in 2018, and Anyscale in 2019. Meanwhile, leading-edge organizations used natural language processing and computer vision in production applications.

So, who did Gartner identify as DSML “Leaders” in 2019?

KNIME, RapidMiner, SAS, and TIBCO.

Gartner preaches to everyone who listens that IT will control DSML tools and consolidate them into a single platform. That’s not a forecast or “market understanding.” It’s wishcasting.

In 1993, I worked for AT&T Universal Card. One day, the CIO invited the company’s analysts and managers to a vendor presentation.

“We’re going to consolidate all of our analytics on one platform!” he declared.

The platform: a Thinking Machines CM-5E.

When we stopped laughing, we said no.

Thomas Dinsmore's Substack

Discussion about this post