What's the role of an ML PM?
How to add value when the product experience is driven by engineering
Lots of new subscribers have found their way here through Twitter, again. Welcome! ‘What’s my A.I.?’ is a newsletter focused on helping product managers perform better on-the-job. With each letter, I want a) readers to walk away with tactics they can use to build and scale resilient products and teams and b) to make the product craft more accessible. If you have any feedback or requests, reach out directly.
What’s in this newsletter:
What’s the role of an ML PM?
Resources: A product executive’s blog for a view from the top, Top start-up jobs database, Counting Stuff newsletter by Randy Au
What’s the role of a PM building ML based products?
A few weeks ago, two followers on Twitter messaged me asking what the competencies of a Machine Learning PM are. One is an engineer considering making the move from data engineering to PM and the other is on a team with a newly appointed PM.
Across all of my previous PM roles at Google, Yelp and Quantcast, I’ve partnered with an engineering team with a machine learning skillset to solve hairy problems for the business. As a result, I feel uniquely qualified to generalize competencies that are not niche to a specific company. Note that the contributions I’m describing below are for a PM building ML based products rather than one deciding what to build for machine learning tools used by developers (eg what someone at Databricks might be doing). Let’s start by distinguishing what value the PM adds instead of taking a skills only view - they may share some skills with SWE/MLEs but are accountable for a distinctly different outcome:
As you can see, the PM cannot be divorced from the technical challenges that the engineering team will face and should at the least help but ideally drive decision-making. Here are the questions that I’ve asked or had to inform in my previous roles:
If you are joining a team early in its lifecycle or deciding whether an ML skillset is warranted, you need to determine whether the problem at hand would benefit from ML. Said another way, could you achieve most of the value to end-users by a lighter implementation like using heuristics in the back-end or even letting users set rules for their experience? You may not know to start but you can tackle that knowledge gap through launching the dumber implementation first either en masse or via a pilot. As an example, imagine you’re deciding whether to guide users automagically from an app’s landing page to one of the other sub-pages (discover near me, saved places, trending places). You could configure the landing page per the device’s / signed-in user’s previous browsing in the app or you can ask every user the same prompt: do you want to Discover | View saved places | Trending places?
Unless you’re performing a one-off analysis, the model(s) your team maintains will need to be repeatedly re-trained and evaluated for total performance as well as bias. Therefore, unless you have a steady pipeline of labeled ground truth data, you’ll struggle to improve or even maintain the quality of your predictions/recommendations/groupings. You should have an opinion on where your team can collect ground truth: is it user activity on your company’s product’s UI surfaces, do you have to purchase it from external third parties, do you have to periodically pay humans to curate a data set for you, or something else? Further, with a) increasing regulations on data privacy in the EU and the US and b) platform restrictions to data collection without users’ explicit consent, your options to purchase data from external sources or even your own surfaces may be limited and expose your organization to legal risk.
The goal is rarely to have 100% accuracy (or whichever other metric you choose, described in more detail in subsequent sections) before launching. In fact, because ML is warranted when dealing with very large sets of data in which it is difficult to use straightforward heuristics, you should expect to see errors. However, you should be able to articulate what the cost of prediction is to a) your end-user and b) your company. The canonical illustration here is predicting tumors is far more sensitive to error than recommending which furniture item to buy on an e-commerce website. To probe sensitivity, think beyond the industry you are in (healthcare vs e-commerce) and specifically ask: a) do your customers ask questions about accuracy and if they do, why? what are they concerned about beyond the first level concern b) can you benchmark competitors’ performance for the same product/category? c) what’s the performance / reliability of alternative solutions beyond competitors (eg if you’re automating a process that’s performed manually by users today, how important is it to users to get every step of the process correct offline) ?
In contrast to non-data product development, you’ll have to consider a blend of metrics and the trade-offs among those metrics could change as frequently as per model release. As a starting point, you should understand the relationship between model quality and end-customer experience. For example, if you oversee a payment check-out flow for several merchant storefronts and want to detect fraud, it makes sense that you want the model to be very accurate because if you flag innocent customers as fraudulent, the merchant will lose out on both sales and the shopper’s trust. In this case, the underlying implementation is a classifier so you should track a variation of the confusion matrix over time and ask to disaggregate metrics per customer segment (in the next bullet). In the case you’re not working with any sort of a classifier, evaluating metrics like mean squared error (MSE) may be more useful. Google’s documentation for the Vertex platform has an easy to understand list of metrics given problem type here.
Confusion matrix (detailed here) example:
Disaggregation can make the difference between improving your customers’ experience with your product/brand and losing their trust. The more diverse your training data set and the behavior of your end-customers, the more important it is for you to slice your metrics based on customer type. For instance, if your product is trying to recommend nearby businesses to consumers (a real problem I worked on while at Yelp) in real-time, it would be a terrible experience for consumers if you suggested businesses a mile away from where they’re walking. Location data is notoriously finicky because of GPS sensitivity to bad weather and building surfaces and your product is not going to be able to circumvent these innate technical challenges. Therefore, disaggregating the performance of your business recommendation model based on city areas is critical so that you can block your product from running in areas where prediction tends to be bad (eg Times Square in NYC) and only focus on those where you know you tend to have high quality predictions.
An example of disaggregated metrics can look like the grid below (credit to source). In the case of the example in the previous bullet, the x and y axes would display city geo labels (neighborhoods or cities):
All models need to be re-trained to remain performant if the ground truth conditions are changing. As a simple example, if you’re personalizing which tweets you want to show to users in their Twitter homepage, you’ll need to re-train your model(s) to keep up with users’ interests and preferred formats. Perhaps in 2016 users would interact with pithy news content and headlines whereas in 2022, users prefer longer threads on a variety of topics. If you have a dashboard of metrics to monitor performance at training and in serving, you’ll start to see model or feature drift that will point to re-hydration. You should have clear tactics in mind for how you’ll collect new training and validation data. Will you need to purchase it from an external source? Will you need to create a new surface for users to be able to collect user data? Will you need consent from users in light of data privacy regulations around the world? All of that should be considered, discussed with your engineering counterpart, and accounted for in your roadmap.
As your product progresses in maturity, it will get increasingly difficult to squeeze out improvements. As the PM, you ought to have an opinion on whether the model(s) you’re building have a terminal end state (ie once you reach maturity it’s time for your team to focus on an entirely new problem altogether) or whether there are other opportunities like focusing on new markets (eg instead of building recommendations for text based tweets you may need to build for video content). What is the ROI and strategic rationale for pivoting to new market(s) vs ending your investment altogether? Whose counsel within your organization will you need to make the right call?
As much as we (I’m guilty of this) may want to live in an elegant, data-driven world, the reality is that you’ll need to override model output or augment it. This is likely to be the case when you’re early on in the testing phase.
One of the joys of building products (really: systems) that rely on ML is observing how similar the learning patterns of machines are to the human mind. There is enormous beauty in building ML-informed products and if your interests align with the opportunity to do so, you should go in with an open mind ready to address the many challenges that will come up along the way, even if you are not an ML engineer yourself.
Run the business newsletter by Ibrahim Bashir gives a product executive’s view on questions of product strategy, up-leveling craft, and common challenges (with suggestions for mitigations)
Top start-up jobs database compiled by Linda, an inspiring PM who went the solopreneur route. Previously, she was a Group PM at Faire early on before it hit decacorn status.
This post by Randy Au of the Counting Stuff newsletter explaining the “difference” between quant UX researchers vs data scientists. Though most organizations cannot afford / are not set-up to have both of these job families, you may still find it useful to see to what extent a UXR or data scientist identifies with the responsibility of “solving problems for the user and business”. This means the PM does not have monopoly over the user or the business and therefore as a PM you must be good at leading to bring engaged cross-functional partners along for the ride.
Reminder on #22WIP or the 2022 Women in Product conference on May 10 and May 11. I’ve purchased my ticket, already. Hope to see you there.
If you’ve made it all the way to the bottom, thanks for reading! Consider sharing with someone you think may find this newsletter useful.
P.S. Here’s the far pithier version of this newsletter on Twitter: