What I Learned: Successful ML Models in Production

Preamble

8 min readNov 22, 2023

I’ve been wanting to do a series called “what I learned”. I thought I would start with something that has been a tremendous learning experience for me, and also brought me great amounts of joy — seeing an ML model perform in production. Hopefully this will show the process I went thru, and highlight that the process is what dictates the results, usually. Aka, you usually get out of something what you put into it.

If you get anything out of this, what I hope you get is that data science is the foundation of successful models in production.

Top 10 Opportunities for Learning (aka hard problems to solve in the real world)

Real world data is very noisy: sensors (in this instance, cameras) go bad, and sensors stop sending signals (pictures) due to batteries, for example.
The process of data collection is messy, sensors create many duplicate “signals” in quick repetitive succession, and we want to predict unique events. The data in the signal is not always consistent based on the make of the sensor.
I’m using an object detection model (AWS Rekognition Custom Labels) to recognize objects. Because this is a very imbalanced dataset, it is inherently hard to get good quality images to train the object detection model. (this is just one of the models used in the overall solution)
The data is extremely seasonal, and the seasonality is not exactly at the same time, and the strength of the seasonality varies, due to many factors in nature, including weather, and its extremes.
You have to think about framing the problem carefully: from a classification standpoint, this is an extremely imbalanced problem — the class to predict is very small. If I frame the problem in time series prediction terms, the data distribution is “zero-inflated count data” — meaning that non-zero counts are rare events. I would assume that most rare event problems are zero-inflated. I could also frame the problem as time series classification.
Whether I chose classification or forecasting, I’m going to have to be smart about what metrics I use to evaluate my model, due to the imbalanced problem and the zero-inflated issue.
Feature engineering matters over “superman models”. But be careful about engineering features that will translate to prediction time.
I first framed it as a classification problem and I didn’t trust my first model. So I framed it differently and tried different models. This was invaluable.
Random Forests, and XGBoost/Catboost et al will give you feature importance. But there’s much more to explainability and prediction quality, than just feature importances.
Users of my app want an easy way to get the value of the predictive model. Also, my users have lots of biases that will need to be overcome, “rules of thumb”, lore that has been passed down to generations, etc. How you understand your model and translate that to a user interface that makes it easy is critical.

Problems 1, 2, & 3

The first thing I did was build the Rekognition Custom Labels model. I don’t want to spend a lot of time here, but my advice here is quality over quantity. Pay attention to your metrics per class. Get more data, get rid of bad data for your training set. Make sure your bounding boxes are good. I find myself amazed how good it is at predicting based on real-world, difficult pictures, when I train on high-quality images.
At one point, I used an AWS Glue FindMatches transform, training an ML classification model to identify duplicate images. This was interesting, but in the end, I created a deterministic algorithm, executed it in a pyspark job, and that was better. This is a key learning — if you can create a deterministic model, do that. ML works when deterministic isn’t possible.
The ingestion of a significant number of images is a scale problem made for the cloud. S3, SQS, Lambda (and Lambda PowerTools), DynamoDB, and Step Functions are game changers. Use the idempotency utility from PowerTools for avoiding duplicate inserts. There are so many great patterns and practices incorporated in these services. I can’t believe software developers don’t want to use these services everyday. I could write entire articles about the capabilities here — I will save that for another day.

Problem 4

It turns out, I don’t really care about signals that come from some parts of the year. For many reasons inherent in the data, I really only care about certain timeframes because people only care about my model in those timeframes. In fact, using data in certain periods only exacerbates my zero-inflated/class imbalance problem. How did I find this out? Spending lots of time with the data, analyzing it, and deep thinking about what I am actually trying to predict. I also created several models that included this data to see what it did to the prediction quality. Exploratory data analysis matters, and building initial models and analyzing the results matter.

I also spent time feeding the data into Meta’s Prophet model. Why? Because it made it very easy to “see” seasonality. I never liked the time series predication capability of Prophet, but the seasonality capability of Prophet was helpful. That took me down several learning paths. But Prophet’s ability to quickly diagnose seasonality and visualize it was worthwhile. This helped me realize, “I don’t need all this data”.

Problems 5, 6, & 8

From the prior phase, I knew I had a class imbalance problem, and I knew that, for classification, it meant I might need to use SMOTE or similar techniques, and that I was going to need to pay attention to different metrics than just Accuracy. I won’t go into it here, but understand Balanced Accuracy, and Precision and Recall in imbalanced problems.

When I chose to frame this as a time series prediction problem, I went down many paths. Someone told me “timeseries is a b$$ch” and all I can say is, yes it is. With all that said, I don’t feel like any of this was wasted. From deep learning models to XGBoost for forecasting, it was all helpful.

At first I kept wondering why so many SOTA models kept wanting to predict zeros. Then I started googling and sure enough, I learned that this is a real challenge with “zero-inflated count problems”. In other words, your model can predict zero and be right a lot. Its the same problem with accuracy metrics in imbalanced classification problems. So there are all sorts of good metrics and loss functions, from the Poisson distribution to Tweedie. And it turns out, this happens quite frequently in many industries and their business problems. And its even more annoying when you are trying to predict a count of 1 or 2, maybe 3, instead of zero. Super hard. But I learned something that I can apply in different domains. I even toyed with anomaly detection models at one point ,framing the problem as “non-zero counts would be an anomaly”.

The point here is that whether framing it as classification or timeseries prediction, I kept gaining deeper understandings of my data, how it occurs, why it occurs, and ultimately asking myself, what am I really trying to predict, and how am I going to explain it.

(I ended up framing it as classification, and using SHAP interaction plots, which I will talk about below)

Problem 7

A professor told me one time “Be great at feature engineering, not creating superman models”. This process made me truly understand this. And in retrospect, I think this results in real-world predictability because it ties your model back to the real world. Tweaking all the hyperparameters can be fun, and I’m not saying that you shouldn’t tune hyperparameters, but every model I created got significantly better (meaning my metrics improved more) with more effort spent on feature engineering.

You also have to think critically about what data you will have at prediction time. It is easy to create features when you are training and testing and end up leaking training data into your test set. And then you realize you won’t have access to that data at prediction time. This goes back to trying to create superman models.

I had access to users throughout this process and that allowed me to ask questions about the problem and the data. I got to hear them talk about what they do that they think influences the data, or what they see that could be an issue, or rules of thumb they have. All these things help you build better features.

Problem 9

You can’t get away from explainability of your model. Its how people in the real world expect you to talk. It convinces them that you know what you are talking about. Feature importances are good, but they are not everything. For my problem, when I created my first model, I got the feature importances and said “here is the answer”. And then I got the first question from the first tester, and I realized I didn’t have the answer. But I remembered an article I read about SHAP. So I dug into SHAP and thats when explainability came alive, so to speak. I won’t try to dive into SHAP here, but for me, the true predictive power of the model comes alive with SHAP due to the interaction diagrams. In fact, I would say those boost the predictive power of my model, because the interaction between the features allows me to give real insights to my users.

Problem 10

I’m not going to go into all the nuances of design and UX/UI, but I continue to believe it is just as critical as the results of your model. Most apps can’t draw SHAP graphs, and it would probably just confuse them anyways. So you have to think very hard about how you create a prediction, and explain it so that it is actionable in the way that the user gets value from your app.

It goes back to the Jobs To Be Done Theory: people pull products into their lives to make progress on problems they face. Thats about as simple as I can say it. You have to surface your model in a way that helps the user make progress against their problem.

Summary

In the end, I would say that creating good ML models starts with data science — spending time with the data, the people who will use your model, and/or the people who know the processes that create the data. Then forming hypotheses and testing your hypotheses by framing the problem in many ways, trying different types of models, and iterating. By iterating, I mean going back to the data science, re-framing, asking more questions, and building more models.

And in my next project, I will have an even better toolset of techniques that will help me think about, and solve the problem.

Last thought — this is what I love so much about data science and machine learning, and what this project taught me: its never going to be static. New problems will require completely different framings and problem statements. New models will need to be created. New metrics and loss functions will need to be applied, specific to that problem. New features will need to be created.