Machine Learning ProjectLife Cycle
Today, the term Machine Learning comes up in every other discussion. In fact, in the bay area, it is a staple. We hear about unicorn start-ups as well as established organizations solving significant challenges using Machine Learning. Then, there are many more companies who are in the process of figuring out what and how long it takes to implement Machine Learning models in their organization. This article is an effort to share my insight into the process of this new edge phenomenon, a major paradigm shift from the traditional rule-based system.
Before I go into the details, let me start with how Machine Learning differs from a rule-based system. In a rule-based system, decisions are made based on a set of rules built on a set of facts by human experts, while Machine Learning decisions are made based on a function (a model) built on patterns extracted by Machines from data. A rule-based system is considered rigid as it cannot make a decision when there is no historical data. Machines, on the other hand, can make an estimate based on similar patterns found in the historical dataset.
Let me take a business case to explain this further. Customer Churn, for example, is a well-known business challenge companies encounter on an ongoing basis. We spend our marketing dollars to acquire customers; they come onboard and after a few months, leave for reasons unknown.
In such scenarios, a rule-based system may decide to send a promotional offer after a certain number of days/months( based on the companies definition of churn) of inactivity. Still, the chances of customers returning are pretty low. They may have moved on to a different company or lost interest in the product. In a rule-based system, there is no easy way to predict and intervene if and when a customer is going to churn. However, Machine Learning looks at the spending pattern, demographics, psychographics of customer actions in the past, and tries to find a similar pattern on the new customer to predict their actions. This information can alert the business to take preemptive care and save the customer from churning. This intervention makes a significant difference in the customer experience and impacts the business metrics.
To identify and implement Machine Learning in an organization, we need to make vital changes in our process that exist in our traditional rule-based system.
1. Define a clear use case with a measurable outcome
2. Integrate enterprise-wide data seamlessly
3. Create a lab environment for experimentation
4. Operationalize successful pilots and monitor
Essentially, embrace the paradigm shift, The ML Mindset. Let's see how we incorporate "The ML Mindset" in a machine learning workflow. This workflow is implemented by domain experts, data engineers, data scientists, and software engineers contributing to various tasks. These days, however, companies are looking for individuals who have knowledge of the whole workflow and known by the title full-stack data scientist.
The following diagram shows all the tasks a Full Stack Data Scientist performs to complete a project. As I was looking for inspiration to draw an appropriate flowchart to show the ML workflow and came across thisAWS presentation. I made a few modifications to the diagram that I believe reflects the essence of an end-to-end machine learning project.
1. Business Problem:The broad ML technique selection/elimination process starts at the very beginning of the Data Science/Machine Learning project workflow when we define the business goal. Here, we understand the business challenges and look for projects that will have a significant impact, whether it is immediate or long term. Many a time, existing business reports will indicate the challenge, and the goal will be to improve a metric or KPI. Other times, a new business initiative will drive the project.
In our specific example of customer churn, the larger goal may be to increase revenue and one of the strategies may be to improve customer retention, the immediate business goal for this Machine Learning project is to predict churn with higher accuracy (say from 10% to 40%). A business arrives at this number after diving into all the KPIs impacting the industry. That discussion is beyond the scope of this post.
2. ML Problem Framing:We then decide on basic Machine Learning tasks. When working with structured/tabular data, the task at hand is primarily one of the following: supervised, unsupervised, or reinforcement learning. Many articles explain each task and its application. Among them, I found twoAIandMLflowcharts by Karen Hao from MIT Technology Review, which is all-inclusive and straightforward to understand.
At the end of this stage, we should know the broad ML technique (Supervised/Unsupervised, Regression/Classification/Forecasting) to be implemented for the project and have a good understanding of data availability, model evaluation metrics, and their target score to consider a model reliable.
Customer churn prediction is a supervised classification task where we have historical data of customers who are labeled into two classes: churned or not churned. For a supervised classification task, evaluation metrics are based onthe confusion matrix.
3. Data Collection & Integration:In business, collecting data is like a treasure hunt; all the joy and agony of it. The process is complicated, painstaking, but eventually rewarding. Often enough, we find crucial data stored in a spreadsheet. Retailers with both online and physical presence sometimes have promotional flyers in the store that is not uploaded in the data repository. Model accuracy relies heavily on data size, and as I mentioned earlier, it is essential to integrate enterprise-wide data for Machine Learning.
For customer churn, we will need to collect data from various business domains including recency, frequency, monetization, tenure, acquisition channel, promotions, demographics, psychographics, etc.
4. Exploratory Data Analysis:This step is where knowledge of Data and Algorithms help to decide on the initial set of algorithms (preferably 2–3) that we would like to implement. EDA is the process of understanding our data set through statistical summary, distribution, and the relationships between features and targets. It helps us build intuition on the data. I want to emphasize the wordintuition.While developingintuition, refrain from drawing aconclusion. It is very easy to get carried away and start making assumptions without running a data set through a model. When we perform EDA, we are looking at two variables at a time (we are performing bi-variate analysis). Our world, on the other hand, is multivariate, such as how seedling growth rate is dependent on the sun, water, minerals, etc. Statistical models and ML algorithms implement multivariate techniques under the hood that helps us conclude with a certain degree of accuracy. No single factor is responsible for the change. One or two factors may be the driving factors, but there are still many others behind the change. Do keep this thought in mind during EDA. This step is essential, and guidelines are similar for all datasets. You can create atemplateto use it for all the projects with minimal modifications.
5. Data Preparation:Our observations made during Exploratory Data Analysis give guidance to various data processing steps.This includes removingduplicates, fixingmisspelled words, ensuringdata integrity, aggregatingcategorical valueswith limited observations, droppingfeatures with sparse data, imputingmissing datafor important features, handlingoutliers, processing and integratingsemi-structured & unstructureddata.
6. Feature Engineering:It is a well-known fact that Data Scientists spend the majority of their time exploring and preparing the data, engineering features before applying a model. Of all the three, Feature Engineering is the most challenging and can make a big difference in model performance. A few standard techniques include transforming data using the logfunction or normalization,creating or extractingnew featuresfrom the existing data,feature selection&dimensionality reduction.
Although the limelight of the workflow is model training and evaluation, I would like to reiterate that the previous three steps (Exploratory Data Analysis, Data Preparation and Feature Engineering) consume 80% of the total time and is highly related to the success of a Machine Learning project.
7. Model Training & Parameter Tuning:Equipped with the list of 2–3 algorithms from exploratory data analysis (step 4) and transformed data (steps 5 & 6), we are ready to train the model. For each algorithm, we select various ranges of hyperparameters to train and choose the configuration that yields the best model score. There are various algorithms (Grid search, random search, Bayesian optimization) available for parameter tuning. We will use Hyperopt, one of the open-source libraries used to optimize searching the hyperparameter space, using the Bayesian optimization technique. We then compare the model evaluation metrics (precision, recall, F1, etc.) for each of the three algorithms on the training data and validation data with the best hyperparameters. Besides performance measures, a good model will perform similarly(generate similar scores on evaluation metrics) in both the training and validation datasets. Understanding and interpreting relevant model evaluation metrics is the key to success in this step.
8. Model Evaluation:We then compare the model evaluation metrics (RMSE, R squared, AUC, precision, recall, F1, etc) for each of the implemented algorithms on the validation data and test data with the best hyperparameters. Understanding and interpreting relevant model evaluation metrics is the key to success in this step. Our expectation is that good models produce comparable results in validation and test. They won't produce identical results, but AUC/F1/precision/recall/RMSE scores on test and validation sets will be close.
9. Model Deployment:Once we are content with the model outcomes, the next step is to run the model with test data, and make its output is available via API, web applications, reports, or dashboards. If the model is to work with streaming data, it is being incorporated in applications through a Web API. If the result is to be delivered to business users for insight, the results are shared in dashboards or automated reports delivered via email. It is essential to confirm prediction accuracy through AB testing thereafter. Operationalization involves up-front investment in systems that smooth the deployment, maintenance, and adoption of whichever data processes we choose to employ. It is worth the extra effort to avoid runtime failures.
10. Monitoring Drift & Decay:Monitoring production models is different from monitoring other applications. A product recommendation model won't adapt to changing tastes. A loan risk model won't adapt to changing economic conditions. With fraud detection, criminals adapt as models evolve. Data science teams need to be able to detect and react quickly when models drift. As we detect drift and decay, we are back to the beginning of the cycle where we may adjust the business goal, collect more data, and repeat the cycle.
11. Delivering Model Output:When a model output is not directly consumed by a web application, it is often used to deliver business insights through a dashboard or report. One of the most difficult tasks of machine learning projects is explaining a model's outcomes to an audience. Data visualization tools like Tableau or Google Data Studio are very helpful in building storylines to share insights from Data Science work.
Machine Learning cycle tends to vary between 3 and 6 months followed by ongoing maintenance. ML is evolving and the cycle length perhaps will continue to shrink with automation but the steps in the workflow stay the same. I encourage you to embrace the ML Mindset. Take a look at your current projects in your team/organization and think of ways to integrate Machine Learning that will impact your business metrics significantly.