Business Use Case
Demand forecasting is a common business use case, which aims to estimate what customer demand will look like in the future so that the business can assess the impact on their supply chain. The client’s business requirement for this use case is to understand the sales pattern for one of the company’s best-selling products, especially the demand behavior during the COVID time, so the business can better manage their supply chain, inventory and procurement. In this project, we attempted to build an ML model to predict the product demand for the normal times, as we treat the sales during COVID period as the anomaly.
Dataset
The dataset used here is the historical daily sales for a specific item sold in the company’s US stores. The data columns are:
trend-index: this is the company’s internal demand indicator calculated from Google Analytics.
sales: actual daily sales (unit in millions) for the item in all US stores combined. Date ranges from 2015-01-01 to 2020-04-29.
One interesting factor about this dataset is the sales data covered pre-COVID and COVID periods, which is a good use case to explore data drift and concept drift.
The dataset size is relatively small, which only contains ~2,000 rows and 3 columns: date, trend-index, and sales, as the main features in the ML modeling.
Workflow
In this project, we followed the standard data science workflow:
Exploratory data analysis (EDA)
Data preprocessing and feature engineering
Model training, experiments, and evaluation
Model deployment
Inference
Monitoring
Note that the notebooks for each step above have been provided in the A360 AI example GitHub repository
1. Exploratory Data Analysis
Since the business values and goals are clear, the first step is to understand the dataset by generating visualization from the data and see if we can find some insights. Below is a plot for sales pattern before COVID. We assumed COVID started March 2020 in the US. Note that the plot clearly shows the seasonal pattern for the sales. Daily sales are in blue color and the index feature is in orange color.
The second plot is the sales in the entire data period, including the COVID period. Note the sales level elevated significantly after March 2020. This product item was in very high demand during COVID time.
The third plot gives a closer look at the sales pattern during COVID time. The product demand was consistently increasing throughout March and April 2020.
Taking a closer look at the main feature in the data, the trend-index, which is the client’s internal index from Google Analytics, it is evident that there is a high correlation between trend-index and sales. However, during COVID time, trend-index remains at the same level as pre-COVID, therefore, it failed to represent the elevated sales pattern after March 2020. In addition, trend-index is a lagging indicator to the sales, instead of the leading factor, which does not help too much with prediction.
2. Data Preprocessing
Since the dataset is time-series in nature, we created date features to engineer the data. The date features we created were month, day of month, day of year, week of year, as well as if the day is during the weekend, first day of the month, and last day of the month. We generated these features to capture the seasonal patterns we observed during the EDA. As for the categorical feature, such as month, we implemented one-hot encoding. After the feature engineering, we expanded the number of features from 3 to 20.
Highlight of A360 AI functionality
In A360 AI platform, we provide an easy-to-use API, called MDK (model development kit), to connect the cloud storage with the working Jupyter Lab environment. If you were in SageMaker, you would have to use a Python package boto3 and write at least 10-20 lines of the code to download/upload your data from/to an S3 bucket. With A360 MDK API (a360ai), you can directly load the csv file from S3 and write the feature engineered dataframe back to S3 with just one line of the code, such as a360ai.load_dataset and a360ai.write_dataset.
3. Model Training
After the feature engineering, our training data was ready for model training. For proof of concept, here we built a random forest model to predict the sales demand for the next 3 months. We treat this as a regression problem. The training data time period was January 2015 to September 2019. The validation data time period was October to December 2019. We further split the training data into train/test with 80/20 ratio. We reserved the 2020 data for understanding the data drift happened during COVID time.
In our random forest regressor (implemented Scikit-Learn in Python), we experimented to fine-tune a hyperparemeter, n_estimator. We tested 6 values for n_estimator to see which model produced the best test score (accuracy). We utilized A360 MDK to help us track the experiments we did with the random forest model.
Highlight of A360 AI functionality
With A360 MDK, you can easily track your model experiment and hyperparameter tuning. By adding a few lines of the code to log your hyperparameters and metrics, MDK will track your model experiment and provide a clean table to show you the metrics corresponding to the hyperparameters, so you can quickly see which model has the best result.
With this proof-of-concept model we built, the best accuracy score is around 88%. Then we proceeded to deploy this model as a cloud endpoint, which can be used to make predictions in client’s business application.
4. Model Deployment
In A360 AI platform, deployment is fast and easy. You only need to set a final run with our MDK API in the modeling notebook. Then A360 AI’s packaging technology, called Starpack, will fetch the model artifacts and training data baseline to package the model as a Docker container. This container is then deployed automatically into a scalable, secure Kubernetes pod as a cloud endpoint REST API. We only need to do a few clicks on the platform UI (A360 Deployment Hub). Below are a few screenshots of deployment process on the UI. During the deployment process, saving the endpoint API key is a requirement step. The API key is required to invoke the cloud endpoint. The whole deployment process only took about 5 minutes to complete.
Highlight of A360 AI functionality
Starpack is the key technology of A360 AI as Model Deployment as Code (MDaC), building upon the concepts of Terraform which is an Infrastructure as Code (IaC) approach. Starpack utilizes the declarative language (YAML specification) to automatically deploy ML models leveraging GitOps. Along with a UI console, A360 AI completely abstracts the infrastructure complexity for ML deployment from data scientists.
5. Inference
Once our REST API is available, we can invoke it with new input data. The endpoint URL can be easily retrieved from the Deployment Hub UI. In the notebook, we simply utilized Python request and use API key to send input data as JSON format to the endpoint, and got the prediction result back. The inference process on A360 AI platform is very straightforward.
6. Monitoring Data Drift
After the model is deployed, it is a big milestone that the model can actually be used in the business application. However, the job is not done yet. Data scientists would want to closely monitor their model performance as the new data will continue coming in.
In this use case, since our data cover the sales pattern in pre-COVID and COVID period, the model we built is for pre-COVID, therefore, we expect the model will not work well for COVID time, as we saw from the data that the sales number increased 3-4 times more during COVID time. This is a good opportunity to observe the data drift in A360’s monitoring dashboard.
Highlight of A360 AI functionality
A360 AI has a pre-built monitoring dashboard that helps data scientists to monitor the data drift. The metric sigma (mean value of the standard deviation of the training data) is defaulted in the dashboard to monitor the data drift. If the sigma value is over 2-3, it flags the data drift and data scientists should examine the new incoming data and see if the model re-training is required.
After we sent COVID sales data to the REST API to get sales predictions, we can now navigate to A360 monitoring dashboard. See the screenshot example below.
The left panel shows the accumulated sigma value for each feature. If the sigma value is above 3, the color becomes red, which is the alert for data scientists, indicating possible data drifts. The right panel shows the sigma changes through time, which is useful to examine the time-series data drift as well as distinguish if you just have one data point as anomaly or the drift is persistent for multiple data points.
Conclusion
Here we showcase a business use case, product demand forecasting, and walk you through the data science process you can take in A360 AI platform to tackle this problem. We also demonstrate how our MDK and Starpack can make data scientists more efficient in building and deploying ML models as well as processing data from the cloud and monitoring the data drift.
Opmerkingen