The main aim of analyzing the backorder data is to find out the reason which a particular product is not in stock when a potential customer wished to buy it. Thus the companies are looking to explore the cause of backorder and the solution which might be used to minimize the backorder and increase the business. There are several causes that could be the reason for backorder.
- Delayed in placing the order: Based on the order cycle, stock, other relevant means orders are placed to the vendor daily, weekly, monthly, or after every certain interval of time. After an order is placed, frequently a designated person reviews the order. Then that person judges whether that particular order needs be to process or not. A delay in this decision-making process results in a backorder.
- Warehouse Discrepancies: A discrepancy may occur when the stock is maintained digitally or by some manual measure. The discrepancy can not be matched with the actual stock quantity present in the warehouse.
- Human Error: The mishaps that are caused by an individual person can cause a backorder.
- If the production in factories is lacking due to their internal issue which no E-commerce can control on their own can lead to backorder.
- High Demands are caused by a large number of orders placed by customers. This high demand can be because of customers purchasing in an abnormal manner or for some seasonal, festival demand. For this type of situation, we must find out the reason behind this demand.
Identifying products with the highest chances of shortage before its occurrence can present a high opportunity to improve an overall company’s performance. Machine learning is applied to the design and development of predictive models which assess all areas of management, providing essential insights for companies to understand and take action to change their operational structure.
Use of ML to Solve this Problem
To predict whether a particular product will go in backorder or not, we need to observe the pattern of the features or the independent variables on the basis of which we will predict. Now as we have a large volume of data points, it is not feasible to review each data point manually to find out the pattern of data. To resolve this issue, machine learning plays a great role in handling large volumes of data and finding out the pattern from the data. From this pattern, we can create a rule-based system that helps us to predict the target variable.
Source of data
The data is taken from the GitHub of the following link: https://github.com/rodrigosantis1/backorder_prediction which suppose to have the main source of data of Kaggle’s “Can You Predict Product Backorders?” The data.rar file contains 2 CSV files viz — 1. Kaggle_Training_Dataset_v2.csv and 2. Kaggle_Test_Dataset_v2.csv.
The Kaggle_Train_Dataset_v2.csv contains 1687861 rows (records) and 23 columns (features). The Kaggle_Test_Dataset_v2.csv contains 242075 rows (records) and 23 columns (features).
For both the CSV files the columns are of 2 different data types viz — 1. Object and 2. float64. So we can say that the columns which have a data type of float64 are contained numerical features (total 15 no. of numerical features) and the columns which have a data type of object are contained categorical features (total 8 no. of categorical features).
The dataset columns contain the following data:
- SKU: Stock Keeping Unit for the product (it is actually a unique ID for every variation of the product)
- national_inv: Current inventory level for the part
- lead_time: Transit time for product (if available)
- in_transit_qty: Amount of product in transit from source
- forecast_3_month: Forecast sales for the next 3 months
- forecast_6_month: Forecast sales for the next 6 months
- forecast_9_month: Forecast sales for the next 9 months
- sales_1_month: Sales quantity for the prior 1 month time period
- sales_3_month: Sales quantity for the prior 3 month time period
- sales_6_month: Sales quantity for the prior 6 month time period
- sales_9_month: Sales quantity for the prior 9 month time period
- min_bank: Minimum recommended amount to stock
- potential_issue: Source issue for part identified
- pieces_past_due: Parts overdue from source
- perf_6_month_avg: Source performance for prior 6 month period
- perf_12_month_avg: Source performance for prior 12 month period
- local_bo_qty: Amount of stock orders overdue
- deck_risk: Part risk flag
- oe_constraint: Part risk flag
- ppap_risk: Part risk flag
- stop_auto_buy: Part risk flag
- rev_stop: Part risk flag
- went_on_backorder: The product actually went on backorder. This is the target value.
Existing approaches to the problem
From the above-mentioned problem, it is very clear that the problem is a binary classification problem and the data is highly imbalanced. Since the items which go on backorder (positive class) are rare compared to the items that do not (negative class). Some metrics are employed for the evaluation of the model performance.
- ROC-AUC: The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the True Positive Rate (TPR) against False Positive Rate (FPR) at various threshold values. The Area under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes. So, if for a model, say m1 AUC > m2 AUC then at most of the threshold values model m1 is able to identify the positive class better than negative class.
- Precision-Recall AUC: Precision is measured as out of the total number of predicted positive points’ percentage of actual positive prediction.
Precision(Pr) = True Positive(TP)/(True Positive(TP)+False Positive(FP)).
On the other hand, recall is measured as the percentage of actual positive prediction out of the total number of positive points in the dataset.
Recall(Rc) = True Positive(TP)/(True Positive(TP)+False Negative(FN)).
Both the precision and the recall are focused on the positive class (the minority class) and are unconcerned with the true negatives (majority class). As it is an imbalanced dataset the precision and recall make it possible to assess the performance of a classifier on the minority class. The AUC of the Pr-Re curve is very important as the business needs to select the suitable threshold based on tread off between precision and recall.
After applying different supervised machine learning algorithms on the given dataset like Logistic Regression, Classification Tree, Random Forest, Gradient Boosting Tree, Blagging, etc. Blagging gives the best ROC-AUC Score.
The preliminary observation on the training dataset:
- The data types of the dataset are object (which contains string data) and float64 (which contain all numerical data).
- The dataset contains 15 numerical features which are not normalized that means the data is not scaled.
- The dataset contains 7 categorical features which contain only 2 unique values viz. — Yes and No.
- In the given dataset only “lead_time” columns have null values. Out of 1687861 rows, 100894 null values are present, which implies that 6.357% of data have null values.
- The last row in the training dataset contains null values which need to be removed.
- The total no of training data points is 1687861 and among them, 11293 data points had gone on backorder. So it implies that only 0.66% of the test dataset has gone on backorder, which is highly imbalanced.
- The columns “perf_6_month_avg” and “perf_12_month_avg” have max. value as 1 and min. value as -99. It looks like the missing values are replaced with -99.
If we observe the boxplot of the “lead_time” we will see that on the top whisker of the majority class there are few outliers that need to be handled.
From the above correlation matrix, we can observe that the following independent variable have a high correlation, which is as follows:
- forecast_3_month, forecast_6_month, forecast_9_month have correlation at around 0.99. So we can say those 3 features will show nearly the same behavior. We can pick one of them for the final model.
- sales_1_month, sales_3_month, sales_6_month, sales_9_month have correlation ranges from 0.82 to 0.92. So we can say those 4 features will show nearly the same behavior. We can pick one of them for the final model.
- pref_6_month_avg, pref_12_month_avg have correlation at around 0.97. So we can say those 2 features will show nearly the same behavior. We can pick one of them for the final model.
For most of the features, we can see that the mean value is higher than the 75th percentile value, which indicates the features are right-skewed. From this observation, we can say that the right-skewed data might be in log-normal distribution.
If we look at the QQ Plot of the above features we can say that the features are in log-normal distribution.
First cut Approach
From the above-studied solutions and research papers, we can conclude that as a baseline model for this prediction we would KNN classification model. For further improvement of the model accuracy we would use SVR as it produced smaller results of the relative mean square error along with higher forecast precision and as an Ensemble method, we can use a Decision tree classifier or Random Forest Classifier. The reason behind using ensemble models is it is better to rely on more than one model rather than a single model. There are two main reasons to use an ensemble over a single model, and they are related; they are:
Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.
Handling Missing Values
Iterative Imputation is a strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. It will find all the rows of a data frame that do not have a particular feature missing and make a regression model on top of it and finally predict the missing value.
Handling Categorical Variable
In the given dataset there are 2 different categorical variables “Yes” and “No”. We had replaced the “Yes” value with 1 and “No” value with 0 with the following code snippet
In the above code, we can see that “SKU” a categorical feature that has a large number of unique values but is not important for prediction. That is why we removed that particular feature.
Handling data Imbalance
There are several ways to handle data imbalance.
- Under-sampling: In this approach, the number of data points is reduced down to the number of data points of the minority class. Random Under-sampling is one of the under-sampling methods using which randomly the data points of majority classes are removed to reduce down to the number of data points of minority classes. But if we remove the data from the dataset it implies that we are losing information that may be useful for prediction which is a drawback.
- Over Sampling: In this approach, the number of data points of the minority classes is increased to the number of data points of the majority class by replication or any other means. Synthetic Minority Oversampling Technique (SMOTE) is one of the popular oversampling techniques. It finds the K Nearest Neighbors’ of minority class data points and by selecting one randomly produce a new data point.
- Ensemble: There are few ensemble techniques that help to handle the imbalanced data set by itself. There are several Bagging, boosting, adaptive boosting mechanisms to handle data imbalance while training the model.
Handling the Skew & Outlier
As the data contains an outlier as well as right-skewed (or positive skewed) it is quite important to scale the data. We had used 2 different approaches to scale the data before training.
- Robust Scale: Robust scalar is used because it can scale the data by avoiding the impact of outlier data using the formulae value = (value — median) / (p75 — p25) where p75 and p25 means 75th percentile value and 25th percentile value.
- Log Transform & Standard Scalar: On the other hand, we first apply log transform on the skewed data and then apply standard scalar on top of it.
Comparison of all models
There are several classification models that are applied on 2 different types of scaled data which are as follows:
Decision Tree, Random Forest, Balanced Bagging, XGBoost, Adaboost, Custom Ensemble, and Multi-Layer Perceptron. For all the models Randomized Search CV is used for hyperparameter tuning. After performing all the model we had to find out that the highest ROC-AUC score achieved by Random Forest Classifier for both scaled data which is 0.932
Following is the Score comparison Table
At the end of the generation of the final model, we had created a simple web app that predicts the product backorder in bulk.
Conclusion & Future Work
As a conclusion, we can say that after applying all the relevant models we find out that if we consider all the dependent variables and apply hyperparameter tuning we find out Random Forest Classifier works best which is nearly equal to the existing solution. For future work, we can apply some sort of dimensionality reduction to reduce the response time as well as several deep learning approach to investigate that whether any neural network model can perform better than the Random Forest Classifier.
Predict the product backorder on Kaggle Dataset. Contribute to subhamaybose/Product-Backorder-Prediction development by…