For any query, contact us at
+91-9872993883
+91-8283824812
info@ris-ai.com

☰

AI Demos Blog Thesis Services Pricing Contact Us Know More

Most Viewed Articles

Blogs >
Product Demand Forecasting in Python

Product Demand Forecasting in Python

Product Demand Forecasting in Python is the process of using predictive analysis of historical data to estimate and predict future demand for a product or service. It helps the business make better decisions that estimate the total sales and revenue for future . Also reduce risks and make efficient financial decisions.In this article , we will find out the properties of a product which impacts its sale.

Predicting Sales - ¶

First, importing libraries of Python¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Importing the dataset ¶

In [2]:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

Combine test and train into one file¶

In [3]:

train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, data.shape)

(8523, 13) (5681, 12) (14204, 13)

my_project_env/lib/python3.5/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  This is separate from the ipykernel package so we can avoid doing imports until

In [4]:

data.head()

Out[4]:

	Item_Fat_Content	Item_Identifier	Item_MRP	Item_Outlet_Sales	Item_Type	Item_Visibility	Item_Weight	Outlet_Establishment_Year	Outlet_Identifier	Outlet_Location_Type	Outlet_Size	Outlet_Type	source
0	Low Fat	FDA15	249.8092	3735.1380	Dairy	0.016047	9.30	1999	OUT049	Tier 1	Medium	Supermarket Type1	train
1	Regular	DRC01	48.2692	443.4228	Soft Drinks	0.019278	5.92	2009	OUT018	Tier 3	Medium	Supermarket Type2	train
2	Low Fat	FDN15	141.6180	2097.2700	Meat	0.016760	17.50	1999	OUT049	Tier 1	Medium	Supermarket Type1	train
3	Regular	FDX07	182.0950	732.3800	Fruits and Vegetables	0.000000	19.20	1998	OUT010	Tier 3	NaN	Grocery Store	train
4	Low Fat	NCD19	53.8614	994.7052	Household	0.000000	8.93	1987	OUT013	Tier 3	High	Supermarket Type1	train

Numerical data summary: ¶

In [5]:

data.describe()

Out[5]:

	Item_MRP	Item_Outlet_Sales	Item_Visibility	Item_Weight	Outlet_Establishment_Year
count	14204.000000	8523.000000	14204.000000	11765.000000	14204.000000
mean	141.004977	2181.288914	0.065953	12.792854	1997.830681
std	62.086938	1706.499616	0.051459	4.652502	8.371664
min	31.290000	33.290000	0.000000	4.555000	1985.000000
25%	94.012000	834.247400	0.027036	8.710000	1987.000000
50%	142.247000	1794.331000	0.054021	12.600000	1999.000000
75%	185.855600	3101.296400	0.094037	16.750000	2004.000000
max	266.888400	13086.964800	0.328391	21.350000	2009.000000

Data Cleaning ¶

This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers.

Check missing values: ¶

In [6]:

data.apply(lambda x: sum(x.isnull()))

Out[6]:

Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

Filling missing values ¶

We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. This can be done as:

In [7]:

data.Item_Outlet_Sales = data.Item_Outlet_Sales.fillna(data.Item_Outlet_Sales.mean())

In [8]:

data.Item_Weight = data.Item_Weight.fillna(data.Item_Weight.mean())

In [9]:

data['Outlet_Size'].value_counts()

Out[9]:

Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64

In [10]:

data.Outlet_Size = data.Outlet_Size.fillna('Medium')

In [11]:

data.apply(lambda x: sum(x.isnull()))

Out[11]:

Item_Fat_Content             0
Item_Identifier              0
Item_MRP                     0
Item_Outlet_Sales            0
Item_Type                    0
Item_Visibility              0
Item_Weight                  0
Outlet_Establishment_Year    0
Outlet_Identifier            0
Outlet_Location_Type         0
Outlet_Size                  0
Outlet_Type                  0
source                       0
dtype: int64

In [12]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14204 entries, 0 to 14203
Data columns (total 13 columns):
Item_Fat_Content             14204 non-null object
Item_Identifier              14204 non-null object
Item_MRP                     14204 non-null float64
Item_Outlet_Sales            14204 non-null float64
Item_Type                    14204 non-null object
Item_Visibility              14204 non-null float64
Item_Weight                  14204 non-null float64
Outlet_Establishment_Year    14204 non-null int64
Outlet_Identifier            14204 non-null object
Outlet_Location_Type         14204 non-null object
Outlet_Size                  14204 non-null object
Outlet_Type                  14204 non-null object
source                       14204 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 1.4+ MB

Item type combine: ¶

In [13]:

data['Item_Identifier'].value_counts()
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food',
                                                             'NC':'Non-Consumable',
                                                             'DR':'Drinks'})
data['Item_Type_Combined'].value_counts()

Out[13]:

Food              10201
Non-Consumable     2686
Drinks             1317
Name: Item_Type_Combined, dtype: int64

Numerical and One-Hot Coding of Categorical variables ¶

Since scikit-learn accepts only numerical variables, I converted all categories of nominal variables into numeric types. Also, I wanted Outlet_Identifier as a variable as well. So I created a new variable ‘Outlet’ same as Outlet_Identifier and coded that. Outlet_Identifier should remain as it is, because it will be required in the submission file.

New variable for outlet ¶

In [14]:

#Import library:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()

data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']
le = LabelEncoder()
for i in var_mod:
    data[i] = le.fit_transform(data[i])

One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the Item_Fat_Content has 3 categories – ‘Low Fat’, ‘Regular’ and ‘Non-Edible’. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers – 0 (if the category is not present) and 1(if category is present). This can be done using ‘get_dummies’ function of Pandas.

One Hot Coding:¶

In [15]:

data = pd.get_dummies(data, columns=['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet'])

In [16]:

data.head()

Out[16]:

	Item_Identifier	Item_MRP	Item_Outlet_Sales	Item_Type	Item_Visibility	Item_Weight	Outlet_Establishment_Year	Outlet_Identifier	source	...	Outlet_0	Outlet_1	Outlet_3	Outlet_9
0	FDA15	249.8092	3735.1380	Dairy	0.016047	9.30	1999	OUT049	train	...	0	0	0	1
1	DRC01	48.2692	443.4228	Soft Drinks	0.019278	5.92	2009	OUT018	train	...	0	0	1	0
2	FDN15	141.6180	2097.2700	Meat	0.016760	17.50	1999	OUT049	train	...	0	0	0	1
3	FDX07	182.0950	732.3800	Fruits and Vegetables	0.000000	19.20	1998	OUT010	train	...	1	0	0	0
4	NCD19	53.8614	994.7052	Household	0.000000	8.93	1987	OUT013	train	...	0	1	0	0

5 rows × 37 columns

Datatypes of columns now ¶

In [17]:

data.dtypes

Out[17]:

Item_Identifier               object
Item_MRP                     float64
Item_Outlet_Sales            float64
Item_Type                     object
Item_Visibility              float64
Item_Weight                  float64
Outlet_Establishment_Year      int64
Outlet_Identifier             object
source                        object
Item_Fat_Content_0             uint8
Item_Fat_Content_1             uint8
Item_Fat_Content_2             uint8
Item_Fat_Content_3             uint8
Item_Fat_Content_4             uint8
Outlet_Location_Type_0         uint8
Outlet_Location_Type_1         uint8
Outlet_Location_Type_2         uint8
Outlet_Size_0                  uint8
Outlet_Size_1                  uint8
Outlet_Size_2                  uint8
Outlet_Type_0                  uint8
Outlet_Type_1                  uint8
Outlet_Type_2                  uint8
Outlet_Type_3                  uint8
Item_Type_Combined_0           uint8
Item_Type_Combined_1           uint8
Item_Type_Combined_2           uint8
Outlet_0                       uint8
Outlet_1                       uint8
Outlet_2                       uint8
Outlet_3                       uint8
Outlet_4                       uint8
Outlet_5                       uint8
Outlet_6                       uint8
Outlet_7                       uint8
Outlet_8                       uint8
Outlet_9                       uint8
dtype: object

Exporting Data ¶

Final step is to convert data back into train and test data sets. Its generally a good idea to export both of these as modified data sets so that they can be re-used for multiple sessions. This can be achieved using following code:

Drop the columns which have been converted to different types: ¶

In [18]:

import warnings
warnings.filterwarnings('ignore')

data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

Divide into test and train:¶

In [19]:

train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]

Drop unnecessary columns: ¶

In [20]:

test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)

Export files as modified versions:¶

In [21]:

train.to_csv("train_modified.csv",index=False)
test.to_csv("test_modified.csv",index=False)

Model Building ¶

Lets start by making a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, in this case lets predict the sales as the overall average sales. This can be done as:

Reading modified data ¶

In [22]:

train2 = pd.read_csv("train_modified.csv")
test2 = pd.read_csv("test_modified.csv")

In [23]:

train2.head()

Out[23]:

	Item_Identifier	Item_MRP	Item_Outlet_Sales	Item_Visibility	Item_Weight	Outlet_Identifier	Item_Fat_Content_1	Item_Fat_Content_2	...	Outlet_0	Outlet_1	Outlet_3	Outlet_9
0	FDA15	249.8092	3735.1380	0.016047	9.30	OUT049	1	0	...	0	0	0	1
1	DRC01	48.2692	443.4228	0.019278	5.92	OUT018	0	1	...	0	0	1	0
2	FDN15	141.6180	2097.2700	0.016760	17.50	OUT049	1	0	...	0	0	0	1
3	FDX07	182.0950	732.3800	0.000000	19.20	OUT010	0	1	...	1	0	0	0
4	NCD19	53.8614	994.7052	0.000000	8.93	OUT013	1	0	...	0	1	0	0

5 rows × 34 columns

In [24]:

X_train = train2.drop(['Item_Outlet_Sales', 'Outlet_Identifier','Item_Identifier'], axis=1)
y_train = train2.Item_Outlet_Sales

In [25]:

X_test = test2.drop(['Outlet_Identifier','Item_Identifier'], axis=1)

In [26]:

X_train.head()

Out[26]:

	Item_MRP	Item_Visibility	Item_Weight	Item_Fat_Content_1	Item_Fat_Content_2	Outlet_Location_Type_0	...	Outlet_0	Outlet_1	Outlet_3	Outlet_9
0	249.8092	0.016047	9.30	1	0	1	...	0	0	0	1
1	48.2692	0.019278	5.92	0	1	0	...	0	0	1	0
2	141.6180	0.016760	17.50	1	0	1	...	0	0	0	1
3	182.0950	0.000000	19.20	0	1	0	...	1	0	0	0
4	53.8614	0.000000	8.93	1	0	0	...	0	1	0	0

5 rows × 31 columns

In [27]:

y_train.head()

Out[27]:

0    3735.1380
1     443.4228
2    2097.2700
3     732.3800
4     994.7052
Name: Item_Outlet_Sales, dtype: float64

Measuring Accuracy ¶

In [28]:

import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn import metrics

Random Forest Model: ¶

Fitting Random Forest Regression to the dataset ¶

In [29]:

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100,max_depth=6, min_samples_leaf=50,n_jobs=4)
regressor.fit(X_train, y_train)

Out[29]:

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=6, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=50,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=4, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

Predicting the test set results¶

In [30]:

y_pred = regressor.predict(X_test)
y_pred

Out[30]:

array([1644.58503874, 1365.02660944,  607.55946827, ..., 1967.37127117,
       3679.8358505 , 1318.82000327])

Product Demand Forecasting in Python ¶

Getting accuracy of prediction using random forest ¶

In [31]:

rf_accuracy = round(regressor.score(X_train,y_train),2)
rf_accuracy

Out[31]:

0.61

In [32]:

r2_score(y_train, regressor.predict(X_train))

Out[32]:

0.6127340118877953

In [33]:

import warnings
warnings.filterwarnings('ignore')
cv_score = cross_val_score(regressor, X_train, y_train, cv=5)
print(np.sqrt(np.abs(cv_score)))

[0.78025888 0.76655097 0.75912421 0.78069719 0.77997061]

In [34]:

print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, regressor.predict(X_train))))

RMSE : 1062

Since I’ll be making many models, instead of repeating the codes again and again, I would like to define a generic function which takes the algorithm and data as input and makes the model, performs cross-validation and generates submission.

In [35]:

submission = pd.DataFrame({
'Item_Identifier':test2['Item_Identifier'],
'Outlet_Identifier':test2['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred
},columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])

In [36]:

submission.to_csv('submission3.csv',index=False)

Conclusion: ¶

We started with making some hypothesis about the data without looking at it. Then we moved on to data exploration where we found out some nuances in the data which required remediation. Next, we performed data cleaning and feature engineering, where we imputed missing values and solved other irregularities, made new features and also made the data model-friendly by one-hot-coding. Finally we made random forest model and got a glimpse of how to tune them for better results and get 61% accuracy rate for sales predictions.

In [ ]:

Most Viewed Articles

Product Demand Forecasting in Python

Predicting Sales - ¶

First, importing libraries of Python¶

Importing the dataset ¶

Combine test and train into one file¶

Numerical data summary: ¶

Data Cleaning ¶

Check missing values: ¶

Filling missing values ¶

Item type combine: ¶

Numerical and One-Hot Coding of Categorical variables ¶

New variable for outlet ¶

One Hot Coding:¶

Datatypes of columns now ¶

Exporting Data ¶

Drop the columns which have been converted to different types: ¶

Divide into test and train:¶

Drop unnecessary columns: ¶

Export files as modified versions:¶

Model Building ¶

Reading modified data ¶

Measuring Accuracy ¶

Random Forest Model: ¶

Fitting Random Forest Regression to the dataset ¶

Predicting the test set results¶

Product Demand Forecasting in Python ¶

Getting accuracy of prediction using random forest ¶

Conclusion: ¶

Don't forget to share this Article!

Sharing is Caring

Search Article

Popular ML Articles

Resources You Will Ever Need

Popular Searches

Go for Research

Consultation fee- 150 USD/hour

Select Thesis

Synopsis

Research Paper

Total cost (in USD): $0

PHD

Contact for custom package.