• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Big Mart Sales Dataset

Objective: To find out the properties of a product, and store which impacts the sales of a product.

Loading Important Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Importing the dataset

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

Combine test and train into one file

In [3]:
train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, data.shape)
(8523, 13) (5681, 12) (14204, 13)
/home/webtunix/my_project_dir/my_project_env/lib/python3.5/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  This is separate from the ipykernel package so we can avoid doing imports until
In [4]:
data.head()
Out[4]:
Item_Fat_Content Item_Identifier Item_MRP Item_Outlet_Sales Item_Type Item_Visibility Item_Weight Outlet_Establishment_Year Outlet_Identifier Outlet_Location_Type Outlet_Size Outlet_Type source
0 Low Fat FDA15 249.8092 3735.1380 Dairy 0.016047 9.30 1999 OUT049 Tier 1 Medium Supermarket Type1 train
1 Regular DRC01 48.2692 443.4228 Soft Drinks 0.019278 5.92 2009 OUT018 Tier 3 Medium Supermarket Type2 train
2 Low Fat FDN15 141.6180 2097.2700 Meat 0.016760 17.50 1999 OUT049 Tier 1 Medium Supermarket Type1 train
3 Regular FDX07 182.0950 732.3800 Fruits and Vegetables 0.000000 19.20 1998 OUT010 Tier 3 NaN Grocery Store train
4 Low Fat NCD19 53.8614 994.7052 Household 0.000000 8.93 1987 OUT013 Tier 3 High Supermarket Type1 train

Numerical data summary:

In [5]:
data.describe()
Out[5]:
Item_MRP Item_Outlet_Sales Item_Visibility Item_Weight Outlet_Establishment_Year
count 14204.000000 8523.000000 14204.000000 11765.000000 14204.000000
mean 141.004977 2181.288914 0.065953 12.792854 1997.830681
std 62.086938 1706.499616 0.051459 4.652502 8.371664
min 31.290000 33.290000 0.000000 4.555000 1985.000000
25% 94.012000 834.247400 0.027036 8.710000 1987.000000
50% 142.247000 1794.331000 0.054021 12.600000 1999.000000
75% 185.855600 3101.296400 0.094037 16.750000 2004.000000
max 266.888400 13086.964800 0.328391 21.350000 2009.000000

Data Cleaning

This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers.

Check missing values:

In [6]:
data.apply(lambda x: sum(x.isnull()))
Out[6]:
Item_Fat_Content                0
Item_Identifier                 0
Item_MRP                        0
Item_Outlet_Sales            5681
Item_Type                       0
Item_Visibility                 0
Item_Weight                  2439
Outlet_Establishment_Year       0
Outlet_Identifier               0
Outlet_Location_Type            0
Outlet_Size                  4016
Outlet_Type                     0
source                          0
dtype: int64

Filling missing values

We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. This can be done as:

In [7]:
data.Item_Outlet_Sales = data.Item_Outlet_Sales.fillna(data.Item_Outlet_Sales.mean())
In [8]:
data.Item_Weight = data.Item_Weight.fillna(data.Item_Weight.mean())
In [9]:
data['Outlet_Size'].value_counts()
Out[9]:
Medium    4655
Small     3980
High      1553
Name: Outlet_Size, dtype: int64
In [10]:
data.Outlet_Size = data.Outlet_Size.fillna('Medium')
In [11]:
data.apply(lambda x: sum(x.isnull()))
Out[11]:
Item_Fat_Content             0
Item_Identifier              0
Item_MRP                     0
Item_Outlet_Sales            0
Item_Type                    0
Item_Visibility              0
Item_Weight                  0
Outlet_Establishment_Year    0
Outlet_Identifier            0
Outlet_Location_Type         0
Outlet_Size                  0
Outlet_Type                  0
source                       0
dtype: int64
In [12]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14204 entries, 0 to 14203
Data columns (total 13 columns):
Item_Fat_Content             14204 non-null object
Item_Identifier              14204 non-null object
Item_MRP                     14204 non-null float64
Item_Outlet_Sales            14204 non-null float64
Item_Type                    14204 non-null object
Item_Visibility              14204 non-null float64
Item_Weight                  14204 non-null float64
Outlet_Establishment_Year    14204 non-null int64
Outlet_Identifier            14204 non-null object
Outlet_Location_Type         14204 non-null object
Outlet_Size                  14204 non-null object
Outlet_Type                  14204 non-null object
source                       14204 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 1.4+ MB

Item type combine:

In [13]:
data['Item_Identifier'].value_counts()
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])
data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food',
                                                             'NC':'Non-Consumable',
                                                             'DR':'Drinks'})
data['Item_Type_Combined'].value_counts()
Out[13]:
Food              10201
Non-Consumable     2686
Drinks             1317
Name: Item_Type_Combined, dtype: int64

Numerical and One-Hot Coding of Categorical variables

Since scikit-learn accepts only numerical variables, I converted all categories of nominal variables into numeric types. Also, I wanted Outlet_Identifier as a variable as well. So I created a new variable ‘Outlet’ same as Outlet_Identifier and coded that. Outlet_Identifier should remain as it is, because it will be required in the submission file.

New variable for outlet

In [14]:
#Import library:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()

data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod = ['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet_Type','Outlet']
le = LabelEncoder()
for i in var_mod:
    data[i] = le.fit_transform(data[i])

One-Hot-Coding refers to creating dummy variables, one for each category of a categorical variable. For example, the Item_Fat_Content has 3 categories – ‘Low Fat’, ‘Regular’ and ‘Non-Edible’. One hot coding will remove this variable and generate 3 new variables. Each will have binary numbers – 0 (if the category is not present) and 1(if category is present). This can be done using ‘get_dummies’ function of Pandas.

One Hot Coding:

In [15]:
data = pd.get_dummies(data, columns=['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Type_Combined','Outlet'])
In [16]:
data.head()
Out[16]:
Item_Identifier Item_MRP Item_Outlet_Sales Item_Type Item_Visibility Item_Weight Outlet_Establishment_Year Outlet_Identifier source Item_Fat_Content_0 ... Outlet_0 Outlet_1 Outlet_2 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Outlet_9
0 FDA15 249.8092 3735.1380 Dairy 0.016047 9.30 1999 OUT049 train 0 ... 0 0 0 0 0 0 0 0 0 1
1 DRC01 48.2692 443.4228 Soft Drinks 0.019278 5.92 2009 OUT018 train 0 ... 0 0 0 1 0 0 0 0 0 0
2 FDN15 141.6180 2097.2700 Meat 0.016760 17.50 1999 OUT049 train 0 ... 0 0 0 0 0 0 0 0 0 1
3 FDX07 182.0950 732.3800 Fruits and Vegetables 0.000000 19.20 1998 OUT010 train 0 ... 1 0 0 0 0 0 0 0 0 0
4 NCD19 53.8614 994.7052 Household 0.000000 8.93 1987 OUT013 train 0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 37 columns

Datatypes of columns now
In [17]:
data.dtypes
Out[17]:
Item_Identifier               object
Item_MRP                     float64
Item_Outlet_Sales            float64
Item_Type                     object
Item_Visibility              float64
Item_Weight                  float64
Outlet_Establishment_Year      int64
Outlet_Identifier             object
source                        object
Item_Fat_Content_0             uint8
Item_Fat_Content_1             uint8
Item_Fat_Content_2             uint8
Item_Fat_Content_3             uint8
Item_Fat_Content_4             uint8
Outlet_Location_Type_0         uint8
Outlet_Location_Type_1         uint8
Outlet_Location_Type_2         uint8
Outlet_Size_0                  uint8
Outlet_Size_1                  uint8
Outlet_Size_2                  uint8
Outlet_Type_0                  uint8
Outlet_Type_1                  uint8
Outlet_Type_2                  uint8
Outlet_Type_3                  uint8
Item_Type_Combined_0           uint8
Item_Type_Combined_1           uint8
Item_Type_Combined_2           uint8
Outlet_0                       uint8
Outlet_1                       uint8
Outlet_2                       uint8
Outlet_3                       uint8
Outlet_4                       uint8
Outlet_5                       uint8
Outlet_6                       uint8
Outlet_7                       uint8
Outlet_8                       uint8
Outlet_9                       uint8
dtype: object

Exporting Data

Final step is to convert data back into train and test data sets. Its generally a good idea to export both of these as modified data sets so that they can be re-used for multiple sessions. This can be achieved using following code:

Drop the columns which have been converted to different types:

In [18]:
import warnings
warnings.filterwarnings('ignore')

data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

Divide into test and train:

In [19]:
train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]

Drop unnecessary columns:

In [20]:
test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)

Export files as modified versions:

In [21]:
train.to_csv("train_modified.csv",index=False)
test.to_csv("test_modified.csv",index=False)

Model Building

Lets start by making a baseline model. Baseline model is the one which requires no predictive model and its like an informed guess. For instance, in this case lets predict the sales as the overall average sales. This can be done as:

Reading modified data

In [22]:
train2 = pd.read_csv("train_modified.csv")
test2 = pd.read_csv("test_modified.csv")
                                             
In [23]:
train2.head()
                                          
Out[23]:
Item_Identifier Item_MRP Item_Outlet_Sales Item_Visibility Item_Weight Outlet_Identifier Item_Fat_Content_0 Item_Fat_Content_1 Item_Fat_Content_2 Item_Fat_Content_3 ... Outlet_0 Outlet_1 Outlet_2 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Outlet_9
0 FDA15 249.8092 3735.1380 0.016047 9.30 OUT049 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 DRC01 48.2692 443.4228 0.019278 5.92 OUT018 0 0 1 0 ... 0 0 0 1 0 0 0 0 0 0
2 FDN15 141.6180 2097.2700 0.016760 17.50 OUT049 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 FDX07 182.0950 732.3800 0.000000 19.20 OUT010 0 0 1 0 ... 1 0 0 0 0 0 0 0 0 0
4 NCD19 53.8614 994.7052 0.000000 8.93 OUT013 0 1 0 0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 34 columns

In [24]:
X_train = train2.drop(['Item_Outlet_Sales', 'Outlet_Identifier','Item_Identifier'], axis=1)
y_train = train2.Item_Outlet_Sales
In [25]:
X_test = test2.drop(['Outlet_Identifier','Item_Identifier'], axis=1)
In [26]:
X_train.head()
Out[26]:
Item_MRP Item_Visibility Item_Weight Item_Fat_Content_0 Item_Fat_Content_1 Item_Fat_Content_2 Item_Fat_Content_3 Item_Fat_Content_4 Outlet_Location_Type_0 Outlet_Location_Type_1 ... Outlet_0 Outlet_1 Outlet_2 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Outlet_9
0 249.8092 0.016047 9.30 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
1 48.2692 0.019278 5.92 0 0 1 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
2 141.6180 0.016760 17.50 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
3 182.0950 0.000000 19.20 0 0 1 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 53.8614 0.000000 8.93 0 1 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0

5 rows × 31 columns

In [27]:
y_train.head()
Out[27]:
0    3735.1380
1     443.4228
2    2097.2700
3     732.3800
4     994.7052
Name: Item_Outlet_Sales, dtype: float64

Measuring Accuracy

In [28]:
import warnings
warnings.filterwarnings('ignore')

from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn import metrics

Random Forest Model:

Fitting Random Forest Regression to the dataset

In [29]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=100,max_depth=6, min_samples_leaf=50,n_jobs=4)
regressor.fit(X_train, y_train)
Out[29]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=6, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=50,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=4, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

Predicting the test set results

In [30]:
y_pred = regressor.predict(X_test)
y_pred
Out[30]:
array([1644.58503874, 1365.02660944,  607.55946827, ..., 1967.37127117,
       3679.8358505 , 1318.82000327])

We Get Accuracy of prdiction by Random Forest

In [31]:
rf_accuracy = round(regressor.score(X_train,y_train),2)
rf_accuracy
Out[31]:
0.61
In [32]:
r2_score(y_train, regressor.predict(X_train))
Out[32]:
0.6127340118877953
In [33]:
import warnings
warnings.filterwarnings('ignore')
cv_score = cross_val_score(regressor, X_train, y_train, cv=5)
print(np.sqrt(np.abs(cv_score)))
[0.78025888 0.76655097 0.75912421 0.78069719 0.77997061]
In [34]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, regressor.predict(X_train))))
RMSE : 1062

Since I’ll be making many models, instead of repeating the codes again and again, I would like to define a generic function which takes the algorithm and data as input and makes the model, performs cross-validation and generates submission.

In [35]:
submission = pd.DataFrame({
'Item_Identifier':test2['Item_Identifier'],
'Outlet_Identifier':test2['Outlet_Identifier'],
'Item_Outlet_Sales': y_pred
},columns=['Item_Identifier','Outlet_Identifier','Item_Outlet_Sales'])
In [36]:
submission.to_csv('submission3.csv',index=False)

Conclusion:

We started with making some hypothesis about the data without looking at it. Then we moved on to data exploration where we found out some nuances in the data which required remediation. Next, we performed data cleaning and feature engineering, where we imputed missing values and solved other irregularities, made new features and also made the data model-friendly by one-hot-coding. Finally we made random forest model and got a glimpse of how to tune them for better results and get 61% accuracy rate for sales predictions.

In [ ]: