• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Wine Quality Prediction using Machine Learning

According to experts, wine quality is checked with its smell, flavor and color but we are not a wine experts. Here’s the use of Machine Learning comes . In this article , we will focus on Wine Quality Prediction on the basis of given features. Also every industry need to prove product quality to promote their product so quality check is important.

Firstly, we import necessary library for this model. Numpy will be used for making the mathematical calculations more accurate, pandas will be used to work with file formats like csv, xls etc. and sklearn (scikit-learn) will be used to import our classifier for prediction.from sklearn.model_selection import train_test_split is used to split our dataset into training and testing data. from sklearn import preprocessing is used to preprocess the data before fitting into predictor, or converting it to a range of -1,1, which is easy to understand for the Machine Learning Algorithms. from sklearn import tree is used to import our decision tree classifier, which we will be using for prediction.

Importing the libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import tree

Loading the Dataset

Now we read CSV file name winequality-red.csv. which have fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality columns.

In [2]:
dataset_url = 'winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

Showing Dataset Information and Head value of first five rows. By calling head() and info() function.

In [3]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

Separating The Data Into Features And Labels

In every Machine Learning program, there are two things, features and labels. Features are the part of a dataset which are used to predict the label. And labels on the other hand are mapped to features. After the model has been trained, we give features to it, so that it can predict the labels. So, if we analyse this dataset, since we have to predict the wine quality, the attribute quality will become our label and the rest of the attributes will become the features.We just stored and quality in y, which is the common symbol used to represent the labels in Machine Learning and dropped quality and stored the remaining features in X , again common symbol for features in ML.

In [4]:
y = data.quality
X = data.drop('quality', axis=1)

Splitting Into Test And Train Data

Split our dataset into test and train data, we will be using the train data to to train our model for predicting wine quality. We have used, train_test_split() function that we imported from sklearn to split the data. Notice we have used test_size=0.2 to make the test data 20% of the original data. The rest 80% is used for training.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)

Now let’s print and see the first five elements of data we have split using head() function.

In [12]:
      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
104             7.2             0.490         0.24             2.2      0.070   
1273            7.5             0.580         0.20             2.0      0.073   
253             7.7             0.775         0.42             1.9      0.092   
944             8.3             0.300         0.49             3.8      0.090   
358            11.9             0.430         0.66             3.1      0.109   

      free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
104                   5.0                  36.0  0.99600  3.33       0.48   
1273                 34.0                  44.0  0.99494  3.10       0.43   
253                   8.0                  86.0  0.99590  3.23       0.59   
944                  11.0                  24.0  0.99498  3.27       0.64   
358                  10.0                  23.0  1.00000  3.15       0.85   

104       9.4  
1273      9.3  
253       9.5  
944      12.1  
358      10.4  

Train Data Preprocessing

data normalization will be happen it is part of pre-processing in which data is converted to fit in a range of -1 and 1. These are simply, the values which are understood by a Machine Learning Algorithm easily.

In [7]:
X_train_scaled = preprocessing.scale(X_train)
array([[-0.63399594, -0.21765447, -0.14303419, ...,  0.11677241,
        -1.06775661, -0.96550033],
       [-0.46219916,  0.28178151, -0.35003926, ..., -1.35247219,
        -1.37012138, -1.06017513],
       [-0.34766797,  1.36389278,  0.78848861, ..., -0.52202959,
        -0.40255412, -0.87082553],
       [-0.86305831,  0.44826016, -1.17805953, ...,  0.69169421,
        -0.34208117, -0.87082553],
       [ 0.45405034, -0.16216158,  0.16747341, ...,  0.05289221,
         0.32312132,  0.07592243],
       [-0.63399594, -0.93906198,  0.99549368, ...,  0.56393381,
         1.16974266,  0.54929641]])
Training The Classifier

Now the values of all the train attributes are in the range of -1 and 1 and that is exactly what we were aiming for. Time has now come for the most exciting step, training our algorithm so that it can predict the wine quality. We do so by importing a DecisionTreeClassifier() and using fit() to train it.

In [8]:
clf.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Check how efficiently your algorithm is predicting the label (in this case wine quality). This can be done using the score() function.

This score can change over time depending on the size of your dataset and shuffling of data when we divide the data into test and train, but you can always expect a range of ±5 around your first result.

In [9]:
confidence = clf.score(X_test, y_test)
print("\nThe confidence score:\n")
The confidence score:


Now that we have trained our classifier with features, we obtain the labels using predict() function.

Our predicted information is stored in y_pred but it has far too many columns to compare it with the expected labels we stored in y_test .

In [10]:
y_pred = clf.predict(X_test)
Comparing The Predicted And Expected Labels

we will just take first five entries of both, print them and compare them. We just converted y_pred from a numpy array to a list, so that we can compare with ease. Then we printed the first five elements of that list using for loop. And finally, we just printed the first five values that we were expecting, which were stored in y_test using head() function.

In [11]:
#converting the numpy array to list
#printing first 5 predictions
print("\nThe prediction:\n")
for i in range(0,5):
#printing first five expectations
print("\nThe expectation:\n")
The prediction:


The expectation:

1522    5
875     7
747     5
401     6
1254    5
Name: quality, dtype: int64

Almost all of the values in the prediction are similar to the expectations. Our predictor got wrong just once, predicting 7 as 6, but that’s it. This gives us the accuracy of 80% for 5 examples. Of course, as the examples increases the accuracy goes down, precisely to 0.612575 or approx 62.1875%, but overall our predictor performs quite well, in-fact any accuracy % greater than 50% is considered as great.

In [ ]: