Project Objective: Is to predict the gas consumption on the bases of data provided of US states which can help in many decision making for climate change, people, goverment policies and many more thing.The process of solving regression problem with decision tree using Scikit Learn is very similar to that of classification. However for regression we use DecisionTreeRegressor class of the tree library. Also the evaluation matrics for regression differ from those of classification. The rest of the process is almost same like other regression models.

Firstly, we import necessary library(numpy, matplotlib and pandas) for this model.

In [1]:

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

Now we read CSV file name petrol_consumption.csv. We will use this dataset to try and predict gas consumptions (in millions of gallons) in 48 US states based upon gas tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with a drivers license.

In [2]:

dataset = pd.read_csv('petrol_consumption.csv')

It contain 48 Column and 5 Rows containg imformation about US satates information related to petrol consumption prediction. We will again use the head function of the dataframe to see what our data actually looks like

In [3]:

dataset.head() dataset.shape

Out[3]:

(48, 5)

To see statistical details of the dataset, execute the following command:

In [4]:

dataset.describe()

Out[4]:

Petrol_tax | Average_income | Paved_Highways | Population_Driver_licence(%) | Petrol_Consumption | |
---|---|---|---|---|---|

count | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 |

mean | 7.668333 | 4241.833333 | 5565.416667 | 0.570333 | 576.770833 |

std | 0.950770 | 573.623768 | 3491.507166 | 0.055470 | 111.885816 |

min | 5.000000 | 3063.000000 | 431.000000 | 0.451000 | 344.000000 |

25% | 7.000000 | 3739.000000 | 3110.250000 | 0.529750 | 509.500000 |

50% | 7.500000 | 4298.000000 | 4735.500000 | 0.564500 | 568.500000 |

75% | 8.125000 | 4578.750000 | 7156.000000 | 0.595250 | 632.750000 |

max | 10.000000 | 5342.000000 | 17782.000000 | 0.724000 | 968.000000 |

As with the classification task, in this section we will divide our data into attributes and labels and consequently into training and test sets. Execute the following commands to divide data into labels and attributes:

In [5]:

X = dataset.drop('Petrol_Consumption', axis=1) y = dataset['Petrol_Consumption']

Here the X variable contains all the columns from the dataset, except 'Petrol_Consumption' column, which is the label. The y variable contains values from the 'Petrol_Consumption' column, which means that the X variable contains the attribute set and y variable contains the corresponding labels.

In [6]:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

As mentioned earlier, for a regression task we'll use a different sklearn class than we did for the classification task. The class we'll be using here is the DecisionTreeRegressor class, as opposed to the DecisionTreeClassifier from before.

In [7]:

from sklearn.tree import DecisionTreeRegressor regressor = DecisionTreeRegressor() regressor.fit(X_train, y_train)

Out[7]:

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')

In [8]:

y_pred = regressor.predict(X_test)

In [9]:

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred}) df

Out[9]:

Actual | Predicted | |
---|---|---|

29 | 534 | 547.0 |

4 | 410 | 414.0 |

26 | 577 | 574.0 |

30 | 571 | 554.0 |

32 | 577 | 631.0 |

37 | 704 | 644.0 |

34 | 487 | 628.0 |

40 | 587 | 540.0 |

7 | 467 | 414.0 |

10 | 580 | 464.0 |

Remember that in your case the records compared may be different, depending upon the training and testing split. Since the train_test_split method randomly splits the data we likely won't have the same training and test sets.

To evaluate performance of the regression algorithm, the commonly used metrics are mean absolute error, mean squared error, and root mean squared error. The Scikit-Learn library contains functions that can help calculate these values for us. To do so, use this code from the metrics package:

In [10]:

from sklearn import metrics print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 50.8 Mean Squared Error: 4535.4 Root Mean Squared Error: 67.34537846058926

The mean absolute error for our algorithm is 54.7, which is less than 10 percent of the mean of all the values in the 'Petrol_Consumption' column. This means that our algorithm did a fine prediction job.

- Thesis Services
- Thesis Writers Near me
- Ph.D Thesis Help
- M.Tech Thesis Help
- Thesis Assistance Online
- Thesis Help Chandigarh
- Thesis Writing Services
- Thesis Service Online
- Thesis Topics in Computer Science
- Online Thesis Writing Services
- Ph.D Research Topics in AI
- Thesis Guidance and Counselling
- Research Paper Writing Services
- Thesis Topics in Computer Science
- Brain Tumor Detection
- Brain Tumor Detection in Matlab
- Markov Chain
- Object Detection
- Employee Attrition Prediction
- Handwritten Character Recognition
- Gradient Descent with Nesterov Momentum
- Gender Age Detection with OpenCV
- Realtime Eye Blink Detection
- Pencil Sketch of a Photo
- Realtime Facial Expression Recognition
- Time Series Forecasting
- Face Comparison
- Credit Card Fraud Detection
- House Price Prediction
- House Budget Prediction
- Stock Prediction
- Email Spam Detection