Predicting Car Price¶
Introduction¶
In this notebook we aim to go through a toy example of an end-to-end data science project starting from collecting the data to model deployment. We follow the jupyter notebook Car Price and the codes are taken from Github codes. We will build a regression model that can predict car prices based on multiple features such as mileage, mark, model, model_year, fuel_type and the city. We scrape data from the ad website Avito. The notebook will be presented as follows.
- Data collection
- Data preprocessing and cleaning
- Exploratory data analysis and visualization
- Data modeling
- Model deployment
All the codes are available on Github.
Data collection¶
In [1]:
#data collection and preprocessing
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
In [13]:
def get_ads_urls():
urls_list = []
#define the basic url
basic_url = "https://www.avito.ma/fr/maroc/voitures-à_vendre?mpr=500000000&o="
#loop over the pages
for i in range(1,250):
#get the page url
url = basic_url + str(i)
#get the request response
r = requests.get(url)
data = r.text
#transform it to a BeautifulSoup object
soup = BeautifulSoup(data, "lxml")
#get the links for the cars in a page
for div in soup.findAll('div',{'class':'item-img'}):
a = div.findAll('a')[0]
urls_list.append(a.get('href'))
#write the link to a dataframe
df = pd.DataFrame(data = {"url":urls_list})
df.to_csv('ads_urls.csv', sep = ',', index = False)
In [14]:
#put all the urls in a file
get_ads_urls()
In [68]:
def scrape_ad_data(ad_url):
"""Extract the information from each ad_url
"""
r = requests.get(ad_url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
feature = soup.findAll('h2',{'class':["font-normal" ,"fs12" ,"no-margin" ,"ln22"]})
#create a list to hold the features
results = []
for a in feature:
results.append(''.join(a.findAll(text = True)).replace('\n',''))
return results
In [44]:
import csv
def write_data_to_csv(data):
with open("output.csv","w") as f:
writer = csv.writer(f)
writer.writerows(data)
In [69]:
#read in the url file
url_data = pd.read_csv('ads_urls.csv', encoding = 'latin-1')
data = []
for i in range(len(url_data)):
data.append(scrape_ad_data(url_data['url'][i]))
In [70]:
#write the results to file
write_data_to_csv(data)
Data Preprocessing and Cleaning¶
In [162]:
#set the column names
colnames = ['price', 'year_model','mileage','fuel_type','mark','model','fiscal_power','sector','type','city']
#read in the csv file
df = pd.read_csv('output.csv', sep = ",", names = colnames, header = None, encoding = 'latin-1')
#display the first few rows
df.head()
Out[162]:
The 'price' column¶
In [163]:
#remove the rows that don't contain price
df = df[df['price'].str.contains('DH') == True]
#remove the string 'DH' from the price
df['price'] = df['price'].apply(lambda x:x.strip('DH'))
#remove the space
df['price'] = df['price'].apply(lambda x:x.replace(" ",""))
#convert to numeric type
df['price'] = pd.to_numeric(df['price'],errors = "coerce")
#check the head
df['price'].head()
Out[163]:
In [164]:
#convert to USD
df['price'] = df['price']*0.27
The 'year model' column¶
In [165]:
df['year_model'].value_counts()
Out[165]:
In [166]:
#remove the rows that don't contain Modele
df = df[df['year_model'].str.contains("Année-Modèle") == True]
#remove the Annee-Modele
df['year_model'] = df['year_model'].apply(lambda x:x.strip("Année-Modèle:"))
#remove the ou plus ancien
df['year_model'] = df['year_model'].apply(lambda x:x.strip(" ou plus ancien"))
#remove the row that contains -
df = df[df['year_model'] != '-']
#convert to integer
df['year_model'] = pd.to_numeric(df['year_model'], errors = 'coerce')
#check the values again
df['year_model'].value_counts()
Out[166]:
The 'mileage' column¶
In [167]:
df['mileage'].value_counts()
Out[167]:
In [168]:
#remove those rows that do not contain Kilometrage
df = df[df['mileage'].str.contains('Kilométrage')==True]
#remove Kilometrage
df['mileage'] = df['mileage'].apply(lambda x:x.strip('Kilométrage:'))
#remove the row with '-'
df = df[df['mileage'] != '-']
#replace 'Plus de 500 000' with '500000-500000'
df['mileage'] = df['mileage'].apply(lambda x:x.replace('Plus de 500 000',"500000-500000"))
#split into min and max value
df['min'] = df['mileage'].apply(lambda x:x.split('-')[0])
df['max'] = df['mileage'].apply(lambda x:x.split('-')[1])
#remove empty space
df['min'] = df['min'].apply(lambda x:x.replace(" ",""))
df['max'] = df['max'].apply(lambda x:x.replace(" ",""))
#compute the mean
df['mileage'] = df.apply(lambda row: (int(row['min'])+int(row['max']))/2, axis = 1)
#remove the min and max features
df.drop(['min','max'], axis = 1,inplace = True)
#count the values
df['mileage'].value_counts()
Out[168]:
The 'fuel_type' column¶
In [169]:
df['fuel_type'].value_counts()
Out[169]:
In [170]:
#remove those rows that don't contain 'Type de carburant'
df = df[df['fuel_type'].str.contains("Type de carburant")==True]
#remove 'Type de carburant'
df['fuel_type'] = df['fuel_type'].apply(lambda x:x.strip("Type de carburant:"))
#remove '-'
df = df[df['fuel_type'] != '-']
#print values
df['fuel_type'].value_counts()
Out[170]:
The 'mark' column¶
In [171]:
df['mark'].value_counts()
Out[171]:
In [172]:
#remove the rows that don't contain 'Marque'
df = df[df['mark'].str.contains('Marque')==True]
#remove 'Marque'
df['mark'] = df['mark'].apply(lambda x:x.strip('Marque:'))
#value counts
df['mark'].value_counts()
Out[172]:
The 'model' column¶
In [173]:
df['model'].value_counts()
Out[173]:
In [174]:
#remove 'Modele'
df['model'] = df['model'].apply(lambda x:x.strip('Modèle:'))
#value_count
df['model'].value_counts()
Out[174]:
The 'fiscal power' column¶
In [175]:
df['fiscal_power'].value_counts()
Out[175]:
For this feature we will replace '-' by np.nan and fill in the missing values later.
In [201]:
#remove 'Puissance fiscale'
df['fiscal_power'] = df['fiscal_power'].apply(lambda x:x.strip('Puissance fiscale:'))
#replace 'Plus de 48 CV' by '48'
df['fiscal_power'] = df['fiscal_power'].apply(lambda x:x.replace('de 48 CV','48'))
#drop 'CV '
df['fiscal_power'] = df['fiscal_power'].apply(lambda x:x.strip(" CV"))
#convert to integer
df['fiscal_power'] = pd.to_numeric(df['fiscal_power'], errors = 'coerce')
#print values
df['fiscal_power'].value_counts()
Out[201]:
In [203]:
#missing value
df['fiscal_power'].isnull().sum()
Out[203]:
Drop the 'sector' and 'type' columns¶
In [204]:
df.drop(['sector','type'],axis = 1,inplace = True)
df.head()
Out[204]:
Save the clean data set¶
In [205]:
df.to_csv('cars.csv',index = False)
Exploratory Data Analysis and Visualization¶
In [2]:
df = pd.read_csv('cars.csv', encoding = 'latin-1')
print('The shape of the data is {}'.format(df.shape))
df.head()
Out[2]:
In [3]:
#missing values
df.isnull().sum()
Out[3]:
In [4]:
#quick statistics
df.describe()
Out[4]:
Remove outliers¶
In [5]:
mean = np.mean(df['price'])
std = np.std(df['price'])
df = df[(df['price']>=mean-3*std)&(df['price']<=mean+3*std)]
Price¶
In [6]:
#distribution of the price
plt.figure(figsize = (10,6))
sns.kdeplot(df['price'])
Out[6]:
Mileage¶
In [7]:
plt.figure(figsize = (15,6))
sns.regplot(x='mileage', y = 'price', data = df, fit_reg = False)
Out[7]:
Fuel Type¶
In [8]:
#box plot of price in terms of fuel type
plt.figure(figsize = (5,5))
sns.boxplot(x='fuel_type', y='price', data = df)
Out[8]:
Year_model¶
In [9]:
plt.figure(figsize = (20,5))
sns.barplot(x= 'year_model', y= 'price', data =df[df['year_model']>2000])
Out[9]:
Mark¶
In [10]:
plt.figure(figsize = (20,20))
sns.barplot(x='price', y = 'mark', data = df)
Out[10]:
Fiscal power¶
In [11]:
plt.figure(figsize = (20,20))
sns.barplot(x='fiscal_power', y='price', data = df)
Out[11]:
Correlation matrix¶
In [12]:
corr_matrix = df.corr()
corr_matrix['price'].sort_values(ascending = False)
Out[12]:
In [13]:
sns.heatmap(corr_matrix, annot = True, cmap = 'viridis')
Out[13]:
Data modeling¶
In [14]:
#pick the relevant features
X= df[['year_model','mileage','fiscal_power','fuel_type','mark']]
y = df['price']
Imputation¶
In [15]:
from sklearn.preprocessing import Imputer
In [16]:
imputer = Imputer(strategy = 'mean')
In [17]:
imputer.fit(X[['fiscal_power']])
Out[17]:
In [18]:
X.iloc[:,2] = imputer.transform(X[['fiscal_power']])[:,0]
In [19]:
X.isnull().sum()
Out[19]:
Feature scaling¶
In [20]:
from sklearn.preprocessing import StandardScaler
In [21]:
scaler = StandardScaler()
In [22]:
scaler.fit(X[['year_model','mileage','fiscal_power']])
Out[22]:
In [23]:
X[['year_model','mileage','fiscal_power']]=scaler.transform(X[['year_model','mileage','fiscal_power']])
In [24]:
X.head()
Out[24]:
One hot encode the categorical variables¶
In [25]:
X = pd.get_dummies(X)
print('The shape of the data is {}'.format(X.shape))
X.head()
Out[25]:
Train test split¶
In [26]:
from sklearn.model_selection import train_test_split
In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
print('The training set size is {}'.format(X_train.shape))
print('The testing set size is {}'.format(X_test.shape))
Elastic net¶
In [28]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import cross_val_score
In [29]:
en = ElasticNet()
scores = cross_val_score(en, X, y, scoring = "neg_mean_squared_error", cv = 5)
rmse_scores = np.sqrt(-scores)
print('The mean rmse is {} with standard deviation {}'.format(np.mean(rmse_scores),np.std(rmse_scores)))
Random forest regressor¶
In [30]:
from sklearn.ensemble import RandomForestRegressor
In [31]:
rdf = RandomForestRegressor()
scores = cross_val_score(rdf, X, y, scoring = 'neg_mean_squared_error', cv = 5)
rdf_score = np.sqrt(-scores)
print('The mean rmse of rdf is {} with standard deviation {}'.format(np.mean(rdf_score),np.std(rdf_score)))
Gradient boosting¶
In [32]:
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
scores = cross_val_score(gbr, X, y, scoring = 'neg_mean_squared_error', cv = 5)
gbr_score = np.sqrt(-scores)
print('The mean rmse of gbr is {} with standard deviation {}'.format(np.mean(gbr_score),np.std(gbr_score)))
In [43]:
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
print(gbr)
In [44]:
from sklearn.metrics import mean_squared_error
predictions = gbr.predict(X_test)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
rmse
Out[44]:
Predict an observation provided by a user¶
In [51]:
#user_input = [2012, 124999.5,6, 'Diesel','BMW']
user_input = {'year_model':2012, 'mileage':124999.5, 'fiscal_power':6, 'fuel_type':'Diesel', 'mark':'BMW'}
year_mean = np.mean(df['year_model'])
year_std = np.std(df['year_model'])
mileage_mean = np.mean(df['mileage'])
mileage_std = np.std(df['mileage'])
fiscal_mean = np.mean(df['fiscal_power'])
fiscal_std = np.std(df['fiscal_power'])
In [55]:
def input_to_one_hot(data):
#initialize the target vector with zero values
enc_input = np.zeros(62)
enc_input[0] = (data['year_model']-year_mean)/year_std
enc_input[1] = (data['mileage']-mileage_mean)/mileage_std
enc_input[2] = (data['fiscal_power']-fiscal_mean)/fiscal_std
#convert the mark to match the column name
mark_col = 'mark_'+data['mark']
#find the index of mark_col
mark_ind = X.columns.tolist().index(mark_col)
#set the corresponding entry to be 1
enc_input[mark_ind] = 1
#convert the fuel type to match the column name
fuel_col = 'fuel_type_'+data['fuel_type']
#find the index of fuel_col
fuel_ind = X.columns.tolist().index(fuel_col)
#set the corresponding entry to be 1
enc_input[fuel_ind] = 1
return enc_input
In [56]:
input_to_one_hot(user_input)
Out[56]:
In [57]:
a = input_to_one_hot(user_input)
In [64]:
print('The price of the car is {}'.format(round(gbr.predict([a])[0],2)))
Save the model¶
In [65]:
from sklearn.externals import joblib
joblib.dump(gbr,'model.pkl')
Out[65]:
Model deployment¶
We will deploy the model using the Heroku platform. One can check out the app in action in car-price-prediction. All the codes are available on Github.
Comments
comments powered by Disqus