Exploring Ebay Car Sales Data¶
In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classified section of the German eBay website.
The dataset was uploaded to Kaggle. We made a few modifications to the data as follows.
- We sampled 50,000 data points from the full dataset, to make sure the code runs quickly.
- We dirtied the dataset a bit so we can practise cleaning the data.
The data dictionary provided with the dataset is as follows:
- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name -Name of the car.
- seller -Whether the seller is private or a dealer.
- offerType - The type of listing.
- abtest -Whether the listing is included in an A/B test.
- vehicleType -The vehicle Type.
- yearOfRegistration -The year in wich the car was first registered
- gearbox -The transmission type
- model -The car model name
- kilometer- How many kilometers the car has driven
- monthOfRegistration - The month in which the car was first registered.
- fuelType -The brand of the car
- notRepairedDamage -If the car has a damage which is not yet repaired.
- dateCreated -The date on which the eBay listing was created.
- nrOfPictures -The number of pictures in the ad
- postalCode -The postal code for the location of the vehicle.
- lastSeenOnline -When the crawler saw this ad last online.
#import the pandas and numpy libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#read in the dataset
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
#print the dataset
autos
#get the info of the dataset
autos.info()
# get the head of the dataset
autos.head()
We can make the following observations:
- The dataset contains 20 columns, most of which are strings.
- Some columns have null values, but none have more than ~ 20% null values.
- The column names use camelCase instead of Python's preferred snake_case, which means we can't just replace spaces with underscores.
Let's convert the column names from camelCase to snake_case and reword some of the column names based on the data dictionary to be more descriptive.
# print the column names
autos.columns
#convert column names from camelCase to snake_case
columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'year_of_registration', 'gear_box', 'power_ps', 'model',
'odometer', 'month_of_registration', 'fuel_type', 'brand',
'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
'last_seen']
#assign the column names back to the datafram
autos.columns = columns
#see the head of the dataframe again
autos.head()
Now let us do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: -Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. -Example of numeric data stored as text which can be cleaned and converted.
#use describe to give a quick summary of the data,
#include = 'all' to get also categorical columns
autos.describe(include = 'all')
Observe that the columns seller, offer_type, abtest, gear_box, not_repaired_damage have only two distinct values, so they can be dropped.
The price and odometer columns are numeric values stored at text. For each column:
- Remove any non-numeric characters.
- Convert the column to a numeric dtype.
- Use DataFrame.rename() to rename the odometer column to odometer_km.
#remove the '$' sign from price
autos['price'] = autos['price'].str.replace("$","")
#remove "," from price
autos['price'] = autos['price'].str.replace(",","")
#convert to float type
autos['price'] = autos['price'].astype(float)
#remove 'km' from odometer
autos['odometer'] = autos['odometer'].str.replace("km","")
#remove ',' from odometer
autos['odometer'] = autos['odometer'].str.replace(",","")
#convert to float type
autos['odometer'] = autos['odometer'].astype(float)
#rename odometer to odometer_km
autos.rename({'odometer':'odometer_km'}, axis = 1,inplace = True)
#check the info again
autos.info()
Let's explore the odometer_km feature
autos['odometer_km'].describe()
autos['odometer_km'].value_counts().sort_index(ascending = False)
autos.hist(column = 'odometer_km')
Let's explore the price column
autos['price'].describe()
autos.hist(column = 'price', bins = 100)
autos['price'].value_counts().head(10)
#drop the row with price = 0
autos = autos[autos['price']>0]
autos['price'].describe()
Now let's move on to the date column
There are three columns that contains date date_crawled, last_seen, and ad_created.
Let's first look at the date_crawled feature
#normalize = True means we use frequency instead of count
#dropna means we also include missing values
autos['date_crawled'].str[:10].value_counts(normalize = True,dropna = False).sort_index()
Now the last_seen feature
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Now the year_of_registration feature
autos['year_of_registration'].describe()
Observe that the year_of_registration feature contains some odd values
- The minimum value is 1000, before cars were invented.
- The maximum value is 9999, many years into the future
autos['year_of_registration'].value_counts(normalize = True).sort_index()
We choose the registration_year between 1910 and 2016
#restrict the year of registration to be between 1910 and 2016
autos = autos[(autos['year_of_registration']>=1910)&(autos['year_of_registration']<=2016)]
#investigate the distributrion of year of registration again
autos['year_of_registration'].value_counts(normalize = True).sort_index()
Now let's explore the brand column
#get the top 20 brands
top20 = autos['brand'].value_counts().head(20).index
#find the mean price of the top 20 brands
top_brand = {}
for brand in top20:
top = autos[autos['brand'] == brand]
top_brand[brand] = np.mean(top['price'])
top_brand
autos.groupby('brand')['price'].mean().sort_values(ascending = False).head(20)
#convert top 20 brands and their mean prices to a dataframe
mean_price = pd.Series(top_brand)
top20 = pd.DataFrame(mean_price, columns = ['mean_price'])
top20
#compute mean mileage
mileage = {}
for brand in top_brand:
top = autos[autos['brand']==brand]
mileage[brand] = top['odometer_km'].mean()
#convert mileage to a series
mileage = pd.Series(mileage)
#add mileage to the top20 dataframe
top20['mean_mileage'] = mileage
print(top20)
top20.sort_values(by = 'mean_price', ascending = False)
Comments
comments powered by Disqus