Exploring Ebay Car Sales Data¶

In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classified section of the German eBay website.

The dataset was uploaded to Kaggle. We made a few modifications to the data as follows.

We sampled 50,000 data points from the full dataset, to make sure the code runs quickly.
We dirtied the dataset a bit so we can practise cleaning the data.

The data dictionary provided with the dataset is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name -Name of the car.
seller -Whether the seller is private or a dealer.
offerType - The type of listing.
abtest -Whether the listing is included in an A/B test.
vehicleType -The vehicle Type.
yearOfRegistration -The year in wich the car was first registered
gearbox -The transmission type
model -The car model name
kilometer- How many kilometers the car has driven
monthOfRegistration - The month in which the car was first registered.
fuelType -The brand of the car
notRepairedDamage -If the car has a damage which is not yet repaired.
dateCreated -The date on which the eBay listing was created.
nrOfPictures -The number of pictures in the ad
postalCode -The postal code for the location of the vehicle.
lastSeenOnline -When the crawler saw this ad last online.

In [1]:

#import the pandas and numpy libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:

#read in the dataset
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [3]:

#print the dataset
autos

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	$7,900	test	bus	2006	automatik	150	voyager	150,000km	4	diesel	chrysler	NaN	2016-03-21 00:00:00	0	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	$300	test	limousine	1995	manuell	90	golf	150,000km	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	0	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	privat	Angebot	$1,990	control	limousine	1998	manuell	90	golf	150,000km	12	diesel	volkswagen	nein	2016-03-16 00:00:00	0	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	privat	Angebot	$250	test	NaN	2000	manuell	0	arosa	150,000km	10	NaN	seat	nein	2016-03-22 00:00:00	0	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	$590	control	bus	1997	manuell	90	megane	150,000km	7	benzin	renault	nein	2016-03-16 00:00:00	0	15749	2016-04-06 10:46:35
10	2016-03-15 01:41:36	VW_Golf_Tuning_in_siber/grau	privat	Angebot	$999	test	NaN	2017	manuell	90	NaN	150,000km	4	benzin	volkswagen	nein	2016-03-14 00:00:00	0	86157	2016-04-07 03:16:21
11	2016-03-16 18:45:34	Mercedes_A140_Motorschaden	privat	Angebot	$350	control	NaN	2000	NaN	0	NaN	150,000km	0	benzin	mercedes_benz	NaN	2016-03-16 00:00:00	0	17498	2016-03-16 18:45:34
12	2016-03-31 19:48:22	Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...	privat	Angebot	$5,299	control	kleinwagen	2010	automatik	71	fortwo	50,000km	9	benzin	smart	nein	2016-03-31 00:00:00	0	34590	2016-04-06 14:17:52
13	2016-03-23 10:48:32	Audi_A3_1.6_tuning	privat	Angebot	$1,350	control	limousine	1999	manuell	101	a3	150,000km	11	benzin	audi	nein	2016-03-23 00:00:00	0	12043	2016-04-01 14:17:13
14	2016-03-23 11:50:46	Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...	privat	Angebot	$3,999	test	kleinwagen	2007	manuell	75	clio	150,000km	9	benzin	renault	NaN	2016-03-23 00:00:00	0	81737	2016-04-01 15:46:47
15	2016-04-01 12:06:20	Corvette_C3_Coupe_T_Top_Crossfire_Injection	privat	Angebot	$18,900	test	coupe	1982	automatik	203	NaN	80,000km	6	benzin	sonstige_autos	nein	2016-04-01 00:00:00	0	61276	2016-04-02 21:10:48
16	2016-03-16 14:59:02	Opel_Vectra_B_Kombi	privat	Angebot	$350	test	kombi	1999	manuell	101	vectra	150,000km	5	benzin	opel	nein	2016-03-16 00:00:00	0	57299	2016-03-18 05:29:37
17	2016-03-29 11:46:22	Volkswagen_Scirocco_2_G60	privat	Angebot	$5,500	test	coupe	1990	manuell	205	scirocco	150,000km	6	benzin	volkswagen	nein	2016-03-29 00:00:00	0	74821	2016-04-05 20:46:26
18	2016-03-26 19:57:44	Verkaufen_mein_bmw_e36_320_i_touring	privat	Angebot	$300	control	bus	1995	manuell	150	3er	150,000km	0	benzin	bmw	NaN	2016-03-26 00:00:00	0	54329	2016-04-02 12:16:41
19	2016-03-17 13:36:21	mazda_tribute_2.0_mit_gas_und_tuev_neu_2018	privat	Angebot	$4,150	control	suv	2004	manuell	124	andere	150,000km	2	lpg	mazda	nein	2016-03-17 00:00:00	0	40878	2016-03-17 14:45:58
20	2016-03-05 19:57:31	Audi_A4_Avant_1.9_TDI_6_GangAHKKlimatronik...	privat	Angebot	$3,500	test	kombi	2003	manuell	131	a4	150,000km	5	diesel	audi	NaN	2016-03-05 00:00:00	0	53913	2016-03-07 05:46:46
21	2016-03-06 19:07:10	Porsche_911_Carrera_4S_Cabrio	privat	Angebot	$41,500	test	cabrio	2004	manuell	320	911	150,000km	4	benzin	porsche	nein	2016-03-06 00:00:00	0	65428	2016-04-05 23:46:19
22	2016-03-28 20:50:54	MINI_Cooper_S_Cabrio	privat	Angebot	$25,450	control	cabrio	2015	manuell	184	cooper	10,000km	1	benzin	mini	nein	2016-03-28 00:00:00	0	44789	2016-04-01 06:45:30
23	2016-03-10 19:55:34	Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima	privat	Angebot	$7,999	control	bus	2010	manuell	120	NaN	150,000km	2	diesel	peugeot	nein	2016-03-10 00:00:00	0	30900	2016-03-17 08:45:17
24	2016-04-03 11:57:02	BMW_535i_xDrive_Sport_Aut.	privat	Angebot	$48,500	control	limousine	2014	automatik	306	5er	30,000km	12	benzin	bmw	nein	2016-04-03 00:00:00	0	22547	2016-04-07 13:16:50
25	2016-03-21 21:56:18	Ford_escort_kombi_an_bastler_mit_ghia_ausstattung	privat	Angebot	$90	control	kombi	1996	manuell	116	NaN	150,000km	4	benzin	ford	ja	2016-03-21 00:00:00	0	27574	2016-04-01 05:16:49
26	2016-04-03 22:46:28	Volkswagen_Polo_Fox	privat	Angebot	$777	control	kleinwagen	1992	manuell	54	polo	125,000km	2	benzin	volkswagen	nein	2016-04-03 00:00:00	0	38110	2016-04-05 23:46:48
27	2016-03-27 18:45:01	Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE	privat	Angebot	$0	control	NaN	2005	NaN	0	NaN	150,000km	0	NaN	ford	NaN	2016-03-27 00:00:00	0	66701	2016-03-27 18:45:01
28	2016-03-19 21:56:19	MINI_Cooper_D	privat	Angebot	$5,250	control	kleinwagen	2007	manuell	110	cooper	150,000km	7	diesel	mini	ja	2016-03-19 00:00:00	0	15745	2016-04-07 14:58:48
29	2016-04-02 12:45:44	Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...	privat	Angebot	$4,999	test	kombi	2004	automatik	204	e_klasse	150,000km	10	diesel	mercedes_benz	nein	2016-04-02 00:00:00	0	47638	2016-04-02 12:45:44
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49970	2016-03-21 22:47:37	c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...	privat	Angebot	$15,800	control	bus	2010	automatik	136	c4	60,000km	4	diesel	citroen	nein	2016-03-21 00:00:00	0	14947	2016-04-07 04:17:34
49971	2016-03-29 14:54:12	W.Lupo_1.0	privat	Angebot	$950	test	kleinwagen	2001	manuell	50	lupo	150,000km	4	benzin	volkswagen	nein	2016-03-29 00:00:00	0	65197	2016-03-29 20:41:51
49972	2016-03-26 22:25:23	Mercedes_Benz_Vito_115_CDI_Extralang_Aut.	privat	Angebot	$3,300	control	bus	2004	automatik	150	vito	150,000km	10	diesel	mercedes_benz	ja	2016-03-26 00:00:00	0	65326	2016-03-28 11:28:18
49973	2016-03-27 05:32:39	Mercedes_Benz_SLK_200_Kompressor	privat	Angebot	$6,000	control	cabrio	2004	manuell	163	slk	150,000km	11	benzin	mercedes_benz	nein	2016-03-27 00:00:00	0	53567	2016-03-27 08:25:24
49974	2016-03-20 10:52:31	Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing...	privat	Angebot	$0	control	cabrio	1983	manuell	70	golf	150,000km	2	benzin	volkswagen	nein	2016-03-20 00:00:00	0	8209	2016-03-27 19:48:16
49975	2016-03-27 20:51:39	Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort	privat	Angebot	$9,700	control	kleinwagen	2012	automatik	88	jazz	100,000km	11	hybrid	honda	nein	2016-03-27 00:00:00	0	84385	2016-04-05 19:45:34
49976	2016-03-19 18:56:05	Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...	privat	Angebot	$5,900	test	kombi	1992	automatik	150	80	150,000km	12	benzin	audi	nein	2016-03-19 00:00:00	0	36100	2016-04-07 06:16:44
49977	2016-03-31 18:37:18	Mercedes_Benz_C200_Cdi_W203	privat	Angebot	$5,500	control	limousine	2003	manuell	116	c_klasse	150,000km	2	diesel	mercedes_benz	nein	2016-03-31 00:00:00	0	33739	2016-04-06 12:16:11
49978	2016-04-04 10:37:14	Mercedes_Benz_E_200_Classic	privat	Angebot	$900	control	limousine	1996	automatik	136	e_klasse	150,000km	9	benzin	mercedes_benz	ja	2016-04-04 00:00:00	0	24405	2016-04-06 12:44:20
49979	2016-03-20 18:38:40	Volkswagen_Polo_1.6_TDI_Style	privat	Angebot	$11,000	test	kleinwagen	2011	manuell	90	polo	70,000km	11	diesel	volkswagen	nein	2016-03-20 00:00:00	0	48455	2016-04-07 01:45:12
49980	2016-03-12 10:55:54	Ford_Escort_Turnier_16V	privat	Angebot	$400	control	kombi	1995	manuell	105	escort	125,000km	3	benzin	ford	NaN	2016-03-12 00:00:00	0	56218	2016-04-06 17:16:49
49981	2016-03-15 09:38:21	Opel_Astra_Kombi_mit_Anhaengerkupplung	privat	Angebot	$2,000	control	kombi	1998	manuell	115	astra	150,000km	12	benzin	opel	nein	2016-03-15 00:00:00	0	86859	2016-04-05 17:21:46
49982	2016-03-29 18:51:08	Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm	privat	Angebot	$1,950	control	kleinwagen	2004	manuell	0	fabia	90,000km	7	benzin	skoda	NaN	2016-03-29 00:00:00	0	45884	2016-03-29 18:51:08
49983	2016-03-06 12:43:04	Ford_focus_99	privat	Angebot	$600	test	kleinwagen	1999	manuell	101	focus	150,000km	4	benzin	ford	NaN	2016-03-06 00:00:00	0	52477	2016-03-09 06:16:08
49984	2016-03-31 22:48:48	Student_sucht_ein__Anfaengerauto___ab_2000_BJ_...	privat	Angebot	$0	test	NaN	2000	NaN	0	NaN	150,000km	0	NaN	sonstige_autos	NaN	2016-03-31 00:00:00	0	12103	2016-04-02 19:44:53
49985	2016-04-02 16:38:23	Verkaufe_meinen_vw_vento!	privat	Angebot	$1,000	control	NaN	1995	automatik	0	NaN	150,000km	0	benzin	volkswagen	NaN	2016-04-02 00:00:00	0	30900	2016-04-06 15:17:52
49986	2016-04-04 20:46:02	Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...	privat	Angebot	$15,900	control	limousine	2010	automatik	218	300c	125,000km	11	diesel	chrysler	nein	2016-04-04 00:00:00	0	73527	2016-04-06 23:16:00
49987	2016-03-22 20:47:27	Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...	privat	Angebot	$21,990	control	limousine	2013	manuell	150	a3	50,000km	11	diesel	audi	nein	2016-03-22 00:00:00	0	94362	2016-03-26 22:46:06
49988	2016-03-28 19:49:51	BMW_330_Ci	privat	Angebot	$9,550	control	coupe	2001	manuell	231	3er	150,000km	10	benzin	bmw	nein	2016-03-28 00:00:00	0	83646	2016-04-07 02:17:40
49989	2016-03-11 19:50:37	VW_Polo_zum_Ausschlachten_oder_Wiederaufbau	privat	Angebot	$150	test	kleinwagen	1997	manuell	0	polo	150,000km	5	benzin	volkswagen	ja	2016-03-11 00:00:00	0	21244	2016-03-12 10:17:55
49990	2016-03-21 19:54:19	Mercedes_Benz_A_200__BlueEFFICIENCY__Urban	privat	Angebot	$17,500	test	limousine	2012	manuell	156	a_klasse	30,000km	12	benzin	mercedes_benz	nein	2016-03-21 00:00:00	0	58239	2016-04-06 22:46:57
49991	2016-03-06 15:25:19	Kleinwagen	privat	Angebot	$500	control	NaN	2016	manuell	0	twingo	150,000km	0	benzin	renault	NaN	2016-03-06 00:00:00	0	61350	2016-03-06 18:24:19
49992	2016-03-10 19:37:38	Fiat_Grande_Punto_1.4_T_Jet_16V_Sport	privat	Angebot	$4,800	control	kleinwagen	2009	manuell	120	andere	125,000km	9	lpg	fiat	nein	2016-03-10 00:00:00	0	68642	2016-03-13 01:44:51
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	privat	Angebot	$1,650	control	kleinwagen	1997	manuell	0	NaN	150,000km	7	benzin	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	privat	Angebot	$5,000	control	kombi	2001	automatik	299	a6	150,000km	1	benzin	audi	nein	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [4]:

#get the info of the dataset
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [5]:

# get the head of the dataset
autos.head()

Out[5]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

We can make the following observations:

The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~ 20% null values.
The column names use camelCase instead of Python's preferred snake_case, which means we can't just replace spaces with underscores.

Let's convert the column names from camelCase to snake_case and reword some of the column names based on the data dictionary to be more descriptive.

In [6]:

# print the column names
autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

#convert column names from camelCase to snake_case
columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gear_box', 'power_ps', 'model',
       'odometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [8]:

#assign the column names back to the datafram
autos.columns = columns

In [9]:

#see the head of the dataframe again
autos.head()

Out[9]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	year_of_registration	gear_box	power_ps	model	odometer	month_of_registration	fuel_type	brand	not_repaired_damage	date_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Now let us do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: -Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. -Example of numeric data stored as text which can be cleaned and converted.

In [10]:

#use describe to give a quick summary of the data, 
#include = 'all' to get also categorical columns
autos.describe(include = 'all')

Out[10]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	year_of_registration	gear_box	power_ps	model	odometer	month_of_registration	fuel_type	brand	not_repaired_damage	date_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-21 16:37:21	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Observe that the columns seller, offer_type, abtest, gear_box, not_repaired_damage have only two distinct values, so they can be dropped.

The price and odometer columns are numeric values stored at text. For each column:

Remove any non-numeric characters.
Convert the column to a numeric dtype.
Use DataFrame.rename() to rename the odometer column to odometer_km.

In [11]:

#remove the '$' sign from price
autos['price'] = autos['price'].str.replace("$","")
#remove "," from price
autos['price'] = autos['price'].str.replace(",","")
#convert to float type
autos['price'] = autos['price'].astype(float)

In [12]:

#remove 'km' from odometer
autos['odometer'] = autos['odometer'].str.replace("km","")
#remove ',' from odometer
autos['odometer'] = autos['odometer'].str.replace(",","")
#convert to float type
autos['odometer'] = autos['odometer'].astype(float)
#rename odometer to odometer_km
autos.rename({'odometer':'odometer_km'}, axis = 1,inplace = True)

In [13]:

#check the info again
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled             50000 non-null object
name                     50000 non-null object
seller                   50000 non-null object
offer_type               50000 non-null object
price                    50000 non-null float64
abtest                   50000 non-null object
vehicle_type             44905 non-null object
year_of_registration     50000 non-null int64
gear_box                 47320 non-null object
power_ps                 50000 non-null int64
model                    47242 non-null object
odometer_km              50000 non-null float64
month_of_registration    50000 non-null int64
fuel_type                45518 non-null object
brand                    50000 non-null object
not_repaired_damage      40171 non-null object
date_created             50000 non-null object
nr_of_pictures           50000 non-null int64
postal_code              50000 non-null int64
last_seen                50000 non-null object
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

Let's explore the odometer_km feature

In [14]:

autos['odometer_km'].describe()

Out[14]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [15]:

autos['odometer_km'].value_counts().sort_index(ascending = False)

Out[15]:

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
40000.0       819
30000.0       789
20000.0       784
10000.0       264
5000.0        967
Name: odometer_km, dtype: int64

In [16]:

autos.hist(column = 'odometer_km')

Out[16]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3cc3fe7630>]],
      dtype=object)

Let's explore the price column

In [17]:

autos['price'].describe()

Out[17]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [18]:

autos.hist(column = 'price', bins = 100)

Out[18]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3cc46d79b0>]],
      dtype=object)

In [19]:

autos['price'].value_counts().head(10)

Out[19]:

0.0       1421
500.0      781
1500.0     734
2500.0     643
1200.0     639
1000.0     639
600.0      531
800.0      498
3500.0     498
2000.0     460
Name: price, dtype: int64

In [20]:

#drop the row with price = 0
autos = autos[autos['price']>0]

In [21]:

autos['price'].describe()

Out[21]:

count    4.857900e+04
mean     1.012788e+04
std      4.880873e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      1.000000e+08
Name: price, dtype: float64

Now let's move on to the date column

There are three columns that contains date date_crawled, last_seen, and ad_created.

Let's first look at the date_crawled feature

In [22]:

#normalize = True means we use frequency instead of count
#dropna means we also include missing values
autos['date_crawled'].str[:10].value_counts(normalize = True,dropna = False).sort_index()

Out[22]:

2016-03-05    0.025320
2016-03-06    0.014039
2016-03-07    0.036003
2016-03-08    0.033327
2016-03-09    0.033101
2016-03-10    0.032174
2016-03-11    0.032566
2016-03-12    0.036930
2016-03-13    0.015665
2016-03-14    0.036538
2016-03-15    0.034274
2016-03-16    0.029601
2016-03-17    0.031639
2016-03-18    0.012907
2016-03-19    0.034768
2016-03-20    0.037876
2016-03-21    0.037403
2016-03-22    0.033018
2016-03-23    0.032216
2016-03-24    0.029334
2016-03-25    0.031598
2016-03-26    0.032195
2016-03-27    0.031083
2016-03-28    0.034850
2016-03-29    0.034130
2016-03-30    0.033677
2016-03-31    0.031845
2016-04-01    0.033677
2016-04-02    0.035468
2016-04-03    0.038597
2016-04-04    0.036518
2016-04-05    0.013092
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64

Now the last_seen feature

In [23]:

autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()

Out[23]:

2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007431
2016-03-09    0.009613
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023796
2016-03-13    0.008893
2016-03-14    0.012598
2016-03-15    0.015871
2016-03-16    0.016447
2016-03-17    0.028078
2016-03-18    0.007349
2016-03-19    0.015830
2016-03-20    0.020647
2016-03-21    0.020647
2016-03-22    0.021367
2016-03-23    0.018527
2016-03-24    0.019762
2016-03-25    0.019206
2016-03-26    0.016797
2016-03-27    0.015645
2016-03-28    0.020873
2016-03-29    0.022355
2016-03-30    0.024764
2016-03-31    0.023796
2016-04-01    0.022788
2016-04-02    0.024928
2016-04-03    0.025196
2016-04-04    0.024476
2016-04-05    0.124766
2016-04-06    0.221824
2016-04-07    0.131909
Name: last_seen, dtype: float64

Now the year_of_registration feature

In [24]:

autos['year_of_registration'].describe()

Out[24]:

count    48579.000000
mean      2004.753000
std         88.631663
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: year_of_registration, dtype: float64

Observe that the year_of_registration feature contains some odd values

The minimum value is 1000, before cars were invented.
The maximum value is 9999, many years into the future

In [25]:

autos['year_of_registration'].value_counts(normalize = True).sort_index()

Out[25]:

1000    0.000021
1001    0.000021
1111    0.000021
1800    0.000041
1910    0.000103
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000041
1937    0.000082
1938    0.000021
1939    0.000021
1941    0.000041
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000041
1952    0.000021
1953    0.000021
1954    0.000041
1955    0.000041
1956    0.000082
1957    0.000041
1958    0.000082
1959    0.000124
1960    0.000494
1961    0.000124
1962    0.000082
1963    0.000165
1964    0.000247
          ...   
2000    0.064966
2001    0.054283
2002    0.051174
2003    0.055559
2004    0.055641
2005    0.060438
2006    0.054983
2007    0.046872
2008    0.045596
2009    0.042940
2010    0.032710
2011    0.033409
2012    0.026966
2013    0.016530
2014    0.013668
2015    0.008069
2016    0.025114
2017    0.028675
2018    0.009696
2019    0.000041
2800    0.000021
4100    0.000021
4500    0.000021
4800    0.000021
5000    0.000082
5911    0.000021
6200    0.000021
8888    0.000021
9000    0.000021
9999    0.000062
Name: year_of_registration, Length: 95, dtype: float64

We choose the registration_year between 1910 and 2016

In [26]:

#restrict the year of registration to be between 1910 and 2016
autos = autos[(autos['year_of_registration']>=1910)&(autos['year_of_registration']<=2016)]

In [27]:

#investigate the distributrion of year of registration again
autos['year_of_registration'].value_counts(normalize = True).sort_index()

Out[27]:

1910    0.000107
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000043
1937    0.000086
1938    0.000021
1939    0.000021
1941    0.000043
1943    0.000021
1948    0.000021
1950    0.000064
1951    0.000043
1952    0.000021
1953    0.000021
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000128
1960    0.000514
1961    0.000128
1962    0.000086
1963    0.000171
1964    0.000257
1965    0.000364
1966    0.000471
1967    0.000557
1968    0.000557
          ...   
1987    0.001542
1988    0.002891
1989    0.003726
1990    0.007432
1991    0.007260
1992    0.007946
1993    0.009102
1994    0.013471
1995    0.026299
1996    0.029405
1997    0.041784
1998    0.050607
1999    0.062086
2000    0.067590
2001    0.056475
2002    0.053241
2003    0.057803
2004    0.057889
2005    0.062879
2006    0.057203
2007    0.048765
2008    0.047438
2009    0.044675
2010    0.034031
2011    0.034759
2012    0.028056
2013    0.017197
2014    0.014221
2015    0.008395
2016    0.026128
Name: year_of_registration, Length: 78, dtype: float64

Now let's explore the brand column

In [28]:

#get the top 20 brands
top20 = autos['brand'].value_counts().head(20).index

In [29]:

#find the mean price of the top 20 brands
top_brand = {}
for brand in top20:
    top = autos[autos['brand'] == brand]
    top_brand[brand] = np.mean(top['price'])

In [30]:

top_brand

Out[30]:

{'audi': 9336.687453600594,
 'bmw': 8571.480147917478,
 'citroen': 45486.68549618321,
 'fiat': 2813.748538011696,
 'ford': 7456.547932618683,
 'hyundai': 5365.254273504273,
 'mazda': 4112.596614950635,
 'mercedes_benz': 30829.021092362345,
 'mini': 10613.459657701711,
 'nissan': 4743.40252454418,
 'opel': 5432.479195699781,
 'peugeot': 3094.0172290021537,
 'renault': 2474.8646069968195,
 'seat': 4397.230949589683,
 'skoda': 6368.0,
 'smart': 3580.2239031770046,
 'sonstige_autos': 45208.36442516269,
 'toyota': 5167.091062394604,
 'volkswagen': 6729.81956411556,
 'volvo': 4946.501170960188}

In [34]:

autos.groupby('brand')['price'].mean().sort_values(ascending = False).head(20)

Out[34]:

brand
porsche           45643.937063
citroen           45486.685496
sonstige_autos    45208.364425
mercedes_benz     30829.021092
land_rover        19108.091837
jeep              11650.500000
jaguar            11635.493151
mini              10613.459658
audi               9336.687454
bmw                8571.480148
ford               7456.547933
volkswagen         6729.819564
chevrolet          6684.139098
skoda              6368.000000
kia                5982.330303
dacia              5915.528455
opel               5432.479196
hyundai            5365.254274
toyota             5167.091062
volvo              4946.501171
Name: price, dtype: float64

In [36]:

#convert top 20 brands and their mean prices to a dataframe
mean_price = pd.Series(top_brand)
top20 = pd.DataFrame(mean_price, columns = ['mean_price'])

In [37]:

top20

Out[37]:

	mean_price
audi	9336.687454
bmw	8571.480148
citroen	45486.685496
fiat	2813.748538
ford	7456.547933
hyundai	5365.254274
mazda	4112.596615
mercedes_benz	30829.021092
mini	10613.459658
nissan	4743.402525
opel	5432.479196
peugeot	3094.017229
renault	2474.864607
seat	4397.230950
skoda	6368.000000
smart	3580.223903
sonstige_autos	45208.364425
toyota	5167.091062
volkswagen	6729.819564
volvo	4946.501171

In [38]:

#compute mean mileage
mileage = {}
for brand in top_brand:
    top = autos[autos['brand']==brand]
    mileage[brand] = top['odometer_km'].mean()

In [39]:

#convert mileage to a series
mileage = pd.Series(mileage)

In [40]:

#add mileage to the top20 dataframe
top20['mean_mileage'] = mileage

In [41]:

print(top20)

                  mean_price   mean_mileage
audi             9336.687454  129157.386785
bmw              8571.480148  132575.905021
citroen         45486.685496  119572.519084
fiat             2813.748538  117121.971596
ford             7456.547933  124243.491577
hyundai          5365.254274  106442.307692
mazda            4112.596615  124464.033850
mercedes_benz   30829.021092  130792.628774
mini            10613.459658   88105.134474
nissan           4743.402525  118330.995792
opel             5432.479196  129314.154888
peugeot          3094.017229  127153.625269
renault          2474.864607  128071.331213
seat             4397.230950  121131.301290
skoda            6368.000000  110848.563969
smart            3580.223903   99326.777610
sonstige_autos  45208.364425   89707.158351
toyota           5167.091062  115944.350759
volkswagen       6729.819564  128713.634060
volvo            4946.501171  138067.915691

In [42]:

top20.sort_values(by = 'mean_price', ascending = False)

Out[42]:

	mean_price	mean_mileage
citroen	45486.685496	119572.519084
sonstige_autos	45208.364425	89707.158351
mercedes_benz	30829.021092	130792.628774
mini	10613.459658	88105.134474
audi	9336.687454	129157.386785
bmw	8571.480148	132575.905021
ford	7456.547933	124243.491577
volkswagen	6729.819564	128713.634060
skoda	6368.000000	110848.563969
opel	5432.479196	129314.154888
hyundai	5365.254274	106442.307692
toyota	5167.091062	115944.350759
volvo	4946.501171	138067.915691
nissan	4743.402525	118330.995792
seat	4397.230950	121131.301290
mazda	4112.596615	124464.033850
smart	3580.223903	99326.777610
peugeot	3094.017229	127153.625269
fiat	2813.748538	117121.971596
renault	2474.864607	128071.331213

Exploring Car Sale Data

Exploring Ebay Car Sales Data¶

Comments

Exploring Ebay Car Sales Data¶

Part 1 of the Exploratory Data Analysis series

Next articles

Comments