Guided Project: Star Wars Survey¶
For this guided project from dataquest we will be working with the Star Wars survey data. The team at FiveThirtyEight surveyed Star Wars fans using the online tool Survey Monkey. They received 835 responses, which we can download from their Github repository.
1. Overview¶
The following code will read the data into a pandas dataframe
import pandas as pd
star_wars = pd.read_csv("star_wars.csv", encoding = "ISO-8859-1")
We need to specify an encoding because the data set has some characters that aren't in Python's default utf-8 encoding. For more information about encoding, check out Joel Spolsky's blog.
Let us check out the first few rows of the data
star_wars.head(10)
The data has several columns, including
- RespondentID - An anonymized ID for the respondent (person taking the survey)
- Gender - The respondent's gender
- Age -The respondent's age
- Household Income -The respondent's income
- Education- The respondent's education level
- Location (Census Region) - The respondent's location
- Have you seen any of the 6 films in the Star Wars franchise? - Has a Yes or No response.
- Do you consider yourself to be a fan of the Star Wars film franchise? Has a Yes or No response.
Remove the rows where RespondentID is NaN
star_wars = star_wars[pd.notnull(star_wars['RespondentID'])]
Check the number of rows and columns
star_wars.shape
Let us look at the list of columns
star_wars.columns
2. Cleaning and Mapping Yes/No Columns¶
Convert Have you seen any of the 6 films in the Star Wars franchise to Boolean type
star_wars['Have you seen any of the 6 films in the Star Wars franchise?']=star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].map({'Yes':True,'No':False})
#checking the column
star_wars['Have you seen any of the 6 films in the Star Wars franchise?'].value_counts(dropna = False)
Convert Do you consider yourself to be a fan of the Star Wars franchise? column to the Boolean type
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']=star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].map({'Yes':True,'No':False})
star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?'].value_counts(dropna = False)
3. Cleaning and Mapping Checkbox Columns¶
The next six columns represent a single checkbox question. The respondent checked off a series of boxes in response to the question Which of the following Star Wars films have you seen? Please select all that apply.
The columns for this question are
- Which of the following Star Wars films have you seen? Please select all that apply. -Whether or not the respondent saw Star Wars: Episode I the Phantom Menace.
- Unamed: 4 - Whether or not the respondent saw Star Wars: Episode II Attack of the Clones.
- Unamed: 5 -Whether or not the respondent saw Star Wars: Episode III Revenge of the Sith
- Unamed: 6 -Whether or not the respondent saw Star Wars: Episode IV A New Hope
- Unamed: 7 -Whether or not the respondent saw Star Wars: Episode V The Empire Strikes Back
- Unamed: 8 -Whether or not the respondent saw Star Wars: Episode VI Return of the Jedi
For each of these columns, if the value in a cell is the name of the movie, that means the respondent saw the movie. If the value is NaN, the respondent either didn't answer or didn't see the movie. We'll assume that they didn't see the movie.
We'll need to convert each of these columns to a Boolean, then rename the column to somthing more intuitive.
Convert each column above so that it only contains the values True and False
Be very careful with spacing when constructing your mapping dictionary.
#look at the names of the movies
star_wars.iloc[0,:]
import numpy as np
name_map = {'Star Wars: Episode I The Phantom Menace':True,
'Star Wars: Episode II Attack of the Clones': True,
'Star Wars: Episode III Revenge of the Sith': True,
'Star Wars: Episode IV A New Hope': True,
'Star Wars: Episode V The Empire Strikes Back': True,
'Star Wars: Episode VI Return of the Jedi': True,
np.NaN: False}
Convert the columns to Boolean
for i in list(range(3,9)):
star_wars.iloc[:,i] = star_wars.iloc[:,i].map(name_map)
#check the columns
star_wars.iloc[:,3:9].head()
Rename the columns
star_wars.columns[3:9]
star_wars = star_wars.rename(columns = {'Which of the following Star Wars films have you seen? Please select all that apply.':'seen_1',
'Unnamed: 4': 'seen_2',
'Unnamed: 5': 'seen_3',
'Unnamed: 6': 'seen_4',
'Unnamed: 7': 'seen_5',
'Unnamed: 8': 'seen_6'})
#check the head of the dataframe
star_wars.head()
4. Cleaning the Ranking Columns¶
The next six columns ask the respondents to rank the Star Wars movies in order of least favorite to most favorite. 1 means the film was the most favorite and 6 means it was the least favorite. Each of the following columns can contain the value 1, 2, 3, 4, 5, 6, or NaN:
- Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. How much the respondent liked Star Wars: Episode I The Phantom Menace
- Unnamed: 10 - How much the respondent liked Star Wars: Episode II Attack of the Clones
- Unnamed: 11 - How much the respondent liked Star Wars: Episode III Revenge of the Sith
- Unnamed: 12 - How much the respondent liked Star Wars: Episode IV A New Hope
- Unnamed: 13 - How much the respondent liked Star Wars: Episode V The Empire Strikes Back
- Unnamed: 14 - How much the respondent liked Star Wars: Episode VI Return of the Jedi
Fortunately these columns don't require a lot of cleanup. We'll need to convert each column to a numeric type, though, then rename the columns so that we can tell what they represent more easily.
Convert columns 9 to 15 to float type
star_wars.iloc[:,9:15] = star_wars.iloc[:,9:15].astype(float)
Rename columns 9 to 15
star_wars.columns[9:15]
star_wars = star_wars.rename(columns = {'Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.':'rank_1',
'Unnamed: 10':'rank_2',
'Unnamed: 11':'rank_3',
'Unnamed: 12':'rank_4',
'Unnamed: 13': 'rank_5',
'Unnamed: 14':'rank_6'})
star_wars.iloc[:,9:15].head()
5. Finding the Highest-Ranked Movie¶
ranking = star_wars.mean()[9:15]
ranking.plot.bar()
We see that Star Wars: Episode V The Empire Strikes Back has the highest rating (remember that lower is better) and Star Wars: Episode III Revenge of the Sith has the lowest rating.
6. Finding the Most Viewed Movie¶
movie_seen = star_wars.sum()[3:9]
movie_seen.plot.bar()
We see that Episode V is the most popular and Episode III is the least popular.
7. Exploring the Data by Binary Segments¶
We know which movies the survey population as a whole has ranked the highest. Now let's examine how certain segments of the survey population responded. We will focus on two columns which segment our data into two groups.
- Do you consider yourself to be a fan of the Star Wars film franchise? - True or False
- Gender - Male or Female
Let's first segment the data based on Star Wars fans.
fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==True]
fan_rank = fan.iloc[:,9:15].mean()
fan_rank.plot.bar()
For Star Wars fans, Episode V is still the highest ranked, and Episode III is still the lowest ranked.
fan_seen = fan.iloc[:,3:9].sum()
fan_seen.plot.bar()
not_fan = star_wars[star_wars['Do you consider yourself to be a fan of the Star Wars film franchise?']==False]
not_fan_rank = not_fan.iloc[:,9:15].mean()
not_fan_rank.plot.bar()
not_fan_seen = not_fan.iloc[:,3:9].sum()
not_fan_seen.plot.bar()
Now let us segment the data based on gender
male = star_wars[star_wars['Gender']=='Male']
male_rank = male.iloc[:,9:15].mean()
male_rank.plot.bar()
male_seen = male.iloc[:,3:9].sum()
male_seen.plot.bar()
female = star_wars[star_wars['Gender'] == 'Female']
female_rank = female.iloc[:,9:15].mean()
female_rank.plot.bar()
female_seen = female.iloc[:,3:9].sum()
female_seen.plot.bar()
From the above plots we see that overall Episode 5 is the most popular and Episode 3 is the least popular. The results do not depend on Gender.
8. Next Steps¶
Now let's segment the data based on Education
star_wars['Education'].value_counts()
graduate = star_wars[star_wars['Education'] == 'Graduate degree']
graduate.iloc[:,9:15].mean().plot.bar()
graduate.iloc[:,3:9].sum().plot.bar()
Let's look at Location (Census Region)
star_wars['Location (Census Region)'].value_counts()
Let us figure out the region that see the most Star Wars movies
star_wars['Total_View'] = star_wars['seen_1']*1+star_wars['seen_2']*1+star_wars['seen_3']*1+star_wars['seen_4']*1+star_wars['seen_5']*1+star_wars['seen_6']*1
star_wars.groupby('Location (Census Region)').sum()['Total_View'].sort_values(ascending = False)
So the Pacific region sees the most Star Wars movies and the East South Central sees the least Star Wars movies.
Now let's look at columns 15 to 29
star_wars.iloc[:,15:29].head()
Let's look at the column names
star_wars.columns[15:29]
From the dataset we have
- 'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her. - Han Solo
- Unnamed: 16 - Luke Skywalker
- Unnamed: 17 -Princess Leia Organa
- Unnamed: 18 -Anakin Skywalker
- Unnamed: 19 -Obi Wan Kenobi
- Unnamed: 20 -Emperor Palpatine
- Unnamed: 21 -Darth Vader
- Unnamed: 22 -Lando Calrissian
- Unnamed: 23 -Boba Fett
- Unnamed: 24 -C-3P0
- Unnamed: 25 -R2D2
- Unnamed: 26 -Jar Jar Binks
- Unnamed: 27 -Padme Amidala
- Unnamed: 28 - Yoda
Let's rename the columns to character names
star_wars = star_wars.rename(columns = {'Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.':'Han_Solo',
'Unnamed: 16':'Luke_Skywalker',
'Unnamed: 17': 'Princess_Leia',
'Unnamed: 18':'Anakin_Skywalker',
'Unnamed: 19':'Obi_Wan_Kenobi',
'Unnamed: 20':'Emperor_Palpatine',
'Unnamed: 21':'Darth_Vader',
'Unnamed: 22':'Lando',
'Unnamed: 23':'Boba',
'Unnamed: 24':'C3P0',
'Unnamed: 25':'R2D2',
'Unnamed: 26':'Jar_Jar',
'Unnamed: 27':'Padme',
'Unnamed: 28':'Yoda'})
star_wars.iloc[:,15:29].head()
star_wars['Han_Solo'].value_counts()
Let's convert these ratings to number: 1 being most favorite, and 6 being least favorite
favor_map = {'Very favorably':1,'Somewhat favorably':2,
'Neither favorably nor unfavorably (neutral)': 3,
'Unfamiliar (N/A)':4,
'Somewhat unfavorably': 5,
'Very unfavorably': 6}
for i in list(range(15,29)):
star_wars.iloc[:, i] = star_wars.iloc[:,i].map(favor_map)
star_wars.iloc[:,15:29].mean().plot.bar()
star_wars.iloc[:,15:29].mean().sort_values()
We see that the most favored character is Han Solo and the least favored character is Jar_jar.
Comments
comments powered by Disqus