import pandas as pd
import numpy as np
import json
= 'assests/datasets/example.txt'
path = pd.DataFrame([json.loads(line) for line in open(path)]) df
6 Projects with Python
6.1 Example 1: USA.gov
Data From Bitly
In 2011, URL shortening service Bitly
partnered with the US government website USA.gov
to provide a feed of anonymous data gathered from users who shorten links ending with .gov
or .mil
. The data is gotten from [1].
The data file can be downloaded from here. The file is mostly in JSON. It can be converted into a DataFrame by the following code.
We mainly use tz
and a
columns. So let us clean it.
'tz'] = df['tz'].fillna('Missing')
df['tz'][df['tz'] == ''] = 'Unknown'
df['a'] = df['a'].fillna('Missing')
df['a'][df['a'] == ''] = 'Unknown' df[
Example 6.1 We first want to extract the timezone infomation from it. The timezone info is in the column tz
. Please count different values in the columns tz
.
= df['tz']
tzone = tzone.value_counts()
tvc tvc
tz
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
...
Europe/Uzhgorod 1
Australia/Queensland 1
Europe/Sofia 1
America/Costa_Rica 1
America/Tegucigalpa 1
Name: count, Length: 98, dtype: int64
Example 6.2 We would like to visulize the value counts. You may just show the top ten results.
import seaborn as sns
=tvc[:10].values, y=tvc[:10].index) sns.barplot(x
<Axes: ylabel='tz'>
Example 6.3 We then would like to extract information from the column a
. This column is about the agent of the connection. The important info is the part before the space ' '
. Please get that part out and count values.
= df['a']
agent = agent.str.split(' ').str[0]
agent = agent.value_counts()
avc avc.head()
a
Mozilla/5.0 2594
Mozilla/4.0 601
GoogleMaps/RochesterNY 121
Missing 120
Opera/9.80 34
Name: count, dtype: int64
Example 6.4 Now let us assume that, if Windows
appears in column a
the user is using Windows
os, if not then not. Please detect os, and classify it as Windows
and Not Windows
.
'os'] = np.where(df['a'].str.contains('Windows'), 'Windows', 'Not Windows') df[
Now make a bar plot about the counts based on os
and timezone
.
Example 6.5 We first group the data by os
and tz
.
= df.groupby(['tz', 'os']).size().unstack().fillna(0)
tz_os_counts tz_os_counts.head()
os | Not Windows | Windows |
---|---|---|
tz | ||
Africa/Cairo | 0.0 | 3.0 |
Africa/Casablanca | 0.0 | 1.0 |
Africa/Ceuta | 0.0 | 2.0 |
Africa/Johannesburg | 0.0 | 1.0 |
Africa/Lusaka | 0.0 | 1.0 |
Example 6.6 We then turn it into a DataFrame. You may use any methods.
We use the .stack()
, .unstack()
tricks here.
= tz_os_counts.stack()[tz_os_counts.sum(axis=1).nlargest(10).index]
tovc = 'count'
tovc.name = pd.DataFrame(tovc).reset_index()
dftovc dftovc.head()
tz | os | count | |
---|---|---|---|
0 | America/New_York | Not Windows | 339.0 |
1 | America/New_York | Windows | 912.0 |
2 | Unknown | Not Windows | 245.0 |
3 | Unknown | Windows | 276.0 |
4 | America/Chicago | Not Windows | 115.0 |
Example 6.7 We may now draw the bar plot showing tz
and os
.
='count', y='tz', hue='os', data=dftovc) sns.barplot(x
<Axes: xlabel='count', ylabel='tz'>
6.2 Example 2: US Baby Names 1880–2010
The United States Social Security Administration (SSA) has made available data on the frequency of baby names from 1880 through the present. Hadley Wickham, an author of several popular R packages, has often made use of this dataset in illustrating data manipulation in R
. The dataset can be downloaded from here as a zip file. Please unzip it and put it in your working folder.
Example 6.8 In the folder there are 131 .txt
files. The naming scheme is yob
+ the year. Each file contains 3 columns: name
, gender
, and counts
. We would like to add a column year
, and combine all files into a single DataFrame. In our example, the year is from 1880 to 2010.
import pandas as pd
= 'assests/datasets/babynames/'
path = list()
dflist for year in range(1880, 2011):
= path + 'yob' + str(year) + '.txt'
filename = pd.read_csv(filename, names=['name', 'gender', 'counts'])
df 'year'] = year
df[
dflist.append(df)= pd.concat(dflist, ignore_index=True) df
Example 6.9 Plot the total births by sex and year.
import seaborn as sns
=df.groupby(['gender', 'year']).sum().reset_index(),
sns.relplot(data='year', y='counts', hue='gender', kind='line') x
Example 6.10 For further analysis, we would like to compute the proportions of each name relative to the total number of births per year per gender.
def add_prop(group):
'prop'] = group.counts / group.counts.sum()
group[return group
= df.groupby(['gender', 'year']).apply(add_prop)
df df.head()
name | gender | counts | year | prop | |
---|---|---|---|---|---|
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
Example 6.11 Now we would like to keep the first 100 names in each year, and save it as a new DataFrame top100
.
= (
top100 'year', 'gender'])
df.groupby([apply(lambda x: df.loc[x['counts'].nlargest(100).index])
.=['year', 'gender'])
.drop(columns
.reset_index()='level_2')
.drop(columns
) top100.head()
year | gender | name | counts | prop | |
---|---|---|---|---|---|
0 | 1880 | F | Mary | 7065 | 0.077643 |
1 | 1880 | F | Anna | 2604 | 0.028618 |
2 | 1880 | F | Emma | 2003 | 0.022013 |
3 | 1880 | F | Elizabeth | 1939 | 0.021309 |
4 | 1880 | F | Minnie | 1746 | 0.019188 |
Note that level_2
is related to the original index after reset_index()
. That’s why we don’t need it here.
Example 6.12 Now we would like to draw the trend of some names: John
, Harry
and Mary
.
= ['John', 'Harry', 'Mary']
namelist =top100[top100['name'].isin(namelist)],
sns.relplot(data='year', y='counts', hue='name', kind='line') x
Example 6.13 Please analyze the ending of names.
'ending'] = df['name'].str[-1]
df[= df.groupby(['gender', 'year', 'ending']).sum().reset_index() endingcount
Example 6.14 We would like to draw barplots to show the distributions in year 1910, 1960 and 2010.
= endingcount[endingcount['year'].isin([1910, 1960, 2010])]
certainyear import matplotlib.pyplot as plt
= plt.subplots(2, 1, figsize=(10,7))
fig, axs =certainyear[endingcount['gender']=='M'],
sns.barplot(data='ending', y='prop', hue='year', ax=axs[0])
x=certainyear[endingcount['gender']=='F'],
sns.barplot(data='ending', y='prop', hue='year', ax=axs[1]).legend_.remove() x
Example 6.15 We would also like to draw the line plot to show the trending of certain letters through years.
=endingcount[endingcount.ending.isin(['d', 'n', 'y'])],
sns.relplot(data='year', y='prop', hue='ending', kind='line') x
6.3 Exercises
Exercise 6.1 Please use the baby name dataset. We would like to consider the diversity of the names. Please compute the number of popular names in top 50% for each year each gender. Draw a line plot to show the trend and discuss the result.
Exercise 6.2 Please consider the baby name dataset. Please draw the trends of counts of names ending in a
, e
, n
across years for each gender.