4  Package: pandas

The basic data structure for pandas is pandas.DataFrame. You may treat it as a generalized version of tables.

To use pandas, we just import it. In most cases you would like to use the alias pd.

import pandas as pd

Since DataFrame is more like a table, the biggest questions here is not to do computations (which is still very important), but to retrieve, search, sort, merge, etc.. those data.

4.1 Basic pandas

4.1.1 Series and DataFrame

A Series is a 1-d array-like object which has index. The default index is starting from 0. You may change the index to be something assigned by you. Thus it can be treated as a generalization of a dict.

obj = pd.Series([3, 1, 2, 4])
obj
0    3
1    1
2    2
3    4
dtype: int64
obj2 = pd.Series([3, 1, 2, 4], index=['a', 'b', 'c', 'd'])
obj2
a    3
b    1
c    2
d    4
dtype: int64
data3 = {'a': 3, 'b': 1, 'c': 2, 'd': 4}
obj3 = pd.Series(data3)
obj3
a    3
b    1
c    2
d    4
dtype: int64

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. When displaying a DataFrame, we may use .head() to just display the first few rows for efficicy.

import pandas as pd

data = {'a': [1, 2, 3, 4, 5, 6, 7],
        'b': [1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1],
        'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
df.head()
a b c
0 1 1.1 a
1 2 2.1 b
2 3 3.1 c
3 4 4.1 d
4 5 5.1 e

4.1.2 Decorations

A Series or a DataFrame might have named row indexes and column names. I collect some tools for you to play with them and list them below.

We may use the setting columns= or index= to change the column names and the index names. See the following example.

import numpy as np
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

We may use the .rename() method. Note that by default the return value of this method is a copy and it won’t affect the original DataFrame. The arguments can be in many different formats. Please see the official document for more details.

If you want to directly make the change, please use the argument inplace=True.

The following example shows the standard way to rename.

df = pd.DataFrame(np.arange(16).reshape((4, 4)))
df.rename(columns={0: 'zero'}, index={2: 'two'})
zero 1 2 3
0 0 1 2 3
1 4 5 6 7
two 8 9 10 11
3 12 13 14 15

However the orginal df is not affected.

df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15

If you would like to change the original df, you may either set df = df.rename(columns={0: 'zero'}, index={2: 'two'}), or

df.rename(columns={0: 'zero'}, index={2: 'two'}, inplace=True)
df
zero 1 2 3
0 0 1 2 3
1 4 5 6 7
two 8 9 10 11
3 12 13 14 15

The title is all. A few remarks:

  1. You may set multiple columns to be the index. In this case, what you get is a multi-index system (which is also called Hierarchical indexing). We will talk about this later in Section 4.4.2.
  2. The argument drop is used to control whether the column is deleted after you set it to be the index. The default setting is True.
  3. The argument append is used to control whether the column you choose is appended to the exsiting index to form a multi-index system. The default is False.
  4. The argument inplace is used to control whether you want to make the change inplace. The default is False.

The title is all. A few remarks:

  1. The new index is integers starting from 0.
  2. drop is an argument to control whether the original index is dropped or added back to the DataFrame as a column. The default is False, which means that by default the original index will be added back to the DataFrame.

4.1.3 Look at the DataFrame

The following methods can be used to look at the DataFrame. Their syntax is very simple. Please try them by yourselves.

  • .head(): show the first few rows.
  • .tail(): show the last few rows.
  • .describe(): show the basic statistics of each columns.

These are methods for Series which might be helpful to understand the data.

  • .unique()
  • .value_counts()
import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 1, 2, 2, 1, 1, 1],
                   'b': [3, 1, 1, 2, 4, 5, 2, 1, 3]})
df.head(3)
a b
0 1 3
1 2 1
2 3 1
df.tail()
a b
4 2 4
5 2 5
6 1 2
7 1 1
8 1 3
df.describe()
a b
count 9.000000 9.000000
mean 1.555556 2.444444
std 0.726483 1.424001
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 3.000000
max 3.000000 5.000000
df['a'].unique()
array([1, 2, 3], dtype=int64)
df['b'].value_counts()
b
1    3
3    2
2    2
4    1
5    1
Name: count, dtype: int64

4.2 Indexing

The act of selecting rows or columns to access from a dataframe or series is called indexing. There are many different ways to index in pandas. We will only cover the most popular ones.

There is the same copy and view issue with pandas as in numpy. However it is more complicated and more inconsistent. Please check the official documents for more details and do more experiments before implementing the codes. Usually if your code is ambiguous, you might see the infamous SettingWithCopyWarning warning.

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df[df['a']==3]['b'] = 3
C:\Users\Xinli\AppData\Local\Temp\ipykernel_9520\4018622545.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['a']==3]['b'] = 3

Note that pandas is testing a copy-on-write feature to fix the issue. The feature can be simlified as “any DataFrame or Series derived from another in any way always behaves as a copy”. Please keep an eye on the updates about when the feature will be fully implemented in pandas.

4.2.1 []

Input value type Return Value Type
Series[label] scalar value
Series[list of labels] Series corresponding to labels
Series[slice] Series corresponding to the slice
Series[boolean vector] Series corresponding to the boolean vector
  1. For Series, values are accessed by labels, not positions. Since Series are usually considered as a column, you may think these labels as row indexes.

  2. When using slice, things becomes more complicated. There are two ways of using slice. You may either slice by positions, or slice by labels. The main differences between them is that:

  • slice by positions Series[i:j] doesn’t contain the last index Series[j];
  • slice by labels Sereies[I:J] contains the last label Series[J].
  1. Sometimes the labels of a series are integers, but different than the position indexes. In pandas 1.5.1, slice by positions takes priority. However the whole scenario is very confusing, and this will be changed in future versions. In this cases it is recommanded to use .loc and .iloc.

  2. When indexing using boolean vector, the vector should be of the same length as the Series. In other words, it works as the boolean bector shows which row is selected.

See some examples below.

Example 4.1  

import pandas as pd

example = pd.Series({'a': 1.1, 'b': 2.2, 'c': 3.3, 'd': 4.4})
example
a    1.1
b    2.2
c    3.3
d    4.4
dtype: float64
example['b']
2.2
example[['b', 'a']]
b    2.2
a    1.1
dtype: float64
example[0:2]
a    1.1
b    2.2
dtype: float64
example['a':'c']
a    1.1
b    2.2
c    3.3
dtype: float64
example[[True, False, True, False]]
a    1.1
c    3.3
dtype: float64
Input value type Return Value Type
DataFrame[colname] The column corresponding to colname as a Series
DataFrame[list-of-colnames] The columns of DataFrame corresponding to colnames
DataFrame[slice] The rows of DataFrame corresponding to the slice
DataFrame[boolean list] DataFrame corresponding to the boolean list
  1. Slice for DataFrame behaves exactly like slice for Series, that it is selecting rows, and it works for both labels and positions. Similarly, slicing by positions are not recommended and might be deprecated in the future.
  2. On the other hand side, selecting rows are usually related to querying. Therefore it is better not to focus on slicing.
  3. Inside [], one column name and a list of columna names will result totally different objects: one is a Series and the other is a DataFrame.
  4. In prior versions, using [list-of-colnames] would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError). This behavior was changed and will now raise a KeyError if at least one label is missing. The recommended alternative is to use .reindex().
  5. When indexing using boolean vector, the vector should be of the same length as the number of rows of the DataFrame. In other words, it works as the boolean bector shows which row is selected.
  6. Using [] for DataFrame cannot give you a single value, since what are inside [] is always treated as a row index or a column index. If you want to get access to the value of a single cell by both row index and column index, use other method like .loc[].
  7. If the column name is eligible for attributes, you may also use df.a to represent df['a'] for simplicity.

Example 4.2  

import pandas as pd

example = pd.DataFrame({'a': [1.1, 2.2], 'b': [2.2, 3.3], 'c': [3.3, 4.4]})
example
a b c
0 1.1 2.2 3.3
1 2.2 3.3 4.4
example['a']
0    1.1
1    2.2
Name: a, dtype: float64
example.a
0    1.1
1    2.2
Name: a, dtype: float64
example[['a']]
a
0 1.1
1 2.2
example[0:1]
a b c
0 1.1 2.2 3.3
example[[False, True]]
a b c
1 2.2 3.3 4.4

4.2.2 .loc[] and .iloc[]

[] contains several different types of ways to access data. .loc[] and .iloc[] are more specific.

  • .loc[] is to use labels to access data.
  • .iloc[] is to use positions to access data.
  1. When there is only one index is specified, it is refered to rows.
  2. When using both indexes, the first is row index and the second is column index.
  3. When selecting all rows/columns, you may put : in the corresponding place.
  4. df.loc[1, 'a'] refers to the cell in the DataFrame df whose row index is 1 and column index is a. df[1, 'a'] refers to the column in the DataFrame df whose column name is (1, 'a').
  5. Many other small details are very similar to []. For example, pay attention to the differences between df.loc[:, 'a'] and df.loc[:, ['a']].

Example 4.3  

import pandas as pd
example = pd.DataFrame({'a': [1.1, 2.2], 'b': [2.2, 3.3], 'c': [3.3, 4.4]})
example
a b c
0 1.1 2.2 3.3
1 2.2 3.3 4.4
example.loc[1]
a    2.2
b    3.3
c    4.4
Name: 1, dtype: float64
example.loc[:, 'a']
0    1.1
1    2.2
Name: a, dtype: float64
example.loc[1, 'a']
2.2
example.iloc[0:1, 0:2]
a b
0 1.1 2.2
example.iloc[1, 0:2]
a    2.2
b    3.3
Name: 1, dtype: float64
example.iloc[[1], 0:2]
a b
1 2.2 3.3

4.2.3 Boolean indexing

Let df be a DataFrame. Assume that boo is boolean vector of the dimension same to the number of rows of df, then we can use df[boo] to filter data: all rows with True will be selected. The syntax is similar to the boolean indexing in numpy.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
                  index=pd.date_range('1/1/2023', periods=8),
                  columns=['A', 'B', 'C', 'D'])
df
A B C D
2023-01-01 -1.114627 -2.230739 0.717786 -1.512329
2023-01-02 1.439817 -0.892224 -0.855341 0.130689
2023-01-03 0.191556 1.417130 -0.189149 1.106091
2023-01-04 -0.781606 0.921218 0.500114 1.051130
2023-01-05 -2.429426 -1.556279 0.635073 0.562324
2023-01-06 2.016767 0.304860 -1.181919 -0.979527
2023-01-07 0.530413 -0.353301 0.751793 -0.642841
2023-01-08 0.166893 0.000033 0.108881 0.988175
df[df['A']>0]
A B C D
2023-01-02 1.439817 -0.892224 -0.855341 0.130689
2023-01-03 0.191556 1.417130 -0.189149 1.106091
2023-01-06 2.016767 0.304860 -1.181919 -0.979527
2023-01-07 0.530413 -0.353301 0.751793 -0.642841
2023-01-08 0.166893 0.000033 0.108881 0.988175

To get the boolean vector, we may directly compute logic expression using columns of df. The previous example is of this kind.

You may write complicated expressions. The operators are:

  • | for or
  • & for and
  • ~ for not

Note that parentheses must be used to ensure a correct result. Please see the following example.

df[(df['A'] > 1) & (df['B'] < 3)]
A B C D
2023-01-02 1.439817 -0.892224 -0.855341 0.130689
2023-01-06 2.016767 0.304860 -1.181919 -0.979527

There are many methods and functions that can create boolean vectors. We will introduce them when we need them.

4.2.4 .query()

DataFrame has a .query() method that allows filtering using an expression instead of a boolean vector. This method uses a different approach from the point of programming language. From the point of users, you are free to choose between .query() and boolean indexing to filter data.

df = pd.DataFrame({'A': [1,2,3], 'B': [3,2,1], 'C': [5,4,3]})
df
A B C
0 1 3 5
1 2 2 4
2 3 1 3
df.query('A<B and B<C')
A B C
0 1 3 5

It can be simplified as follows:

df.query('A<B<C')
A B C
0 1 3 5

This is the same as the following code.

df[(df['A']<df['B']) & (df['B']<df['C'])]
A B C
0 1 3 5

Note that .query() does not require the usage of parentheses. It also use English like or/and/not for |/&/~.

df.query('A in C')
A B C
2 3 1 3
df.query('A not in C')
A B C
0 1 3 5
1 2 2 4
df.query('A not in C and A<B')
A B C
0 1 3 5
df.query('[1,2] in B')
A B C
1 2 2 4
2 3 1 3

4.2.5 Reindex

.reindex() is a data alignment method in pandas. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

  • Reordering the existing data to match a new set of labels
  • Inserting missing value (NaN) markers in label locations where no data for that label existed

Here is a simple example:

import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.reindex(index = ['Colorado', 'Arkansas', 'New York'],
             columns = ['three', 'five', 'one'])
three five one
Colorado 6.0 NaN 4.0
Arkansas NaN NaN NaN
New York 14.0 NaN 12.0

From the first glance, .reindex() behave the same as other indexing methods. Here are a few differences:

  • The purpose of indexing methods is to select/filter data, while the purpose of reindex is to make the data in a very specific form.
  • When dealing with non-existent indexes/columns, most other indexing methods will return error or warning, while .reindex() can handle it automatically.
  • The default setting of .reindex() is to return a copy. This setting can be changed by the argument copy=False.

For more details please see the official guide.

4.2.6 Updating data

  • Assign values to a column of a DataFrame will update that column. If the column doesn’t exist, new column will be created. This is called enlargement.
  • When assign values with non-existent row index, that part of the data will be discarded.
  • When using .loc, a DataFrame can be enlarged on either axis.
  • Any time if there are no values with a specific column and row, it will show as NaN.

Example 4.4  

import pandas as pd

data = {'a': [1, 2, 3, 4],
        'b': [1.1, 2.1, 3.1, 4.1],
        'c': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)

newcol = {1: 'good', 3: 'better', 5: 'best'}
df['d'] = pd.Series(newcol)
df
a b c d
0 1 1.1 a NaN
1 2 2.1 b good
2 3 3.1 c NaN
3 4 4.1 d better

4.3 Data cleaning

4.3.1 Handling Missing Data

  • np.nan, pd.NA
  • pd.isnull(), np.isnan()
  • .dropna(), .fillna()
import pandas as pd
import numpy as np

data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], 
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
data.dropna()
0 1 2
0 1.0 6.5 3.0
data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = np.nan
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
data.dropna(axis=1, how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
data.dropna(thresh=2)
0 1 2 4
0 1.0 6.5 3.0 NaN
3 NaN 6.5 3.0 NaN
data.fillna(0)
0 1 2 4
0 1.0 6.5 3.0 0.0
1 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 6.5 3.0 0.0
data.fillna({1: 0.5, 2: -0.1})
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 0.5 -0.1 NaN
2 NaN 0.5 -0.1 NaN
3 NaN 6.5 3.0 NaN

Elements of the same index and columns will be computed. By default, if any entry is nan, the answer will be nan. You may use fill_value argument to fill the empty slots. Please see the following example.

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan

df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0

Relatedly, when reindexing a Series or DataFrame, you can also specify a fill_value.

4.3.2 Handling duplicates

import numpy as np
import pandas as pd

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data.drop_duplicates(['k1'], keep='last')
k1 k2
4 one 3
6 two 4

4.3.3 Drop data

You may use .drop() to drop columns or rows.

  1. If you directly apply .drop() to an index, that index is considered as a row index.
  2. To drop a column, you need to specify the argument columns=.
  3. There is still the inplace= issue.

4.3.4 String Manipulation

When the column Series is of type str, all methods in pd.Series.str will be applied to each entry of the Series.

import pandas as pd
import numpy as np
s = pd.Series(["A ", " B ", "C", "Aaba", " Baca ", np.nan, "CABA", "dog", "cat"])
s
0        A 
1        B 
2         C
3      Aaba
4     Baca 
5       NaN
6      CABA
7       dog
8       cat
dtype: object
s.str.lower()
0        a 
1        b 
2         c
3      aaba
4     baca 
5       NaN
6      caba
7       dog
8       cat
dtype: object
s.str.split('a')
0          [A ]
1         [ B ]
2           [C]
3      [A, b, ]
4    [ B, c,  ]
5           NaN
6        [CABA]
7         [dog]
8        [c, t]
dtype: object
s.str.len()
0    2.0
1    3.0
2    1.0
3    4.0
4    6.0
5    NaN
6    4.0
7    3.0
8    3.0
dtype: float64
s.str.strip()
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
s.str.replace("A", '1')
0        1 
1        B 
2         C
3      1aba
4     Baca 
5       NaN
6      C1B1
7       dog
8       cat
dtype: object

Example 4.5 We could also use .str to play with column names and row indexes.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2),
                  columns=[" Column A ", " Column B "], index=range(3))

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
column_a column_b
0 0.098619 0.841380
1 2.018570 0.122149
2 -0.352726 -0.411704

String methods are usually used with regular expressions. For more details please see Appendix F.

4.4 Data Wrangling

4.4.1 Tidy data

The same underlying data can be represented in multiple ways. To better study the data, it is better to make these data tidy.

Definition 4.1 A dataset is tidy if

  1. Each variable have its own column.
  2. Each observation have its own row.
  3. Each value have its oven cell.

These DataFrame are provided by tidyr. We will talk about them again when we get to R. These tables can be downloaded by clicking the names.

  1. table1
import pandas as pd
table1 = pd.read_csv('assests/datasets/table1.csv', index_col='Unnamed: 0')
table1
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
  1. table2
import pandas as pd
table2 = pd.read_csv('assests/datasets/table2.csv', index_col='Unnamed: 0')
table2
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
  1. table3
import pandas as pd
table3 = pd.read_csv('assests/datasets/table3.csv', index_col='Unnamed: 0')
table3
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
  1. Spread across two DataFrames: table4a and table4b:
import pandas as pd
table4a = pd.read_csv('assests/datasets/table4a.csv', index_col='Unnamed: 0')
table4b = pd.read_csv('assests/datasets/table4b.csv', index_col='Unnamed: 0')
table4a
country 1999 2000
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
country 1999 2000
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583

Among all these DataFrames, only table1 is tidy.

These three conditions are interrelated because it is impossible to only satisfy two of the three. In pratical, we need to follow the instructions:

  1. Put each dataset in a DataFrame.
  2. Put each variable in a column.
  3. Every row is about one obeservation.

Tidy data is a consistent way to organize your data. The main advantages are:

  1. It is one consistent way of storing data. In other words, this is a consistent data structure that can be used in many cases.
  2. To placing variables in columns enables Python to do vectorized operations.

Most datasets are untidy, since tidy data is usually not intuitive for collecting. Therefore raw data which are collected by some naive ideas are usually not tidy.

Untidy data are usually:

  • One variable might be spread across multiple columns.
  • One observation might be scattered across multiple rows.

A common problem is that the column names are not names of variables, but values of a variable. For example, table4a above has columns 1999 and 2000. These two names are actually the values of a variable year. In addition, each row represents two observations, not one.

country 1999 2000
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766

To tidy this type of dataset, we need to gather those columns into a new pair of variables. We need three parameters:

  • The set of columns that represent values. In this case, those are 1999 and 2000.
  • The name of the variable. In this case, it is year. -The name of the variable whose values are spread over the cells. In this case, it is the number of cases.

Then we apply .melt().

table4a.melt(id_vars=['country'],
             value_vars=['1999', '2000'],
             var_name='year',
             value_name='cases')
country year cases
0 Afghanistan 1999 745
1 Brazil 1999 37737
2 China 1999 212258
3 Afghanistan 2000 2666
4 Brazil 2000 80488
5 China 2000 213766

We can do the similar thing to table4b.

table4b.melt(id_vars=['country'],
             value_vars=['1999', '2000'],
             var_name='year',
             value_name='population')
country year population
0 Afghanistan 1999 19987071
1 Brazil 1999 172006362
2 China 1999 1272915272
3 Afghanistan 2000 20595360
4 Brazil 2000 174504898
5 China 2000 1280428583
Tip

In Python there are multiple different ways to change a wide DataFrame to be longer like .melt(). Among all of them, .melt() is the most common one.

Another issuse is that an observation is scattered across multiple rows. Take table2 as an example.

An observation is a country in a year, but each observation is spread across two rows.

table2
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583

We could apply .pivot() to make it tidy. Here we need two arguments.

  • The column that contains variable names. Here, it’s type.
  • The column that contains values forms multiple variables. Here, it’s count.
table2.pivot(index=['country', 'year'], columns='type', values='count')
type cases population
country year
Afghanistan 1999 745 19987071
2000 2666 20595360
Brazil 1999 37737 172006362
2000 80488 174504898
China 1999 212258 1272915272
2000 213766 1280428583

If we would like to split one columns into multiple columns since there are more than one values in a cell, we could use Series string method to split it.

table3['newrate'] = table3['rate'].str.split('/')
table3
country year rate newrate
1 Afghanistan 1999 745/19987071 [745, 19987071]
2 Afghanistan 2000 2666/20595360 [2666, 20595360]
3 Brazil 1999 37737/172006362 [37737, 172006362]
4 Brazil 2000 80488/174504898 [80488, 174504898]
5 China 1999 212258/1272915272 [212258, 1272915272]
6 China 2000 213766/1280428583 [213766, 1280428583]

If we prepare two columns from the beginning, we could directly get two columns. Note that the argument expand=True means that we want to get a DataFrame by expanding dimensionality. More details can be found here.

table3[['cases', 'population']] = table3['rate'].str.split('/', expand=True)
table3.drop(columns=['rate', 'newrate'], inplace=True)
table3
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583

Similarly we could also combine columns just as they are strings.

table3['another_rate'] = table3['cases']+'/'+table3['population']
table3
country year cases population another_rate
1 Afghanistan 1999 745 19987071 745/19987071
2 Afghanistan 2000 2666 20595360 2666/20595360
3 Brazil 1999 37737 172006362 37737/172006362
4 Brazil 2000 80488 174504898 80488/174504898
5 China 1999 212258 1272915272 212258/1272915272
6 China 2000 213766 1280428583 213766/1280428583

4.4.2 Hierarchical indexing

Pandas support a more complex indexing system, that the index may have multiple levels. See the following example.

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),
                 index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                          [1, 2, 3, 1, 2, 3, 1, 2, 3]])
data
a  1    0.380306
   2   -1.458122
   3    0.416262
b  1   -1.301703
   2   -0.936877
c  3   -1.262353
   1    1.395704
d  2    0.119812
   3    0.255070
dtype: float64

You may look at the Series using different levels of indexes.

data['a']
1    0.380306
2   -1.458122
3    0.416262
dtype: float64
data.loc[:, 2]
a   -1.458122
b   -0.936877
d    0.119812
dtype: float64

You may use groupby to group by levels and do calculations related to levels. More .groupby() will be discussed in the next section.

data.groupby(level=1).sum()
1    0.474307
2   -2.275187
3   -0.591021
dtype: float64

From the example above, you may notice that the 2-level hierarchical indexing for a Series works very similar to a DataFrame. In fact, you may translate it back and forth between a 2-level indexing Series and a DataFrame.

df = data.unstack()
df
1 2 3
a 0.380306 -1.458122 0.416262
b -1.301703 -0.936877 NaN
c 1.395704 NaN -1.262353
d NaN 0.119812 0.255070
df.stack()
a  1    0.380306
   2   -1.458122
   3    0.416262
b  1   -1.301703
   2   -0.936877
c  1    1.395704
   3   -1.262353
d  2    0.119812
   3    0.255070
dtype: float64

For DataFrame the index for both axes can be multiindex. The usual indexing way can be used if you want to start from the first level of the index. The more specific method to extract data is .xs.

import pandas as pd

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

df = pd.concat([df1, df2], keys=['x', 'y'])
df
A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df['A']
x  0    A0
   1    A1
   2    A2
   3    A3
y  4    A4
   5    A5
   6    A6
   7    A7
Name: A, dtype: object
df.loc['x']
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df.loc['x',3]
A    A3
B    B3
C    C3
D    D3
Name: (x, 3), dtype: object
df.xs(3, level=1, drop_level=False)
A B C D
x 3 A3 B3 C3 D3

4.4.3 Combining and Merging Datasets

merge and concat are the two most common ways to combine datasets.

Merge combines datasets by linking rows using one or more keys. This is from relational databases (e.g., SQL-based).

Here are some examples.

Example 4.6  

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

The two DataFrames are displayed as follows.

df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1, df2, on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

If the column names are different in each object, you can specify them separately.

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0

By default merge does an inner join, that the keys in the result are the interesection found in both tables. Below are different types of merge. To specify the method for merge, the option is how.

  • inner
  • left
  • right
  • outer

Let’s see the following examples.

df1 = pd.DataFrame({'Key': [1, 2], 'A': [0, 2], 'B': [1, 3]})
df1
Key A B
0 1 0 1
1 2 2 3
df2 = pd.DataFrame({'Key': [1, 3], 'C': [0, 2], 'D': [1, 3]})
df2
Key C D
0 1 0 1
1 3 2 3
pd.merge(df1, df2, on='Key', how='inner')
Key A B C D
0 1 0 1 0 1
pd.merge(df1, df2, on='Key', how='outer')
Key A B C D
0 1 0.0 1.0 0.0 1.0
1 2 2.0 3.0 NaN NaN
2 3 NaN NaN 2.0 3.0
pd.merge(df1, df2, on='Key', how='left')
Key A B C D
0 1 0 1 0.0 1.0
1 2 2 3 NaN NaN
pd.merge(df1, df2, on='Key', how='right')
Key A B C D
0 1 0.0 1.0 0 1
1 3 NaN NaN 2 3
Note

If a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data. Here is a very basic example with one unique key combination.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})
pd.merge(df1, df2, on='key', how='left')
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
Note

If the merge keys in a DataFrame is in its index instead of column(s), we could pass left_index=True or right_index=True or both instead of setting left_on/right_on/on.

Example 4.7 If we want to really create a Cartesian product, we may use the option how='cross'. For example, we would like to generate a deck of cards, we may use the following codes.

suit = pd.DataFrame({'suit': ['spades', 'hearts', 'clubs', 'diamonds']})
face = pd.DataFrame({'face': list(range(1, 14))})
deck = pd.merge(suit, face, how='cross')

The concat() function (in the main pandas namespace) performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

import pandas as pd

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

The default way of pd.concat() is vertically. Note that it will check the column names. If the column names don’t match, new columns will be created and nan values will be assigned.

If you want to concatenate the DataFrame horizontally you need to add axis=1 option. Similarly, row index will be checked before concatenating. See the following example.

Example 4.8  

pd.concat([df1, df2, df3], axis=1)
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Example 4.9 Consider the deck example from Example 4.7. This time we would like to use pd.concat() to get the result.

suitlist = ['spades', 'hearts', 'clubs', 'diamonds']
facelist = list(range(1, 14))
decklist = [pd.DataFrame({'suit': suit, 'face': facelist}) for suit in suitlist]
deck = pd.concat(decklist, ignore_index=True)

4.5 Data Aggregation and Group Operations

4.5.1 split-apply-combine model

We would like to apply group operations based on the split-apply-combine model.

  • In the first stage of the process, data contained in a pandas object is split into groups based on one or more keys that you provide. We then use .groupby(keys) to perform the split step. The result is a grouped groupby object.
  • Once this is done, a function is applied to each group, producing a new value.
  • Finally the results of all those function applications are combined into a result object. We may apply groupby functions directly as methods to groupby objects.The result is the combined result object.
import pandas as pd
import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
key1 key2 data1 data2
0 a one -1.060120 0.052755
1 a two 0.373027 -0.507036
2 b one -0.682489 2.054801
3 b two -0.041028 0.029546
4 a one -0.432938 -0.159506

Now we want to group data1 in df by key1.

grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000213B513FEB0>

What we get is a groupby object and we could apply group functions to it.

The method to look at each group is .get_group().

grouped.get_group('a')
0   -1.060120
1    0.373027
4   -0.432938
Name: data1, dtype: float64

We may directly apply some group functions to the groupby object.

grouped.mean()
key1
a   -0.373344
b   -0.361758
Name: data1, dtype: float64
grouped.size()
key1
a    3
b    2
Name: data1, dtype: int64

We could iterate over groups.

for name, group in grouped:
    print('name', name)
    print('group', group)
name a
group 0   -1.060120
1    0.373027
4   -0.432938
Name: data1, dtype: float64
name b
group 2   -0.682489
3   -0.041028
Name: data1, dtype: float64

We could convert the group object into list and dictionary.

list(grouped)
[('a',
  0   -1.060120
  1    0.373027
  4   -0.432938
  Name: data1, dtype: float64),
 ('b',
  2   -0.682489
  3   -0.041028
  Name: data1, dtype: float64)]
dict(list(grouped))
{'a': 0   -1.060120
 1    0.373027
 4   -0.432938
 Name: data1, dtype: float64,
 'b': 2   -0.682489
 3   -0.041028
 Name: data1, dtype: float64}

4.5.2 Built-in aggregation functions

The following functions directly work with groupby objects. You may try them by yourselves.

  • .describe()
  • .count()
  • .sum()
  • .mean()
  • .median
  • .std(), .var()
  • .min(), .max()
  • .prod()
  • .first(), .last()

4.5.3 Function Application and Mapping

We may apply functions to each row/column of a DataFrame. If the function is a built-in function that is compatible with DataFrame, you can directly call the function that it will be applied automatically to each row/column. If it is not, we can call apply to get the desired result.

To understand the behaviour of map, you may treat it as a loop, through a Series. pandas goes through each item in the Series and perform operations as instructed. If there is a returned value, it will be recorded along the Sereis.

import pandas as pd

ind = pd.Series(['Ohio', 'Colorado', 'New York'])
ind
0        Ohio
1    Colorado
2    New York
dtype: object
ind.map(lambda x: x[:4].upper())
0    OHIO
1    COLO
2    NEW 
dtype: object

In the example we go through each item in ind. Each item is a string. We pick the first 4 characters, and change them to be upper case.

Note that this operation can also be done by string method. These are two different methods but the results are the same.

ind.str[:4].str.upper()
0    OHIO
1    COLO
2    NEW 
dtype: object

apply is very similar to map, but for DataFrame. The default setting is to go through each column of a DataFrame, and the input is the column. You may use the argument axis=1 to change it to go through each row. Please see the following example.

Example 4.10  

import pandas as pd
data = pd.DataFrame(np.random.rand(4, 4),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data
one two three four
Ohio 0.578536 0.380353 0.492142 0.560179
Colorado 0.794798 0.369753 0.084621 0.799780
Utah 0.270936 0.830585 0.943238 0.812854
New York 0.272575 0.658831 0.493834 0.700987
f = lambda x: x.max() - x.min()

data.apply(f)
one      0.523862
two      0.460832
three    0.858617
four     0.252674
dtype: float64

Change axis to find the range for each row.

data.apply(f, axis=1)
Ohio        0.198182
Colorado    0.715159
Utah        0.672303
New York    0.428412
dtype: float64

We can use more complicated function to get more complicated result.

Example 4.11  

data = pd.DataFrame(np.random.rand(4, 4),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

f = lambda x: pd.Series([x.max(), x.min()], index=['max', 'min'])

data.apply(f)
one two three four
max 0.996999 0.529646 0.985185 0.843412
min 0.208313 0.020437 0.215638 0.019380

4.5.4 Some examples

Example 4.12 Consider the following DataFrame.

import pandas as pd
import numpy as np
df = pd.DataFrame({'location': ['East', 'East', 'East', 'East',
                                'West', 'West', 'West', 'West'],
                   'data': np.random.randn(8)},
                   index=['Ohio', 'New York', 'Vermont', 'Florida',
                          'Oregon', 'Nevada', 'California', 'Idaho'])
df.loc[['Vermont', 'Nevada', 'Idaho'], 'data'] = np.nan
df
location data
Ohio East 0.430881
New York East -2.047067
Vermont East NaN
Florida East 1.384468
Oregon West 1.138818
Nevada West NaN
California West 0.969296
Idaho West NaN

We would like to fill in NA values with the mean from each location group.

df.groupby('location', group_keys=False).apply(lambda x: x.fillna(x.mean()))
data
Ohio 0.430881
New York -2.047067
Vermont -0.077239
Florida 1.384468
Oregon 1.138818
Nevada 1.054057
California 0.969296
Idaho 1.054057

The argument group_keys=False refers to the setting whether you want to group_keys to be presented. If it is True, the result looks like this.

df.groupby('location', group_keys=True).apply(lambda x: x.fillna(x.mean()))
data
location
East Ohio 0.430881
New York -2.047067
Vermont -0.077239
Florida 1.384468
West Oregon 1.138818
Nevada 1.054057
California 0.969296
Idaho 1.054057

We could also fill in NA values with predefined values, similar to the non-groupby case.

predefined = {'East': 0.1, 'West': -0.5}
df.groupby('location', group_keys=True).apply(lambda x: x.fillna(predefined[x.name]))
location data
location
East Ohio East 0.430881
New York East -2.047067
Vermont East 0.100000
Florida East 1.384468
West Oregon West 1.138818
Nevada West -0.500000
California West 0.969296
Idaho West -0.500000
Chaining commands

You may chain commands to a DataFrame, just like the examples shown above. If the commands are too long:

  • a () has to be used to indicate that this is a multiline command, and
  • the line is broken before the . sybmol.

Please see the following example.

(df.groupby('location', group_keys=False)
    .apply(lambda x: x.fillna(predefined[x.name]))
    .reset_index()
    .groupby('location')
    .max()
)
index data
location
East Vermont 1.384468
West Oregon 1.138818

4.6 Read and write files

4.6.1 Read files

In most cases we will read data from a csv file or an excel file.

A csv file is a plain txt file, with a fixed format. It consists of rows and columns. Rows are separated by newline symbol, which is usually \n. Columns are separated by a separator. Common separators include empty spaces, comma ,, semi-column ;, tab space \t. There might be other speical separators, depending on the creators of the specific csv files.

In pandas, you may use pd.read_csv() function to read a csv file.

  1. The argument sep is used to set separators. The default is ,.
  2. The argument names is used to set the column names. Otherwise the column names will be generated and is highly unlikely to be directly usable.
  3. The argument header will choose the header row and only parse the lines after it. If there is no header, you may set header=None.
  4. The argument index_col is used to set the index column(s). If it is False, the index will be automatically generated from 0. If it is set to a list of columns, the result will be a multi-index system.

You may read the document for more arguments.

Please see the following example.

Example 4.13 The file is yob1880.txt. This is from the US Baby names dataset. It provides the counts of each US baby names born in 1880. You may use any txt editor to open the file. The first few rows are like the following:

Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578

It seems that sep is the default ,. So you may directly directly read it into a DataFrame by pd.read_csv().

import pandas as pd

df = pd.read_csv('assests/datasets/yob1880.txt')
df.head()
Mary F 7065
0 Anna F 2604
1 Emma F 2003
2 Elizabeth F 1939
3 Minnie F 1746
4 Margaret F 1578

Please look at the header of the DataFrame. It is supposed to be the first data. Therefore there is no header in the original file. So the correct way to read the file is

import pandas as pd

df = pd.read_csv('assests/datasets/yob1880.txt',
                 header=None,
                 names=['Name', 'Sex', 'Counts'])
df.head()
Name Sex Counts
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746

pandas provides pd.read_excel() function to read Excel files. Since Excel files are much more complicated than csv files, it requires more setting. One of the most important different setting is the engine. pandas needs you to specify a way (an engine) to understand Excel files. For the newer Excel file .xlsx, it is recommended to use the engine openpyxl.

If you don’t have openpyxl installed, you may use the following code to install it.

pip install openpyxl

Many options, like header, names and index_col, are very similar to pd.read_csv(). Some additional remarks:

  1. There is no sep argument since columns are not separated based on separators.
  2. The argument sheet_name is used to choose which sheet(s) you want to read.
  3. The argument nrows is used to set the number of rows to parse.

You may read the document for more arguments.

Example 4.14 The file can be downloaded from here. This is the result of the Pre-Post test of a class for the course COURSE1001. You may first use Microsoft Office or other spreadsheet applications to open the file to have some ideas what it look like.

Here is the screenshot of the first few columns. Last and First refers to the last name and the first name of the student, while Last0X and First0X are students’ fake names.

Note that this files contains two DataFrames.

  • The first is the result of the pretest, which is from row 3 to row 11, with the header row 2.
  • The second is the result of the posttest, which is from row 15 to row 23, with the header row 14. To read the file, the code is as follows:
import pandas as pd

df_pre = pd.read_excel('assests/datasets/prepost.xlsx',
                       engine='openpyxl',
                       header=2, 
                       nrows=10)
df_pre 
Last First T# 1 2 3 4 5 6 7 8 9 10 Unnamed: 13
0 Last01 First01 T01 0 0 0 0 0 1 0 0 0 0 NaN
1 Last02 First02 T02 0 0 1 0 0 0 0 0 1 1 NaN
2 Last03 First03 T03 1 0 1 0 0 0 0 0 0 0
3 Last04 First04 T04 1 0 0 1 0 0 0 0 0 0 NaN
4 Last05 First05 T05 0 1 0 0 0 0 0 0 0 0 NaN
5 Last06 First06 T06 0 0 1 0 0 0 1 0 0 1 NaN
6 Last07 First07 T07 0 0 0 0 0 0 0 0 0 0 NaN
7 Last08 First08 T08 0 0 1 0 0 0 0 0 0 0 NaN
8 Last09 First09 T09 0 0 0 0 0 0 1 0 0 1 NaN
df_post = pd.read_excel('assests/datasets/prepost.xlsx',
                        engine='openpyxl',
                        header=14,
                        nrows=10)                      
df_post
Last First T# 1 2 3 4 5 6 7 8 9 10 Unnamed: 13
0 Last01 First01 T01 1 1 0 1 0 0 0 0 1 1 NaN
1 Last02 First02 T02 0 0 0 1 0 0 1 0 0 0 NaN
2 Last03 First03 T03 1 1 1 1 1 1 0 0 1 0 NaN
3 Last04 First04 T04 1 1 0 1 0 1 0 0 1 0 NaN
4 Last05 First05 T05 1 1 1 1 1 1 0 0 0 0 NaN
5 Last06 First06 T06 1 0 0 0 0 0 0 0 0 1 NaN
6 Last07 First07 T07 1 1 0 1 0 0 0 0 0 0 NaN
7 Last08 First08 T08 0 1 1 0 1 0 0 1 1 0 NaN
8 Last09 First09 T09 1 1 1 1 1 1 0 1 0 0 NaN

It seems that the original files have an additional column Unnamed: 13 containing nan values that should be dropped. Then it is not necessary to read it from the original file. Here we could use the argument usecols to select the first 13 columns. We only show the example of pretest result.

df_pre = pd.read_excel('assests/datasets/prepost.xlsx',
                       engine='openpyxl',
                       header=2, 
                       nrows=10,
                       usecols=list(range(13)))
df_pre 
Last First T# 1 2 3 4 5 6 7 8 9 10
0 Last01 First01 T01 0 0 0 0 0 1 0 0 0 0
1 Last02 First02 T02 0 0 1 0 0 0 0 0 1 1
2 Last03 First03 T03 1 0 1 0 0 0 0 0 0 0
3 Last04 First04 T04 1 0 0 1 0 0 0 0 0 0
4 Last05 First05 T05 0 1 0 0 0 0 0 0 0 0
5 Last06 First06 T06 0 0 1 0 0 0 1 0 0 1
6 Last07 First07 T07 0 0 0 0 0 0 0 0 0 0
7 Last08 First08 T08 0 0 1 0 0 0 0 0 0 0
8 Last09 First09 T09 0 0 0 0 0 0 1 0 0 1

4.6.2 Write files

We will only talk about writing in csv. The function is df.to_csv(). It is straightforward.

  • The argument index is used to control whether you want to write index into the file. The default is True. If the index doesn’t contain any real information, we usually set it to be False.

4.7 Example: Movies

Below we explore the MovieLens 1M datasets. You may download it from this link. This is a .dat file, and you may use the following code to read it into a DataFrame.

import pandas as pd
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assests/datasets/movies.dat', sep='::',
                       header=None, names=mnames, engine="python",
                       encoding='ISO-8859-1')
movies.head()
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

In this example we concentrate on exploring the genres information. We first want to find all genres in this dataset. The idea is:

  • split each item in the genres column by | to get a list.
  • go through each item in the genres column, and union all lists together.

This can be done by the map function.

all_genres = list()
movies['genres'].map(lambda x: all_genres.extend(x.split('|')))
0       None
1       None
2       None
3       None
4       None
        ... 
3878    None
3879    None
3880    None
3881    None
3882    None
Name: genres, Length: 3883, dtype: object

all_genres is the list of all genres (with duplicates).

In the output of the above code you may see many None in each row. This is because the lambda function used in map doesn’t have a return value. However after applying the function to each row, new genres information is added to the list all_genres.

Then we would like to drop all the duplicates to get the list of all unique genres.

genres = pd.unique(all_genres)
genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

genres is the list of all unique genres.

4.8 Exercises

Many problems are based on [1].

Exercise 4.1 Let df be a DataFrame. Please answer the following questions in a Markdown cell.

  1. What does df[0] do?
  2. What does df[[0]] do?
  3. What does df[0:1] do?

Exercise 4.2 Please use the following code to generate a series ser, and then finish the following tasks.

import pandas as pd
import numpy as np


mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)
  1. Convert the series ser into a dataframe df with its index as another column on the dataframe.
  2. Pick the two columns of df and set them into two serieses ser1 and ser2.
  3. Combine two series ser1 and ser2 to form a new dataframe newdf, and name their columns ser1 and ser2.

Exercise 4.3 Consider two serieses ser1 and ser2. You may use the following ser1 and ser2 as an example. The output of each questions below should be a series. You may want to learn the following commands:

import pandas as pd

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
  1. Find all the elements from ser1 that are also in ser2.
  2. Find all the elements from ser2 that are also in ser1.
  3. From ser1 remove items present in ser2.
  4. Find the union of ser1 and ser2.
  5. Find the intersection of ser1 and ser2.
  6. Find all the elemetns that are in either ser1 or ser2, but not both.

Exercise 4.4 Consider the following DataFrame.

import pandas as pd

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
  1. Please select the column two.
  2. Please select the second and the third row.
  3. Please find the rows that the column three value is bigger than 5.
  4. Please find the last row that the column three value is bigger than 5.
  5. Please find the rows that the column three value is bigger than 5, and display the resulted DataFrame with only Colorado and Utah row and four and one columns, in the specified order.

Exercise 4.5 Consider the following Series.

import pandas as pd
import numpy as np
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
  1. Please use pd.Series.value_counts() to calculte the frequency counts of each unique value of the following Series.
  2. Please keep the top 2 most frequent items of ser as it is and replace everything else as Other.

Exercise 4.6 Consider the Series ser:

import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 10, 7))

Find the positions of numbers that are multiples of 3 from ser.

Exercise 4.7 Compute the mean of weights of each fruit.

import pandas as pd
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
df = pd.DataFrame({'fruit': fruit, 'weights': weights})

Exercise 4.8 Consider the following DataFrame.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
  1. Check if df has any missing values.
  2. Please count the number of missing values in each column.
  3. Please replace all missing values in Min.Price and Max.Price with their mean respectively.

Exercise 4.9 Get the last two rows of df whose row sum is greater than 100.

import pandas as pd
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

Exercise 4.10 The groupby object df_grouped is given below.

import pandas as pd
import numpy as np
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'price': np.random.rand(9),
                   'taste': np.random.randint(0, 11, 9)})

df_grouped = df.groupby(['fruit'])
  1. Get the group belonging to apple as a DataFrame.
  2. Find the second largest value of taste for banana.
  3. Compute the mean price for every fruit.

Exercise 4.11 Join df1 and df2 by fruit/pazham and weight/kilo.

df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

Exercise 4.12 Consider the following DataFrame.

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])
  1. Replace NaN with string missing in columns Manufacturer, Model and Type.
  2. Create an index as a combination of these three columns.

Exercise 4.13 Given the following DataFrame.

import pandas as pd
df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'age': [30, 40, 32, 67, 43],
    'score': ['90%', '95%', '100%', '82%', '87%'],
    'age_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})
  • Please use .map to create a new column numeric_score whose value is the number version of score.
  • Please use .apply to create a new column numeric_score whose value is the number version of score.

Exercise 4.14 The following DataFrame is given.

import pandas as pd
df = pd.DataFrame(["STD, City    State",
                   "33, Kolkata    West Bengal",
                   "44, Chennai    Tamil Nadu",
                   "40, Hyderabad    Telengana",
                   "80, Bangalore    Karnataka"],
                   columns=['row'])

Please create a new DataFrame out of df by spliting it into three columns based on , and four spaces. In addition, the column names of the new DataFrame are given by the first row of df.