import pandas as pd
4 Package: pandas
The basic data structure for pandas
is pandas.DataFrame
. You may treat it as a generalized version of tables.
To use pandas
, we just import it. In most cases you would like to use the alias pd
.
Since DataFrame
is more like a table, the biggest questions here is not to do computations (which is still very important), but to retrieve, search, sort, merge, etc.. those data.
4.1 Basic pandas
4.1.1 Series
and DataFrame
A Series is a 1-d array-like object which has index. The default index is starting from 0
. You may change the index to be something assigned by you. Thus it can be treated as a generalization of a dict
.
= pd.Series([3, 1, 2, 4])
obj obj
0 3
1 1
2 2
3 4
dtype: int64
= pd.Series([3, 1, 2, 4], index=['a', 'b', 'c', 'd'])
obj2 obj2
a 3
b 1
c 2
d 4
dtype: int64
= {'a': 3, 'b': 1, 'c': 2, 'd': 4}
data3 = pd.Series(data3)
obj3 obj3
a 3
b 1
c 2
d 4
dtype: int64
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. When displaying a DataFrame, we may use .head()
to just display the first few rows for efficicy.
import pandas as pd
= {'a': [1, 2, 3, 4, 5, 6, 7],
data 'b': [1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1],
'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
= pd.DataFrame(data)
df df.head()
a | b | c | |
---|---|---|---|
0 | 1 | 1.1 | a |
1 | 2 | 2.1 | b |
2 | 3 | 3.1 | c |
3 | 4 | 4.1 | d |
4 | 5 | 5.1 | e |
4.1.2 Decorations
A Series
or a DataFrame
might have named row indexes and column names. I collect some tools for you to play with them and list them below.
We may use the setting columns=
or index=
to change the column names and the index names. See the following example.
import numpy as np
import pandas as pd
= pd.DataFrame(np.arange(16).reshape((4, 4)),
data =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four'])
columns data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
.rename()
We may use the .rename()
method. Note that by default the return value of this method is a copy and it won’t affect the original DataFrame
. The arguments can be in many different formats. Please see the official document for more details.
If you want to directly make the change, please use the argument inplace=True
.
The following example shows the standard way to rename.
= pd.DataFrame(np.arange(16).reshape((4, 4)))
df ={0: 'zero'}, index={2: 'two'}) df.rename(columns
zero | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
two | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
However the orginal df
is not affected.
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
If you would like to change the original df
, you may either set df = df.rename(columns={0: 'zero'}, index={2: 'two'})
, or
={0: 'zero'}, index={2: 'two'}, inplace=True)
df.rename(columns df
zero | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
two | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
.set_index()
The title is all. A few remarks:
- You may set multiple columns to be the index. In this case, what you get is a multi-index system (which is also called Hierarchical indexing). We will talk about this later in Section 4.4.2.
- The argument
drop
is used to control whether the column is deleted after you set it to be the index. The default setting isTrue
. - The argument
append
is used to control whether the column you choose is appended to the exsiting index to form a multi-index system. The default isFalse
. - The argument
inplace
is used to control whether you want to make the change inplace. The default isFalse
.
.reset_index()
The title is all. A few remarks:
- The new index is integers starting from
0
. drop
is an argument to control whether the original index is dropped or added back to theDataFrame
as a column. The default isFalse
, which means that by default the original index will be added back to theDataFrame
.
4.1.3 Look at the DataFrame
The following methods can be used to look at the DataFrame
. Their syntax is very simple. Please try them by yourselves.
.head()
: show the first few rows..tail()
: show the last few rows..describe()
: show the basic statistics of each columns.
These are methods for Series
which might be helpful to understand the data.
.unique()
.value_counts()
import pandas as pd
= pd.DataFrame({'a': [1, 2, 3, 1, 2, 2, 1, 1, 1],
df 'b': [3, 1, 1, 2, 4, 5, 2, 1, 3]})
3) df.head(
a | b | |
---|---|---|
0 | 1 | 3 |
1 | 2 | 1 |
2 | 3 | 1 |
df.tail()
a | b | |
---|---|---|
4 | 2 | 4 |
5 | 2 | 5 |
6 | 1 | 2 |
7 | 1 | 1 |
8 | 1 | 3 |
df.describe()
a | b | |
---|---|---|
count | 9.000000 | 9.000000 |
mean | 1.555556 | 2.444444 |
std | 0.726483 | 1.424001 |
min | 1.000000 | 1.000000 |
25% | 1.000000 | 1.000000 |
50% | 1.000000 | 2.000000 |
75% | 2.000000 | 3.000000 |
max | 3.000000 | 5.000000 |
'a'].unique() df[
array([1, 2, 3], dtype=int64)
'b'].value_counts() df[
1 3
3 2
2 2
4 1
5 1
Name: b, dtype: int64
4.2 Indexing
The act of selecting rows or columns to access from a dataframe or series is called indexing. There are many different ways to index in pandas
. We will only cover the most popular ones.
There is the same copy and view issue with pandas
as in numpy
. However it is more complicated and more inconsistent. Please check the official documents for more details and do more experiments before implementing the codes. Usually if your code is ambiguous, you might see the infamous SettingWithCopyWarning
warning.
= pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df 'a']==3]['b'] = 3 df[df[
C:\Users\Xinli\AppData\Local\Temp\ipykernel_10540\4018622545.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[df['a']==3]['b'] = 3
Note that pandas
is testing a copy-on-write
feature to fix the issue. The feature can be simlified as “any DataFrame
or Series
derived from another in any way always behaves as a copy”. Please keep an eye on the updates about when the feature will be fully implemented in pandas
.
4.2.1 []
Series[]
Input value type | Return Value Type |
---|---|
Series[label] |
scalar value |
Series[list of labels] |
Series corresponding to labels |
Series[slice] |
Series corresponding to the slice |
Series[boolean vector] |
Series corresponding to the boolean vector |
For
Series
, values are accessed by labels, not positions. SinceSeries
are usually considered as a column, you may think these labels as row indexes.When using slice, things becomes more complicated. There are two ways of using slice. You may either slice by positions, or slice by labels. The main differences between them is that:
- slice by positions
Series[i:j]
doesn’t contain the last indexSeries[j]
; - slice by labels
Sereies[I:J]
contains the last labelSeries[J]
.
Sometimes the labels of a series are integers, but different than the position indexes. In
pandas 1.5.1
, slice by positions takes priority. However the whole scenario is very confusing, and this will be changed in future versions. In this cases it is recommanded to use.loc
and.iloc
.When indexing using boolean vector, the vector should be of the same length as the
Series
. In other words, it works as the boolean bector shows which row is selected.
See some examples below.
Example 4.1
import pandas as pd
= pd.Series({'a': 1.1, 'b': 2.2, 'c': 3.3, 'd': 4.4})
example example
a 1.1
b 2.2
c 3.3
d 4.4
dtype: float64
'b'] example[
2.2
'b', 'a']] example[[
b 2.2
a 1.1
dtype: float64
0:2] example[
a 1.1
b 2.2
dtype: float64
'a':'c'] example[
a 1.1
b 2.2
c 3.3
dtype: float64
True, False, True, False]] example[[
a 1.1
c 3.3
dtype: float64
DataFrame[]
Input value type | Return Value Type |
---|---|
DataFrame[colname] |
The column corresponding to colname as a Series |
DataFrame[list-of-colnames] |
The columns of DataFrame corresponding to colnames |
DataFrame[slice] |
The rows of DataFrame corresponding to the slice |
DataFrame[boolean list] |
DataFrame corresponding to the boolean list |
- Slice for
DataFrame
behaves exactly like slice forSeries
, that it is selecting rows, and it works for both labels and positions. Similarly, slicing by positions are not recommended and might be deprecated in the future. - On the other hand side, selecting rows are usually related to querying. Therefore it is better not to focus on slicing.
- Inside
[]
, one column name and a list of columna names will result totally different objects: one is aSeries
and the other is aDataFrame
. - In prior versions, using
[list-of-colnames]
would work as long as at least 1 of the keys was found (otherwise it would raise aKeyError
). This behavior was changed and will now raise aKeyError
if at least one label is missing. The recommended alternative is to use.reindex()
. - When indexing using boolean vector, the vector should be of the same length as the number of rows of the
DataFrame
. In other words, it works as the boolean bector shows which row is selected. - Using
[]
forDataFrame
cannot give you a single value, since what are inside[]
is always treated as a row index or a column index. If you want to get access to the value of a single cell by both row index and column index, use other method like.loc[]
. - If the column name is eligible for attributes, you may also use
df.a
to representdf['a']
for simplicity.
Example 4.2
import pandas as pd
= pd.DataFrame({'a': [1.1, 2.2], 'b': [2.2, 3.3], 'c': [3.3, 4.4]})
example example
a | b | c | |
---|---|---|---|
0 | 1.1 | 2.2 | 3.3 |
1 | 2.2 | 3.3 | 4.4 |
'a'] example[
0 1.1
1 2.2
Name: a, dtype: float64
example.a
0 1.1
1 2.2
Name: a, dtype: float64
'a']] example[[
a | |
---|---|
0 | 1.1 |
1 | 2.2 |
0:1] example[
a | b | c | |
---|---|---|---|
0 | 1.1 | 2.2 | 3.3 |
False, True]] example[[
a | b | c | |
---|---|---|---|
1 | 2.2 | 3.3 | 4.4 |
4.2.2 .loc[]
and .iloc[]
[]
contains several different types of ways to access data. .loc[]
and .iloc[]
are more specific.
.loc[]
is to use labels to access data..iloc[]
is to use positions to access data.
.loc[]
and .iloc[]
- When there is only one index is specified, it is refered to rows.
- When using both indexes, the first is row index and the second is column index.
- When selecting all rows/columns, you may put
:
in the corresponding place. df.loc[1, 'a']
refers to the cell in the DataFramedf
whose row index is1
and column index isa
.df[1, 'a']
refers to the column in the DataFramedf
whose column name is(1, 'a')
.- Many other small details are very similar to
[]
. For example, pay attention to the differences betweendf.loc[:, 'a']
anddf.loc[:, ['a']]
.
Example 4.3
import pandas as pd
= pd.DataFrame({'a': [1.1, 2.2], 'b': [2.2, 3.3], 'c': [3.3, 4.4]})
example example
a | b | c | |
---|---|---|---|
0 | 1.1 | 2.2 | 3.3 |
1 | 2.2 | 3.3 | 4.4 |
1] example.loc[
a 2.2
b 3.3
c 4.4
Name: 1, dtype: float64
'a'] example.loc[:,
0 1.1
1 2.2
Name: a, dtype: float64
1, 'a'] example.loc[
2.2
0:1, 0:2] example.iloc[
a | b | |
---|---|---|
0 | 1.1 | 2.2 |
1, 0:2] example.iloc[
a 2.2
b 3.3
Name: 1, dtype: float64
1], 0:2] example.iloc[[
a | b | |
---|---|---|
1 | 2.2 | 3.3 |
4.2.3 Boolean indexing
Let df
be a DataFrame
. Assume that boo
is boolean vector of the dimension same to the number of rows of df
, then we can use df[boo]
to filter data: all rows with True
will be selected. The syntax is similar to the boolean indexing in numpy
.
import pandas as pd
import numpy as np
= pd.DataFrame(np.random.randn(8, 4),
df =pd.date_range('1/1/2023', periods=8),
index=['A', 'B', 'C', 'D'])
columns df
A | B | C | D | |
---|---|---|---|---|
2023-01-01 | 0.237607 | 1.323886 | -0.658127 | -0.975299 |
2023-01-02 | -0.764331 | 0.833129 | -2.136817 | -0.494983 |
2023-01-03 | -0.857309 | -0.025589 | 0.802247 | -0.437641 |
2023-01-04 | 0.833024 | -0.345832 | -0.224655 | -1.669289 |
2023-01-05 | -0.610940 | 0.804115 | -0.147965 | -1.162376 |
2023-01-06 | 1.149887 | -1.471409 | -1.546448 | 0.805580 |
2023-01-07 | 1.150973 | -2.067331 | -1.544236 | -0.131534 |
2023-01-08 | 0.208735 | -0.331659 | -1.371364 | 0.762330 |
'A']>0] df[df[
A | B | C | D | |
---|---|---|---|---|
2023-01-01 | 0.237607 | 1.323886 | -0.658127 | -0.975299 |
2023-01-04 | 0.833024 | -0.345832 | -0.224655 | -1.669289 |
2023-01-06 | 1.149887 | -1.471409 | -1.546448 | 0.805580 |
2023-01-07 | 1.150973 | -2.067331 | -1.544236 | -0.131534 |
2023-01-08 | 0.208735 | -0.331659 | -1.371364 | 0.762330 |
To get the boolean vector, we may directly compute logic expression using columns of df
. The previous example is of this kind.
You may write complicated expressions. The operators are:
|
for or&
for and~
for not
Note that parentheses must be used to ensure a correct result. Please see the following example.
'A'] > 1) & (df['B'] < 3)] df[(df[
A | B | C | D | |
---|---|---|---|---|
2023-01-06 | 1.149887 | -1.471409 | -1.546448 | 0.805580 |
2023-01-07 | 1.150973 | -2.067331 | -1.544236 | -0.131534 |
There are many methods and functions that can create boolean vectors. We will introduce them when we need them.
4.2.4 .query()
DataFrame
has a .query()
method that allows filtering using an expression instead of a boolean vector. This method uses a different approach from the point of programming language. From the point of users, you are free to choose between .query()
and boolean indexing to filter data.
.query()
.
= pd.DataFrame({'A': [1,2,3], 'B': [3,2,1], 'C': [5,4,3]})
df df
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
1 | 2 | 2 | 4 |
2 | 3 | 1 | 3 |
'A<B and B<C') df.query(
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
It can be simplified as follows:
'A<B<C') df.query(
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
This is the same as the following code.
'A']<df['B']) & (df['B']<df['C'])] df[(df[
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
Note that .query()
does not require the usage of parentheses. It also use English like or
/and
/not
for |
/&
/~
.
'A in C') df.query(
A | B | C | |
---|---|---|---|
2 | 3 | 1 | 3 |
'A not in C') df.query(
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
1 | 2 | 2 | 4 |
'A not in C and A<B') df.query(
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 5 |
'[1,2] in B') df.query(
A | B | C | |
---|---|---|---|
1 | 2 | 2 | 4 |
2 | 3 | 1 | 3 |
4.2.5 Reindex
.reindex()
is a data alignment method in pandas
. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:
- Reordering the existing data to match a new set of labels
- Inserting missing value (
NaN
) markers in label locations where no data for that label existed
Here is a simple example:
import pandas as pd
= pd.DataFrame(np.arange(16).reshape((4, 4)),
data =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four'])
columns data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0 | 1 | 2 | 3 |
Colorado | 4 | 5 | 6 | 7 |
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
= ['Colorado', 'Arkansas', 'New York'],
data.reindex(index = ['three', 'five', 'one']) columns
three | five | one | |
---|---|---|---|
Colorado | 6.0 | NaN | 4.0 |
Arkansas | NaN | NaN | NaN |
New York | 14.0 | NaN | 12.0 |
From the first glance, .reindex()
behave the same as other indexing methods. Here are a few differences:
- The purpose of indexing methods is to select/filter data, while the purpose of reindex is to make the data in a very specific form.
- When dealing with non-existent indexes/columns, most other indexing methods will return error or warning, while
.reindex()
can handle it automatically. - The default setting of
.reindex()
is to return a copy. This setting can be changed by the argumentcopy=False
.
For more details please see the official guide.
4.2.6 Updating data
- Assign values to a column of a DataFrame will update that column. If the column doesn’t exist, new column will be created. This is called enlargement.
- When assign values with non-existent row index, that part of the data will be discarded.
- When using
.loc
, aDataFrame
can be enlarged on either axis. - Any time if there are no values with a specific column and row, it will show as
NaN
.
Example 4.4
import pandas as pd
= {'a': [1, 2, 3, 4],
data 'b': [1.1, 2.1, 3.1, 4.1],
'c': ['a', 'b', 'c', 'd']}
= pd.DataFrame(data)
df
= {1: 'good', 3: 'better', 5: 'best'}
newcol 'd'] = pd.Series(newcol)
df[ df
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 1.1 | a | NaN |
1 | 2 | 2.1 | b | good |
2 | 3 | 3.1 | c | NaN |
3 | 4 | 4.1 | d | better |
4.3 Data cleaning
4.3.1 Handling Missing Data
np.nan
,pd.NA
pd.isnull()
,np.isnan()
.dropna()
,.fillna()
.dropna()
example
import pandas as pd
import numpy as np
= pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
data 6.5, 3.]])
[np.nan, np.nan, np.nan], [np.nan, data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
data.dropna()
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
='all') data.dropna(how
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
4] = np.nan
data[ data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
=1, how='all') data.dropna(axis
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
=2) data.dropna(thresh
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
.fillna()
example
0) data.fillna(
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | 0.0 |
1 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 6.5 | 3.0 | 0.0 |
1: 0.5, 2: -0.1}) data.fillna({
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | 0.5 | -0.1 | NaN |
2 | NaN | 0.5 | -0.1 | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
Elements of the same index and columns will be computed. By default, if any entry is nan
, the answer will be nan
. You may use fill_value
argument to fill the empty slots. Please see the following example.
import pandas as pd
import numpy as np
= pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df1 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2 1, 'b'] = np.nan
df2.loc[
=0) df1.add(df2, fill_value
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
Relatedly, when reindexing a Series or DataFrame, you can also specify a fill_value
.
4.3.2 Handling duplicates
.drop_duplicates()
example
import numpy as np
import pandas as pd
= pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
data 'k2': [1, 1, 2, 3, 3, 4, 4]})
'k1'], keep='last') data.drop_duplicates([
k1 | k2 | |
---|---|---|
4 | one | 3 |
6 | two | 4 |
4.3.3 Drop data
You may use .drop()
to drop columns or rows.
- If you directly apply
.drop()
to an index, that index is considered as a row index. - To drop a column, you need to specify the argument
columns=
. - There is still the
inplace=
issue.
4.3.4 String Manipulation
When the column Series
is of type str
, all methods in pd.Series.str
will be applied to each entry of the Series.
import pandas as pd
import numpy as np
= pd.Series(["A ", " B ", "C", "Aaba", " Baca ", np.nan, "CABA", "dog", "cat"])
s s
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
str.lower() s.
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
str.split('a') s.
0 [A ]
1 [ B ]
2 [C]
3 [A, b, ]
4 [ B, c, ]
5 NaN
6 [CABA]
7 [dog]
8 [c, t]
dtype: object
str.len() s.
0 2.0
1 3.0
2 1.0
3 4.0
4 6.0
5 NaN
6 4.0
7 3.0
8 3.0
dtype: float64
str.strip() s.
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
str.replace("A", '1') s.
0 1
1 B
2 C
3 1aba
4 Baca
5 NaN
6 C1B1
7 dog
8 cat
dtype: object
Example 4.5 We could also use .str
to play with column names and row indexes.
import pandas as pd
import numpy as np
= pd.DataFrame(np.random.randn(3, 2),
df =[" Column A ", " Column B "], index=range(3))
columns
= df.columns.str.strip().str.lower().str.replace(' ', '_')
df.columns df
column_a | column_b | |
---|---|---|
0 | 0.465369 | 1.539442 |
1 | 2.196586 | -1.541876 |
2 | 0.371981 | 0.309551 |
String methods are usually used with regular expressions. For more details please see Appendix F.
4.4 Data Wrangling
4.4.1 Tidy data
The same underlying data can be represented in multiple ways. To better study the data, it is better to make these data tidy.
Definition 4.1 A dataset is tidy if
- Each variable have its own column.
- Each observation have its own row.
- Each value have its oven cell.
These DataFrame
are provided by tidyr
. We will talk about them again when we get to R. These tables can be downloaded by clicking the names.
import pandas as pd
= pd.read_csv('assests/datasets/table1.csv', index_col='Unnamed: 0')
table1 table1
country | year | cases | population | |
---|---|---|---|---|
1 | Afghanistan | 1999 | 745 | 19987071 |
2 | Afghanistan | 2000 | 2666 | 20595360 |
3 | Brazil | 1999 | 37737 | 172006362 |
4 | Brazil | 2000 | 80488 | 174504898 |
5 | China | 1999 | 212258 | 1272915272 |
6 | China | 2000 | 213766 | 1280428583 |
import pandas as pd
= pd.read_csv('assests/datasets/table2.csv', index_col='Unnamed: 0')
table2 table2
country | year | type | count | |
---|---|---|---|---|
1 | Afghanistan | 1999 | cases | 745 |
2 | Afghanistan | 1999 | population | 19987071 |
3 | Afghanistan | 2000 | cases | 2666 |
4 | Afghanistan | 2000 | population | 20595360 |
5 | Brazil | 1999 | cases | 37737 |
6 | Brazil | 1999 | population | 172006362 |
7 | Brazil | 2000 | cases | 80488 |
8 | Brazil | 2000 | population | 174504898 |
9 | China | 1999 | cases | 212258 |
10 | China | 1999 | population | 1272915272 |
11 | China | 2000 | cases | 213766 |
12 | China | 2000 | population | 1280428583 |
import pandas as pd
= pd.read_csv('assests/datasets/table3.csv', index_col='Unnamed: 0')
table3 table3
country | year | rate | |
---|---|---|---|
1 | Afghanistan | 1999 | 745/19987071 |
2 | Afghanistan | 2000 | 2666/20595360 |
3 | Brazil | 1999 | 37737/172006362 |
4 | Brazil | 2000 | 80488/174504898 |
5 | China | 1999 | 212258/1272915272 |
6 | China | 2000 | 213766/1280428583 |
import pandas as pd
= pd.read_csv('assests/datasets/table4a.csv', index_col='Unnamed: 0')
table4a = pd.read_csv('assests/datasets/table4b.csv', index_col='Unnamed: 0')
table4b table4a
country | 1999 | 2000 | |
---|---|---|---|
1 | Afghanistan | 745 | 2666 |
2 | Brazil | 37737 | 80488 |
3 | China | 212258 | 213766 |
table4b
country | 1999 | 2000 | |
---|---|---|---|
1 | Afghanistan | 19987071 | 20595360 |
2 | Brazil | 172006362 | 174504898 |
3 | China | 1272915272 | 1280428583 |
Among all these DataFrame
s, only table1
is tidy.
These three conditions are interrelated because it is impossible to only satisfy two of the three. In pratical, we need to follow the instructions:
- Put each dataset in a
DataFrame
. - Put each variable in a column.
- Every row is about one obeservation.
Tidy data is a consistent way to organize your data. The main advantages are:
- It is one consistent way of storing data. In other words, this is a consistent data structure that can be used in many cases.
- To placing variables in columns enables Python to do vectorized operations.
Most datasets are untidy, since tidy data is usually not intuitive for collecting. Therefore raw data which are collected by some naive ideas are usually not tidy.
Untidy data are usually:
- One variable might be spread across multiple columns.
- One observation might be scattered across multiple rows.
.melt()
method
A common problem is that the column names are not names of variables, but values of a variable. For example, table4a
above has columns 1999
and 2000
. These two names are actually the values of a variable year
. In addition, each row represents two observations, not one.
country | 1999 | 2000 | |
---|---|---|---|
1 | Afghanistan | 745 | 2666 |
2 | Brazil | 37737 | 80488 |
3 | China | 212258 | 213766 |
To tidy this type of dataset, we need to gather those columns into a new pair of variables. We need three parameters:
- The set of columns that represent values. In this case, those are
1999
and2000
. - The name of the variable. In this case, it is
year
. -The name of the variable whose values are spread over the cells. In this case, it is the number ofcases
.
Then we apply .melt()
.
=['country'],
table4a.melt(id_vars=['1999', '2000'],
value_vars='year',
var_name='cases') value_name
country | year | cases | |
---|---|---|---|
0 | Afghanistan | 1999 | 745 |
1 | Brazil | 1999 | 37737 |
2 | China | 1999 | 212258 |
3 | Afghanistan | 2000 | 2666 |
4 | Brazil | 2000 | 80488 |
5 | China | 2000 | 213766 |
We can do the similar thing to table4b
.
=['country'],
table4b.melt(id_vars=['1999', '2000'],
value_vars='year',
var_name='population') value_name
country | year | population | |
---|---|---|---|
0 | Afghanistan | 1999 | 19987071 |
1 | Brazil | 1999 | 172006362 |
2 | China | 1999 | 1272915272 |
3 | Afghanistan | 2000 | 20595360 |
4 | Brazil | 2000 | 174504898 |
5 | China | 2000 | 1280428583 |
In Python there are multiple different ways to change a wide DataFrame
to be longer like .melt()
. Among all of them, .melt()
is the most common one.
.pivot()
method
Another issuse is that an observation is scattered across multiple rows. Take table2
as an example.
An observation is a country in a year, but each observation is spread across two rows.
table2
country | year | type | count | |
---|---|---|---|---|
1 | Afghanistan | 1999 | cases | 745 |
2 | Afghanistan | 1999 | population | 19987071 |
3 | Afghanistan | 2000 | cases | 2666 |
4 | Afghanistan | 2000 | population | 20595360 |
5 | Brazil | 1999 | cases | 37737 |
6 | Brazil | 1999 | population | 172006362 |
7 | Brazil | 2000 | cases | 80488 |
8 | Brazil | 2000 | population | 174504898 |
9 | China | 1999 | cases | 212258 |
10 | China | 1999 | population | 1272915272 |
11 | China | 2000 | cases | 213766 |
12 | China | 2000 | population | 1280428583 |
We could apply .pivot()
to make it tidy. Here we need two arguments.
- The column that contains variable names. Here, it’s
type
. - The column that contains values forms multiple variables. Here, it’s
count
.
=['country', 'year'], columns='type', values='count') table2.pivot(index
type | cases | population | |
---|---|---|---|
country | year | ||
Afghanistan | 1999 | 745 | 19987071 |
2000 | 2666 | 20595360 | |
Brazil | 1999 | 37737 | 172006362 |
2000 | 80488 | 174504898 | |
China | 1999 | 212258 | 1272915272 |
2000 | 213766 | 1280428583 |
If we would like to split one columns into multiple columns since there are more than one values in a cell, we could use Series
string method to split it.
'newrate'] = table3['rate'].str.split('/')
table3[ table3
country | year | rate | newrate | |
---|---|---|---|---|
1 | Afghanistan | 1999 | 745/19987071 | [745, 19987071] |
2 | Afghanistan | 2000 | 2666/20595360 | [2666, 20595360] |
3 | Brazil | 1999 | 37737/172006362 | [37737, 172006362] |
4 | Brazil | 2000 | 80488/174504898 | [80488, 174504898] |
5 | China | 1999 | 212258/1272915272 | [212258, 1272915272] |
6 | China | 2000 | 213766/1280428583 | [213766, 1280428583] |
If we prepare two columns from the beginning, we could directly get two columns. Note that the argument expand=True
means that we want to get a DataFrame
by expanding dimensionality. More details can be found here.
'cases', 'population']] = table3['rate'].str.split('/', expand=True)
table3[[=['rate', 'newrate'], inplace=True)
table3.drop(columns table3
country | year | cases | population | |
---|---|---|---|---|
1 | Afghanistan | 1999 | 745 | 19987071 |
2 | Afghanistan | 2000 | 2666 | 20595360 |
3 | Brazil | 1999 | 37737 | 172006362 |
4 | Brazil | 2000 | 80488 | 174504898 |
5 | China | 1999 | 212258 | 1272915272 |
6 | China | 2000 | 213766 | 1280428583 |
Similarly we could also combine columns just as they are strings.
'another_rate'] = table3['cases']+'/'+table3['population']
table3[ table3
country | year | cases | population | another_rate | |
---|---|---|---|---|---|
1 | Afghanistan | 1999 | 745 | 19987071 | 745/19987071 |
2 | Afghanistan | 2000 | 2666 | 20595360 | 2666/20595360 |
3 | Brazil | 1999 | 37737 | 172006362 | 37737/172006362 |
4 | Brazil | 2000 | 80488 | 174504898 | 80488/174504898 |
5 | China | 1999 | 212258 | 1272915272 | 212258/1272915272 |
6 | China | 2000 | 213766 | 1280428583 | 213766/1280428583 |
4.4.2 Hierarchical indexing
Pandas support a more complex indexing system, that the index may have multiple levels. See the following example.
import pandas as pd
import numpy as np
= pd.Series(np.random.randn(9),
data = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
index 1, 2, 3, 1, 2, 3, 1, 2, 3]])
[ data
a 1 -0.729463
2 -0.550397
3 1.271335
b 1 -0.139672
2 -0.083897
c 3 0.699671
1 0.184548
d 2 -0.414853
3 0.931543
dtype: float64
You may look at the Series using different levels of indexes.
'a'] data[
1 -0.729463
2 -0.550397
3 1.271335
dtype: float64
2] data.loc[:,
a -0.550397
b -0.083897
d -0.414853
dtype: float64
You may use groupby to group by levels and do calculations related to levels. More .groupby()
will be discussed in the next section.
=1).sum() data.groupby(level
1 -0.684587
2 -1.049147
3 2.902549
dtype: float64
From the example above, you may notice that the 2-level hierarchical indexing for a Series works very similar to a DataFrame. In fact, you may translate it back and forth between a 2-level indexing Series and a DataFrame.
= data.unstack()
df df
1 | 2 | 3 | |
---|---|---|---|
a | -0.729463 | -0.550397 | 1.271335 |
b | -0.139672 | -0.083897 | NaN |
c | 0.184548 | NaN | 0.699671 |
d | NaN | -0.414853 | 0.931543 |
df.stack()
a 1 -0.729463
2 -0.550397
3 1.271335
b 1 -0.139672
2 -0.083897
c 1 0.184548
3 0.699671
d 2 -0.414853
3 0.931543
dtype: float64
For DataFrame the index for both axes can be multiindex. The usual indexing way can be used if you want to start from the first level of the index. The more specific method to extract data is .xs
.
import pandas as pd
= pd.DataFrame(
df1
{"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},=[0, 1, 2, 3],
index
)
= pd.DataFrame(
df2
{"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},=[4, 5, 6, 7],
index
)
= pd.concat([df1, df2], keys=['x', 'y']) df
df
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 |
'A'] df[
x 0 A0
1 A1
2 A2
3 A3
y 4 A4
5 A5
6 A6
7 A7
Name: A, dtype: object
'x'] df.loc[
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
'x',3] df.loc[
A A3
B B3
C C3
D D3
Name: (x, 3), dtype: object
3, level=1, drop_level=False) df.xs(
A | B | C | D | ||
---|---|---|---|---|---|
x | 3 | A3 | B3 | C3 | D3 |
4.4.3 Combining and Merging Datasets
merge
and concat
are the two most common ways to combine datasets.
pd.merge()
function
Merge combines datasets by linking rows using one or more keys. This is from relational databases (e.g., SQL-based).
Here are some examples.
Example 4.6
import pandas as pd
= pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
df1 'data1': range(7)})
= pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)}) df2
The two DataFrames are displayed as follows.
df1
key | data1 | |
---|---|---|
0 | b | 0 |
1 | b | 1 |
2 | a | 2 |
3 | c | 3 |
4 | a | 4 |
5 | a | 5 |
6 | b | 6 |
df2
key | data2 | |
---|---|---|
0 | a | 0 |
1 | b | 1 |
2 | d | 2 |
='key') pd.merge(df1, df2, on
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1 |
1 | b | 1 | 1 |
2 | b | 6 | 1 |
3 | a | 2 | 0 |
4 | a | 4 | 0 |
5 | a | 5 | 0 |
If the column names are different in each object, you can specify them separately.
= pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
df3 'data1': range(7)})
= pd.DataFrame({'rkey': ['a', 'b', 'd'],
df4 'data2': range(3)})
='lkey', right_on='rkey') pd.merge(df3, df4, left_on
lkey | data1 | rkey | data2 | |
---|---|---|---|---|
0 | b | 0 | b | 1 |
1 | b | 1 | b | 1 |
2 | b | 6 | b | 1 |
3 | a | 2 | a | 0 |
4 | a | 4 | a | 0 |
5 | a | 5 | a | 0 |
By default merge
does an inner join, that the keys in the result are the interesection found in both tables. Below are different types of merge
. To specify the method for merge, the option is how
.
inner
left
right
outer
Let’s see the following examples.
= pd.DataFrame({'Key': [1, 2], 'A': [0, 2], 'B': [1, 3]})
df1 df1
Key | A | B | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 2 | 2 | 3 |
= pd.DataFrame({'Key': [1, 3], 'C': [0, 2], 'D': [1, 3]})
df2 df2
Key | C | D | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 3 | 2 | 3 |
='Key', how='inner') pd.merge(df1, df2, on
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 |
='Key', how='outer') pd.merge(df1, df2, on
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | 1 | 0.0 | 1.0 | 0.0 | 1.0 |
1 | 2 | 2.0 | 3.0 | NaN | NaN |
2 | 3 | NaN | NaN | 2.0 | 3.0 |
='Key', how='left') pd.merge(df1, df2, on
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0.0 | 1.0 |
1 | 2 | 2 | 3 | NaN | NaN |
='Key', how='right') pd.merge(df1, df2, on
Key | A | B | C | D | |
---|---|---|---|---|---|
0 | 1 | 0.0 | 1.0 | 0 | 1 |
1 | 3 | NaN | NaN | 2 | 3 |
If a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data. Here is a very basic example with one unique key combination.
= pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
df1 'data1': range(6)})
= pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
df2 'data2': range(5)})
='key', how='left') pd.merge(df1, df2, on
key | data1 | data2 | |
---|---|---|---|
0 | b | 0 | 1.0 |
1 | b | 0 | 3.0 |
2 | b | 1 | 1.0 |
3 | b | 1 | 3.0 |
4 | a | 2 | 0.0 |
5 | a | 2 | 2.0 |
6 | c | 3 | NaN |
7 | a | 4 | 0.0 |
8 | a | 4 | 2.0 |
9 | b | 5 | 1.0 |
10 | b | 5 | 3.0 |
If the merge keys in a DataFrame is in its index instead of column(s), we could pass left_index=True
or right_index=True
or both instead of setting left_on
/right_on
/on
.
Example 4.7 If we want to really create a Cartesian product, we may use the option how='cross'
. For example, we would like to generate a deck of cards, we may use the following codes.
= pd.DataFrame({'suit': ['spades', 'hearts', 'clubs', 'diamonds']})
suit = pd.DataFrame({'face': list(range(1, 14))})
face = pd.merge(suit, face, how='cross') deck
pd.concat()
function
The concat()
function (in the main pandas namespace) performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
import pandas as pd
= pd.DataFrame(
df1
{"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},=[0, 1, 2, 3],
index
)
= pd.DataFrame(
df2
{"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},=[4, 5, 6, 7],
index
)
= pd.DataFrame(
df3
{"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},=[8, 9, 10, 11],
index
)
=['x', 'y', 'z']) pd.concat([df1, df2, df3], keys
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 | |
z | 8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 | |
10 | A10 | B10 | C10 | D10 | |
11 | A11 | B11 | C11 | D11 |
The default way of pd.concat()
is vertically. Note that it will check the column names. If the column names don’t match, new columns will be created and nan
values will be assigned.
If you want to concatenate the DataFrame horizontally you need to add axis=1
option. Similarly, row index will be checked before concatenating. See the following example.
Example 4.8
=1) pd.concat([df1, df2, df3], axis
A | B | C | D | A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
Example 4.9 Consider the deck example from Example 4.7. This time we would like to use pd.concat()
to get the result.
= ['spades', 'hearts', 'clubs', 'diamonds']
suitlist = list(range(1, 14))
facelist = [pd.DataFrame({'suit': suit, 'face': facelist}) for suit in suitlist]
decklist = pd.concat(decklist, ignore_index=True) deck
4.5 Data Aggregation and Group Operations
4.5.1 split-apply-combine model
We would like to apply group operations based on the split-apply-combine model.
- In the first stage of the process, data contained in a pandas object is split into groups based on one or more keys that you provide. We then use
.groupby(keys)
to perform the split step. The result is a groupedgroupby
object. - Once this is done, a function is applied to each group, producing a new value.
- Finally the results of all those function applications are combined into a result object. We may apply groupby functions directly as methods to groupby objects.The result is the combined result object.
import pandas as pd
import numpy as np
= pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
df 'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | -0.350086 | -0.226207 |
1 | a | two | -0.107462 | 1.464920 |
2 | b | one | -0.245411 | 0.951813 |
3 | b | two | -0.346121 | -1.089006 |
4 | a | one | 0.155076 | -1.744793 |
Now we want to group data1
in df
by key1
.
= df['data1'].groupby(df['key1'])
grouped grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000231956EF4F0>
What we get is a groupby object and we could apply group functions to it.
The method to look at each group is .get_group()
.
'a') grouped.get_group(
0 -0.350086
1 -0.107462
4 0.155076
Name: data1, dtype: float64
We may directly apply some group functions to the groupby object.
grouped.mean()
key1
a -0.100824
b -0.295766
Name: data1, dtype: float64
grouped.size()
key1
a 3
b 2
Name: data1, dtype: int64
We could iterate over groups.
for name, group in grouped:
print('name', name)
print('group', group)
name a
group 0 -0.350086
1 -0.107462
4 0.155076
Name: data1, dtype: float64
name b
group 2 -0.245411
3 -0.346121
Name: data1, dtype: float64
We could convert the group object into list and dictionary.
list(grouped)
[('a',
0 -0.350086
1 -0.107462
4 0.155076
Name: data1, dtype: float64),
('b',
2 -0.245411
3 -0.346121
Name: data1, dtype: float64)]
dict(list(grouped))
{'a': 0 -0.350086
1 -0.107462
4 0.155076
Name: data1, dtype: float64,
'b': 2 -0.245411
3 -0.346121
Name: data1, dtype: float64}
4.5.2 Built-in aggregation functions
The following functions directly work with groupby objects. You may try them by yourselves.
.describe()
.count()
.sum()
.mean()
.median
.std()
,.var()
.min()
,.max()
.prod()
.first()
,.last()
4.5.3 Function Application and Mapping
We may apply functions to each row/column of a DataFrame
. If the function is a built-in function that is compatible with DataFrame
, you can directly call the function that it will be applied automatically to each row/column. If it is not, we can call apply
to get the desired result.
map
To understand the behaviour of map
, you may treat it as a loop, through a Series
. pandas
goes through each item in the Series
and perform operations as instructed. If there is a returned value, it will be recorded along the Sereis
.
import pandas as pd
= pd.Series(['Ohio', 'Colorado', 'New York'])
ind ind
0 Ohio
1 Colorado
2 New York
dtype: object
map(lambda x: x[:4].upper()) ind.
0 OHIO
1 COLO
2 NEW
dtype: object
In the example we go through each item in ind
. Each item is a string. We pick the first 4 characters, and change them to be upper case.
Note that this operation can also be done by string method. These are two different methods but the results are the same.
str[:4].str.upper() ind.
0 OHIO
1 COLO
2 NEW
dtype: object
apply
apply
is very similar to map
, but for DataFrame
. The default setting is to go through each column of a DataFrame
, and the input is the column. You may use the argument axis=1
to change it to go through each row. Please see the following example.
Example 4.10
import pandas as pd
= pd.DataFrame(np.random.rand(4, 4),
data =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four'])
columns data
one | two | three | four | |
---|---|---|---|---|
Ohio | 0.767534 | 0.772175 | 0.936214 | 0.665185 |
Colorado | 0.239010 | 0.317945 | 0.208979 | 0.893111 |
Utah | 0.111401 | 0.810791 | 0.903352 | 0.383868 |
New York | 0.717844 | 0.674613 | 0.558172 | 0.293760 |
= lambda x: x.max() - x.min()
f
apply(f) data.
one 0.656133
two 0.492846
three 0.727235
four 0.599351
dtype: float64
Change axis
to find the range for each row.
apply(f, axis=1) data.
Ohio 0.271028
Colorado 0.684133
Utah 0.791952
New York 0.424085
dtype: float64
We can use more complicated function to get more complicated result.
Example 4.11
= pd.DataFrame(np.random.rand(4, 4),
data =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four'])
columns
= lambda x: pd.Series([x.max(), x.min()], index=['max', 'min'])
f
apply(f) data.
one | two | three | four | |
---|---|---|---|---|
max | 0.877111 | 0.458147 | 0.818885 | 0.939261 |
min | 0.422852 | 0.060961 | 0.132593 | 0.287785 |
4.5.4 Some examples
Example 4.12 Consider the following DataFrame.
import pandas as pd
import numpy as np
= pd.DataFrame({'location': ['East', 'East', 'East', 'East',
df 'West', 'West', 'West', 'West'],
'data': np.random.randn(8)},
=['Ohio', 'New York', 'Vermont', 'Florida',
index'Oregon', 'Nevada', 'California', 'Idaho'])
'Vermont', 'Nevada', 'Idaho'], 'data'] = np.nan
df.loc[[ df
location | data | |
---|---|---|
Ohio | East | -1.373885 |
New York | East | -1.822774 |
Vermont | East | NaN |
Florida | East | -1.324571 |
Oregon | West | 0.075709 |
Nevada | West | NaN |
California | West | -1.382985 |
Idaho | West | NaN |
We would like to fill in NA values with the mean from each location
group.
'location', group_keys=False).apply(lambda x: x.fillna(x.mean())) df.groupby(
location | data | |
---|---|---|
Ohio | East | -1.373885 |
New York | East | -1.822774 |
Vermont | East | -1.507077 |
Florida | East | -1.324571 |
Oregon | West | 0.075709 |
Nevada | West | -0.653638 |
California | West | -1.382985 |
Idaho | West | -0.653638 |
The argument group_keys=False
refers to the setting whether you want to group_keys
to be presented. If it is True
, the result looks like this.
'location', group_keys=True).apply(lambda x: x.fillna(x.mean())) df.groupby(
location | data | ||
---|---|---|---|
location | |||
East | Ohio | East | -1.373885 |
New York | East | -1.822774 | |
Vermont | East | -1.507077 | |
Florida | East | -1.324571 | |
West | Oregon | West | 0.075709 |
Nevada | West | -0.653638 | |
California | West | -1.382985 | |
Idaho | West | -0.653638 |
We could also fill in NA values with predefined values, similar to the non-groupby case.
= {'East': 0.1, 'West': -0.5}
predefined 'location', group_keys=True).apply(lambda x: x.fillna(predefined[x.name])) df.groupby(
location | data | ||
---|---|---|---|
location | |||
East | Ohio | East | -1.373885 |
New York | East | -1.822774 | |
Vermont | East | 0.100000 | |
Florida | East | -1.324571 | |
West | Oregon | West | 0.075709 |
Nevada | West | -0.500000 | |
California | West | -1.382985 | |
Idaho | West | -0.500000 |
You may chain commands to a DataFrame
, just like the examples shown above. If the commands are too long:
- a
()
has to be used to indicate that this is a multiline command, and - the line is broken before the
.
sybmol.
Please see the following example.
'location', group_keys=False)
(df.groupby(apply(lambda x: x.fillna(predefined[x.name]))
.
.reset_index()'location')
.groupby(max()
. )
index | data | |
---|---|---|
location | ||
East | Vermont | 0.100000 |
West | Oregon | 0.075709 |
4.6 Read and write files
4.6.1 Read files
In most cases we will read data from a csv
file or an excel
file.
csv
files
A csv
file is a plain txt file, with a fixed format. It consists of rows and columns. Rows are separated by newline symbol, which is usually \n
. Columns are separated by a separator. Common separators include empty spaces, comma
,
, semi-column ;
, tab space \t
. There might be other speical separators, depending on the creators of the specific csv
files.
In pandas
, you may use pd.read_csv()
function to read a csv
file.
- The argument
sep
is used to set separators. The default is,
. - The argument
names
is used to set the column names. Otherwise the column names will be generated and is highly unlikely to be directly usable. - The argument
header
will choose the header row and only parse the lines after it. If there is no header, you may setheader=None
. - The argument
index_col
is used to set the index column(s). If it isFalse
, the index will be automatically generated from0
. If it is set to a list of columns, the result will be a multi-index system.
You may read the document for more arguments.
Please see the following example.
Example 4.13 The file is yob1880.txt
. This is from the US Baby names dataset. It provides the counts of each US baby names born in 1880. You may use any txt editor to open the file. The first few rows are like the following:
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
It seems that sep
is the default ,
. So you may directly directly read it into a DataFrame
by pd.read_csv()
.
import pandas as pd
= pd.read_csv('assests/datasets/yob1880.txt')
df df.head()
Mary | F | 7065 | |
---|---|---|---|
0 | Anna | F | 2604 |
1 | Emma | F | 2003 |
2 | Elizabeth | F | 1939 |
3 | Minnie | F | 1746 |
4 | Margaret | F | 1578 |
Please look at the header of the DataFrame
. It is supposed to be the first data. Therefore there is no header in the original file. So the correct way to read the file is
import pandas as pd
= pd.read_csv('assests/datasets/yob1880.txt',
df =None,
header=['Name', 'Sex', 'Counts'])
names df.head()
Name | Sex | Counts | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
pandas
provides pd.read_excel()
function to read Excel files. Since Excel files are much more complicated than csv
files, it requires more setting. One of the most important different setting is the engine. pandas
needs you to specify a way (an engine) to understand Excel files. For the newer Excel file .xlsx
, it is recommended to use the engine openpyxl
.
If you don’t have openpyxl
installed, you may use the following code to install it.
pip install openpyxl
Many options, like header
, names
and index_col
, are very similar to pd.read_csv()
. Some additional remarks:
- There is no
sep
argument since columns are not separated based on separators. - The argument
sheet_name
is used to choose which sheet(s) you want to read. - The argument
nrows
is used to set the number of rows to parse.
You may read the document for more arguments.
Example 4.14 The file can be downloaded from here. This is the result of the Pre-Post test of a class for the course COURSE1001. You may first use Microsoft Office or other spreadsheet applications to open the file to have some ideas what it look like.
Here is the screenshot of the first few columns. Last
and First
refers to the last name and the first name of the student, while Last0X
and First0X
are students’ fake names.
Note that this files contains two DataFrame
s.
- The first is the result of the pretest, which is from row 3 to row 11, with the header row 2.
- The second is the result of the posttest, which is from row 15 to row 23, with the header row 14. To read the file, the code is as follows:
import pandas as pd
= pd.read_excel('assests/datasets/prepost.xlsx',
df_pre ='openpyxl',
engine=2,
header=10)
nrows df_pre
Last | First | T# | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Unnamed: 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Last01 | First01 | T01 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NaN |
1 | Last02 | First02 | T02 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | NaN |
2 | Last03 | First03 | T03 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | Last04 | First04 | T04 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
4 | Last05 | First05 | T05 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
5 | Last06 | First06 | T06 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
6 | Last07 | First07 | T07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
7 | Last08 | First08 | T08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
8 | Last09 | First09 | T09 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | NaN |
= pd.read_excel('assests/datasets/prepost.xlsx',
df_post ='openpyxl',
engine=14,
header=10)
nrows df_post
Last | First | T# | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Unnamed: 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Last01 | First01 | T01 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | NaN |
1 | Last02 | First02 | T02 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | NaN |
2 | Last03 | First03 | T03 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | NaN |
3 | Last04 | First04 | T04 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | NaN |
4 | Last05 | First05 | T05 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | NaN |
5 | Last06 | First06 | T06 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | NaN |
6 | Last07 | First07 | T07 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
7 | Last08 | First08 | T08 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | NaN |
8 | Last09 | First09 | T09 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | NaN |
It seems that the original files have an additional column Unnamed: 13
containing nan
values that should be dropped. Then it is not necessary to read it from the original file. Here we could use the argument usecols
to select the first 13 columns. We only show the example of pretest result.
= pd.read_excel('assests/datasets/prepost.xlsx',
df_pre ='openpyxl',
engine=2,
header=10,
nrows=list(range(13)))
usecols df_pre
Last | First | T# | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Last01 | First01 | T01 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | Last02 | First02 | T02 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
2 | Last03 | First03 | T03 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Last04 | First04 | T04 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Last05 | First05 | T05 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | Last06 | First06 | T06 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
6 | Last07 | First07 | T07 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | Last08 | First08 | T08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | Last09 | First09 | T09 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4.6.2 Write files
We will only talk about writing in csv
. The function is df.to_csv()
. It is straightforward.
- The argument
index
is used to control whether you want to write index into the file. The default isTrue
. If the index doesn’t contain any real information, we usually set it to beFalse
.
4.7 Example: Movies
Below we explore the MovieLens 1M datasets. You may download it from this link. This is a .dat
file, and you may use the following code to read it into a DataFrame
.
import pandas as pd
= ['movie_id', 'title', 'genres']
mnames = pd.read_table('assests/datasets/movies.dat', sep='::',
movies =None, names=mnames, engine="python",
header='ISO-8859-1')
encoding movies.head()
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
In this example we concentrate on exploring the genres information. We first want to find all genres in this dataset. The idea is:
- split each item in the
genres
column by|
to get a list. - go through each item in the
genres
column, and union all lists together.
This can be done by the map
function.
= list()
all_genres 'genres'].map(lambda x: all_genres.extend(x.split('|'))) movies[
0 None
1 None
2 None
3 None
4 None
...
3878 None
3879 None
3880 None
3881 None
3882 None
Name: genres, Length: 3883, dtype: object
all_genres
is the list of all genres (with duplicates).
In the output of the above code you may see many None
in each row. This is because the lambda function used in map
doesn’t have a return value. However after applying the function to each row, new genres information is added to the list all_genres
.
Then we would like to drop all the duplicates to get the list of all unique genres.
= pd.unique(all_genres)
genres genres
array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
'Western'], dtype=object)
genres
is the list of all unique genres.
4.8 Exercises
Many problems are based on [1].
Exercise 4.1 Let df
be a DataFrame
. Please answer the following questions in a Markdown cell.
- What does
df[0]
do? - What does
df[[0]]
do? - What does
df[0:1]
do?
Exercise 4.2 Please use the following code to generate a series ser
, and then finish the following tasks.
import pandas as pd
import numpy as np
= list('abcedfghijklmnopqrstuvwxyz')
mylist = np.arange(26)
myarr = dict(zip(mylist, myarr))
mydict = pd.Series(mydict) ser
- Convert the series
ser
into a dataframedf
with its index as another column on the dataframe. - Pick the two columns of
df
and set them into two seriesesser1
andser2
. - Combine two series
ser1
andser2
to form a new dataframenewdf
, and name their columnsser1
andser2
.
Exercise 4.3 Consider two serieses ser1
and ser2
. You may use the following ser1
and ser2
as an example. The output of each questions below should be a series. You may want to learn the following commands:
import pandas as pd
= pd.Series([1, 2, 3, 4, 5])
ser1 = pd.Series([4, 5, 6, 7, 8]) ser2
- Find all the elements from
ser1
that are also inser2
. - Find all the elements from
ser2
that are also inser1
. - From
ser1
remove items present inser2
. - Find the union of
ser1
andser2
. - Find the intersection of
ser1
andser2
. - Find all the elemetns that are in either
ser1
orser2
, but not both.
Exercise 4.4 Consider the following DataFrame
.
import pandas as pd
= pd.DataFrame(np.arange(16).reshape((4, 4)),
data =['Ohio', 'Colorado', 'Utah', 'New York'],
index=['one', 'two', 'three', 'four']) columns
- Please select the column
two
. - Please select the second and the third row.
- Please find the rows that the column
three
value is bigger than5
. - Please find the last row that the column
three
value is bigger than5
. - Please find the rows that the column
three
value is bigger than5
, and display the resultedDataFrame
with onlyColorado
andUtah
row andfour
andone
columns, in the specified order.
Exercise 4.5 Consider the following Series
.
import pandas as pd
import numpy as np
= pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30))) ser
- Please use
pd.Series.value_counts()
to calculte the frequency counts of each unique value of the following Series. - Please keep the top 2 most frequent items of
ser
as it is and replace everything else asOther
.
Exercise 4.6 Consider the Series ser
:
import pandas as pd
import numpy as np
= pd.Series(np.random.randint(1, 10, 7)) ser
Find the positions of numbers that are multiples of 3 from ser.
Exercise 4.7 Compute the mean of weights
of each fruit
.
import pandas as pd
= pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
fruit = pd.Series(np.linspace(1, 10, 10))
weights = pd.DataFrame({'fruit': fruit, 'weights': weights}) df
Exercise 4.8 Consider the following DataFrame.
import pandas as pd
= pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv') df
- Check if
df
has any missing values. - Please count the number of missing values in each column.
- Please replace all missing values in
Min.Price
andMax.Price
with their mean respectively.
Exercise 4.9 Get the last two rows of df
whose row sum is greater than 100.
import pandas as pd
= pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4)) df
Exercise 4.10 The groupby object df_grouped
is given below.
import pandas as pd
import numpy as np
= pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
df 'price': np.random.rand(9),
'taste': np.random.randint(0, 11, 9)})
= df.groupby(['fruit']) df_grouped
- Get the group belonging to
apple
as a DataFrame. - Find the second largest value of
taste
forbanana
. - Compute the mean
price
for everyfruit
.
Exercise 4.11 Join df1
and df2
by fruit
/pazham
and weight
/kilo
.
= pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
df1 'weight': ['high', 'medium', 'low'] * 3,
'price': np.random.randint(0, 15, 9)})
= pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
df2 'kilo': ['high', 'low'] * 3,
'price': np.random.randint(0, 15, 6)})
Exercise 4.12 Consider the following DataFrame.
import pandas as pd
import numpy as np
= pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5]) df
- Replace
NaN
with stringmissing
in columnsManufacturer
,Model
andType
. - Create an index as a combination of these three columns.
Exercise 4.13 Given the following DataFrame.
import pandas as pd
= pd.DataFrame({
df 'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
'age': [30, 40, 32, 67, 43],
'score': ['90%', '95%', '100%', '82%', '87%'],
'age_missing_data': [30, 40, 32, 67, None],
'income':[100000, 80000, 55000, 62000, 120000]
})
- Please use
.map
to create a new columnnumeric_score
whose value is the number version ofscore
. - Please use
.apply
to create a new columnnumeric_score
whose value is the number version ofscore
.
Exercise 4.14 The following DataFrame is given.
import pandas as pd
= pd.DataFrame(["STD, City State",
df "33, Kolkata West Bengal",
"44, Chennai Tamil Nadu",
"40, Hyderabad Telengana",
"80, Bangalore Karnataka"],
=['row']) columns
Please create a new DataFrame out of df
by spliting it into three columns based on ,
and four spaces. In addition, the column names of the new DataFrame are given by the first row of df
.