4  Package: pandas

The basic data structure for pandas is pandas.DataFrame. You may treat it as a generalized version of tables.

To use pandas, we just import it. In most cases you would like to use the alias pd.

import pandas as pd

Since DataFrame is more like a table, the biggest questions here is not to do computations (which is still very important), but to retrieve, search, sort, merge, etc.. those data.

4.1 Basic pandas

4.1.1 Series and DataFrame

A Series is a 1-d array-like object which has index. The default index is starting from 0. You may change the index to be something assigned by you. Thus it can be treated as a generalization of a dict.

obj = pd.Series([3, 1, 2, 4])
obj
0    3
1    1
2    2
3    4
dtype: int64
obj2 = pd.Series([3, 1, 2, 4], index=['a', 'b', 'c', 'd'])
obj2
a    3
b    1
c    2
d    4
dtype: int64
data3 = {'a': 3, 'b': 1, 'c': 2, 'd': 4}
obj3 = pd.Series(data3)
obj3
a    3
b    1
c    2
d    4
dtype: int64

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. When displaying a DataFrame, we may use .head() to just display the first few rows for efficicy.

import pandas as pd

data = {'a': [1, 2, 3, 4, 5, 6, 7],
        'b': [1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1],
        'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
df.head()
a b c
0 1 1.1 a
1 2 2.1 b
2 3 3.1 c
3 4 4.1 d
4 5 5.1 e
Note

We may use the setting columns= or index= as well as the methods .rename(columns=, index=) to change the column names and the index names. See the following example.

import numpy as np
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

4.1.2 Accessing data

  • A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute. What one gets from this is a Series object.
    • dict-like notation: df['a']
    • by attribute: df.a. Note that if the name of the column is not suitable for attribute names, this method doesn’t work.
  • Rows are retrieved by .loc if using the row index, and by .iloc if using the row number.

4.1.3 Updating data

  • Assign values to a column of a DataFrame will update that column. If the column doesn’t exist, new column will be created.
  • When assign values with non-existent row index, that part of the data will be discarded.
  • Any time if there are no values with a specific column and row, it will show as NaN.

Example 4.1  

import pandas as pd

data = {'a': [1, 2, 3, 4],
        'b': [1.1, 2.1, 3.1, 4.1],
        'c': ['a', 'b', 'c', 'd']}
df = pd.DataFrame(data)

newcol = {1: 'good', 3: 'better', 5: 'best'}
df['d'] = pd.Series(newcol)
df
a b c d
0 1 1.1 a NaN
1 2 2.1 b good
2 3 3.1 c NaN
3 4 4.1 d better

4.1.4 Indexing, Selection, and Filtering

  • Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

  • We can use logical expresssion to filter DataFrame.

import pandas as pd

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data[data['one']>5]
one two three four
Utah 8 9 10 11
New York 12 13 14 15
  • .loc, .iloc
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
print(data.loc['Colorado', ['two', 'three']])
print(data.iloc[2, [3, 0, 1]])
two      5
three    6
Name: Colorado, dtype: int32
four    11
one      8
two      9
Name: Utah, dtype: int32
  • Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.
import pandas as pd

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b':'c']
b    1.0
c    2.0
dtype: float64
  • Reindex .reindex():
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

data.reindex(index = ['Colorado', 'Arkansas', 'New York'],
             columns = ['three', 'five', 'one'])
three five one
Colorado 6.0 NaN 4.0
Arkansas NaN NaN NaN
New York 14.0 NaN 12.0
Note

.loc and .reindex are very similar to each other. The main difference between theses two is that .loc will return a view and .reindex will return a copy in most cases.

Note

When locate data using indexes, duplicate labels will return all results.

4.1.5 Essential functions

  • Arithmetic and Data Alignment Elements of the same index and columns will be computed. By default, if any entry is nan, the answer will be nan. You may use fill_value argument to fill the empty slots.

Example 4.2  

import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan

df1.add(df2, fill_value=0)
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 5.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0

Relatedly, when reindexing a Series or DataFrame, you can also specify a fill_value.

4.1.6 Function Application and Mapping

We may apply functions to each row/column of a DataFrame. If the function is built-in function that is compatible with DataFrame, you can directly call the function that it will be applied automatically to each row/column. If it is not, we can call apply to get the desired result.

Example 4.3  

import pandas as pd
data = pd.DataFrame(np.random.rand(4, 4),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

f = lambda x: x.max() - x.min()

print(data.apply(f))
print(data.apply(f, axis='columns'))
one      0.189348
two      0.700152
three    0.330421
four     0.766805
dtype: float64
Ohio        0.719334
Colorado    0.576251
Utah        0.236155
New York    0.364350
dtype: float64

We can use more complicated function to get more complicated result.

Example 4.4  

import pandas as pd
data = pd.DataFrame(np.random.rand(4, 4),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

f = lambda x: pd.Series([x.max(), x.min()], index=['max', 'min'])

print(data.apply(f))
          one       two     three      four
max  0.868566  0.872288  0.938442  0.697079
min  0.248375  0.022153  0.138255  0.134982

4.1.7 Sorting and Ranking

  • .sort_values(by=)
  • .rank(ascending=, method=)

4.1.8 Summarizing and Computing Descriptive Statistics

  • sum, cumsum
  • mean, median
  • .describe()
  • .cov, .corr

4.1.9 Unique Values, Value Counts, and Membership

  • unique
  • value_counts

4.1.10 Reading and Writing Data in Text Format

  • read_csv
  • read_excel
  • df.to_csv

4.1.11 Copies and views

  • inplace

4.2 Data cleaning

4.2.1 Handling Missing Data

  • np.nan, pd.NA
  • pd.isnull(), np.isnan()
  • dropna, fillna

Example 4.5  

import pandas as pd
import numpy as np

data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], 
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
cleaned = data.dropna()
cleanedrow = data.dropna(how='all')
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned
0 1 2
0 1.0 6.5 3.0
cleanedrow
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
data[4] = np.nan
cleaned1 = data.dropna(axis=1, how='all')
cleanedthresh = data.dropna(thresh=2)
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
cleaned1
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleanedthresh
0 1 2 4
0 1.0 6.5 3.0 NaN
3 NaN 6.5 3.0 NaN
fill0 = data.fillna(0)
filldict = data.fillna({1: 0.5, 2: -0.1})
data
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
fill0
0 1 2 4
0 1.0 6.5 3.0 0.0
1 1.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 6.5 3.0 0.0
filldict
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 0.5 -0.1 NaN
2 NaN 0.5 -0.1 NaN
3 NaN 6.5 3.0 NaN

4.2.2 Data Transformation

  • .duplicated(), drop_duplicates()

Example 4.6  

import numpy as np
import pandas as pd

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data.drop_duplicates(['k1'], keep='last')
k1 k2
4 one 3
6 two 4
  • pd.Series.map(), pd.DataFrame.apply()

Example 4.7  

import pandas as pd
import numpy as np

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                     'Pastrami', 'corned beef', 'Bacon',
                     'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
    }

data['animal'] = data['food'].str.lower().map(meat_to_animal)

data['food'].map(lambda x: meat_to_animal[x.lower()])
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object
  • replace
  • rename
  • describe
  • permutation
  • sample
  • dummy variables

4.2.3 Example: Movies

Below we explore the MovieLens 1M datasets. You may download it from this link.

import pandas as pd
import numpy as np
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assests/datasets/movies.dat', sep='::',
                       header=None, names=mnames, engine="python",
                       encoding='ISO-8859-1')

all_genres = list()
movies['genres'].map(lambda x: all_genres.extend(x.split('|')))

genres = pd.unique(all_genres)

dummies = pd.DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

for i, gen in enumerate(movies.genres):
    indices = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indices] = 1

movies_windic = movies.join(dummies.add_prefix('Genre_'))

4.2.4 String Manipulation

The key idea in this section is that, all methods in pd.Series.str will be applied to each entry of the Series.

Example 4.8  

import pandas as pd
import numpy as np
s = pd.Series(["A ", " B ", "C", "Aaba", " Baca ", np.nan, "CABA", "dog", "cat"])

s.str.lower()
s.str.split('a')
s.str.len()
s.str.strip()
s.str.replace("A", '1')
0        1 
1        B 
2         C
3      1aba
4     Baca 
5       NaN
6      C1B1
7       dog
8       cat
dtype: object

Example 4.9 We could also use .str to play with column names and row indexes.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2),
                  columns=[" Column A ", " Column B "], index=range(3))

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df
column_a column_b
0 1.375395 0.738538
1 0.213217 -0.128138
2 -1.410801 -0.002241

4.2.5 Regular expression

Regular expressions provide a flexible way to search or match string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re module is responsible for applying regular expressions to strings.

For details of the regular expression language in Python, please read the official documents from here. There are also many great websites for learning regex. This is one example.

We will briefly mentioned a few rules here.

  • .: matches any character except a newline.
  • \d: matches any digit. It is the same as [0-9].
  • \w: matches any alphabatic or numeric character. It is the same as [a-zA-Z0-9_].
  • \s: matches any whitespaces. It is the same as [\t\n\r\f\v].
  • *: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
  • +: Causes the resulting RE to match 1 or more repetitions of the preceding RE, as many repetitions as are possible.
  • ?: Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
  • *?, +?, ??: The *, +, and ? qualifiers are all greedy; they match as much text as possible. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
  • {m}: Specifies that exactly m copies of the previous RE should be matched.
  • {m,n}: Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.
  • {m,n}?: Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.
  • []: Used to indicate a set of characters.
  • (): set groups.

Example 4.10  

import re
text = "foo bar\t baz \tqux"
pattern = '\s+'
regex = re.compile(pattern)
regex.split(text)
['foo', 'bar', 'baz', 'qux']
  • .match()
  • .search()
  • .findall()
  • .split()
  • .sub()

We can use () to specify groups, and use .groups() to get access to the results.

Example 4.11  

import re
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()
('wesm', 'bright', 'net')

To use regex to DataFrame and Series, you may directly apply .match, .findall, .replace after .str, with the regex pattern as one of the arguments.

.extract is a method that is not from re. It is used to extract the matched groups and make them as a DataFrame.

Example 4.12  

import pandas as pd
import numpy as np
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assests/datasets/movies.dat', sep='::',
                       header=None, names=mnames, engine="python",
                       encoding='ISO-8859-1')

pattern = r'([a-zA-Z0-9_\s,.?:;\']+)\((\d{4})\)'
movies = movies.join(movies.title.str.extract(pattern).rename(columns={0: 'movie title', 1: 'year'}))

4.3 Data Wrangling

4.3.1 Hierarchical indexing

Pandas support a more complex indexing system, that the index may have multiple levels. See the following example.

import pandas as pd
import numpy as np

data = pd.Series(np.random.randn(9),
                 index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                          [1, 2, 3, 1, 2, 3, 1, 2, 3]])
data
a  1   -0.802479
   2    0.759039
   3   -1.171718
b  1   -0.210611
   2    0.321742
c  3   -1.160436
   1    0.516246
d  2    0.397977
   3    0.062464
dtype: float64

You may look at the Series using different levels of indexes.

data['a']
1   -0.802479
2    0.759039
3   -1.171718
dtype: float64
data.loc[:, 2]
a    0.759039
b    0.321742
d    0.397977
dtype: float64

You may use groupby to group by levels and do calculations related to levels. More .groupby() will be discussed in the next section.

data.groupby(level=1).sum()
1   -0.496844
2    1.478758
3   -2.269690
dtype: float64

From the example above, you may notice that the 2-level hierarchical indexing for a Series works very similar to a DataFrame. In fact, you may translate it back and forth between a 2-level indexing Series and a DataFrame.

df = data.unstack()
df
1 2 3
a -0.802479 0.759039 -1.171718
b -0.210611 0.321742 NaN
c 0.516246 NaN -1.160436
d NaN 0.397977 0.062464
df.stack()
a  1   -0.802479
   2    0.759039
   3   -1.171718
b  1   -0.210611
   2    0.321742
c  1    0.516246
   3   -1.160436
d  2    0.397977
   3    0.062464
dtype: float64

For DataFrame the index for both axes can be multiindex. The usual indexing way can be used if you want to start from the first level of the index. The more specific method to extract data is .xs.

Example 4.13  

import pandas as pd

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

df = pd.concat([df1, df2], keys=['x', 'y'])
df
A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
df['A']
x  0    A0
   1    A1
   2    A2
   3    A3
y  4    A4
   5    A5
   6    A6
   7    A7
Name: A, dtype: object
df.loc['x']
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
df.loc['x',3]
A    A3
B    B3
C    C3
D    D3
Name: (x, 3), dtype: object
df.xs(3, level=1, drop_level=False)
A B C D
x 3 A3 B3 C3 D3

4.3.2 Combining and Merging Datasets

4.3.2.1 merge()

Merge combines datasets by linking rows using one or more keys. This is from relational databases (e.g., SQL-based).

Here are some examples.

Example 4.14  

import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

The two DataFrames are displayed as follows.

df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1, df2, on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

If the column names are different in each object, you can specify them separately.

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0

By default merge does an inner join, that the keys in the result are the interesection found in both tables. Below are different types of merge. To specify the method for merge, the option is how.

  • inner
  • left
  • right
  • outer

Let’s see the following examples.

df1 = pd.DataFrame({'Key': [1, 2], 'A': [0, 2], 'B': [1, 3]})
df1
Key A B
0 1 0 1
1 2 2 3
df2 = pd.DataFrame({'Key': [1, 3], 'C': [0, 2], 'D': [1, 3]})
df2
Key C D
0 1 0 1
1 3 2 3
pd.merge(df1, df2, on='Key', how='inner')
Key A B C D
0 1 0 1 0 1
pd.merge(df1, df2, on='Key', how='outer')
Key A B C D
0 1 0.0 1.0 0.0 1.0
1 2 2.0 3.0 NaN NaN
2 3 NaN NaN 2.0 3.0
pd.merge(df1, df2, on='Key', how='left')
Key A B C D
0 1 0 1 0.0 1.0
1 2 2 3 NaN NaN
pd.merge(df1, df2, on='Key', how='right')
Key A B C D
0 1 0.0 1.0 0 1
1 3 NaN NaN 2 3
Note

If a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data. Here is a very basic example with one unique key combination.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})
pd.merge(df1, df2, on='key', how='left')
key data1 data2
0 b 0 1.0
1 b 0 3.0
2 b 1 1.0
3 b 1 3.0
4 a 2 0.0
5 a 2 2.0
6 c 3 NaN
7 a 4 0.0
8 a 4 2.0
9 b 5 1.0
10 b 5 3.0
Note

If the merge keys in a DataFrame is in its index instead of column(s), we could pass left_index=True or right_index=True or both instead of setting left_on/right_on/on.

Example 4.15 If we want to really create a Cartesian product, we may use the option how='cross'. For example, we would like to generate a deck of cards, we may use the following codes.

suit = pd.DataFrame({'suit': ['spades', 'hearts', 'clubs', 'diamonds']})
face = pd.DataFrame({'face': list(range(1, 14))})
deck = pd.merge(suit, face, how='cross')

4.3.2.2 concat()

The concat() function (in the main pandas namespace) performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

import pandas as pd

df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
A B C D
x 0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
y 4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
z 8 A8 B8 C8 D8
9 A9 B9 C9 D9
10 A10 B10 C10 D10
11 A11 B11 C11 D11

The default way of pd.concat() is vertically. Note that it will check the column names. If the column names don’t match, new columns will be created and nan values will be assigned.

If you want to concatenate the DataFrame horizontally you need to add axis=1 option. Similarly, row index will be checked before concatenating. See the following example.

Example 4.16  

pd.concat([df1, df2, df3], axis=1)
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11

Example 4.17 Consider the deck example from Example 4.15. This time we would like to use pd.concat() to get the result.

suitlist = ['spades', 'hearts', 'clubs', 'diamonds']
facelist = list(range(1, 14))
decklist = [pd.DataFrame({'suit': suit, 'face': facelist}) for suit in suitlist]
deck = pd.concat(decklist, ignore_index=True)

4.4 Data Aggregation and Group Operations

4.4.1 split-apply-combine model

We would like to apply group operations based on the split-apply-combine model.

  • In the first stage of the process, data contained in a pandas object is split into groups based on one or more keys that you provide. We then use .groupby(keys) to perform the split step. The result is a grouped groupby object.
  • Once this is done, a function is applied to each group, producing a new value.
  • Finally the results of all those function applications are combined into a result object. We may apply groupby functions directly as methods to groupby objects.The result is the combined result object.

Example 4.18  

import pandas as pd
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df
key1 key2 data1 data2
0 a one 0.873122 0.561476
1 a two -0.426869 -1.433012
2 b one 1.756470 0.352796
3 b two 0.754264 -2.418179
4 a one -0.452852 -0.950064

Now we want to group data1 in df by key1.

grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002020D272F40>

What we get is a groupby object and we could apply group functions to it.

The method to look at each group is .get_group.

grouped.get_group('a')
0    0.873122
1   -0.426869
4   -0.452852
Name: data1, dtype: float64

We may directly apply some group functions to the groupby object.

grouped.mean()
key1
a   -0.002200
b    1.255367
Name: data1, dtype: float64
grouped.size()
key1
a    3
b    2
Name: data1, dtype: int64

We could iterate over groups.

for name, group in grouped:
    print('name', name)
    print('group', group)
name a
group 0    0.873122
1   -0.426869
4   -0.452852
Name: data1, dtype: float64
name b
group 2    1.756470
3    0.754264
Name: data1, dtype: float64

We could convert the group object into list and dictionary.

list(grouped)
[('a',
  0    0.873122
  1   -0.426869
  4   -0.452852
  Name: data1, dtype: float64),
 ('b',
  2    1.756470
  3    0.754264
  Name: data1, dtype: float64)]
dict(list(grouped))
{'a': 0    0.873122
 1   -0.426869
 4   -0.452852
 Name: data1, dtype: float64,
 'b': 2    1.756470
 3    0.754264
 Name: data1, dtype: float64}

4.4.2 More aggregation functions

  • .describe()
  • .count()
  • .sum()
  • .mean()
  • .median
  • .std(), .var()
  • .min(), .max()
  • .prod()
  • first(), .last()
  • .agg()

4.4.3 Some examples

Example 4.19 Consider the following DataFrame.

import pandas as pd
df = pd.DataFrame({'location': ['East', 'East', 'East', 'East',
                                'West', 'West', 'West', 'West'],
                   'data': np.random.randn(8)},
                   index=['Ohio', 'New York', 'Vermont', 'Florida',
                          'Oregon', 'Nevada', 'California', 'Idaho'])
df.loc[['Vermont', 'Nevada', 'Idaho'], 'data'] = np.nan

We would like to fill in NA values with the mean from each group.

df.groupby('location').apply(lambda x: x.fillna(x.mean()))
C:\Users\Xinli\AppData\Local\Temp\ipykernel_29412\2040193686.py:1: FutureWarning:

Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
location data
Ohio East -1.063793
New York East 0.419128
Vermont East 0.123220
Florida East 1.014324
Oregon West -0.821446
Nevada West -0.185755
California West 0.449936
Idaho West -0.185755

We could also fill in NA values with predefined values, similar to the non-groupby case.

df.groupby('location').apply(lambda x: x.fillna({'East': 0.1,
                                                 'West': -0.5}[x.name]))
location data
Ohio East -1.063793
New York East 0.419128
Vermont East 0.100000
Florida East 1.014324
Oregon West -0.821446
Nevada West -0.500000
California West 0.449936
Idaho West -0.500000

4.5 Exercises

Most problems are based on [1].

Exercise 4.1 Please use the following code to generate a series ser, and then finish the following tasks.

import pandas as pd
import numpy as np


mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))
ser = pd.Series(mydict)
  1. Convert the series ser into a dataframe df with its index as another column on the dataframe.
  2. Pick the two columns of df and set them into two serieses ser1 and ser2.
  3. Combine two series ser1 and ser2 to form a new dataframe newdf, and name their columns ser1 and ser2.

Exercise 4.2 Consider two serieses ser1 and ser2. You may use the following ser1 and ser2 as an example. The output of each questions below should be a series. You may want to learn the following commands:

import pandas as pd

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
  1. Find all the elements from ser1 that are also in ser2.
  2. Find all the elements from ser2 that are also in ser1.
  3. From ser1 remove items present in ser2.
  4. Find the union of ser1 and ser2.
  5. Find the intersection of ser1 and ser2.
  6. Find all the elemetns that are in either ser1 or ser2, but not both.

Exercise 4.3 (Some statistics) Please check the following commands and answer the following questions.

How to get the minimum, 25th percentile, median, 75th, and max of a numeric series? You may use the following Series as an example.

import pandas as pd
ser = pd.Series(np.random.normal(10, 5, 25))

Exercise 4.4 Please use pd.Series.value_counts() to calculte the frequency counts of each unique value of the following Series.

import pandas as pd
import numpy as np
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

Exercise 4.5 Please keep the top 2 most frequent items of ser as it is and replace everything else as Other.

import pandas as pd
import numpy as np
ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

Exercise 4.6 Please use pd.cut or pd.qcut to bin the Series ser into 10 equal deciles. You may use the following ser as an example.

import pandas as pd
ser = pd.Series(np.random.random(20))

Exercise 4.7 Consider the Series ser:

import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 10, 7))

Find the positions of numbers that are multiples of 3 from ser.

Exercise 4.8 Compute the mean of weights of each fruit.

import pandas as pd
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))
df = pd.DataFrame({'fruit': fruit, 'weights': weights})

Exercise 4.9 Consider the following DataFrame.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
  1. Check if df has any missing values.
  2. Please count the number of missing values in each column.
  3. Please replace all missing values in Min.Price and Max.Price with their mean respectively.

Exercise 4.10 Replace the spaces in my_str = 'dbc deb abed gade' with the least frequent character.

Exercise 4.11 Suppress scientific notations like e-03 in df and print up to 4 numbers after decimal.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(4)**10, columns=['random'])
df
random
0 4.622963e-07
1 2.912607e-02
2 6.323939e-02
3 1.510522e-01

Exercise 4.12 Format the values in column random of df as percentages.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(4), columns=['random'])
df
random
0 0.191921
1 0.468266
2 0.048119
3 0.263165

Exercise 4.13 (Regular expressions) Please use regular expressions to finish the following tasks.

  1. Match a string that has an a followed by zero or more b’s.
  2. Match a string that has an a followed by one or more b’s.
  3. Match a string that has an a followed by zero or one b.
  4. Match a string that has an a followed by three b’s.

Exercise 4.14 (More regex) Find all words starting with a or e in a given string:

text = "The following example creates an ArrayList with a capacity of 50 elements. Four elements are then added to the ArrayList and the ArrayList is trimmed accordingly."

Exercise 4.15 (More regex) Write a Python code to extract year, month and date from a url1:

url1= "https://www.washingtonpost.com/news/football-insider/wp/2016/09/02/odell-beckhams-fame-rests-on-one-stupid-little-ball-josh-norman-tells-author/"

Exercise 4.16 (More regex) Please use regex to parse the following str to create a dictionary.

text = r'''
{
    name: Firstname Lastname;
    age: 100;
    salary: 10000 
}
'''

Exercise 4.17 Consider the following DataFrame.

data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks.  Warm pink, clear carmine pink, rose pink shaded salmon.  Mild fragrance.  Large, very double, in small clusters, high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Every Good Gift', 'Red.  Flowers velvety red.  Moderate fragrance.  Average diameter 4".  Medium-large, full (26-40 petals), borne mostly solitary bloom form.  Blooms in flushes throughout the season.'], 
        ['Evghenya', 'Orange-pink.  75 petals.  Large, very double bloom form.  Blooms in flushes throughout the season.'], 
        ['Evita', 'White or white blend.  None to mild fragrance.  35 petals.  Large, full (26-40 petals), high-centered bloom form.  Blooms in flushes throughout the season.'],
        ['Evrathin', 'Light pink. [Deep pink.]  Outer petals white. Expand rarely.  Mild fragrance.  35 to 40 petals.  Average diameter 2.5".  Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form.  Prolific, once-blooming spring or summer.  Glandular sepals, leafy sepals, long sepals buds.'],
        ['Evita 2', 'White, blush shading.  Mild, wild rose fragrance.  20 to 25 petals.  Average diameter 1.25".  Small, very double, cluster-flowered bloom form.  Blooms in flushes throughout the season.']]
  
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM']) 
df 
NAME BLOOM
0 Evert van Dijk Carmine-pink, salmon-pink streaks, stripes, fl...
1 Every Good Gift Red. Flowers velvety red. Moderate fragrance...
2 Evghenya Orange-pink. 75 petals. Large, very double b...
3 Evita White or white blend. None to mild fragrance....
4 Evrathin Light pink. [Deep pink.] Outer petals white. ...
5 Evita 2 White, blush shading. Mild, wild rose fragran...

Please use regex methods to find all the () in each columns.

Exercise 4.18 Get the last two rows of df whose row sum is greater than 100.

import pandas as pd
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

Exercise 4.19 The groupby object df_grouped is given below.

import pandas as pd
import numpy as np
df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'price': np.random.rand(9),
                   'taste': np.random.randint(0, 11, 9)})

df_grouped = df.groupby(['fruit'])
  1. Get the group belonging to apple as a DataFrame.
  2. Find the second largest value of taste for banana.
  3. Compute the mean price for every fruit.

Exercise 4.20 Join df1 and df2 by fruit/pazham and weight/kilo.

df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

4.6 Projects

Exercise 4.21 Extract the valid emails from the series emails. The regex pattern for valid emails is provided as reference.

import pandas as pd
emails = pd.Series(['buying books at amazom.com',
                    'rameses@egypt.com',
                    'matt@t.co',
                    'narendra@modi.com'])
pattern = '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

Exercise 4.22 Consider the following DataFrame.

import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', usecols=[0,1,2,3,5])
  1. Replace NaN with string missing in columns Manufacturer, Model and Type.
  2. Create an index as a combination of these three columns.

Exercise 4.23 Given the following DataFrame.

import pandas as pd
df = pd.DataFrame({
    'name': ['James', 'Jane', 'Melissa', 'Ed', 'Neil'],
    'age': [30, 40, 32, 67, 43],
    'score': ['90%', '95%', '100%', '82%', '87%'],
    'age_missing_data': [30, 40, 32, 67, None],
    'income':[100000, 80000, 55000, 62000, 120000]
})
  • Please use .map to create a new column numeric_score whose value is the number version of score.
  • Please use .apply to create a new column numeric_score whose value is the number version of score.

Exercise 4.24 From ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money']), find the words that contain at least 2 vowels.

Exercise 4.25 Please download the given file with sample emails, and use the following code to load the file and save it to a string content.

with open('assests/datasets/test_emails.txt', 'r') as f:
    content = f.read()

Please use regex to play with content.

  1. Get all valid email address in content, from both the header part or the body part.
  2. There are two emails in content. Please get the sender’s email and the receiver’s email from content.
  3. Please get the sender’s name.
  4. Please get the subject of each email.

Exercise 4.26 The following DataFrame is given.

import pandas as pd
df = pd.DataFrame(["STD, City    State",
                   "33, Kolkata    West Bengal",
                   "44, Chennai    Tamil Nadu",
                   "40, Hyderabad    Telengana",
                   "80, Bangalore    Karnataka"],
                   columns=['row'])
  1. Split the columns into a list with 3 entries.
  2. Make the first row (row 0) into a header.
  3. Create a new DataFrame out of the data.