4 Package: pandas
The basic data structure for pandas
is pandas.DataFrame
. You may treat it as a generalized version of tables.
To use pandas
, we just import it. In most cases you would like to use the alias pd
.
Since DataFrame
is more like a table, the biggest questions here is not to do computations (which is still very important), but to retrieve, search, sort, merge, etc.. those data.
4.1 Basic pandas
4.1.1 Series
and DataFrame
A Series is a 1-d array-like object which has index. The default index is starting from 0
. You may change the index to be something assigned by you. Thus it can be treated as a generalization of a dict
.
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. When displaying a DataFrame, we may use .head()
to just display the first few rows for efficicy.
4.1.2 Accessing data
- A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute. What one gets from this is a Series object.
- dict-like notation:
df['a']
- by attribute:
df.a
. Note that if the name of the column is not suitable for attribute names, this method doesn’t work.
- dict-like notation:
- Rows are retrieved by
.loc
if using the row index, and by.iloc
if using the row number.
4.1.3 Updating data
- Assign values to a column of a DataFrame will update that column. If the column doesn’t exist, new column will be created.
- When assign values with non-existent row index, that part of the data will be discarded.
- Any time if there are no values with a specific column and row, it will show as
NaN
.
4.1.4 Indexing, Selection, and Filtering
Series indexing
(obj[...])
works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.We can use logical expresssion to filter DataFrame.
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data[data['one']>5]
one | two | three | four | |
---|---|---|---|---|
Utah | 8 | 9 | 10 | 11 |
New York | 12 | 13 | 14 | 15 |
.loc
,.iloc
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
print(data.loc['Colorado', ['two', 'three']])
print(data.iloc[2, [3, 0, 1]])
two 5
three 6
Name: Colorado, dtype: int32
four 11
one 8
two 9
Name: Utah, dtype: int32
- Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.
b 1.0
c 2.0
dtype: float64
- Reindex
.reindex()
:
import pandas as pd
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data.reindex(index = ['Colorado', 'Arkansas', 'New York'],
columns = ['three', 'five', 'one'])
three | five | one | |
---|---|---|---|
Colorado | 6.0 | NaN | 4.0 |
Arkansas | NaN | NaN | NaN |
New York | 14.0 | NaN | 12.0 |
4.1.5 Essential functions
- Arithmetic and Data Alignment Elements of the same index and columns will be computed. By default, if any entry is
nan
, the answer will benan
. You may usefill_value
argument to fill the empty slots.
Example 4.2
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1.add(df2, fill_value=0)
a | b | c | d | e | |
---|---|---|---|---|---|
0 | 0.0 | 2.0 | 4.0 | 6.0 | 4.0 |
1 | 9.0 | 5.0 | 13.0 | 15.0 | 9.0 |
2 | 18.0 | 20.0 | 22.0 | 24.0 | 14.0 |
3 | 15.0 | 16.0 | 17.0 | 18.0 | 19.0 |
Relatedly, when reindexing a Series or DataFrame, you can also specify a fill_value
.
4.1.6 Function Application and Mapping
We may apply functions to each row/column of a DataFrame. If the function is built-in function that is compatible with DataFrame, you can directly call the function that it will be applied automatically to each row/column. If it is not, we can call apply
to get the desired result.
Example 4.3
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 4),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
f = lambda x: x.max() - x.min()
print(data.apply(f))
print(data.apply(f, axis='columns'))
one 0.189348
two 0.700152
three 0.330421
four 0.766805
dtype: float64
Ohio 0.719334
Colorado 0.576251
Utah 0.236155
New York 0.364350
dtype: float64
We can use more complicated function to get more complicated result.
Example 4.4
import pandas as pd
data = pd.DataFrame(np.random.rand(4, 4),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
f = lambda x: pd.Series([x.max(), x.min()], index=['max', 'min'])
print(data.apply(f))
one two three four
max 0.868566 0.872288 0.938442 0.697079
min 0.248375 0.022153 0.138255 0.134982
4.1.7 Sorting and Ranking
.sort_values(by=)
.rank(ascending=, method=)
4.1.8 Summarizing and Computing Descriptive Statistics
sum
,cumsum
mean
,median
.describe()
.cov
,.corr
4.1.9 Unique Values, Value Counts, and Membership
unique
value_counts
4.1.10 Reading and Writing Data in Text Format
read_csv
read_excel
df.to_csv
4.1.11 Copies and views
inplace
4.2 Data cleaning
4.2.1 Handling Missing Data
np.nan
,pd.NA
pd.isnull()
,np.isnan()
dropna
,fillna
Example 4.5
import pandas as pd
import numpy as np
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
[np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
cleaned = data.dropna()
cleanedrow = data.dropna(how='all')
data
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | 6.5 | 3.0 |
1 | 1.0 | NaN | NaN |
2 | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 |
data[4] = np.nan
cleaned1 = data.dropna(axis=1, how='all')
cleanedthresh = data.dropna(thresh=2)
data
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
0 | 1 | 2 | 4 | |
---|---|---|---|---|
0 | 1.0 | 6.5 | 3.0 | NaN |
1 | 1.0 | NaN | NaN | NaN |
2 | NaN | NaN | NaN | NaN |
3 | NaN | 6.5 | 3.0 | NaN |
4.2.2 Data Transformation
.duplicated()
,drop_duplicates()
Example 4.6
pd.Series.map()
,pd.DataFrame.apply()
Example 4.7
import pandas as pd
import numpy as np
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
data['animal'] = data['food'].str.lower().map(meat_to_animal)
data['food'].map(lambda x: meat_to_animal[x.lower()])
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
replace
rename
describe
permutation
sample
- dummy variables
4.2.3 Example: Movies
Below we explore the MovieLens 1M datasets. You may download it from this link.
import pandas as pd
import numpy as np
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assests/datasets/movies.dat', sep='::',
header=None, names=mnames, engine="python",
encoding='ISO-8859-1')
all_genres = list()
movies['genres'].map(lambda x: all_genres.extend(x.split('|')))
genres = pd.unique(all_genres)
dummies = pd.DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
for i, gen in enumerate(movies.genres):
indices = dummies.columns.get_indexer(gen.split('|'))
dummies.iloc[i, indices] = 1
movies_windic = movies.join(dummies.add_prefix('Genre_'))
4.2.4 String Manipulation
The key idea in this section is that, all methods in pd.Series.str
will be applied to each entry of the Series.
Example 4.8
Example 4.9 We could also use .str
to play with column names and row indexes.
4.2.5 Regular expression
Regular expressions provide a flexible way to search or match string patterns in text. A single expression, commonly called a regex, is a string formed according to the regular expression language. Python’s built-in re
module is responsible for applying regular expressions to strings.
For details of the regular expression language in Python, please read the official documents from here. There are also many great websites for learning regex. This is one example.
We will briefly mentioned a few rules here.
.
: matches any character except a newline.\d
: matches any digit. It is the same as[0-9]
.\w
: matches any alphabatic or numeric character. It is the same as[a-zA-Z0-9_]
.\s
: matches any whitespaces. It is the same as[\t\n\r\f\v]
.*
: Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible.+
: Causes the resulting RE to match 1 or more repetitions of the preceding RE, as many repetitions as are possible.?
: Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.*?
,+?
,??
: The*
,+
, and?
qualifiers are all greedy; they match as much text as possible. Adding?
after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.{m}
: Specifies that exactly m copies of the previous RE should be matched.{m,n}
: Causes the resulting RE to match fromm
ton
repetitions of the preceding RE, attempting to match as many repetitions as possible.{m,n}?
: Causes the resulting RE to match fromm
ton
repetitions of the preceding RE, attempting to match as few repetitions as possible.[]
: Used to indicate a set of characters.()
: set groups.
Example 4.10
.match()
.search()
.findall()
.split()
.sub()
We can use ()
to specify groups, and use .groups()
to get access to the results.
Example 4.11
To use regex to DataFrame and Series, you may directly apply .match
, .findall
, .replace
after .str
, with the regex pattern as one of the arguments.
.extract
is a method that is not from re
. It is used to extract the matched groups and make them as a DataFrame.
Example 4.12
import pandas as pd
import numpy as np
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('assests/datasets/movies.dat', sep='::',
header=None, names=mnames, engine="python",
encoding='ISO-8859-1')
pattern = r'([a-zA-Z0-9_\s,.?:;\']+)\((\d{4})\)'
movies = movies.join(movies.title.str.extract(pattern).rename(columns={0: 'movie title', 1: 'year'}))
4.3 Data Wrangling
4.3.1 Hierarchical indexing
Pandas support a more complex indexing system, that the index may have multiple levels. See the following example.
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(9),
index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 2, 3, 1, 2, 3]])
data
a 1 -0.802479
2 0.759039
3 -1.171718
b 1 -0.210611
2 0.321742
c 3 -1.160436
1 0.516246
d 2 0.397977
3 0.062464
dtype: float64
You may look at the Series using different levels of indexes.
You may use groupby to group by levels and do calculations related to levels. More .groupby()
will be discussed in the next section.
From the example above, you may notice that the 2-level hierarchical indexing for a Series works very similar to a DataFrame. In fact, you may translate it back and forth between a 2-level indexing Series and a DataFrame.
1 | 2 | 3 | |
---|---|---|---|
a | -0.802479 | 0.759039 | -1.171718 |
b | -0.210611 | 0.321742 | NaN |
c | 0.516246 | NaN | -1.160436 |
d | NaN | 0.397977 | 0.062464 |
a 1 -0.802479
2 0.759039
3 -1.171718
b 1 -0.210611
2 0.321742
c 1 0.516246
3 -1.160436
d 2 0.397977
3 0.062464
dtype: float64
For DataFrame the index for both axes can be multiindex. The usual indexing way can be used if you want to start from the first level of the index. The more specific method to extract data is .xs
.
Example 4.13
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df = pd.concat([df1, df2], keys=['x', 'y'])
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 |
4.3.2 Combining and Merging Datasets
4.3.2.1 merge()
Merge combines datasets by linking rows using one or more keys. This is from relational databases (e.g., SQL-based).
Here are some examples.
Example 4.14
The two DataFrames are displayed as follows.
If the column names are different in each object, you can specify them separately.
By default merge
does an inner join, that the keys in the result are the interesection found in both tables. Below are different types of merge
. To specify the method for merge, the option is how
.
inner
left
right
outer
Let’s see the following examples.
Example 4.15 If we want to really create a Cartesian product, we may use the option how='cross'
. For example, we would like to generate a deck of cards, we may use the following codes.
4.3.2.2 concat()
The concat()
function (in the main pandas namespace) performs concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
pd.concat([df1, df2, df3], keys=['x', 'y', 'z'])
A | B | C | D | ||
---|---|---|---|---|---|
x | 0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 | |
2 | A2 | B2 | C2 | D2 | |
3 | A3 | B3 | C3 | D3 | |
y | 4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 | |
6 | A6 | B6 | C6 | D6 | |
7 | A7 | B7 | C7 | D7 | |
z | 8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 | |
10 | A10 | B10 | C10 | D10 | |
11 | A11 | B11 | C11 | D11 |
The default way of pd.concat()
is vertically. Note that it will check the column names. If the column names don’t match, new columns will be created and nan
values will be assigned.
If you want to concatenate the DataFrame horizontally you need to add axis=1
option. Similarly, row index will be checked before concatenating. See the following example.
Example 4.16
A | B | C | D | A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
Example 4.17 Consider the deck example from Example 4.15. This time we would like to use pd.concat()
to get the result.
4.4 Data Aggregation and Group Operations
4.4.1 split-apply-combine model
We would like to apply group operations based on the split-apply-combine model.
- In the first stage of the process, data contained in a pandas object is split into groups based on one or more keys that you provide. We then use
.groupby(keys)
to perform the split step. The result is a groupedgroupby
object. - Once this is done, a function is applied to each group, producing a new value.
- Finally the results of all those function applications are combined into a result object. We may apply groupby functions directly as methods to groupby objects.The result is the combined result object.
Example 4.18
import pandas as pd
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df
key1 | key2 | data1 | data2 | |
---|---|---|---|---|
0 | a | one | 0.873122 | 0.561476 |
1 | a | two | -0.426869 | -1.433012 |
2 | b | one | 1.756470 | 0.352796 |
3 | b | two | 0.754264 | -2.418179 |
4 | a | one | -0.452852 | -0.950064 |
Now we want to group data1
in df
by key1
.
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002020D272F40>
What we get is a groupby object and we could apply group functions to it.
The method to look at each group is .get_group
.
We may directly apply some group functions to the groupby object.
We could iterate over groups.
name a
group 0 0.873122
1 -0.426869
4 -0.452852
Name: data1, dtype: float64
name b
group 2 1.756470
3 0.754264
Name: data1, dtype: float64
We could convert the group object into list and dictionary.
[('a',
0 0.873122
1 -0.426869
4 -0.452852
Name: data1, dtype: float64),
('b',
2 1.756470
3 0.754264
Name: data1, dtype: float64)]
4.4.2 More aggregation functions
.describe()
.count()
.sum()
.mean()
.median
.std()
,.var()
.min()
,.max()
.prod()
first()
,.last()
.agg()
4.4.3 Some examples
Example 4.19 Consider the following DataFrame.
We would like to fill in NA values with the mean from each group.
C:\Users\Xinli\AppData\Local\Temp\ipykernel_29412\2040193686.py:1: FutureWarning:
Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction.
location | data | |
---|---|---|
Ohio | East | -1.063793 |
New York | East | 0.419128 |
Vermont | East | 0.123220 |
Florida | East | 1.014324 |
Oregon | West | -0.821446 |
Nevada | West | -0.185755 |
California | West | 0.449936 |
Idaho | West | -0.185755 |
We could also fill in NA values with predefined values, similar to the non-groupby case.
4.5 Exercises
Most problems are based on [1].
Exercise 4.1 Please use the following code to generate a series ser
, and then finish the following tasks.
- Convert the series
ser
into a dataframedf
with its index as another column on the dataframe. - Pick the two columns of
df
and set them into two seriesesser1
andser2
. - Combine two series
ser1
andser2
to form a new dataframenewdf
, and name their columnsser1
andser2
.
Exercise 4.2 Consider two serieses ser1
and ser2
. You may use the following ser1
and ser2
as an example. The output of each questions below should be a series. You may want to learn the following commands:
- Find all the elements from
ser1
that are also inser2
. - Find all the elements from
ser2
that are also inser1
. - From
ser1
remove items present inser2
. - Find the union of
ser1
andser2
. - Find the intersection of
ser1
andser2
. - Find all the elemetns that are in either
ser1
orser2
, but not both.
Exercise 4.3 (Some statistics) Please check the following commands and answer the following questions.
How to get the minimum, 25th percentile, median, 75th, and max of a numeric series? You may use the following Series as an example.
Exercise 4.4 Please use pd.Series.value_counts()
to calculte the frequency counts of each unique value of the following Series.
Exercise 4.5 Please keep the top 2 most frequent items of ser
as it is and replace everything else as Other
.
Exercise 4.6 Please use pd.cut
or pd.qcut
to bin the Series ser
into 10 equal deciles. You may use the following ser
as an example.
Exercise 4.7 Consider the Series ser
:
Find the positions of numbers that are multiples of 3 from ser.
Exercise 4.8 Compute the mean of weights
of each fruit
.
Exercise 4.9 Consider the following DataFrame.
- Check if
df
has any missing values. - Please count the number of missing values in each column.
- Please replace all missing values in
Min.Price
andMax.Price
with their mean respectively.
Exercise 4.10 Replace the spaces in my_str = 'dbc deb abed gade'
with the least frequent character.
Exercise 4.11 Suppress scientific notations like e-03
in df
and print up to 4 numbers after decimal.
Exercise 4.12 Format the values in column random
of df
as percentages.
Exercise 4.13 (Regular expressions) Please use regular expressions to finish the following tasks.
- Match a string that has an
a
followed by zero or moreb
’s. - Match a string that has an
a
followed by one or moreb
’s. - Match a string that has an
a
followed by zero or oneb
. - Match a string that has an
a
followed by threeb
’s.
Exercise 4.14 (More regex) Find all words starting with a
or e
in a given string:
Exercise 4.15 (More regex) Write a Python code to extract year, month and date from a url1
:
Exercise 4.16 (More regex) Please use regex to parse the following str to create a dictionary.
Exercise 4.17 Consider the following DataFrame.
data = [['Evert van Dijk', 'Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.'],
['Every Good Gift', 'Red. Flowers velvety red. Moderate fragrance. Average diameter 4". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.'],
['Evghenya', 'Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.'],
['Evita', 'White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.'],
['Evrathin', 'Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.'],
['Evita 2', 'White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.']]
df = pd.DataFrame(data, columns = ['NAME', 'BLOOM'])
df
NAME | BLOOM | |
---|---|---|
0 | Evert van Dijk | Carmine-pink, salmon-pink streaks, stripes, fl... |
1 | Every Good Gift | Red. Flowers velvety red. Moderate fragrance... |
2 | Evghenya | Orange-pink. 75 petals. Large, very double b... |
3 | Evita | White or white blend. None to mild fragrance.... |
4 | Evrathin | Light pink. [Deep pink.] Outer petals white. ... |
5 | Evita 2 | White, blush shading. Mild, wild rose fragran... |
Please use regex methods to find all the ()
in each columns.
Exercise 4.18 Get the last two rows of df
whose row sum is greater than 100.
Exercise 4.19 The groupby object df_grouped
is given below.
- Get the group belonging to
apple
as a DataFrame. - Find the second largest value of
taste
forbanana
. - Compute the mean
price
for everyfruit
.
Exercise 4.20 Join df1
and df2
by fruit
/pazham
and weight
/kilo
.
4.6 Projects
Exercise 4.21 Extract the valid emails from the series emails
. The regex pattern
for valid emails is provided as reference.
Exercise 4.22 Consider the following DataFrame.
- Replace
NaN
with stringmissing
in columnsManufacturer
,Model
andType
. - Create an index as a combination of these three columns.
Exercise 4.23 Given the following DataFrame.
- Please use
.map
to create a new columnnumeric_score
whose value is the number version ofscore
. - Please use
.apply
to create a new columnnumeric_score
whose value is the number version ofscore
.
Exercise 4.24 From ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])
, find the words that contain at least 2 vowels.
Exercise 4.25 Please download the given file with sample emails, and use the following code to load the file and save it to a string content
.
Please use regex to play with content
.
- Get all valid email address in
content
, from both the header part or the body part. - There are two emails in
content
. Please get the sender’s email and the receiver’s email fromcontent
. - Please get the sender’s name.
- Please get the subject of each email.
Exercise 4.26 The following DataFrame is given.
- Split the columns into a list with 3 entries.
- Make the first row (row 0) into a header.
- Create a new DataFrame out of the data.