词汇表

选择左侧的一个关键字...

The Data Science PipelineWrangling

阅读时间: ~50 min

Data is said to be in tidy format if each row corresponds to an observation and each column corresponds a different observation variable. For example, in the iris dataset, each row represents a flower, and the entries of a row specify the flower's species and various measurements made for that flower. Here's the head of the Iris data frame.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa

Note that four of the columns are quantitative (that is, they contain numerical data), while one is categorical (that is, they contain strings which represent categories).

Many types of data do not fit naturally into the tidy data framework, like image data and passages of text. However, tidy data occupies a sweet spot of generality which covers a wide variety of use cases but is nevertheless specific enough to support tools which are both concise and powerful. In this mini-course, we will focus on tidy data.

Pandas

A data frame is an object for storing tidy data, and the package which provides data frames in the Python ecosystem is Pandas. Pandas is built on NumPy, which is the Python library for multi-dimensional arrays. If you aren't comfortable with the basics of NumPy, a brief detour through this interactive notebook is recommended.

import pydataset
iris = pydataset.data("iris")
print(type(iris))
iris.head()

A data frame's index attribute stores the labels of the rows, and the columns attribute stores the column names.

iris.columns, iris.index

Columns of the data frame can be accessed using by indexing the data frame with the name of the column. For column names that are valid Python variable names, dot syntax access is also supported:

import pandas as pd
data = [[179_335,47.7],[81_274,73.4],[24_590,19.9],[22_131,25.4]]
columns = ['population', 'area']
index = ['Providence', 'Cranston', 'Newport', 'Bristol']
ri = pd.DataFrame(data, columns = columns, index = index)
ri['population'] # or ri.population, same thing

Pandas DataFrame objects provide two attributes, called loc and iloc, for accessing entries using names or integers, respectively.

ri.loc['Providence',:'area'], ri.iloc[0,:1]

Note that slices built with row or column names are inclusive, while integer slices follow the usual Python convention of being exclusive of the upper bound.

Exercise
Use the pydataset package to load the 'Boston' data set. Select all of the columns between indus and rad and assign the resulting data frame to a new variable. Then select the first 25 rows from that new data frame.

 

Solution. We index columns using loc and the rows using iloc

boston = pydataset.data('Boston')
boston_trimmed = boston.loc[:,'indus':'rad']
boston_trimmed.iloc[:25,:]

The six verbs of data manipulation

Although data frames support many transformations, we will follow the prominent data scientist Hadley Wickham in suggesting the following six as fundamental. They can be combined to cover most of your data manipulation needs, so you can get up and running quickly in any data manipulation framework by learning how these actions are performed and composed.

  1. Filter. Pick rows based on their values.

  1. Sort. Re-order the rows.

  1. Select. Choose specific columns.

  1. Transform. Create new columns from existing ones.

  1. Aggregate. Reduce the data frame to a single row by applying a function (like sum, min, max, etc.) which maps each column of values to a single value.

  1. Group. Collect the rows of the data frame into groups.

The grouping operation is different from the others because it outputs a grouped data frame object, which can be visualized as a stack of data frames. Here's how the popular data table product Airtable presents grouped data frames (using some of the rows from the iris data set above, grouped on species):

The other operations can be applied group-by-group.

Exercise
Experiment with the interactive table below to practice some of the operations above. You might want to click the "View larger version" icon in the bottom right corner to open the table in a new browser tab.

  1. Use the Filter button to select only those flowers whose sepal length is less than 5.0.

There are such flowers in the data frame (note: the number of rows can be seen in the bottom left corner of the interactive table). 2. Use the Sort button to sort by sepal length and then by sepal width (this means that ties in sepal length are broken by looking at the sepal width value).

The smallest sepal length value is . 3. Use the Hide fields button to select only the petal-related and species columns. 4. Remove the sorts and filters (by clicking on those buttons again and then using the small x by each sort or filter), and then use the Group button to group the data frame by species.

There are flowers of each species in the data frame. 5. Click the small downward pointing triangle under the Petal Length column in the Setosa group (where it originally says "Sum 73.1") to change the aggregation function to "Average". Then click the downward pointing triangle to the left of the "setosa" icon to collapse that group. Do the same for the other two groups.

The species with the largest average petal length is .

Filter

There are two main ways to filter rows in Pandas. The first is to obtain a column of boolean values and use it to index the rows of the data frame. For example:

ri[ri.population > 50_000]

In the last line above, ri.population > 50_000 returns a column with the values [True, True, False, False], and indexing the data frame with a boolean array selects only those rows corresponding to the True values.

We can combine predicates using the usual Python operations for sets:

ri[(ri.population > 50_000) & (ri.area > 50)]

This approach has some drawbacks: First, we have to repeat the name of the data frame multiple times. This violates the programming maxim "Don't repeat yourself". Second, computing an expression like (ri.population > 50_000) & (ri.area > 50) requires three array allocations: one for ri.population > 50_000, one for ri.area > 50, and one for the (ri.population > 50_000) & (ri.area > 50). These allocations are unnecessary, since it is possible to just loop over the rows and directly check the whole condition for each row (although you don't want to program that yourself in Python, because loops in Python are slow).

Pandas does provide a solution to this problem: the query method. You supply your condition as a string, and Pandas makes things efficient for you under the hood:

ri.query('population > 50_000 & area > 50')

Python variables can be interpolated in query strings using the @ prefix. For example, the query above could also be written as

min_pop = 50_000
ri.query('population > @min_pop & area > 50')

Exercise
Use the query method to identify the records in the 'Boston' data set which have the property that indus is at least 10 and either medv is less than 8 or chas is 1.

 

Solution. We use parentheses to group the given logical conditions:

import pydataset
boston = pydataset.data('Boston')
boston.query('indus >= 10 & (medv < 8 | chas == 1)')

Sort

The DataFrame method that sorts values is called sort_values. It takes an argument for the column labels (or list of columns labels) to use for sorting, and you can use the ascending argument to specify whether the values in that column should be in increasing or decreasing order.

ri.sort_values('population',ascending=False)

If a list of columns is supplied, then each column after the first is used to break ties in the preceding columns:

d = pd.DataFrame([[3,2],[3,1],[2,4]],columns=['a','b'])
d.sort_values(['a','b'])

Note that sort_values returns a new data frame. It does not modify the original one.

Exercise
Sort the 'Boston' data set in decreasing order of the value in the first column.

 

Solution. We can inspect columns or look at the data frame's head to see that the first column is 'crim'. Then we sort:

boston = pydataset.data('Boston')
boston.sort_values(['crim'], ascending = False)

Select

To select columns in Pandas, you can just index the data frame with a list of column names:

ri[['population','area']]

If you want to keep all columns except specific ones, you can use the drop method:

ri.drop('population', axis=1)

We have to specify that 'population' refers to columns (axis=1), because the drop method's default is to look for rows to drop. Like sort_values, drop doesn't modify the original data frame.

Exercise
Select the columns in the Boston data frame which contain floating point numbers (as opposed to integers, which do not print with a decimal point).

    

Solution. We inspect the data frame to find that the columns which are not floats are 'chas', rad, and 'tax'. So we use drop instead of selecting:

boston_float_only = boston.drop(['chas', 'rad', 'tax'],axis=1)
boston_float_only.head()

As a follow-up, we note that this is a sufficiently common operation that Pandas supplies a convenience method for it:

import numpy as np
boston_float_only = boston.select_dtypes(exclude=['int'])
boston_float_only.head()

You can inspect the types of the columns of a data frame using its dtypes attribute.

Transform

We can create new columns in a data frame using the assign method. For example:

ri.assign(density = ri.population / ri.area)

If the name of the data frame is quite long, you can avoid having to type it repeatedly by supplying an anonymous function to be applied to the data frame:

ri.assign(density = lambda d: d.population / d.area) \
  .assign(**{'inverse density': lambda d: 1/d.density})

We are using two assign calls to create a column called density and then a second new column called inverse density. (Note how we used splatting to get a space in the column name.)

Exercise
For each observation in the 'toothpaste' data set, find the ratio of the difference between means to the square root of the sum of the squares of the standard deviations for conditions "A" and "B".

 

Solution. Since the formula is quite involved, we use an anonymous function:

import numpy as np
toothpaste = pydataset.data('toothpaste')
toothpaste.assign(score = lambda d: 
      (d.meanA - d.meanB)/np.sqrt(d.sdA**2 + d.sdB**2))

Aggregate

The agg method applies a specified function (called the aggregation function) to each column in a data frame. Several aggregation functions are built-in and can be specified using a string:

NameDescription
countnumber of entries
sumsum of entries
prodproduct of entries
meanaverage
medianmiddle entry when sorted
varvariance
stdstandard deviation
minminimum entry
maxmaximum entry

For example, to find the total population of the four cities in our data frame, we run:

ri.population.agg('sum')

To find the average population and area:

ri.agg('mean')

You can also supply a custom aggregation function instead of a string.

Exercise
Find the range (the difference between max and min) for each of the four quantitative columns in the iris dataset. Try using an anonymous function rather than using the built-in 'min' and 'max' aggregation functions.

 

Solution. We have to drop the categorical column first since we can't compute a range for that.

import numpy as np
import pydataset
iris = pydataset.data('iris')
iris.drop(['Species'],axis=1).agg(lambda r: np.max(r) - np.min(r))

Group

The aggregation function is a little bland by itself. We often want to apply the aggregation function only to specific rows. For example consider the following transaction log:

Customer nameTransaction amount
Acme, Inc.45.03
Acme, Inc.118.22
XYZ Widgets35.55
Acme, Inc.22.04
XYZ Widgets34.78

What is the total transaction amount for each company? We could filter for each company and aggregate the two resulting data frames, but that would get unwieldy quickly as the number of companies grows. Instead, we group the data frame by customer name and then apply the aggregation function:

columns = ['Customer name', 'Transaction amount']
transactions = pd.DataFrame([['Acme, Inc.',45.03],
                             ['Acme, Inc.',118.22],
                             ['XYZ Widgets',35.55],
                             ['Acme, Inc.',22.04],
                             ['XYZ Widgets',34.78]],
                             columns=columns)
transactions.groupby('Customer name').agg('sum')

When a data frame is grouped along a particular column, the resulting grouped data frame is a dictionary-like object whose keys are the unique values of that column and whose values are the data frames obtained by filtering for the given key in that column. For example, transactions.groupby('Customer name') maps the key 'Acme, Inc.' to the data frame

Customer name Transaction amount
0 Acme, Inc. 45.03
1 Acme, Inc. 118.22
3 Acme, Inc. 22.04

and 'XYZ Widgets' to

Customer name Transaction amount
2 XYZ Widgets 35.55
4 XYZ Widgets 34.78

The agg method of DataFrameGroupBy object operates on each each of these data frames to produce a single row, and these rows are collected into a new data frame. The row index for this output data frame comes from the keys of the dictionary-like DataFrameGroupBy object.

Exercise
Use the interactive table below to perform the same aggregation operation described above (in other words, find the sum of the transaction amounts for each company).

Exercise
Group the flowers in the Iris data set in unit intervals of sepal length and find the average sepal width for each such group. (In other words, one group contains all flowers with sepal length in [4,5), another group has all flowers with sepal length in [5,6), and so on.)

Hint: make a new column to group by.

import pydataset
iris = pydataset.data('iris')
 

Solution. We create a new column using the np.floor function. Then we perform the grouping, select the column we want, and aggregate using the mean function:

import pydataset
import numpy as np
iris = pydataset.data('iris') 
iris.assign(sepal_length_floor = np.floor(iris["Sepal.Length"])) \
    .groupby('sepal_length_floor') \
    [['Sepal.Width']] \
    .agg('mean')

We conclude this section by noting that the other four operations (filter, sort, select, and transform) can be applied to grouped data frames as well. However, only selection works directly on grouped data frames. For the others, we use the apply method of the grouped data frame object to operate group-by-group and collect the results into a single data frame.

For example, we can sort by petal length within species as follows:

iris.groupby('Species').apply(lambda d: d.sort_values('Petal.Length'))

You'll notice in data frame returned above that the row indices are retained from the original data frame. Actually, the group values are also incorporated into the indexing scheme—this is advanced feature of Pandas called a multi-indexing. The reset_index method is useful for dropping this extra structure and simply re-indexing the rows from 0. We need to supply the value True to the keyword argument drop, because otherwise the reset_index method will try to keep the old indices around as new columns, and that will fail since we already have a Species column:

(iris.groupby('Species')
     .apply(lambda d: d.sort_values('Petal.Length'))
     .reset_index(drop = True))
Bruno
Bruno Bruno