书名：Practical Data Science Cookbook（Second Edition）
作者名：Prabhanjan Tattar Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta
本章字数：1688字
更新时间：2025-04-04 19:03:20

How to do it...

Let's use the following sequence of steps to import the data and start our exploration of this dataset in Python:

With the following snippet, we will create a Python list in memory that contains dictionaries of each row, where the keys are the column names (the first row of the CSV contains the header information) and the values are the values for that particular row:

In [3]: import csv 
   ...: data_file = "../data/income_dist.csv" 
   ...: with open(data_file, 'r') as csvfile: 
   ...: reader = csv.DictReader(csvfile) 
   ...: data = list(reader)

Note that the input file, income_dist.csv, might be in a different directory in your system depending on where you place it.

We perform a quick check with len to reveal the number of records:

In [4]: len(data) 
   ...:  
Out[4]: 2180

When utilizing CSV data with headers, we check the field names on the CSV reader itself, as well as getting the number of variables:

In [5]: len(reader.fieldnames) 
   ...:  
Out[5]: 354

While this data is not too large, let's start using best practices when accessing it. Rather than holding all of the data in memory, we use a generator to access the data one row at a time.

Generators are Python expressions that allow you to create functions that act as iterables; rather than returning all of the data, they yield data one part at a time in a memory-efficient iteration context. As our datasets get larger, it's useful to use generators to perform filtering on demand and clean data as you read it:

In [6]: def dataset(path): 
   ...: with open(path, 'r') as csvfile: 
   ...: reader = csv.DictReader(csvfile)
   ...: for row in reader: 
   ...: yield row

Also, take note of the with open(path, 'r') as csvfile statement. This statement ensures that the CSV file is closed when the with block is exited, even (or especially) if there is an exception. Python with blocks replace the try, except, and finally statements, and are syntactically brief while semantically more correct programming constructs.

Using our new function, we can take a look to determine which countries are involved in our dataset:

In [7]: print(set([row["Country"] for row in dataset(data_file)])) 

   ...: set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland',...])

We can also inspect the range of years that this dataset covers, as follows:

In [8]: print(min(set([int(row["Year"]) for row in dataset(data_file)]))) 
   ...:  
1875 

In [9]: print(max(set([int(row["Year"]) for row in dataset(data_file)]))) 
   ...:  
2010

In both of these previous examples, we used a Python list comprehension to generate a set. A comprehension is a concise statement that generates an iterable, much like the earlier memory-safe generators. The output variable (or variables) is specified, along with the for keyword, and the iterable to express the variable, along with an optional if condition. In Python 3.6, set and dictionary comprehensions also exist. The previous country set could also be expressed as follows:

In [10]: {row["Country"] for row in dataset(data_file)} 
    ...: set(['Canada', 'Italy', 'France', 'Netherlands', 'Ireland',...]) 
Out[10]: {'Netherlands', Ellipsis, 'Ireland', 'Canada', 'Italy', 'France'}

Finally, let's filter just the data for the United States so we can analyze it exclusively:

In [11]: filter(lambda row: row["Country"] == "United States", 
    ...: dataset(data_file)) 

Out[11]: <filter at 0xb1aeac8>

The Python filter function creates a list from all of the values of a sequence or iterable (the second parameter) that make the function specified by the first parameter true. In this case, we use an anonymous function (a Lambda function) to check whether the value in the specified row's Country column is equal to United States.

With this initial discovery and exploration of the dataset, we can now take a look at some of the data using matplotlib, one of the main scientific plotting packages available for Python and very similar to the plotting capabilities of MATLAB:

In [12]: import csv 
    ...: import numpy as np 
    ...: import matplotlib.pyplot as plt 

In [13]: def dataset(path, filter_field=None, filter_value=None): 
    ...: with open(path, 'r') as csvfile: 
        ...: reader = csv.DictReader(csvfile)
    ...: if filter_field: 
    ...: for row in filter(lambda row: 
    ...: row[filter_field]==filter_value, reader): 
    ...: yield row 
    ...: else: 
    ...: for row in reader: 
    ...: yield row 

In [14]: def main(path): 
    ...: data = [(row["Year"], float(row["Average income per tax unit"])) 
    ...: for row in dataset(path, "Country", "United States")] 
    ...: width = 0.35 
    ...: ind = np.arange(len(data)) 
    ...: fig = plt.figure() 
    ...: ax = plt.subplot(111) 
    ...: ax.bar(ind, list(d[1] for d in data))
    ...: ax.set_xticks(np.arange(0, len(data), 4)) 
    ...: ax.set_xticklabels(list(d[0] for d in data)[0::4],rotation=45) 
    ...: ax.set_ylabel("Income in USD") 
    ...: plt.title("U.S. Average Income 1913-2008") 
    ...: plt.show() 

In [15]: if __name__ == "__main__": 
        ...: main("income_dist.csv")

The preceding snippet will give us the following output:

The preceding example of data exploration with Python should seem familiar from many of the R chapters. Loading the dataset, filtering, and computing ranges required a few more lines of code and specific typecasting, but we quickly created analyses in a memory-safe fashion.

When we moved on to creating charts, we started using NumPy and matplotlib a bit more. NumPy can be used in a very similar fashion to R, to load data from a CSV file to an array in memory and dynamically determine the type of each column. To do this, the following two module functions can be used:

genfromtext: This function creates an array from tabular data stored in a text file with two main loops. The first converts each line of the file to string sequences, and the second converts each string to an appropriate datatype. It is a bit slower and not as memory efficient, but the result is a convenient data table stored in memory. This function also handles missing data, which other faster and simpler functions cannot.
recfromcsv: This function is a helper function based on genfromtext that has default arguments set to provide access to a CSV file.

Have a look at the following snippet:

In [16]: import numpy as np 

    ...: dataset = np.recfromcsv(data_file, skip_header=1) 

    ...: dataset 

array([[             nan,   1.93200000e+03,              nan, ..., 
                 nan,   1.65900000e+00,   2.51700000e+00], 
   [             nan,   1.93300000e+03,              nan, ..., 
                 nan,   1.67400000e+00,   2.48400000e+00], 
   [             nan,   1.93400000e+03,              nan, ..., 
                 nan,   1.65200000e+00,   2.53400000e+00], 
   ...,  
   [             nan,   2.00600000e+03,   4.52600000e+01, ..., 
      1.11936337e+07,   1.54600000e+00,   2.83000000e+00], 
   [             nan,   2.00700000e+03,   4.55100000e+01, ..., 
      1.19172976e+07,   1.53000000e+00,   2.88500000e+00], 
   [             nan,   2.00800000e+03,   4.56000000e+01, ..., 
      9.14119000e+06,   1.55500000e+00,   2.80300000e+00]])

The first argument to the function should be the data source. It should be either a string that points to a local or remote file or a file-like object with a read method. URLs will be downloaded to the current working directory before they are loaded. Additionally, the input can be either text or a compressed file. The function recognizes gzip and bzip2. These files need to have the .gz or .bz2 extensions to be readable. Notable optional arguments to genfromtext include the delimiter, , (comma) by default in recfromcsv; skip_header and skip_footer, which take an optional number of lines to skip from the top or bottom respectively; and dtype, which specifies the datatype of the cells. By default, the dtype is None, and NumPy will attempt to detect the correct format.

We can now get an overall sense of the scope of our data table:

In [17]: dataset.size 

    ...:  
Out[17]: 2179 

In [18]: (len(dataset)+1)*len(dataset.T[1]) # works on 3.6
    ...:  
Out[18]: 771720 
In [19]: dataset.shape 
    ...:  
Out[19]: (2179,)

Depending on your version of NumPy, you might see slightly different output. The dataset.size statement might report back the number of rows of data (2179), and the shape might output as (2179,).

The size property on ndarray returns the number of elements in the matrix. The shape property returns a tuple of the dimensions in our array. CSVs are naturally two-dimensional, therefore the (m, n) tuple indicates the number of rows and columns, respectively.

However, there are a couple of gotchas with using this method. First, note that we had to skip our header line; genfromtxt does allow named columns by setting the keyword argument names to True (and in this case, you won't set skip_headers=1). Unfortunately, in this particular dataset, the column names might contain commas. The CSV reader deals with this correctly since the strings that contain commas are quoted, but genfromtxt is not a CSV reader in general. To fix this, either the headers have to be fixed, or some other names need to be added. Secondly, the Country column has been reduced to NaN, and the Year column has been turned into a floating point integer, which is not ideal.

A manual fix on the dataset is necessary, and this is not uncommon. Since we know that there are 354 columns and the first two columns are Country and Year, we can precompute our column names and datatypes:

In [20]: names = ["country", "year"]

    ...: names.extend(["col%i" % (idx+1) for idx in range(352)]) 
    ...: dtype = "S64,i4," + ",".join(["f18" for idx in range(352)]) 
    ...:  
In [21]: dataset = np.genfromcsv(data_file, dtype=np.dtype, names=names, 
    ...: delimiter=",", skip_header=1, autostrip=2) 
    ...:

We name the first two columns country and year, respectively, and assign them datatypes of S64 or string-64, and then assign the year column as i4 or integer-4. For the rest of the columns, we assign them the name coln, where n is an integer from 1 to 352, and the datatype is f18 or float-18. These character lengths allow us to capture as much data as possible, including exponential floating point representations.

Unfortunately, as we look through the data, we can see a lot of nan values that represent Not a Number, a fixture in floating point arithmetic used to represent values that are not numbers nor are equivalent to infinity. Missing data is a very common issue in the data wrangling and cleaning stage of the pipeline. It appears that the dataset contains many missing or invalid entries, which makes sense given the historical data, and countries that may not have had effective data collection for given columns.

In order to clean the data, we use a NumPy masked array, which is actually a combination of a standard NumPy array and a mask, a set of Boolean values that indicate whether the data in that position should be used in computations or not. This can be done as follows:

import numpy.ma as ma 
ma.masked_invalid(dataset['col1']) 
masked_array(data = [-- -- -- ..., 45.2599983215332
 45.5099983215332 45.599998474121094],  
mask = [ True  True  True ..., False False False],fill_value = 
 1e+20)