书名：Practical Data Science Cookbook（Second Edition）
作者名：Prabhanjan Tattar Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta
本章字数：537字
更新时间：2025-04-04 19:03:20

How it works...

Our dataset function has been modified to filter on a single field and value if desired. If no filter has been specified, it generates the entire CSV. The main piece of interest is what happens in the main function. Here, we generate a bar chart of average incomes in the United States per year using matplotlib. Let's walk through the code.

We collect our data as (year, avg_income) tuples in a list comprehension that utilizes our special dataset method to filter data only for the United States.

We have to cast the average income per tax unit to a float in order to compute on it. In this case, we leave the year as a string since it simply acts as a label; however, in order to perform datetime computations, we might want to convert that year to a date using datetime.strptime (row['Year'], '%Y').date().

After we have performed our data collection, filtering, and conversions, we set up the chart. The width is the maximum width of a bar. An ind iterable (ndarray) refers to the x axis locations for each bar; in this case, we want one location for every data point in our set. A NumPy np.arange function is similar to the built-in xrange functions; it returns an iterable (ndarray) of evenly spaced values in the given interval. In this case, we provide a stop value that is the length of the list and use the default start value of 0 and step size of 1, but these can also be specified. The use of arange allows floating point arguments, and it is typically much faster than simply instantiating the full array of values.

The figure and subplot module functions utilize the matplotlab.pyplot module to create the base figure and axes, respectively. The figure function creates a new figure, or returns a reference to a previously created figure. The subplot function returns a subplot axis positioned by the grid definition with the following arguments: the number of rows, number of columns, and plot number. This function has a convenience when all three arguments are less than 10. Simply supplying a three-digit number with the respective values, for example, plot.subplot (111), creates 1 x 1 axes in subplot 1.

We then use the subplot to create a bar chart from our data. Note the use of another comprehension that passes the values of the incomes from our dataset along with the indices we created with np.arange. On setting the x axis labels, however, we notice that if we add all years as individual labels, the x axis is unreadable. Instead, we add ticks for every 4 years, starting with the first year. In this case, you can see that we use a step size of 4 in np.arange to set our ticks, and similarly, in our labels, we use slices on the Python list to step through every four labels. For example, for a given list, we will use:

mylist[s:e:t]

The slice of the list starts at s, ends at e, and has the step size t. Negative numbers are also supported in slices to be iterated from the end of the list; for example, mylist[-1] will return the last item of the list.