书名：Practical Data Science Cookbook（Second Edition）
作者名：Prabhanjan Tattar Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta
本章字数：759字
更新时间：2025-04-04 19:03:20

How to do it...

With the following steps, we will dive deeper into the dataset and examine additional income figures:

The dataset also contains the average incomes by year of the different groups. Let's graph these and see how they have changed over time, relative to each other:

In [32]: def average_incomes(source): 
    ...: """ 
    ...: Compares percentage average incomes 
    ...: """ 
    ...: columns = ( 
    ...: "Top 10% average income", 
    ...: "Top 5% average income", 
    ...: "Top 1% average income", 
    ...: "Top 0.5% average income", 
    ...: "Top 0.1% average income", 
    ...: "Top 0.05% average income", 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: return linechart([timeseries(source, col) for col in  
    ...: columns], labels=columns, , 
    ...: ylabel="2008 US Dollars") 
    ...: average_incomes(data_file) 
    ...: plt.show()

Since we have the foundation in place to create line charts, we can immediately analyze this new dataset with the tools we already have. We simply choose a different collection of columns and then customize our chart accordingly! The following is the resulting graph:

The results shown by this graph are quite fascinating. Until the 1980s, the wealthy have been about $1-1.5 million richer than the lower income groups. From the 1980s forward, the disparity has increased dramatically.

We can also use the delta functionality to see how much richer the rich are than the average American:

In [33]: def average_top_income_lift(source): 

    ...: """ 
    ...: Compares top percentage avg income over total avg 
    ...: """ 
    ...: columns = ( 
    ...: ("Top 10% average income", "Top 0.1% average income"), 
    ...: ("Top 5% average income", "Top 0.1% average income"), 
    ...: ("Top 1% average income", "Top 0.1% average income"), 
    ...: ("Top 0.5% average income", "Top 0.1% average income"), 
    ...: ("Top 0.1% average income", "Top 0.1% average income"), 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: series = [delta(timeseries(source, a), timeseries(source, 
    ...: b)) for a, b in columns] 
    ...: return linechart(series,labels=list(col[0] for col in columns), 
    ...: ,ylabel="2008 US Dollars") 
    ...:

We still haven't written new code other than the selection of our columns and utilization of the functionality that we have already added to our project. This reveals the following:

In our last analysis, we'll show off a different kind of chart to look at the composition of the income of the wealthiest americans. Since the composition is a percentage-based time series, a good chart for this task is a stacked area. Once again, we can utilize our time series code and simply add a function to create stacked area charts as follows:

In [34]: def stackedarea(series, **kwargs): 
    ...: fig = plt.figure() 
    ...: axe = fig.add_subplot(111) 
    ...: fnx = lambda s: np.array(list(v[1] for v in s), dtype="f8") 
    ...: yax = np.row_stack(fnx(s) for s in series) 
    ...: xax = np.arange(1917, 2008) 
    ...: polys = axe.stackplot(xax, yax) 
    ...: axe.margins(0,0) 
    ...: if 'ylabel' in kwargs: 
    ...: axe.set_ylabel(kwargs['ylabel']) 
    ...: if 'labels' in kwargs: 
    ...: legendProxies = [] 
    ...: for poly in polys: 
    ...: legendProxies.append(plt.Rectangle((0, 0), 1, 1, 
    ...: fc=poly.get_facecolor()[0])) 
    ...: axe.legend(legendProxies, kwargs.get('labels')) 
    ...: if 'title' in kwargs: 
    ...: plt.title(kwargs['title']) 
    ...: return fig 
    ...:

The preceding function expects a group of time series, the total percentages of which add up to 100. We create a special, anonymous function that will convert each series into a NumPy array. The NumPy row_stack function creates a sequence of arrays stacked vertically; this is what will generate our stackplot using the subplot.stackplot function. The only other surprise in this function is the requirement to use a legend proxy to create rectangles with the fill color from the stackplot in the legend.

Now, we can take a look at the income composition of the wealthiest Americans:

In [35]: def income_composition(source): 
    ...: """ 
    ...: Compares income composition 
    ...: """ 
    ...: columns = ( 
    ...: "Top 10% income composition-Wages, salaries andpensions", 
    ...: "Top 10% income composition-Dividends", 
    ...: "Top 10% income composition-Interest Income", 
    ...: "Top 10% income composition-Rents", 
    ...: "Top 10% income composition-Entrepreneurial income", 
    ...: ) 
    ...: source = list(dataset(source)) 
    ...: labels = ("Salary", "Dividends", "Interest", "Rent","Business") 
    ...: return stackedarea([timeseries(source, col) for col in columns],  
    ...: labels=labels, , 
    ...: ylabel="Percentage") 
    ...:

The preceding code generates the following plot:

As you can see, the top 10 percent of American earners make most of their money from a salary income; however, business income also plays a large role. Dividends played a bigger role earlier in the century than they do towards the end of the century, which is true for interest and rent as well. Interestingly, for the first part of the 20th century, the percentage of income that is related to entrepreneurial income declines, until the 1980s, when it starts to grow again, possibly because of the technology sector.