书名：Practical Data Science Cookbook（Second Edition）
作者名：Prabhanjan Tattar Tony Ojeda Sean Patrick Murphy Benjamin Bengfort Abhijit Dasgupta
本章字数：675字
更新时间：2025-04-04 19:03:20

How to do it...

A step-by-step approach to perform the analysis related to the income_dist.csv file can be easily carried out as shown in the next program.

Load the dataset income_dist.csv using the read.csv function and use the functions nrow, str, length, unique, and so on to get the following results:

id <- read.csv("income_dist.csv",header=TRUE) 
nrow(id) 
str(names(id)) 
length(names(id))  
ncol(id) # equivalent of previous line 
unique(id$Country) 
levels(id$Country) # alternatively 
min(id$Year) 
max(id$Year) 
id_us <- id[id$Country=="United States",]

The data is first stored in the R object ID. We see that there are 2180 observations/rows in the dataset. The dataset has 354 variables and a few are seen with the use of two functions, str and names. The number of variables is also verified using the ncol and length functions. The data related to United States is selected through the code id[id$Country=="United States",]. Now, we first use the plot function to get a first view of the average income tax, which is a poor plot.

Using the plot function, we obtain a simple display as follows:

plot(id_us$Year , id_us$Average.income.per.tax.unit) 0

The output is not given here as we intend to improvise it. Instead of a plot, we now use the barplot function.

An elegant display is obtained using the barplot function along with a suitable choice of labels:

barplot(id_us$Average.income.per.tax.unit,ylim=c(0,60000), 

        ylab="Income in USD",col="blue",main="U.S. Average Income 1913-2008", 
        names.arg=id_us$Year)

It is always a good practice to use options of a graphical function. For instance, we specified the range of the y-variable through ylim, the y-axis label through ylab, and a caption for the graph through main.

For further analyses, we continue to focus on the United States region only:

The analysis for the top income data of the US, as in the earlier recipe, Analyzing and visualizing the top income data of the US, is reproduced in R in the following program. After subsetting on the US region, we select the specific variables of 10%, 5%, 1%, 0.5%, and 0.1% using the subsetting as in [. The new R object is id2_us2:

id2 <- read.csv("income_dist.csv",header=TRUE,check.names = F) 
# using the check.names=F option to ensure special characters in colnames 
id2_us <- id2[id$Country=="United States",] 
id2_us2 <- id2_us[,c("Top 10% income share", 
                     "Top 5% income share", 
                     "Top 1% income share", 
                     "Top 0.5% income share", 
                     "Top 0.1% income share")] 
row.names(id2_us2) <- id2_us$Year

The R object id2_us2 is converted into a time series object with the ts function. Now, for this specific choice of the data, we visualize it year-on-year with the next chunk of R code:

id2_us2 <- ts(id2_us2,start=1913,frequency = 1) 
windows(height=20,width=10) 
plot.ts(id2_us2,plot.type="single",ylab="Percentage",frame.plot=TRUE, 
        col=c("blue","green","red","blueviolet","purple")) 
legend(x=c(1960,1980),y=c(45,30),c("Top 10%","Top 5%","Top 1%","Top 0.5%","Top 0.1%"), 
       col = c("blue","green","red","blueviolet","purple"),pch="-")

Note that the object id2_us2 has five time series objects. We plot all of them in a single frame using the option plot.type="single". Legends and colors are used to enhance the aesthetics of the graphical display. The resulting graphical output is given as follows:

The preceding exercise is repeated with scaled data:

id2_scale <- scale(id2_us2) 
windows(height=20,width=10) 
plot.ts(id2_scale,plot.type="single",ylab="Percentage",frame.plot=TRUE, 
        col=c("blue","green","red","blueviolet","purple")) 
legend(x=c(1960,1980),y=c(2,1),c("Top 10%","Top 5%","Top 1%","Top 0.5%","Top 0.1%"), col = c("blue","green","red","blueviolet","purple"),pch="-")

The output is as follows:

Note that this display makes the comparison between the five time series easier.
To replicate the Python analyses in the Furthering the analysis of the top income groups of the US, recipe in R, we give the R code and output in the final chunk of the code:

id2_us3 <- id2_us[,c("Top 10% income share-including capital gains", 
                     "Top 10% income share", 
                     "Top 5% income share-including capital gains", 
                     "Top 5% income share", 
                     "Top 1% income share-including capital gains", 
                     "Top 1% income share", 
                     "Top 0.5% income share-including capital gains", 
                     "Top 0.5% income share", 
                     "Top 0.1% income share-including capital gains", 
                     "Top 0.1% income share", 
                     "Top 0.05% income share-including capital gains", 
                     "Top 0.05% income share") 
                  ] 
id2_us3[,"Top 10% capital gains"] <- id2_us3[,1]-id2_us3[,2] 
id2_us3[,"Top 5% capital gains"] <- id2_us3[,3]-id2_us3[,4]   
id2_us3[,"Top 1% capital gains"] <- id2_us3[,5]-id2_us3[,6]   
id2_us3[,"Top 0.5% capital gains"] <- id2_us3[,7]-id2_us3[,8]   
id2_us3[,"Top 0.1% capital gains"] <- id2_us3[,9]-id2_us3[,10]   
id2_us3[,"Top 0.05% capital gains"] <- id2_us3[,11]-id2_us3[,12]   
id2_us3 <- ts(id2_us3,start=1913,frequency = 1) 
windows(height=20,width=10) 
plot.ts(id2_us3[,13:18],plot.type="single",ylab="Percentage",frame.plot=TRUE, 
        col=c("blue","green","red","blueviolet","purple","yellow")) 
legend(x=c(1960,1980),y=c(7,5),c("Top 10%","Top 5%","Top 1%","Top 0.5%", 
                                 "Top 0.1%","Top 0.05%"), 
       col = c("blue","green","red","blueviolet","purple","yellow"),pch="-")

The graphical output is given as follows: