How to do it...
The following steps will use both plyr and the graphics library, ggplot2, to explore the dataset:
- Let's start by looking at whether there is an overall trend of how MPG changes over time on average. To do this, we use the ddply function from the plyr package to take the vehicles data frame, aggregate rows by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency. The result is then assigned to a new data frame, mpgByYr. Note that this is our first example of split-apply-combine. We split the data frame into groups by year, we apply the mean function to specific variables, and then we combine the results into a new data frame:
mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG =
mean(comb08), avgHghy = mean(highway08), avgCity =
mean(city08))
- To gain a better understanding of this new data frame, we pass it to the ggplot function, telling it to plot the avgMPG variable against the year variable, using points. In addition, we specify that we want axis labels, a title, and even a smoothed conditional mean (geom_smooth()) represented as a shaded region of the plot:
ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() +
geom_smooth() + xlab("Year") + ylab("Average MPG") +
ggtitle("All cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot:

- Based on this visualization, one might conclude that there has been a tremendous increase in the fuel economy of cars sold in the last few years. However, this can be a little misleading as there have been more hybrid and non-gasoline vehicles in later years, which is shown as follows:
table(vehicles$fuelType1)
## Diesel Electricity Midgrade Gasoline Natural Gas
## 1025 56 41 57
## Premium Gasoline Regular Gasoline
## 8521 24587
- Let's look at just gasoline cars, even though there are not many non-gasoline powered cars, and redraw the preceding plot. To do this, we use the subset function to create a new data frame, gasCars, which only contains the rows of vehicles in which the fuelType1 variable is one among a subset of values:
gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline",
"Premium Gasoline", "Midgrade Gasoline") & fuelType2 == "" & atvType != "Hybrid")
mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08))
ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() +
geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("Gasoline cars")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot:

- Have fewer large engine cars been made recently? If so, this can explain the increase. First, let's verify whether cars with larger engines have worse fuel efficiency. We note that the displ variable, which represents the displacement of the engine in liters, is currently a string variable that we need to convert to a numeric variable:
typeof(gasCars$displ)
## "character"
gasCars$displ <- as.numeric(gasCars$displ)
ggplot(gasCars, aes(displ, comb08)) + geom_point() +
geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 2 rows containing missing values
(stat_smooth).
## Warning: Removed 2 rows containing missing values
(geom_point).
The preceding commands will give you the following plot:

This scatter plot of the data offers the convincing evidence that there is a negative, or even inverse correlation, between engine displacement and fuel efficiency; thus, smaller cars tend to be more fuel-efficient.
- Now, let's see whether more small cars were made in later years, which can explain the drastic increase in fuel efficiency:
avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ))
ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() +
geom_smooth() + xlab("Year") + ylab("Average engine displacement (l)")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot:

- From the preceding figure, the average engine displacement has decreased substantially since 2008. To get a better sense of the impact this might have had on fuel efficiency, we can put both MPG and displacement by year on the same graph. Using ddply, we create a new data frame, byYear, which contains both the average fuel efficiency and the average engine displacement by year:
byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08),
avgDispl = mean(displ))
> head(byYear)
year avgMPG avgDispl
1 1984 19.12162 3.068449
2 1985 19.39469 NA
3 1986 19.32046 3.126514
4 1987 19.16457 3.096474
5 1988 19.36761 3.113558
6 1989 19.14196 3.133393
- The head function shows us that the resulting data frame has three columns: year, avgMPG, and avgDispl. To use the faceting capability of ggplot2 to display Average MPG and Avg engine displacement by year on separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to a long format:
byYear2 = melt(byYear, id = "year")
levels(byYear2$variable) <- c("Average MPG", "Avg engine displacement")
head(byYear2)
year variable value
1 1984 Average MPG 19.12162
2 1985 Average MPG 19.39469
3 1986 Average MPG 19.32046
4 1987 Average MPG 19.16457
5 1988 Average MPG 19.36761
6 1989 Average MPG 19.14196
- If we use the nrow function, we can see that the byYear2 data frame has 62 rows and the byYear data frame has only 31. The two separate columns from byYear (avgMPG and avgDispl) have now been melted into one new column (value) in the byYear2 data frame. Note that the variable column in the byYear2 data frame serves to identify the column that the value represents:
ggplot(byYear2, aes(year, value)) + geom_point() +
geom_smooth() + facet_wrap(~variable, ncol = 1, scales =
"free_y") + xlab("Year") + ylab("")
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 1 rows containing missing values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot:

From this plot, we can see the following:
- Engine sizes have generally increased until 2008, with a sudden increase in large cars between 2006 and 2008.
- Since 2009, there has been a decrease in the average car size, which partially explains the increase in fuel efficiency.
- Until 2005, there was an increase in the average car size, but the fuel efficiency remained roughly constant. This seems to indicate that engine efficiency has increased over the years.
- The years 2006-2008 are interesting. Though the average engine size increased quite suddenly, the MPG remained roughly the same as in previous years. This seeming discrepancy might require more investigation.
- Given the trend towards smaller displacement engines, let's see whether automatic or manual transmissions are more efficient for four cylinder engines, and how the efficiencies have changed over time:
gasCars4 <- subset(gasCars, cylinders == "4")
ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() + facet_wrap(~trany2, ncol = 1) + theme(axis.text.x = element_text(angle = 45)) + labs(x = "Year", y = "MPG")
The preceding command will give you the following plot:

- This time, ggplot2 was used to create box plots that help visualize the distribution of values (and not just a single value, such as a mean) for each year.
- Next, let's look at the change in proportion of manual cars available each year:
ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) +
geom_bar(position = "fill") + labs(x = "Year", y = "Proportion
of cars", fill = "Transmission") + theme(axis.text.x =
element_text(angle = 45)) + geom_hline(yintercept = 0.5,
linetype = 2)
The preceding command will give you the following plot:
