Plotting numerical variables
Andrew B. Collier
In the previous installment we generated some simple descriptive statistics for the National Health and Nutrition Examination Survey data. Now we are going to move on to an area in which R really excels: making plots and visualisations. There are a variety of systems for plotting in R, but we will start off with base graphics.
First, make a simple scatter plot of mass against height.
plot(DS0012$height, DS0012$mass, ylab = "mass [kg]", xlab = "height [m]")
This clearly shows the relationship between these two variables, however, there is a high degree of overplotting.
We can improve the overplotting situation by making the points solid but partially transparent.
plot(DS0012$height, DS0012$mass, ylab = "mass [kg]", xlab = "height [m]", pch = 19, col = rgb(0, 0, 0, 0.05))
That’s much better: now we can see more structure in the data.
Now let’s look at the distribution of the BMI data using a histogram.
hist(DS0012$BMI, main = "Distribution of Body Mass Index", col = "lightblue", xlab = "BMI", prob = TRUE) lines(density(DS0012$BMI)) abline(v = mean(DS0012$BMI), lty = "dashed", col = "red")
I have thrown in a few bells and whistles here: a kernel density estimate of the underlying distribution and a vertical dashed line at the mean value of BMI.
Hexagon binning produces a two dimensional analog of the histogram which can be used to further improve on the visualisation of the mass versus height data above. One option is to use the hexbin package. However, in this case I prefer the output from the ggplot2 package.
library(ggplot2) ggplot(DS0012, aes(x=height,y=mass)) + geom_hex(bins=20) + xlab("height [m]") + ylab("mass [kg]")
The syntax for ggplot2 is quite different to that of the base R graphics. It takes quite a lot of getting used to, but it is well worth the effort because it is extremely powerful. The appearance of the ggplot2 output is also rather novel.
Well, that was a very quick and high level overview of some of the plotting capabilities in R. Next time we will take a look at plots generated using categorical variables.