Andrew B. Collier
I have just started preparing a series of talks aimed at introducing the use of R to a rather broad audience consisting of physicists, chemists, statisticians, biologists and computer scientists (plus a few other disciplines thrown in for good measure). I want to use a single consistent set of data throughout the talks. Finding something that would resonate with such a disparate set of people was quite a challenge. After playing around with a couple of options, I settled on using data for age, height and mass. These are things that we can all identify with. The next challenge was to actually find a suitable data set, which was surprisingly difficult. Eventually I stumbled upon the data from the National Health and Nutrition Examination Survey (NHANES), The data from the survey are available here. These data have been divided into a number of sets, each of which has been excellently curated and has a detailed codebook.
I started with the Body Measurements data (DS12), which I downloaded in tab-delimited format. The first task was to load this into R.
> DS0012 <- read.delim("data/DS0012/25505-0012-Data.tsv") > dim(DS0012)  9762 65
So there are 9762 records, each of which has 65 fields. We will only retain a subset of those (sequence number, gender, age, mass and height). Although the field name for mass suggests that it might in fact be weight (BMXWT), it is actually mass in kilograms. Height is given in centimetres.
> DS0012 = DS0012[,c("SEQN", "RIAGENDR", "RIDAGEYR", "BMXWT", "BMXHT")]
There is some missing data (mass and height fields), which we remove.
> summary(DS0012) SEQN RIAGENDR RIDAGEYR BMXWT BMXHT Min. :41475 Min. :1.000 Min. : 0.00 Min. : 3.10 Min. : 81.5 1st Qu.:44008 1st Qu.:1.000 1st Qu.:10.00 1st Qu.: 36.70 1st Qu.:150.3 Median :46540 Median :1.000 Median :29.00 Median : 66.10 Median :162.4 Mean :46547 Mean :1.497 Mean :32.88 Mean : 62.17 Mean :156.1 3rd Qu.:49094 3rd Qu.:2.000 3rd Qu.:54.00 3rd Qu.: 83.65 3rd Qu.:171.7 Max. :51623 Max. :2.000 Max. :80.00 Max. :218.20 Max. :203.8 NA's :131 NA's :889 > DS0012 = subset(DS0012, !(is.na(BMXWT) | is.na(BMXHT))) > dim(DS0012)  8861 6
So, we lost around 10% of the original data, but at least what we are left with is clean. Next we change the column labels to something a little less cryptic and convert the units for height to metres
> names(DS0012) <- c("id", "gender", "age", "mass", "height") > DS0012$height = DS0012$height / 100
Lastly we add in a derived column for Body Mass Index (BMI). There was already BMI data in the original data set, however, it is illustrative to calculate it again here.
> DS0012 = transform(DS0012, BMI = mass / height**2)
This is what the final data look like:
> head(DS0012) id gender age mass height BMI 1 41475 2 62 138.9 1.547 58.03923 2 41476 2 6 22.0 1.204 15.17643 3 41477 1 71 83.9 1.671 30.04755 5 41479 1 52 65.7 1.544 27.55946 6 41480 1 6 27.0 1.227 17.93390 7 41481 1 21 77.9 1.827 23.33782
Next installment: defining some meaningful categories.