Andrew B. Collier
In the previous installment we derived two new categorical variables for the National Health and Nutrition Examination Survey data. This time we will get some simple descriptive statistics from the data.
Firstly, let’s start by looking at a summary of the entire data set. We can exclude the identifier field, since this has no real significance.
> summary(DS0012[, c(-1, -7)]) gender age mass height BMI BMI.category M:4448 Min. : 2.00 Min. : 10.40 Min. :0.815 Min. :12.50 underweight:1759 F:4413 1st Qu.:12.00 1st Qu.: 49.00 1st Qu.:1.503 1st Qu.:19.97 normal :2589 Median :33.00 Median : 68.70 Median :1.624 Median :25.16 overweight :2260 Mean :35.45 Mean : 66.68 Mean :1.561 Mean :25.71 obese :2253 3rd Qu.:56.00 3rd Qu.: 85.20 3rd Qu.:1.717 3rd Qu.:30.08 Max. :80.00 Max. :218.20 Max. :2.038 Max. :73.43
This gives the quantiles and mean for each of the numerical variables, and the counts for each of the categorical variables. The average age of the subjects is 35. The subjects have masses between 10.4 and 218.2 kg.
We could have extracted these statistics for each of the numerical variables individually.
> mean(DS0012$BMI)  25.7057 > median(DS0012$BMI)  25.15504 > quantile(DS0012$BMI) 0% 25% 50% 75% 100% 12.50312 19.97228 25.15504 30.08150 73.42526
It gets a little painful to type out the variable name every time, but we can attach the DS0012 variable to R’s search path, which makes things much more compact.
> attach(DS0012) > mean(BMI)  25.7057
That’s better. We can also get a table of counts for an individual categorical variable.
> table(age.category) age.category child teenager adult mature senior 2220 757 2105 1793 1986
This is precisely the information that we got in the summary above: children make up the largest portion of the sample, followed by adults and then seniors. Teenagers are in the minority. What about generating a contingency table which cross-tabulates two categorical variables?
> table(age.category, BMI.category) BMI.category age.category underweight normal overweight obese child 1537 519 117 47 teenager 111 390 143 113 adult 50 765 630 660 mature 31 421 638 703 senior 30 4 94 732 730
Now that is interesting: it seems that the majority of children in the data are underweight. Should we be concerned? No, the interpretation of BMI for children is different: the nominal thresholds between each of the categories no longer apply and BMI is compared to typical values for children of similar age. Among teenagers and adults the majority of the sample have normal BMIs. However, even the overweight and obese categories for adults are already well populated. In the mature and senior portion of the sample, BMIs more often indicate overweight or obese.
Finally, let’s generate a three way contingency table of BMI, age and gender.
> (bmi.age.gender = table(BMI = BMI.category, age = age.category, gender)) , , gender = M age BMI child teenager adult mature senior underweight 818 67 14 10 15 normal 259 199 393 184 227 overweight 52 75 354 374 390 obese 23 53 291 313 337 , , gender = F age BMI child teenager adult mature senior underweight 719 44 36 21 15 normal 260 191 372 237 267 overweight 65 68 276 264 342 obese 24 60 369 390 393
It’s a little difficult to make sense of all that, but as we will see later on, there are great tools for understanding the contents of multiway contingency tables.
Right, that has given us a general feel for what the data looks like. The next step is to generate some plots.
The last thing that we need to do is detach the DS0012 variable