Categorical data is derived from populations that consist of some number of subpopulations and we record only the subpopulation membership of selected individuals. In such cases the basic data summary is a frequency table that counts the number of individuals within each category. If there is more than one set of categories, then we can summarize the data using a multi-dimensional frequency table. For example, here is part of a dataset that records the hair color, eye color, and sex of a group of 592 students.
Hair Eye Sex Black Brown Female Red Green Male Blond Blue Male Brown Hazel FemaleSometimes numerical codes are used in place of names, but in that case it is important to remember that these codes are not quantitative values, just labels. The frequency table for hair color in this dataset is:
Black Brown Red Blond 108 286 71 127
The basic graphical tool for categorical data is the barplot. This plots bars for each category, the height of which is the frequency or relative frequency of that category. Barplots are more effective than pie charts because we can more readily make a visual comparison of heights of rectangles than angles in a pie.
If a second categorical variable also is observed, for example hair color and gender, a barplot with side-by-side bars for each level of the first variable plotted contiguously, and each such group plotted with space between groups, is most effective to compare each level of the first variable across levels of the second. For example, the following plot shows how hair color is distributed for a sample of males and females. A comparison of the relative frequencies for males and females shows that a relatively higher proportion of females have blond hair and somewhat lower proportion of females have black or brown hair.
We can also display the relationship between hair and eye color using a 2-dimensional frequency table and barplot. The areas of the rectangles in this plot represent the relative frequency of the corresponding category combination.
Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16
Are hair color and eye color related? Although we will consider this question in detail later, we
can think about how to interpret this question here. First note that a total of 108 people have
black hair 68 of whom also have brown eyes. That is, 63% (68/108) of those with black hair
also have brown eyes. In probability theory this ratio is referred to as a conditional probability
and would be expressed as
First note the correspondence between the structure of the sentence, 63% of those with black hair also have brown eyes, and the arithmetic that goes with it. The reference group for this percentage is defined by the prepositional phrase, of those with black hair, and the count for this group is the denominator. The verb plus object in this sentence is have brown eyes. The count of people who have brown eyes within the reference group (those with black hair) is the numerator of this percentage. So those who are counted for the numerator must satisfy both requirements, have brown eyes and have black hair. The corresponding probability statement is
The total counts for eye color are:
Brown Blue Hazel Green 220 215 93 64so 220 of the 592 people in this data have brown eyes. That is, 220/592 = 37% of all people in this data set have brown eyes, but brown eyes occur much more frequently among people with black hair, 63%. The corresponding probabiity statements are
This shows that the percentage of people who have brown eyes depends on whether or not they have black hair. If the two percentages had been equal, that is, if 37% of people with black hair also had brown eyes, then we would say that having brown eyes does not depend on whether or not a person has black hair since those percentages would have been the same. Therefore, for those two outcomes to be independent, there should have been 40 people (37% of 108) with black hair and brown eyes. This is the expected count under the assumption of independence between brown eyes and black hair. We can do the same for each combination of categories in this table to give the expected frequencies:
Brown Blue Hazel Green Black 40.14 39.22 16.97 11.68 Brown 106.28 103.87 44.93 30.92 Red 26.39 25.79 11.15 7.68 Blond 47.20 46.12 19.95 13.73If all of the observed counts had been equal to these expected counts, then hair and eye color would be completely independent. Obviously that is not the case. We can define a measure of distance between the observed counts and the expected counts under the assumption of independence by
Eye Hair Brown Blue Hazel Green Black 19.35 9.42 0.23 3.82 Brown 1.52 3.80 1.83 0.12 Red 0.01 2.99 0.73 5.21 Blond 34.23 49.70 4.96 0.38Note that blond hair with brown or blue eyes are the greatest contributors to the distance from independence of these counts.
data(HairEyeColor) #load HairEyeColor data set HairEyeColor #this is a 3-d array HairEye = apply(HairEyeColor,c(1,2),sum) #sum over gender, keep dimensions 1,2 HairEye Hair = apply(HairEye,1,sum) #get totals for hair color Eye = apply(HairEye,2,sum) #get totals for eye color Gender = apply(HairEyeColor,3,sum) #get totals for gender # graphics Hair.color = c("black","saddlebrown","red","yellow") Eye.color = c("saddlebrown","blue","yellow4","green") barplot(Hair,col=Hair.color) title("Barplot of Hair Color") #barplot is better than pie chart par(mfrow=c(2,1)) barplot(Hair,col=Hair.color) title("Barplot of Hair Color") pie(Hair,col=Hair.color) title("Pie Chart of Hair Color") par(mfrow=c(1,1)) #compare males and females HairGender = margin.table(HairEyeColor, c(1, 3)) print(HairGender) barplot(HairGender,col=Hair.color,main="Hair Color") barplot(HairGender,col=Hair.color,legend.text=TRUE,xlim=c(0,3),main="Hair Color") #relative frequency HairGenderP = scale(HairGender,scale=Gender,center=FALSE) print(HairGenderP) barplot(HairGenderP,col=Hair.color,legend.text=TRUE,xlim=c(0,3),main="Relative Frequencies of Hair Color") barplot(HairGenderP,beside=TRUE,col=Hair.color,legend.text=TRUE,main="Relative Frequencies of Hair Color") # find distances from independence # there are several ways to compute R*C. The easiest way is to use the # function outer() which is a generalized outer product # this function takes two vectors as arguments and generates a matrix # with number of rows = length of first argument and # number of columns = length of second argument. # Elements of the matrix are obtained by multiplying each element of the first # vector by each element of the second vector. N = sum(HairEyeColor) ExpHairEye = outer(Hair,Eye)/N round(ExpHairEye,2) #note that outer preserves names of Hair and Eye # now get distance from independence D = ((HairEye - ExpHairEye)^2)/ExpHairEye round(D,2) # gives contribution from each cell sum(D) # print total distance # now use R function paste to combine text and value paste("Distance of Hair-Eye data from Independence =",round(sum(D),2)) # if round is not used then lots of decimal places will be printed! paste("Distance of Hair-Eye data from Independence =",sum(D))We will see later that this data is very far from independence!
R has several ways to save the graphics into files so they can be added to a document. After a graphic is created in Rstudio, use the Export menu to interactively save the graphic as an image file. The default file type is PNG which is the recommended image format to use. Be sure to change the name of the image file from the default name Rplot.png. Another way is to use the graphics function png() to specify the file name along with options that specify the size in pixels of image. After all comnmands for a particular graphic have been entered, finish the graphic by entering
graphics.off()Try to use informative file names for saved graphics. The folowing script creates text output and image files for the hair-eye color example. These can be imported into a document processor such as Word.