13.3 Example: Eye-Hair
13.3.1 Description of Data
The data set given in Table 13.1 is a contingency table
of hair colors (4 categories) and eye colors
(4 categories) for 592 women (Lebart, L., Morineau, A., and Piron, M.; 1995).
Table 13.1:
Contingency table for eye-hair color data.
EYE
HAIR COLOR |
black |
brown |
red |
blond |
total |
dark brown |
68 |
119 |
26 |
7 |
220 |
light brown |
15 |
54 |
14 |
10 |
93 |
green |
5 |
29 |
14 |
16 |
64 |
blue |
20 |
84 |
17 |
94 |
215 |
total |
108 |
286 |
71 |
127 |
592 |
|
13.3.2 Calling the Quantlet
The following XploRe code explains how to run
correspondence analysis using quantlet
corresp
in XploRe.
library("stats")
corresp("e.dat","null","null","EYE-HAIR","eltxt.dat",
"ectxt.dat","null","null","null")
In this example, we use the active data file e.dat.
The file e.dat
contains the Hair-eye contingency table given
in Table 13.1.
68 119 26 7
15 54 14 10
5 29 14 16
20 84 17 94
Row labels are given in the file eltxt.dat:
dark-brown
light-brown
green
blue
Column labels are in the file ectxt.dat:
BLACK
BROWN
RED
BLOND
13.3.3 Documentation of Results
The output of CA from
corre01.xpl
is shown in the
output window. In this example, we get altogether three factors--three
eigenvalues and three coordinates for each row (column) item.
13.3.4 Eigenvalues
The eigenvalues
give the part of total variation
recovered on the first, second, ... ,
-th factors. They allow to
make a choice for the number of factors (or axes, in the
geometrical representation) to retain.
[1,] EIGENVALUES AND PERCENTAGES
Contents of seig
[1,] 0.2088 89.3727 89.3727
[2,] 0.0222 9.5149 98.8876
[3,] 0.0026 1.1124 100.0000
We see that already the first factor explains nearly
of
total variation in this contingency table, equal to
13.3.5 Contributions
From the formula of Pearson's chi-square (here divided by
) one
can obviously decompose the total variation across row (resp.
column) items additively. This yields the global row (resp. column) contributions to total variation. In the geometrical
representation of row (resp. column) profiles in a
-dimensional Euclidean space--taking the marginal row
(resp. column) profile as the origin--the global contribution of
a row (resp. column) is equal to the squared distance to the
origin times it's relative weight (say
for row
). The squared distance itself is useful to see how a
row item deviates from what is expected under independence.
[1,] "Row relative weights and distances to the origin"
Contents of spdai
[1,] 0.3716 0.0206
[2,] 0.1571 0.0119
[3,] 0.1081 0.0159
[4,] 0.3632 0.0228
[1,] Column relative weights and distances to the origin
Contents of spdaj
[1,] 0.1824 0.0227
[2,] 0.4831 0.0066
[3,] 0.1199 0.0146
[4,] 0.2145 0.0345
It is interesting to know how much each row (or column)
contributes to the variation pertaining to a given factor.
These specific contributions are useful to possibly interpret the
factor in terms of contrasts between row (or column) items. These
contributions are usually given in percents of total variation of
the factor (i.e. corresponding eigenvalues).
[1,] Coordinates of the columns
Contents of scoordj
[1,] -0.0207 -0.0088 0.0023
[2,] -0.0061 0.0013 -0.0020
[3,] -0.0053 0.0131 0.0034
[4,] 0.0343 -0.0029 0.0007
[1,] Contributions of the columns
Contents of scontrj
[1,] 22.2463 37.8774 21.6330
[2,] 5.0860 2.3194 44.2838
[3,] 0.9637 55.1305 31.9125
[4,] 71.7039 4.6727 2.1706
The coordinates of the first axis show that blond hair color
(4-th column item) is opposed to all the other hair colors on the
first axis, in particular, to black hair color (1-st column
item). The first factor can be essentially explained by a strong
contrast between blond and black hair in terms of eyes color
(respective contribution 71,7% and 22,2%)
The second axis (its eigenvalue 9.5% is ten times smaller than
that of the first axis of 89.4%, is mainly constructed by the
item of hair color red (55.1%) as opposed to black hair
color (37,9%). The third factor is accounting for negligible
contribution to total variation (1.1 %).
[1,] Coordinates of the rows
Contents of scoordi
[1,] -0.0202 -0.0036 0.0009
[2,] -0.0087 0.0069 -0.0041
[3,] 0.0066 0.0139 0.0036
[4,] 0.0225 -0.0034 -0.0002
[1,] Contributions of the rows
Contents of scontri
[1,] 43.1157 13.0425 6.6796
[2,] 3.4010 19.8040 61.0856
[3,] 1.3549 55.9095 31.9248
[4,] 52.1284 11.2440 0.3100
For the row items, the first axis is,solely, constructed by eye
colors dark brown (1-st row item) and blue (4-th row item)
(resp. contributions of 43.1% and 52.1%). Coordinates show that
they are opposed in terms of hair profile. The second axis is
mainly due to green eye color (3-rd row item).
The global contribution of a given row (resp. column) itself may
be additively decomposed across the
factors into terms called
squared correlations (by analogy with PCA) when expressed in
percents of that global contribution. Squared correlations are
useful to determine how well each row (resp. column) variation is
recovered on a factor or on restricted number of factors (or axes
in a geometrical representation). This allows to guard against
illusory proximities of points (row or column profiles) in
mappings.
[1,] Squared correlations of the rows
Contents of scorri
[1,] 0.9670 0.0311 0.0019
[2,] 0.5424 0.3363 0.1213
[3,] 0.1759 0.7726 0.0516
[4,] 0.9775 0.0224 0.0001
[1,] Squared correlations of the columns
Contents of scorrj
[1,] 0.8380 0.1519 0.0101
[2,] 0.8644 0.0420 0.0937
[3,] 0.1333 0.8118 0.0549
[4,] 0.9927 0.0069 0.0004
From these correlations it can be inferred, for instance, that
factor 1 is exclusively specific for blond hair color.
13.3.6 Biplots
A simultaneous representation
of row and column items in the same mapping has some interesting
interpretational aspects. When row
and column
, say, are
represented by points in the same (resp. opposite) direction with
respect to the origin it means that
is above (resp. below) the value expected according to independence (conditioned
on the fact that the sum of their squared correlations on the
first two factors is, for each of them, sufficiently high).
Results of the analysis can be visualized in different graphs :
We can visualize the configuration of the items in any two axes.
The importance of the axes is proportional to the variation
explained by this axis. It is measured by the eigenvalue. We can
select any two axes for display. If
than the first five
axes are available to choose from.
We can select different items to display in graphs :
The graph requested (
corre01.xpl
) is shown in the Figure 13.1
Figure 13.1:
CA for the eye-hair example.
|
The graph using the two first coordinates shows the suggestive
features of simultaneous representation of row and column items in
the same mapping. This allows us to interpret the proximities or
distances between items of the same set with their associations to
those of other item sets.
It is possible to project additional rows or columns onto the
various factors without having these elements enter the
construction of factors, as opposed to so-called active items.
This may be useful for various reasons: to get some exogenous
explanations
of some features revealed in the data, to ignore a much too
influentional row or column item (in particular for items with low frequencies),
to see the positions of several items forming a natural group,
etc.
13.3.7 Brief Remark
Why is the position
of the item of hair color blond more extreme than the eye
color blue on the first dominant axis? Because the item of
hair color blond is much more characterized by eye's color
blue than the inverse fact: as can be seen from the data,
74% of blond people have blue eyes while only 44% of
people with blue eyes have blond hair.