3. Multivariate Graphics


1996 plot (x1 {, x2 {, ... {x5}}})
plots the three-dimensional data sets x1, ..., x5
gr = 1999 grsurface (x {, col})
generates surface from the function f(x,y)
gr = 2002 grcontour2 (x {, c {,col}})
generates the contour lines f(x,y)=c
gr = 2005 grcontour3 (x {, c {,col}})
generates the contour lines f(x,y,z)=c
gr = 2008 grsunflower (x {, d {, o {,col}}})
generates a sunflower plot
gr = 2011 grlinreg (x {,col})
generates the linear regression line
gr = 2014 grlinreg2 (x {, n {,col}})
generates the linear regression plane
2017 plot2 (x {, prep {,col}})
plots two variables
2020 plotstar (x {, prep {,col}})
plots a star diagram
2023 plotscml (x {, varnames})
plots a scatter-plot matrix
2026 plotandrews (x {, prep {,col}})
plots Andrews curves
2029 plotpcp (x {, prep {,col}})
plots parallel coordinates

The optional parameter col allows us to produce a graphical object in another color other than black. For details, see Subsection 4.3.


3.1 Three-Dimensional Plots

Up to now, we have just generated graphical objects or data sets and plotted them in a two-dimensional plot. Sometimes the analysis of data is easier if we can show them as three-dimensional data. Rotating the data point cloud will give us more insight into the data. We can use the 2034 plot quantlet for this.

  library("plot")             ; loads library plot
  data = read ("bostonh")     ; reads Boston Housing data
  x = data[,6|13|14]          ; selects columns 6, 13 and 14
  plot(x)                     ; plots the data set
2038graph41.xpl

Figure: Variables 6 (RM), 13 (LSTAT) and 14 (MEDV) of the Boston Housing data plotted in a 3D scatter plot.
\includegraphics[scale=0.6]{grfig41.ps}

Now click with the mouse in the window and use the cursor keys to rotate the data set. Note that the only change we made for three-dimensional plotting is that the input matrix we use in 2054 plot consists of three vectors. We can apply here the quantlet 2057 setmask to color the data points.

We see in the plot of RM, LSTAT and MEDV a nonlinear relationship which may allow us to estimate the median house prices (MEDV) as a parametric function of percentage of lower status people (LSTAT) and average number of rooms (RM).


3.2 Surface Plots

Surfaces are the three-dimensional analogs of curves in two dimensions. Since we have already plotted three-dimensional data, we can imagine what we have to do: generate a data set, generate some lines, and plot them. The quantlet 2062 grsurface does this for data on a rectangular mesh.

  library ("plot")               ; loads library plot
  x0 = #(-3, -3)              
  h  = #(0.2, 0.2)
  n  = #(31, 31)
  x  = grid(x0, h, n)            ; generates a bivariate grid
  f  = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi)
                                 ; computes density of bivariate 
                                 ;   normal with correlation 0.5
  gr = grsurface(x~f)            ; generates surface
  plot(gr)                       ; plots the surface
2066graph42.xpl

Figure: Surface plot of the density of the bivariate standard normal distribution.
\includegraphics[scale=0.6]{grfig42.ps}

Most of the upper program is used to generate the underlying grid and the density function. We may plot the data set itself by plot(g$\sim$f). The surface quantlet 2082 grsurface needs three parameters: the x- and the y-coordinates of the grid and the function values at f(xi,yj).


3.3 Contour Plots

If we view surfaces, then understanding them might be difficult, even if we can rotate them. We can produce contour plots with contour lines f(x,y) = constant.

  library ("plot")         ; loads the library plot
  x0 = #(-3, -3)              
  h  = #(0.2, 0.2)
  n  = #(31, 31)
  x  = grid(x0, h, n)      ; generates a bivariate grid
  f  = exp(-(x[,1]^2+x[,2]^2)/1.5)/(1.5*pi)
                           ; computes density of bivariate 
                           ;    normal with correlation 0.5
  c  = 0.2*(1:4).*max(f)   ; selects contour lines as 10%,...,90% 
;    times the maximum density
  gr = grcontour2(x~f, c)  ; generates contours
  plot(gr)                 ; plots the contours
2088graph43.xpl

Figure: Contour plot of the density of the bivariate standard normal distribution.
\includegraphics[scale=0.6]{grfig43.ps}

The quantlet 2104 grcontour3 can be used to compute contour lines for a function f(x,y,z) = constant. This gives a two-dimensional surface in a three-dimensional space. An example is the XploRe logo.


3.4 Sunflower Plots

Sunflower plots avoid the overplotting if we have many data points. We can consider a sunflower plot as a combination of two-dimensional histograms with a contour plot. As with histograms, we must define a two-dimensional binwidth d and a two-dimensional origin o.

Let's compare a bivariate normal distribution with 1000 data points with the equivalent sunflower plot.

  library ("graphic")          ; loads library graphic
  x  = normal(1000, 2)         ; generates bivariate normal data 
  d  = createdisplay(2,1)      ; creates a display with two plots
  show (d, 1, 1, x)            ; plots the original data
  gr = grsunflower(x)          ; generates sunflower plot
  show (d, 2, 1, gr)           ; plots the sunflower plot
2114graph44.xpl

Figure: Standard 2D plot and sunflower plot of a large random sample of a bivariate standard normal distribution.
\includegraphics[scale=0.6]{grfig44.ps}


3.5 Linear Regression

An important statistical task is to find a relationship between two variables. The most frequently used technique to quantify the relationship is the least squares linear regression:

\begin{displaymath}\sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 \rightarrow \textrm{ minimal.} \end{displaymath}

To understand (graphically) how well the linear regression picks up the true relationship between two variables, we show the data points and the regression line in one plot:
  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  x0 = data[,13:14]              ; selects columns 13 and 14
  l0 = grlinreg(x0)              ; generates regression line
  x1 = log(data[,13:14])         ; logarithm of columns 13 and 14
  l1 = grlinreg(x1)              ; generates regression line
  d  = createdisplay(1,2)        ; creates display with two plots
  show (d, 1, 1, x0, l0)         ; plots data and regression l0
  show (d, 1, 2, x1, l1)         ; plots data and regression l1
2134graph45.xpl

Figure: Linear regressions of the 14th variable (MEDV) by the 13th variable (LSTAT), left: original variables, right: both variables transformed by logarithm.
\includegraphics[scale=0.6]{grfig45.ps}

We see that the regression line (median house price = b0 + b1 percentage of lower status people) does not fit well the data. Either a transformation of the data or a nonlinear regression technique seems to be useful. In our example we have taken logarithms of x and y, thus our model is

\begin{displaymath}\textrm{median houseprice} = \exp(b_0)\ \textrm{percentage lower status people}^{b_1} \end{displaymath}

The transformation of the house price turns out to be especially useful, since it avoids having to achieve negative values for the house prices in the model.

Obviously we can choose other explanatory variables, e.g. nitric oxygen in parts per ten million (NOXSQ) and average numbers of rooms (RM). This results in the following program which draws the three-dimensional data set and the regression plane ( b0 + b1 x1 + b2 x2) with a $4\times4$ mesh.

  library("plot")           ; loads library plot
  data = read ("bostonh")   ; reads Boston Housing data
  x  = data[,5|6|14]        ; selects columns 5, 6 and 14
  p  = grlinreg2(x, 5|5)    ; generates regression plane 
                            ;    with 4x4 mesh
  plot(x, p)                ; plots data set and regression plane
2153graph46.xpl

Figure: Bivariate linear regressions of the 14th variable (MEDV) by the 5th (NOXSQ) and 6th variables (RM).
\includegraphics[scale=0.6]{grfig46.ps}


3.6 Bivariate Plots

We have already plotted two variables, the percentage of lower status people against the median house price. The quantlet 2171 plot2 provides a much more powerful way of plotting multivariate data sets. The following program shows the first two principal components (based on the correlation matrix) of the Boston Housing data set. Additionally, if the median house price is less than the mean of the median house price, then we color the observation green, otherwise we color the observation blue.

  library("plot")            ; loads library plot
  data = read ("bostonh")    ; reads Boston Housing data
  col  = grc.col.green-grc.col.blue
  col  = grc.col.blue+col*(data[,14]<mean(data[,14])) 
                             ; colors observations blue and green
  plot2 (data, grc.prep.pcacorr, col)    
                             ; plots two first principal axes
2175graph47.xpl

Figure: First two principal components based on the correlation matrix of the Boston Housing data.
\includegraphics[scale=0.6]{grfig47.ps}

When we load the library graphic, the library installs an object grc which contains some often used graphical constants, e.g. grc.col.red for the color red. grc.prep.none makes no transformation on the data and plots the first two variables (here: per capita crime rate, CRIM, and proportion of residential land zoned for lots over 25000 square feet, ZN).

grc.prep.standard
standardizes the data,
grc.prep.zeroone
transforms the data on the interval [0,1] before plotting,
grc.prep.pcacov
takes the first two principal components based on the covariance instead of the two first two variables,
grc.prep.pcacorr
takes the first two principal components based on the correlation instead of the two first two variables and
grc.prep.sphere
spheres the data and the takes the first two components.

Since we assume a relation between all variables, we choose grc.prep.pcacorr and color the data points as described above. It seems to be an interesting (nonlinear) relationship for the house prices. A more complex example in Subsection 4.3 describes how the coloring works.


3.7 Star Diagrams

Star diagrams are used to visualize a multivariate data set. For each variable, we plot a star. Each axis of a star represents one variable. Obviously, we need to standardize the variables to a common interval which is internally done by $z_{i,j}=(x_{i,j}-\min_j)/(\max_j-\min_j)$. Depending on the method the user chooses, we select $\min_j$ and $\max_j$. The default method is grc.prep.zeroone.

  library("plot")            ; loads library plot
  data = read ("bostonh")    ; reads Boston Housing data
  data = data[1:70,]         ; select first 70 observations
  col  = grc.col.green-grc.col.blue
  col  = grc.col.blue+col*(data[,14]<mean(data[,14])) 
                             ; colors observations blue and green
  plotstar (data, grc.prep.zeroone, col)
                             ; shows star diagram of the data
2194graph48.xpl

Figure: Star diagram of the Boston Housing data. Green stars represent observations below the average house price, blue stars represent house prices above the average house price.
\includegraphics[scale=0.6]{grfig48.ps}

We note immediately that we have several groups of similar looking data points. The Boston Housing data set is well-known for its outliers and subgroups in the data.


3.8 Scatter-Plot Matrices

Another possibility of analyzing multivariate data is the scatter-plot matrix. Here, we plot a set of scatter plots such that we see every possible variable combination. Obviously we should not throw too much variables in a scatter-plot matrix, since our screen size is limited. In fact, 2212 plotscml limits the number of variables to eight.

  library("plot")                   ; loads library plot
  data = read ("bostonh")           ; reads Boston Housing data
  x = data[,5|6|13|14]              ; selects columns 5, 6, 13 14
  names="NOXSQ"~"RM"~"LSTAT"~"MEDV" ; names of the variables
  plotscml (x, names)               ; shows scatter-plot matrix
2216graph49.xpl

We see a clear nonlinear relationship between RM and LSTAT as well as between LSTAT and MEDV.

Figure: Scatter-plot matrix of the 5th (NOXSQ), 6th (RM), 13th (LSTAT) and 14th variables (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig49.ps}


3.9 Andrews Curves

The idea of Andrews curves is based on replacing each point by a curve such that some properties of the data points are transferred to properties of the curves. For example, it holds that

\begin{displaymath}\int_{-\pi}^{\pi} (f_i(t)-f_j(t))^2 dt = \pi \Vert x_i - x_j \Vert \end{displaymath}

with xi, xj as data points and fi, fj representing curves generated from the data points. We see that distant points in p-space will generate quite different curves. The curve generation is defined by

\begin{displaymath}f_i(t) = \frac{x_{i,1}}{\sqrt{2}} + x_{i,2} \sin(t) + x_{i,3} \cos(t) + x_{i,4} \sin(2t) + x_{i,5} \cos(2t) + \ldots. \end{displaymath}

Again, we need the variables to be on comparable levels. In our example, we choose grc.prep.pcacorr. We see that the curves are quite different, but they cross near t=2. For a fixed t, we can interpret the data points as a projection onto a very specific projection vector $(1/\sqrt{2},\sin(2),\cos(2),...)^Tx_i$. Thus we conclude that one principal component of the correlation matrix is zero. In fact the fourth variable (CHAS), the Charles river index, has a rather small correlation with all other variables. Since the variable is not continuous, we should not include it in our picture.
  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  data = data[21:40]             ; observations 21 to 40
  plotandrews (data, grc.prep.pcacorr)
                                 ; shows Andrews curves based on 
                                 ;    principal components of
                                 ; correlation matrix
2228graph4A.xpl

Figure: Andrews curves based on the principal components of 20 observations of the Boston Housing data.
\includegraphics[scale=0.6]{grfig410.ps}

Note also that the order of the variables plays an important role. The last variable is represented with a rather high frequency. The human brain will not easily consider two high frequent curves as really distinct.


3.10 Parallel Coordinate Plots

A completely different approach are parallel coordinate plots. Instead of insisting on orthogonality of the projections (e.g. in the scatter-plot matrix), we give it up. We plot on the jth parallel axis all data points xij. Then we connect the intersections such that each curve represents one data point. Some properties between the variables create specific patterns in the graphics. For example, a correlation of 1 between two variables is represented by parallel lines between the axes. A correlation of -1 results in a crossing of all lines in one point in the middle between two variables.

  library("plot")                ; loads library plot
  data = read ("bostonh")        ; reads Boston Housing data
  data = data[21:40]             ; observations 21 to 40
  x = data[,6|13|14]             ; selects columns 6, 13 and 14
  plotpcp (x, grc.prep.standard) ; shows parallel coordinate plot
2247graph4B.xpl

Figure: Parallel coordinate plot of 20 observations of the standardized 6th (RM), 13th (LSTAT) and 14th variables (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig411.ps}

The Boston Housing data show some kind of negative correlation between RM and LSTAT and LSTAT and MEDV.



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de