2. Univariate Graphics


gr = 1709 grbox (x {,col})
generates a boxplot of the data set x
gr = 1712 grdot (x {,col})
generates a dotplot of the data set x
gr = 1715 grbar (x {,col})
generates a bar chart of the data set x
gr = 1718 grqq (y, x {,col})
generates a QQ-plot from the data sets y and x
gr = 1721 grqqn (x {,col})
generates a QQ-plot from the data set x and a normal distribution
gr = 1724 grqqu (x {,col})
generates a QQ-plot from the data set x and a uniform distribution
gr = 1727 grhist (x {, h {, o {,col}}})
generates a histogram of the data set x
gr = 1730 grash (x {, h {, k {,col}}})
generates an averaged shifted histogram of the data set x

The optional parameter col allows us to produce a graphical object in another color other than black. For details, see Subsection 4.3. The other optional parameters will be explained when we introduce 1733 grhist and 1736 grash.

In the following examples, we use a mix of graphical primitives which are part of the library graphic and high-level routines which are part of the library plot. Since a call of library plot also loads the library graphic, we do not need to call the library graphic explicitly.


2.1 Boxplots

Let us now examine some variables of the Boston Housing data with statistical graphics. Since the aim of the data exploration is to predict the median house price from the variables, let us make a boxplot with the quantlet 1749 grbox.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  gr = grbox(data[,14])          ; generates a graphical object

  plot(gr)                       ; plots graphical object

1753graph31.xpl

Note that we generate the boxplot in two steps. First we generate the graphical object gr and then we plot it.

Figure: Boxplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig31.ps}

We might not be satisfied with the boxplot, since the window size is chosen such that all the data are visible. Let us now apply an often helpful trick to get a better plot.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,14]                  ; selects the 14th column  

  gr = grbox(x)                  ; generates graphical object

  scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set

  scale = setmask(scale,"white") ; makes scaling data "invisible"

  plot(gr, scale)                ; plots boxplot and scaling data

1770graph32.xpl

Figure: Rescaled boxplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig32.ps}

We have generated an invisible data set which helps us to scale the boxplot better in the window.

We learn from the boxplot that the variable MEDV contains several large outliers. The mean (broken line) and the median (solid line in the box) differ. Moreover we see on the right outliers marked with circles and crosses. Since the box borders (25%- and 75%-quantile) and the whiskers ($\geq$ 25%-quantile - 1.5 interquartile range and $\leq$ 75%-quantile + 1.5 interquartile range) have more or less the same distance from the median, we may consider that the variable has a symmetrical distribution.


2.2 Dotplots

Let us now examine the median house price a little bit more in detail. We use the quantlet 1788 grdot to generate a dotplot. In the horizontal direction, a dotplot takes the value of the observations, in the vertical direction it takes a generated uniformly distributed random number.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,14]                  ; selects 14th column  

  gr = grdot(x)                  ; generates dotplot

  scale = #(min(x),max(x))~#(-1, 2) ; generates scaling data set

  scale = setmask(scale,"white") ; makes scaling data "invisible"

  plot(gr, scale)                ; plots dotplot and scaling data

1792graph33.xpl

Figure: Rescaled dotplot of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig33.ps}

After having rescaled the display, we can detect patterns within the variable MEDV. It seems we have rather sparse area of data until $x \approx 12$, then a denser area of data until $x \approx 18$ with a sharp break at $x \approx 25$ and finally another break at $x \approx 36$. We also see that behind the cross in the dotplot, there is more than one observation.


2.3 Bar Charts

If we want to plot discrete variables, it does not make sense to use a box- or dotplot. For this purpose we can use bar charts. We generate a bar chart with the quantlet 1810 grbar and use the fourth variable (CHAS) which is an indicator variable as to whether the Charles river is part of the school district.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,4]                   ; selects 4th column  

  gr = grbar(x)                  ; generates a bar chart

  gr = setmask(gr, "line", "medium") ; changes line thickness

  plot(gr)                       ; plots bar chart

1814graph34.xpl

We see in the bar chart a large bar representing zeros (school district does not include Charles river) and a small bar representing ones (school district does include Charles river).

Figure: Barchart of the 4th variable (CHAS) of the Boston Housing data generated by a graphic primitive routine.
\includegraphics[scale=0.6]{grfig34.ps}

Although most gr... quantlets generate already a useful graphic, they aimed to be building blocks of high-level routines. If the Charles river index variable would be coded by the numbers -1 and 0, we would not be able to tell which bar chart represents the -1 and which represents the 0. The left bar would still start at 0 and the right bar at 1.

Fortunately, there is the more sophisticated quantlet 1830 plotbar available which generates a much better bar chart.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,4]                   ; selects 4th column  

  plotbar(x)                     ; plots the bar chart

1834graph35.xpl

Figure: Barchart of the 4th variable (CHAS) of the Boston Housing data generated by a graphic high-level routine.
\includegraphics[scale=0.6]{grfig35.ps}


2.4 Quantile-Quantile Plots

Quantile-Quantile plots are used to compare distributions of two variables (1852 grqq) or to compare one variable with a given distribution (1855 grqqu uniform, 1858 grqqn normal).

Let us compare the percentage of lower status people with the appropriate normal distribution.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,13]                  ; selects 13th column  

  gr = grqqn(x)                  ; generates a qq plot

  plot(gr)                       ; plots the qq plot

1862graph36.xpl

Apparently we have a clear deviation from the line which indicates that the 13th variable is not normally distributed. Since the data points cross the 45 degree line twice, we can say that the distribution is steeper in the center and thicker in the tails.

Figure: QQ-plot of the 13th variable (LSTAT) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig36.ps}


2.5 Histograms

The most often used statistical graphics tools to visualize continuous data is the histogram. Let's now generate a histogram from the median house prices.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,14]                  ; selects 14th column  

  gr = grhist(x)                 ; generates histogram

  plot(gr)                       ; plots histogram

1881graph37.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig37.ps}

We already noticed some characteristics of this variable when we generated boxplots. Here we find some of them again, e.g. the central dense region of data and the outliers at the right border. In contrast to all other univariate graphics, 1897 grhist has two optional parameters: h the binwidth and o the origin of the histogram. The change of the binwidth as well as the origin might reveal some more patterns within the data. Let us first change the binwidth to 1.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,14]                  ; selects 14th column  

  gr = grhist(x,1)               ; generates histogram with 

                                 ;    binwidth 1

  plot(gr)                       ; plots histogram

1901graph38.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data with a different binwidth.
\includegraphics[scale=0.6]{grfig38.ps}

Now change the origin to 0.5.


  library("plot")                ; loads library plot

  data = read ("bostonh")        ; reads Boston Housing data

  x = data[,14]                  ; selects 14th column  

  gr = grhist(x,1,0.5)           ; generates histogram with 

                                 ;    binwidth 1 and origin 0.5

  plot(gr)                       ; plots histogram

1918graph39.xpl

Figure: Histogram of the 14th variable (MEDV) of the Boston Housing data with a different origin of the bin.
\includegraphics[scale=0.6]{grfig39.ps}

Well-known problems with histograms are the choices of the origin and the binwidth. To overcome these problems, the concept of average shifted histograms has been developed. In principle, we generate a set of histograms with different origins, instead of one histogram, and then we average these histograms. We apply average shifted histograms to our last example with binwidth 1 but 10 different origins.


  library("plot")          ; loads library plot

  data = read ("bostonh")  ; reads Boston Housing data

  x = data[,14]            ; selects 14th column  

  gr = grash(x,1,10)       ; generates average shifted histogram

  plot(gr)                 ; plots average shifted histogram

1935graph3A.xpl

Figure: Average shifted histogram of the 14th variable (MEDV) of the Boston Housing data.
\includegraphics[scale=0.6]{grfig310.ps}

In both histograms, Figures 15 and 16, we can speculate about multimodality of the data, since they show different histograms. However, the average shifted histogram suggests the existence of three modes between 10 and 25.



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de