3. Summarizing Statistical Information

All functions introduced in Section 2 are intended to directly compute statistical characteristics from data. These direct computations may be quite cumbersome for everyday work. XploRe offers additional functions which summarize the statistical characteristics of data sets in an efficient way.


3.1 Summarizing Metric Data


s = 1720 summarize(x {,xvars})
computes a short summary of descriptive statistics for each column of a matrix x; optionally a vector of variable names xvars can be given
s = 1723 fivenum(x {,xvars})
computes the five number summary for each column of a matrix x; optionally a vector of variable names xvars can be given
s = 1726 descriptive(x {,xvars})
computes detailed descriptive statistics for each column of a matrix x; optionally a vector of variable names xvars can be given

1729 summarize is a tool to obtain a fast overview on a (metric) data matrix. It gives the most important statistical characteristics in the form of a short table. The only required input is the data matrix itself. Optionally, variable names can be provided as a text vector.

The following codes for the data matrix earn are available form the quantlet 1732desc08.xpl. For example

  vearn="educ"|"south"|"female"|"exp"|"union"|"lnwage"|"age"
  summarize(earn, vearn)
shows the summary of the earn data together with their variable names:
  [ 1,]         
  [ 2,]        Minimum  Maximum     Mean   Median   Std.Error
  [ 3,]        -----------------------------------------------
  [ 4,] educ         2       18   13.019       12      2.6154
  [ 5,] south        0        1  0.29213        0     0.45517
  [ 6,] female       0        1   0.4588        0     0.49877
  [ 7,] exp          0       55   17.822       15       12.38
  [ 8,] union        0        1  0.17978        0     0.38436
  [ 9,] lnwage       0   3.7955   2.0592   2.0513     0.52773
  [10,] age         18       64   36.833       35      11.727
  [11,]

An alternative to 1735 summarize is 1738 fivenum which reports the five number summary of all columns of a data set. These five numbers are minimum, maximum, median, 25% and 75% quartile of the data. As with 1741 summarize, the required input is the data matrix itself and variable names can be provided optionally. For the sake of brevity we show 1744 fivenum applied only to column seven of the earn matrix:

  fivenum(earn[,7],vearn[7])
reports
Contents of five
[ 1,]  
[ 2,] ==================================================
[ 3,]  Five number summary: age
[ 4,] --------------------------------------------------
[ 5,]    Minimum                    18
[ 6,]    25% Quartile               28
[ 7,]    Median                     35
[ 8,]    75% Quartile               44
[ 9,]    Maximum                    64
[10,] ==================================================
[11,]
The function 1747 descriptive provides detailed information about the statistical characteristics of all columns of a data matrix. As in the previous tools, the input for 1750 descriptive is the data matrix and optionally variable names:
  descriptive(earn[,7],vearn[7])
produces
Contents of desc
[ 1,]  
[ 2,] =========================================================
[ 3,]  Variable age
[ 4,] =========================================================
[ 5,]  
[ 6,]  Mean              36.8333
[ 7,]  Std.Error         11.7266     Variance          137.513
[ 8,]  
[ 9,]  Minimum                18     Maximum                64
[10,]  Range                  46
[11,]  
[12,]  Lowest cases                  Highest cases 
[13,]         350:            18             368:           64
[14,]          94:            18             212:           64
[15,]          48:            18             223:           64
[16,]          78:            18             125:           64
[17,]         298:            19             501:           64
[18,]  
[19,]  Median                 35
[20,]  25% Quartile           28     75% Quartile           44
[21,]  
[22,]  Skewness         0.545221     Kurtosis        -0.595615
[23,]  
[24,]  Observations                    534
[25,]  Distinct observations            47
[26,]  
[27,]  Total number of {-Inf,Inf,NaN}    0
[28,]  
[29,] =========================================================
[30,]


3.2 Summarizing Categorical Data


s = 1769 frequency(x {, xvars {, outwidth}})
computes a frequency table for each column of a matrix x; optionally a vector variable names xvars and maximal string length for categories can be given
s = 1772 crosstable(x{,xvars})
computes pairwise cross tables from all columns of a data matrix x and computes the result of a $\chi^2$ independence test; optionally a vector variable names xvars can be given

The functions 1775 frequency and 1778 crosstable can be applied to numeric as well as to text matrices. The XploRe codes for this section are available from the quantlet 1785desc09.xpl.

1788 frequency produces a text matrix containing the categories and frequencies as well as cumulative frequencies for all columns of a data matrix. We apply this function to the first and third columns of the earn data.

  frequency(earn[,1|3], vearn[1|3])
1791 frequency is a different way of reporting information about categories and frequencies than the function 1794 discrete used in Section 2.5:
  Contents of freq
  [ 1,]  
  [ 2,] ==================================================
  [ 3,]  Variable educ
  [ 4,] ==================================================
  [ 5,]                 |  Frequency  Percent  Cumulative 
  [ 6,] --------------------------------------------------
  [ 7,]               2 |          1    0.002      0.002
  [ 8,]               3 |          1    0.002      0.004
  [ 9,]               4 |          1    0.002      0.006
  [10,]               5 |          1    0.002      0.007
  [11,]               6 |          3    0.006      0.013
  [12,]               7 |          5    0.009      0.022
  [13,]               8 |         15    0.028      0.051
  [14,]               9 |         12    0.022      0.073
  [15,]              10 |         17    0.032      0.105
  [16,]              11 |         27    0.051      0.155
  [17,]              12 |        219    0.410      0.566
  [18,]              13 |         37    0.069      0.635
  [19,]              14 |         56    0.105      0.740
  [20,]              15 |         13    0.024      0.764
  [21,]              16 |         71    0.133      0.897
  [22,]              17 |         24    0.045      0.942
  [23,]              18 |         31    0.058      1.000
  [24,] --------------------------------------------------
  [25,]                 |        534    1.000
  [26,] ==================================================
  [27,]  
  [28,] ==================================================
  [29,]  Variable female
  [30,] ==================================================
  [31,]                 |  Frequency  Percent  Cumulative
  [32,] --------------------------------------------------
  [33,]               0 |        289    0.541      0.541
  [34,]               1 |        245    0.459      1.000
  [35,] --------------------------------------------------
  [36,]                 |        534    1.000
  [37,] ==================================================
  [38,]

To study the dependence of two categorical variables, one typically analyzes the contingency table (or cross table). 1797 crosstable provides the cross tables of all columns of a data matrix and additionally computes the $\chi^2$ statistic for testing independence and contingency coefficients. For example,

  crosstable(earn[,1|3], vearn[1|3])
gives
  Contents of cross
  [ 1,]  
  [ 2,]                    
  [ 3,] Crosstable for variables educ, female
  [ 4,]  
  [ 5,]           |      0.0000  1.0000 |
  [ 6,] ----------|---------------------|---------
  [ 7,]   2.0000  |      1       0      |       1 
  [ 8,]   3.0000  |      1       0      |       1 
  [ 9,]   4.0000  |      1       0      |       1 
  [10,]   5.0000  |      1       0      |       1 
  [11,]   6.0000  |      1       2      |       3 
  [12,]   7.0000  |      4       1      |       5 
  [13,]   8.0000  |      6       9      |      15 
  [14,]   9.0000  |      7       5      |      12 
  [15,]  10.0000  |     12       5      |      17 
  [16,]  11.0000  |     16      11      |      27 
  [17,]  12.0000  |    109     110      |     219 
  [18,]  13.0000  |     21      16      |      37 
  [19,]  14.0000  |     33      23      |      56 
  [20,]  15.0000  |      7       6      |      13 
  [21,]  16.0000  |     37      34      |      71 
  [22,]  17.0000  |     10      14      |      24 
  [23,]  18.0000  |     22       9      |      31 
  [24,] ----------|---------------------|---------
  [25,]           |    289     245      |     534 
  [26,]                    
  [27,] Chi^2 test of independence
  [28,]  
  [29,]   chi^2 statistic:                    16.15
  [30,]   degrees of freedom:                    16
  [31,]   significance level for rejection:  0.4427
  [32,]  
  [33,]   contingency coefficient:             0.17
  [34,]   corrected contingency coefficient:   0.24
  [35,]
The significance value for the $\chi^2$-test of independence between both variables is 0.4427 here, which means that independence cannot be rejected (at the usual 5% or 10% levels).



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de