1. Data Matrices

In XploRe, data can be stored in matrices (n x p) or arrays ( $n\times p\times\ldots$). Here, we will concentrate on data matrices. Small data matrices can be created directly from the command line or within an XploRe quantlet. Large data matrices are typically read from data files.

The following subsections provide a short introduction on matrix and data handling. Consult the Read and Write Tutorial to learn more about loading data files into XploRe. More details on data and matrix manipulation can be found in the Matrix Handling Tutorial.


1.1 Creating Data Matrices


z = # (x1, x2, ..., xn)
creates a column vector z from scalar numbers x1, x2,..., xn
z = x | y
concatenates two arrays x and y rowwise
z = x $\sim$ y
concatenates two arrays x and y columnwise

Small data matrices can be directly given at the command line or within an XploRe program. The following XploRe codes are all available from the quantlet 1215desc01.xpl. As a first example, consider the data matrix

\begin{displaymath}\left(

\begin{array}{llr@{.}l}

1& 2.0 & 3&4\\

5& 6.0 & 7&8\\

9& 0.0 & 1&44\\

8& 7.0 & 10&432\\

\end{array}\right)\end{displaymath}

which has dimension 4 x 3, i.e. four rows and three columns. To construct this matrix in XploRe, we create each column vector separately and then concatenate these column vectors. A column vector can be created by means of the # or the |operator. The following two lines are equivalent:

  col1=#(1,5,9,8)

  col1=1|5|9|8

Both create the column vector

\begin{displaymath}\left(

\begin{array}{l}

1\\

5\\

9\\

8\\

\end{array}\right).\end{displaymath}

We can check the contents of col1 by issuing just

  col1

at the command line, which results in

  Contents of col1

  [1,]        1 

  [2,]        5 

  [3,]        9 

  [4,]        8

in the output window. In the same way as for col1, we build the second and third columns:

  col2=#(2.0,6.0,0.0,7.0)

  col3=#(3.4,7.8,1.44,10.432)

and group all three vectors together by means of the $\sim$ operator:

  mat=col1~col2~col3

When we check the contents of mat we see

  Contents of mat

  [1,]        1        2      3.4 

  [2,]        5        6      7.8 

  [3,]        9        0     1.44 

  [4,]        8        7   10.432

Note that we could have created mat within a single step

  mat= #(1,5,9,8) ~ #(2.0,6.0,0.0,7.0) ~ #(3.4,7.8,1.44,10.432)

Let us also remark that XploRe does not distinguish between integer and float values. Therefore, the first two columns of the matrix mat appear in the same format.

It is also possible to create text matrices. For example


  textmat= #("aa","c") ~ #("b","d2")

creates the text matrix

\begin{displaymath}\left(

\begin{array}{ll}

\textrm{\tt aa}& \textrm{\tt b}\\

\textrm{\tt c}& \textrm{\tt d2}\\

\end{array}\right)\end{displaymath}

Note that text and numeric values need to be stored in different matrices.


1.2 Loading Data Files


x = 1250 read ("file")
reads numeric data from file.dat
x = 1253 readm ("file")
reads mixed text and numeric data from file.dat

Large data sets are usually stored in data files. XploRe can read data from ASCII files, consisting of both numeric and text data. In the following we will use two data sets: cps85 and uscomp2 (see Data Sets).

The file cps85.dat consists of a subsample of the 1985 U.S. current population survey. The file contains only numeric data. We will assign columns 1 (years of education), 2 (=1 if living in south), 5 (=1 if female) 8 (years of labor market experience), 10 (=1 if working on a union job), 11 (natural logarithm of average hourly earnings) and 12 (age in years) to the XploRe variable earn:


  earn=read("cps85")

  earn=earn[,1|2|5|8|10|11|12]

1273desc02.xpl

The file uscomp2.dat contains information on 79 U.S. companies. The data set has 8 columns, two of them text (columns 1,8) and six numeric (columns 2 to 7). We will only use column 8 (branch, text) and columns 3 and 5 (sales and profits, both numeric) and assign them to the XploRe variables branch and salpro:

  uscomp=readm("uscomp2")

  branch=uscomp.text[,2]

  salpro=uscomp.double[,2|4]

1285desc02.xpl

Since text and numeric data are stored in different XploRe objects, we find column 8 of uscomp2 as the second text column and columns 3 and 5 as the second and fourth numeric columns, respectively. 1296 readm is a function written in the XploRe language, which can be used for reading mixed text and numeric data.


1.3 Matrix Operations


d = 1343 dim(x)
shows the dimension of an array x
n = 1346 rows(x)
shows the number of rows of an array x
p = 1349 cols(x)
shows the number of columns of an array x
y = x[i,j or y = x[i, or y = x[,j
extracts element i,j or row i or column j from x

The first step in data analysis is to find out information on the dimension of the data. This can be done generally by using the function 1364 dim. We apply this function now to the data matrices mat, earn, branch, and salpro that we specified in Subsections 1.1 and 1.2. The codes for this section are available from the quantlet 1367desc02.xpl.


  dim(mat)

  dim(earn)

  dim(branch)

  dim(salpro)

yields

  Contents of dim

  [1,]        4 

  [2,]        3 

  Contents of dim

  [1,]      534 

  [2,]        7 

  Contents of dim

  [1,]       79 

  Contents of dim

  [1,]       79 

  [2,]        2

and tells us that mat is a 4 x 3 matrix, earn is 534 x 7, branch is a 79 x 1 vector and salpro is 79 x 2. If we are just interested in the number of rows or columns, we can use the commands 1370 rows and 1373 cols. For example,

  rows(earn)

  cols(earn)

gives

  Contents of rows

  [1,]      534 

  Contents of cols

  [1,]        7

To extract elements or submatrices of a matrix, we can use the subarray operator []. The following three lines extract the first row, the second column and (4,3)-element (fourth row, third column), for example:


  mat[1,]

  mat[,2]

  mat[4,3]

This operator can also be used for extracting several rows and columns at once. The statement mat[1:3,1|3] extracts the elements which are in the 1st to 3rd rows of mat and in the 1st and 3rd columns. The operator : is used to specify a range of subsequent integers.



Method and Data Technologies   MD*TECH Method and Data Technologies
  http://www.mdtech.de  mdtech@mdtech.de