3. The Cox Proportional Hazards Model
- {ll, ll1, ll2} =
hazregll(data,beta)
- calculates the value of the partial log-likelihood function and
of the first two derivatives
- {betahat, betak, ck} =
hazbeta(data {,maxit})
- estimates the regression coefficients for the Cox proportional hazards model
- {bhaz, bsurv} =
hazbase(data)
- estimates the baseline hazard and survival functions
- surv =
hazsurv(data, z)
- estimates the conditional survival function
- {val, df, pval} =
haztest(data, index)
- performs the likelihood ratio test, Wald's test and the score test
|
The semiparametric Cox proportional hazards model is the
most commonly used model in hazard regression. In this model, the
conditional hazard function, given the covariate value
,
is assumed to be of the form
where
is the vector of regression coefficients, and
denotes the baseline hazard function.
No particular shape is assumed for
the baseline hazard; it is estimated
nonparametrically. The contributions
of covariates to the hazard are multiplicative. An accessible
introduction to the Cox model is given, for example, in
Klein and Moeschberger (1997).
The quantlib hazreg
provides quantlets for estimating the regression
coefficients,
, standard deviations of these estimates,
and estimates for the cumulative baseline hazard and the conditional
survival function. Our calculations are based on standard
partial likelihood methods in the proportional hazards model.
Additionally, we provide three commonly used
tests for the hypothesis that one or more of the
's are zero.
These tests are useful for model choice procedures.
3.1 Estimating the Regression Coefficients
Let us assume that there are no ties between the event times.
In this case, the partial likelihood function is given by
 |
(4) |
where
denotes the risk set at time
.
Note that only event times contribute their own factor
to the partial likelihood.
However, both censored and uncensored observations appear in the
denominator, where the sum
over the risk set
includes all individuals who are still at risk
immediately prior to
.
Let
denote the
maximum (partial) likelihood estimate of
, obtained by
maximizing the partial log-likelihood function,
.
From (4), it follows immediately that
 |
(5) |
The first derivative of
with respect to
is called
vector of efficient scores, given by
 |
(6) |
where
denotes the vector of censoring indicators,
and
is the
-matrix of covariate values, with the
-th row
containing the covariate values of
the
-th individual,
.
For the case of ties in the event times, there are several ways to define
a partial likelihood function. Currently, we are using formula
(4) both for data with and without ties. Each event time contributes one
factor to the likelihood function; for tied events, all events in the tie appear
with the same denominator.
This approach was
suggested by Breslow (1974),
and is implemented in most statistical
packages. When there are few ties, this approximation to the partial
likelihood works rather well, see Klein and Moeschberger (1997), p.238.
The information matrix
is given by the negative of the second
derivative of
. Let
denote the
indicator vector of the risk set
; this means, the
-th
element of
is
when
, and
0, otherwise. Then, the information matrix takes the form
where the
are scalars;
for any vector
,
denotes
the diagonal matrix with the main diagonal
, and
is defined elementwise;
and
. The matrices
are modifications of the design matrix
,
setting the rows of
to zero when the corresponding
observation is not in the risk set for time
. Note that the
index
runs through all
observations. When ties are present, each of
the tied event times appears once, with the same risk set, and
contributes the same term to the information matrix.
For large samples, the maximum likelihood estimate
is known to follow an asymptotic
-variate normal distribution,
The inverse of the information matrix,
is a consistent estimate of the covariance matrix of
.
It may be used to construct confidence intervals for the
components of
.
The quantlet
hazregll
computes the
partial log-likelihood function, its first derivative (efficient scores),
and the negative of the second derivative (information matrix).
The first and second derivatives of the log-likelihood function
(5) are later used to obtain
, as well as
for computing test statistics for local tests on
.
The syntax of
hazregll
is given below:
{ll,ll1,ll2} = hazregll(data,beta)
Input:
- data
-
matrix, the sorted data matrix
obtained as output data of the quantlet
hazdat;
- beta
-
vector, the regression coefficient
vector
.
Output:
- ll
- scalar, the log-likelihood function at parameter
value
;
- ll1
-
vector, the first derivatives
of the log-likelihood function at
parameter value
;
- ll2
-
matrix, the negative Hessian matrix
of the log-likelihood function at parameter value
(information matrix).
Example 5.The simulated data in the file haz01.dat
were generated from a
proportional hazards model with
covariates, the
conditional hazard function
,
and
. The baseline hazard is constant,
.
We use the quantlet
hazregll
to calculate the partial log-likelihood
function, the efficient scores and the information matrix at the
true parameter value,
.
library("hazreg")
dat=read("haz01.dat")
t = dat[,1] ; observed times
delta = dat[,2] ; censoring indicator
z = dat[,3:4] ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
beta = 1|2
{ll,ll1,ll2} = hazregll(data,beta)
The calculations yield ll = -34.679 for the value of the
log-likelihood function,
for the first derivatives,
and
for the information matrix.
The quantlet
hazbeta
calculates the maximum likelihood estimate
by solving the nonlinear equation system
,
defined in (6).
We use a Newton-Raphson algorithm with
the stopping criterion
The syntax of
hazbeta
is given below:
{betahat, betak, ck} = hazbeta(data {,maxit})
Input:
- data
-
matrix, the sorted data matrix
obtained as output data of
hazdat;
- maxit
- scalar, maximum number of iteration for the
Newton-Raphson procedure, default is
.
Output:
- betahat
-
vector, estimate of the
regression parameter
;
- betak
-
matrix, iterated
parameter values through the Newton-Raphson procedure;
- ck
-
vector,
values of the convergence
criterion at each iteration of the Newton-Raphson procedure.
Example 6.In this example, we compute
for the data in haz01.dat,
and estimate the covariance matrix of
by
. We use the quantlets
hazbeta
and
hazregll. The data was generated from a proportional
hazards model with
. Details are given in Examples 4 and 5.
library("hazreg")
dat=read("haz01.dat")
t = dat[,1] ; observed times
delta = dat[,2] ; censoring indicator
z = dat[,3:4] ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
{betahat,betak,ck} = hazbeta(data)
{ll, ll1, ll2} = hazregll(data, betahat)
sigma = inv(ll2) ; covariance matrix estimate
The calculation results in
,
with the estimated covariance matrix
Both components
and
are within their respective
asymptotic 95% confidence intervals that may be
constructed with betahat and the square root of the
diagonal elements of sigma .
3.2 Estimating the Hazard and Survival Functions
We estimate the cumulative baseline hazard function,
, by
The estimate
is a right-continuous step function, with jumps in the
event times.
The index
cycles through all observations,
In the case of tied events, each of the events in the tie
contributes its own term to the sum; this term is the same for all
events in a particular tie.
The estimate
can be derived through a profile
likelihood approach, see Klein and Moeschberger (1997),
pages 260 and 237, and
Johansen (1983).
We estimate the baseline survival function,
by
In the Cox proportional hazards model, the survival function
of an individual with covariate values
is given by
 |
(8) |
Consequently, we estimate the conditional survival function by
substituting estimates for
and
,
 |
(9) |
Note that the estimates
,
and
are all step functions, with jumps at the event times. All three estimates
are non-negative,
is monotonously increasing, and the
survival function estimates are monotonously decreasing.
The quantlet
hazcoxb
provides the estimates
and
. It has the following syntax:
{bcumhaz, bsurv} = hazcoxb(data)
Input:
- data
-
matrix, the sorted data matrix
given by the output data of
hazdat.
Output:
- bcumhaz
-
matrix, with rows
, sorted in the same order
as the
in data;
- bsurv
-
matrix, with rows
, sorted in the same order
as the
in data.
Example 7.In this example, we calculate and plot estimates of the cumulative baseline
hazard
and of the corresponding survival function
,
for the simulated data in
haz01.dat.
The estimates are calculated using the quantlet
hazcoxb. Plotting
of the step functions is supported by
steps4plot. The resulting
plots are displayed in
Figures 2 and 3.
The data in haz01.dat
were generated from a proportional hazards
model with
and
; details are
given in Examples 4 and 5.
library("hazreg")
dat=read("haz01.dat")
t = dat[,1] ; observed times
delta = dat[,2] ; censoring indicator
z = dat[,3:4] ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
{bcumhaz,bsurv} = hazcoxb(data) ; compute estimates
setsize(600,400) ; initiating graph
plot1=createdisplay(1,1) ; initiating graph
plot2=createdisplay(1,1)
n = rows(data) ; sample size
pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n))
; points to be connected
cn = matrix(2*n+2) ; color_num, controls colors
ar = matrix(2*n+2) ; art, controls line types
th = matrix(2*n+2) ; thick, controls line thickness
bsurvline = steps4plot(bsurv, 0~1)
; points for step function plot
setmaskl(bsurvline, pm, cn, ar, th) ; lines connected
setmaskp(bsurvline, 4, 0, 8) ; points controlled
bcumhazline = steps4plot(bcumhaz, 0~0)
; points for step function plot
setmaskl(bcumhazline, pm, cn, ar, th)
setmaskp(bcumhazline, 4, 0, 8)
show(plot1, 1, 1, bcumhazline) ; plot baseline hazard
setgopt(plot1, 1, 1, "title","Cumulative Baseline Hazard")
setgopt(plot1, 1, 1, "xlabel","Time")
setgopt(plot1, 1, 1, "ylabel","Cumulative Hazard")
setgopt(plot1, 1, 1, "ymajor", 0.5)
print (plot1,"hazbcumhaztest.ps")
show(plot2,1, 1, bsurvline) ; plot baseline survival
setgopt(plot2, 1, 1, "title","Baseline Survival Function")
setgopt(plot2, 1, 1, "xlabel","Time")
setgopt(plot2, 1, 1, "ylabel","Survival Function")
setgopt(plot2, 1, 1, "ymajor", 0.2, "ylim", (0|1.01))
print (plot2,"hazbsurvtest.ps")
Figure 2:
Estimate of the cumulative baseline hazard in the
proportional hazards model. Data were generated in a model
with
.
|
Figure:
Estimate of the baseline survival function in the
proportional hazards model. Data were generated in a
model with
.
|
The quantlet
hazsurv
provides an estimate of the conditional
survival function,
, as defined in
formula (9).
It has the following syntax:
surv = hazsurv(data,z)
Input:
- data
-
matrix, the sorted data matrix
given by the output data of
hazdat;
- z
-
vector, value of the covariates;
Output:
- surv
-
matrix, the first column is the
sorted
, followed by the estimated conditional survival function
.
Example 8. We calculate and plot the estimate
of the conditional
survival function for
, using the simulated
data in haz01.dat.
The resulting graph is displayed in Figure 4.
library("hazreg")
dat=read("haz01.dat")
t = dat[,1] ; observed times
delta = dat[,2] ; censoring indicator
z = dat[,3:4] ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
z1 = 0.1|-0.3 ; covariate values
surv = hazsurv(data, z1)
; estimate conditional survival function
setsize(600, 400) ; initiating graph
plot1=createdisplay(1,1) ; initiating graph
n = rows(data)
pm = (#(1,n+2)'+ (0:n))|(#(2*n+2,3*n+3)'+ (0:n))
; points to be connected
cn = matrix(2*n+2) ; color_num, controls colors
ar = matrix(2*n+2) ; art, controls line types
th = matrix(2*n+2)
; thick, controls line thickness
survline = steps4plot(surv, 0~1)
; points for step function plot
setmaskl(survline, pm, cn , ar, th) ; lines connected
setmaskp(survline, 4, 0, 8) ; points controlled
setsize(600,400)
show(plot1, 1, 1, survline)
setgopt(plot1, 1, 1, "title","Conditional Survival Function")
setgopt(plot1, 1, 1, "xlabel","Time")
setgopt(plot1, 1, 1, "ylabel","Survival Function")
setgopt(plot1, 1, 1, "ylim", (0|1.01), "ymajor", 0.2)
print (plot1,"hazsurvtest.ps")
Figure 4:
Estimated conditional survival function based on the data
in haz01.dat, for
.
|
3.3 Hypothesis Testing
The quantlib hazreg
offers three tests for hypotheses about
regression parameters: the likelihood ratio test, Wald's test and the score
test. Assume that
,
where the
-dimensional subvector
consists of the regression
coefficients of
interest, and the
-dimensional subvector
contains
the remaining parameters.
We are testing the hypotheses
versus
, in the presence of the remaining
unknown
regression coefficients;
denotes the
-dimensional zero
vector. This type of tests is often used in model
choice procedures, testing whether a given model can be improved by
including certain additional covariates or covariate combinations.
3.3.1 Likelihood Ratio Test
The test statistic for the likelihood ratio test is given by
where
, and
and
are
the
-dimensional zero vector and the
conditional maximum likelihood estimate for
, given
, respectively. The estimate
is obtained by
substituting the fixed null hypothesis value,
,
for the corresponding
's in the partial log-likelihood
function (5).
Under the null hypothesis, the asymptotic distribution of
is
.
We calculate
-values as tail probabilities of the
distribution,
.
Let
denote the usual maximum partial likelihood estimate of the full
parameter vector
.
Now, let us partition the inverse of the
information matrix
into
where
denotes the
submatrix
corresponding to
.
The information matrix is defined in (8).
The test statistic for the Wald test is
given by
Under the null hypothesis, the distribution of
converges to
.
Let
denote the subvector of
the first
elements of the score function
, defined in (6).
The test statistic for the score test is
where
is the
maximum likelihood estimate for
under the null hypothesis,
introduced in
Section 3.3.1. Again,
the large sample distribution of the test statistic under the null
hypothesis is
.
For more details on hypothesis testing in the Cox
model, see Klein and Moeschberger (1997), Section 8.4.
Values of the three test statistics
,
and
,
and the corresponding asymptotic
-values are provided by
the quantlet
haztest.
-values are computed as tail probabilities of the
distribution,
which is the asymptotic distribution for each of the three tests.
The syntax of
haztest
is given below:
{ttest, val, df, pval} = haztest(data,index)
Input:
- data
-
matrix, the sorted data matrix
given by the output data of
hazdat;
- index
-
vector, with index[i]=0
when
is part of the null hypothesis, and index[i]=1, otherwise;
Output:
- ttest
- printed output, table with values of the test statistics, degrees of freedom
and
-values for the likelihood ratio test, Wald's test and the score test.
- val
-
vector, values of the test
statistics, in the following order: likelihood
ratio test, Wald's test, score test;
- df
- scalar, degrees of freedom of the
reference distribution;
- pval
-
vector,
-values of the tests.
Example 9. We are testing the null hypothesis
for the
data in haz01.dat. The quantlet
haztest
provides
values for
three test statistics and computes the corresponding
-values
as tail probabilities of
the
distribution.
library("hazreg")
dat=read("haz01.dat")
t = dat[,1] ; observed times
delta = dat[,2] ; censoring indicator
z = dat[,3:4] ; covariates
{data,ties} = hazdat(t,delta, z) ; preparing data
index = 1|0 ; testing if the second
; coefficient is zero
{testt, val,df,pval} = haztest(data, index)
testt ; print summary table
The variable testt contains the summary table. The last line of
the code prints the table into the
XploRe
output window:
"--------------------------------------------------------------"
"Cox Proportional Hazards Model"
""
"Hypothesis: beta1=0 for a subvector of regression coefficients"
""
"- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "
" Test statistic DF P-value "
"- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "
"LR Test 7.56687 1 0.00595"
"Wald Test 7.04612 1 0.00794"
"Score Test 4.25763 1 0.03908"
"--------------------------------------------------------------"
""
Additionally, the test statistic values and the
-values are stored in the variables
val and pval, respectively.
The data in haz01.dat
were generated from a proportional hazards
model with
. For this sample, all three tests result
in small
-values,
providing evidence against the null hypothesis
.
3.4 Example: Length of Stay in Nursing Homes
Nursing homes provide both short-term and long-term care,
and patients may stay from a few days to several years.
In this section, we investigate how certain characteristics of
nursing home patients influence their length of stay.
We use a subset of the nursing home data
presented by Morris, Norton, and Zhou (1994).
The original study was sponsored by the National Center for
Health Services in 1980-1982, and investigated the
the impact of certain financial
incentives on the nursing home care of Medicaid patients.
Thirty six nursing homes were randomized
into a treatment or control group; nursing homes in the treatment
group received financial incentives for admitting
more disabled Medicare patients, for improving their health
status, and for discharging the patients to their homes within 90 days.
The nursing home data consist of
patients
from this study. Patients were admitted to participating
nursing homes between May 1, 1981, and April 30, 1982, and
followed over a period of up to three years. The following
characteristics were recorded: age, marital status, gender,
health status. The data set is available on a floppy disk
distributed with
Lange et al. (1994), and
at StatLib,
http://www.stat.cmu.edu/datasets/csb/ .
We restrict our analysis to nursing homes in the
control group, which represents the standard
care without additional financial incentives, and
to patients that were at least 65 years old and of
comparatively good health (health status = 2).
Our subset consists of
patients.
The patients were followed from admission to either discharge,
or death. For patients that were still in a nursing home on
April 30, 1983, the length of stay is
recorded as censored (25 % of the observations).
Our subset of the nursing home data is provided in
nursing.dat. The first column of the data file contains the
length of stay in the nursing home (in days),
the second column is the censoring indicator,
and columns 3-5 contain the age
(in years), the gender (1=male,
0=female), and the
marital status (1=married, 0=unmarried), respectively.
Twenty one percent of the patients are male, and
14 % are married.
In order to identify the impact of age, marital status and
gender on the length of stay in a nursing home, we are
fitting a Cox proportional hazards model with
covariates:
agespline, gender and married. The first
variable measures the age of a patient as
this
transformation was suggested in Morris, Norton, and Zhou (1994).
The other two covariates are indicator variables for
the gender and marital status, respectively.
The second column of
Table 1 provides the sample means of the covariates
and the standard deviation of agespline.
The following code reads in the data and calculates estimates
of the regression coefficients,
,
and their covariance matrix:
library("hazreg")
dat=read("nursing.dat") ; read data from file
t = dat[,1] ; time = length of stay
delta = dat[,2] ; censoring
age = dat[,3] ; covariate AGE
gender = dat[,4] ; covariate GENDER
married = dat[,5] ; covariate MARRIED
limit = matrix(rows(dat), 1)*90 ; transform AGE
agespline = min(age~limit,2) - 65
{data, ties} = hazdat(t, delta, agespline~gender~married)
; prepare data
{betahat, betak, ck}=hazbeta(data) ; estimate beta
{ll, ll1, ll2} = hazregll(data,betahat)
sigma = inv(ll2)
; covariance matrix of betahat
Table 1 presents the estimated regression coefficients in
the fitted model (
, the value of betahat)
and their estimated standard deviation, SE(
), obtained
as square root of the diagonal elements of sigma.
Let us test, for each of the covariates, whether it contributes
to the hazard in the presence of the other two variables:
{ttest1, val1, df1, pval1} = haztest(data, (0|1|1))
;test for AGESPLINE
{ttest2, val2, df2, pval2} = haztest(data, (1|0|1))
;test for GENDER
{ttest3, val3, df3, pval3} = haztest(data, (1|1|0))
;test for MARRIED
The variables ttest1, ttest2 and ttest3 contain
the summary tables for the covariates agespline,
gender, and married, respectively. The
-values are
provided in Table 1.
The only covariate with a significant contribution to the hazard
is the age.
In comparison, in Morris, Norton, and Zhou (1994),
a Cox model is fitted to all
observations,
with additional variables that identify the health
status at entry and the treatment group.
Here, the age does not appear to be significant, while gender,
marital status and poor health significantly contribute
to the length of stay.
These results are an example that caution is advised in
interpreting fitted models. In our case,
gender and marital status are correlated with the health status:
married patients
tend to enter the nursing home with more advanced health problems,
and men are more likely than women to be admitted in poorer health.
In our restricted
data set of patients with similar, good health at entry,
neiter gender nor marital status
are helpful for modeling
the expected length of stay in the framework of proportional hazards.