Case Study 8
SALES OF ORTHOPEDIC EQUIPMENT
The objective of this study is to find ways to increase sales of orthopedic material from our company to hospitals in the United States. I want you to concentrate in a small set of related states of about 500 hospitals and find those who have high consumption of such equipment but where our sales are low. Come up with a selected group where you think our efforts will be rewarded. In reality this is a much bigger dataset ( in the thousands as in most data mining applications) but for your convenience we use this small group.
The following description of the dataset includes variable names and some summaries of variable. Also SAS code is included to perform cluster analysis.
DATASET ORTHOPEDIC
VARIABLES:
ZIP :
US POSTAL CODE
HID :
HOSPITAL ID
CITY : CITY
NAME
STATE : STATE NAME
BEDS : NUMBER
OF HOSPITAL BEDS
RBEDS : NUMBER OF
REHAB BEDS
OUT-V : NUMBER OF
OUTPATIENT VISITS
ADM :
ADMINISTRATIVE COST(In $1000's per year)
SIR :
REVENUE FROM INPATIENT
SALESY : SALES OF
REHABILITATION EQUIPMENT SINCE JAN 1
SALES12 : SALES OF REHAB.
EQUIP. FOR THE LAST 12 MO
HIP95 : NUMBER OF
HIP OPERATIONS FOR 1995
KNEE95 : NUMBER OF KNEE
OPERATIONS FOR 1995
TH
: TEACHING HOSPITAL? 0, 1
TRAUMA : DO THEY HAVE A
TRAUMA UNIT? 0, 1
REHAB : DO THEY
HAVE A REHAB UNIT? 0, 1
HIP96 : NUMBER HIP
OPERATIONS FOR 1996
KNEE96 : NUMBER KNEE
OPERATIONS FOR 1996
FEMUR96 : NUMBER FEMUR
OPERATIONS FOR 1996
SUMMARIES:
ZIP
CITY
STATE BEDS
Min. :
612 Chicago :
45 CA : 458 Min.
: 0.0
1st Qu.:28550
Houston : 41
TX : 342 1st Qu.: 69.0
Median :49000
Philadelphia : 38 NY :
241 Median : 136.0
Mean
:50600 Los Angeles : 28
PA : 238 Mean : 191.2
3rd Qu.:75240 New
York : 24 FL
: 228 3rd Qu.: 262.0
Max.
:99900 Dallas :
24 IL : 208 Max.
:1476.0
(Other) :4503 (Other):2988
RBEDS
OUTV
ADM SIR
Min. :
0.000 Min. :
0 Min. : 0
Min. : 0
1st Qu.: 0.000 1st
Qu.: 7510 1st Qu.: 1932 1st Qu.: 1312
Median : 0.000
Median : 20880 Median : 4508 Median : 3384
Mean :
7.244 Mean : 47350 Mean :
6689 Mean : 4849
3rd Qu.: 0.000 3rd
Qu.: 47700 3rd Qu.: 9402 3rd Qu.: 6832
Max. :850.000
Max. :1987000 Max. :66440
Max. :70300
SALESY
SALES12
HIP95
KNEE95
Min. :
0.00 Min. : 0.00
Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st
Qu.: 0.00 1st Qu.: 7.00 1st
Qu.: 1.00
Median : 1.00
Median : 2.00 Median : 28.00 Median :
18.00
Mean :
25.91 Mean : 41.05 Mean
: 51.27 Mean : 41.73
3rd Qu.: 23.00 3rd
Qu.: 33.00 3rd Qu.: 70.00 3rd Qu.: 52.50
Max. :1209.00
Max. :2770.00 Max. :1421.00
Max. :868.00
T-H
TRAUMA
REHAB HIP96
Min. :0.0000
Min. :0.0000 Min. :0.0000
Min. : 0.0
1st Qu.:0.0000 1st
Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 8.0
Median :0.0000 Median :0.0000
Median :0.0000 Median : 29.0
Mean :0.2737
Mean :0.1225 Mean :0.1839
Mean : 52.6
3rd Qu.:1.0000 3rd
Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.: 71.0
Max. :1.0000
Max. :1.0000 Max. :1.0000
Max. :1373.0
KNEE96 FEMUR96
Min. :
0.00 Min. : 0.00
1st Qu.: 0.00
1st Qu.: 11.00
Median : 18.00
Median : 34.00
Mean :
41.91 Mean : 49.39
3rd Qu.: 56.00 3rd
Qu.: 74.00
Max. :1081.00
Max. :489.00
Overview of the Analysis
Cluster Analysis
We have a dataset and we want
to group the data into k distinct
natural groups. There are many approaches to cluster analysis and
this very noticeable on the
software implementations of cluster analysis
on the usual statistical
packages, which implement 10 or 15 different
methods. Although the idea of
cluster analysis appears to be very
intuitive, they are difficult
to formalize in a general unique sense.
We present here a very limited
review.
There are two popular methods
for doing cluster analysis:
1. Hierarchical clustering.
Many clustering methods
require the definition of an inter-point distance
d(x1,x2)
and an inter-cluster distance d*(C1,C2).
The inter point distance is
often taken to be the Euclidean distance
de(x1,x2)
=
Some times we may use the
Manhattan distance
dM(x1,x2)
=
The inter cluster distance
between two clusters is defined as a function
of the inter point distances
between pairs of points where each
point comes from a different
cluster. The popular definitions of
inter-cluster distances are:
We build a hierarchical tree
starting with a cluster at each
sample point, and at each
stage of the tree the two closest
clusters joint to form a new
cluster.
Once we finish building the
tree the question becomes
"How
many clusters do we chose?"
One way of making this
determination is by inspecting the hierarchical tree
and finding a reasonable
point to break the clusters. We can also plot the
criteria function for the different
number of cluster and visually look
for unusually large jumps.
2. Centroid methods. K-means algorithm.
K seed points are chosen and the data is distributed among K
clusters.
Then the clusters are slowly
optimized using some criteria like R2.
At each stage of the
algorithm one point is moved to the cluster
that will optimize the
criteria function. This is iterated until
convergence occurs. The final
configuration has some dependence on the
initial configuration so it
is important to take a good start.
One possibility is to run WARDS's
method and use the outcome as
initial configuration for
k-means.