Case Study 8

SALES OF ORTHOPEDIC EQUIPMENT

The objective of this study is to find ways to increase sales of orthopedic material from our company to hospitals in the United States.   I want you to concentrate in a small set of related states of about  500 hospitals and find those who have high consumption of such equipment but where our sales are low. Come up with a selected group where you think our efforts will be rewarded. In reality this is a much bigger dataset ( in the thousands as in most data mining applications) but for your convenience we use this small group.

The following description of the dataset includes variable names and some summaries of variable. Also SAS code is included to perform cluster analysis.

DATASET ORTHOPEDIC
VARIABLES:

     ZIP :  US POSTAL CODE
     HID :  HOSPITAL ID
    CITY :  CITY NAME
   STATE :  STATE NAME
    BEDS :  NUMBER OF HOSPITAL BEDS
   RBEDS :  NUMBER OF REHAB BEDS
   OUT-V :  NUMBER OF OUTPATIENT VISITS
     ADM :  ADMINISTRATIVE COST(In $1000's per year)
     SIR :  REVENUE FROM INPATIENT
  SALESY :  SALES OF REHABILITATION EQUIPMENT SINCE JAN 1
 SALES12 :  SALES OF REHAB. EQUIP. FOR THE LAST 12 MO
   HIP95 :  NUMBER OF HIP OPERATIONS FOR 1995
  KNEE95 :  NUMBER OF KNEE OPERATIONS FOR 1995
      TH :  TEACHING HOSPITAL?  0, 1
  TRAUMA :  DO THEY HAVE A TRAUMA UNIT?  0, 1
   REHAB :  DO THEY HAVE A REHAB UNIT?  0, 1
   HIP96 :  NUMBER HIP OPERATIONS FOR 1996
  KNEE96 :  NUMBER KNEE OPERATIONS FOR 1996
 FEMUR96 :  NUMBER FEMUR OPERATIONS FOR 1996
 

SUMMARIES:

       ZIP                   CITY          STATE           BEDS
Min.   :  612    Chicago      :  45    CA    : 458   Min.   :   0.0
1st Qu.:28550    Houston      :  41    TX    : 342   1st Qu.:  69.0
Median :49000    Philadelphia :  38    NY    : 241   Median : 136.0
Mean   :50600    Los Angeles  :  28    PA    : 238   Mean   : 191.2
3rd Qu.:75240    New York     :  24    FL    : 228   3rd Qu.: 262.0
Max.   :99900    Dallas       :  24    IL    : 208   Max.   :1476.0
                 (Other)      :4503   (Other):2988

     RBEDS             OUTV              ADM             SIR
Min.   :  0.000   Min.   :      0   Min.   :    0   Min.   :    0
1st Qu.:  0.000   1st Qu.:   7510   1st Qu.: 1932   1st Qu.: 1312
Median :  0.000   Median :  20880   Median : 4508   Median : 3384
Mean   :  7.244   Mean   :  47350   Mean   : 6689   Mean   : 4849
3rd Qu.:  0.000   3rd Qu.:  47700   3rd Qu.: 9402   3rd Qu.: 6832
Max.   :850.000   Max.   :1987000   Max.   :66440   Max.   :70300
 

     SALESY           SALES12            HIP95             KNEE95
Min.   :   0.00   Min.   :   0.00   Min.   :   0.00   Min.   :  0.00
1st Qu.:   0.00  1st Qu.:   0.00   1st Qu.:   7.00   1st Qu.:  1.00
Median :   1.00   Median :   2.00   Median :  28.00   Median : 18.00
Mean   :  25.91   Mean   :  41.05   Mean   :  51.27   Mean   : 41.73
3rd Qu.:  23.00   3rd Qu.:  33.00   3rd Qu.:  70.00   3rd Qu.: 52.50
Max.   :1209.00   Max.   :2770.00   Max.   :1421.00   Max.   :868.00
 

       T-H             TRAUMA           REHAB            HIP96
Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :   0.0
1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:   8.0
Median :0.0000   Median :0.0000   Median :0.0000   Median :  29.0
Mean   :0.2737   Mean   :0.1225   Mean   :0.1839   Mean   :  52.6
3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:  71.0
Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1373.0
 

     KNEE96           FEMUR96
Min.   :   0.00   Min.   :  0.00
1st Qu.:   0.00   1st Qu.: 11.00
Median :  18.00   Median : 34.00
Mean   :  41.91   Mean   : 49.39
3rd Qu.:  56.00   3rd Qu.: 74.00
Max.   :1081.00   Max.   :489.00

  Overview of the Analysis

 


Cluster Analysis

We have a dataset and we want to group the data into k distinct

natural groups. There are many approaches to cluster analysis and

this very noticeable on the software implementations of cluster analysis

on the usual statistical packages, which implement 10 or 15 different

methods. Although the idea of cluster analysis appears to be very

intuitive, they are difficult to formalize in a general unique sense.

We present here a very limited review.

 

There are two popular methods for doing cluster analysis:

 

1. Hierarchical clustering.

 

Many clustering methods require the definition of an inter-point distance

d(x1,x2) and an inter-cluster distance d*(C1,C2).

 

The inter point distance is often taken to be the Euclidean distance

 

de(x1,x2) =   

 

Some times we may use the Manhattan distance

dM(x1,x2) =   

 

The inter cluster distance between two clusters is defined as a function

of the inter point distances between pairs of points where each

point comes from a different cluster. The popular definitions of

inter-cluster distances are:

We build a hierarchical tree starting with a cluster at each

sample point, and at each stage of the tree the two closest

clusters joint to form a new cluster.

 

Once we finish building the tree the question becomes

"How many clusters do we chose?"

 

One way of making this determination is by inspecting the hierarchical tree

and finding a reasonable point to break the clusters. We can also plot the

criteria function for the different number of cluster and visually look

for unusually large jumps.

 

2. Centroid methods. K-means algorithm.

 

K seed points are chosen and the data is distributed among K clusters.

Then the clusters are slowly optimized using some criteria like R2.

At each stage of the algorithm one point is moved to the cluster

that will optimize the criteria function. This is iterated until

convergence occurs. The final configuration has some dependence on the

initial configuration so it is important to take a good start.

 

One possibility is to run WARDS's method and use the outcome as

initial configuration for k-means.