Challenge Learning Objects FAQ

What is CLOP?
CLOP stands for Challenge Learning Object Package. It is a machine learning package written in the Matlab(R) language, based on the interface of the Spider package developed at the Max Plank Institute. It was developed for the WCCI Performance Prediction Challenge. It is provided here to give baseline methods to try to reproduce the results of the feature selection challenge.

Where do I get CLOP?
You can just download clop. Follow the installation instructions in the README file.

Are there restrictions to using CLOP?
Challenge participants may use CLOP on the challenge data with no restriction, provided that they read the disclaimer and agree to the license. For other uses of CLOP, please contact the organizers at modelselect@clopinet.com.

I don't like/have Matlab, do you support other platforms?
We may provide a command line interface, if there is sufficient demand. But it will not be as flexible. Contact the organizers at modelselect@clopinet.com if your are interested.

Is the sample code part of CLOP?
Yes, but you do not need CLOP to run the sample code. It can be downloaded separately.

Do I need to use CLOP to participate in the challenge?
No. If you use a CLOP model (or a combination of CLOP models using chains or ensembles) you can save your model and submit it with your challenge entry. This will ensure reproducibility of your results.

What are learning objects?
Learning objects are Matlab objects, which have some methods to be trained and tested.

What are Matlab objects?
If you do not know anything about Matlab objects and/or object oriented programming, don't be scared away. You can learn how to use CLOP from examples. But you may definitely benefit from reading the (short) Matlab help on objects. Briefly:  

How do I get started?
In the directory sample_code, you find an example called main.m, which loads data and runs example models. Use the functions help, type, and whereis to view the documentation, the code and the location of the functions or objects.

Tell me something about the interface
In the Spider interface, both data and models are objects. Given 2 matrices X containing the examples (patterns in lines and features in column) and Y containing the target values, you can construct a data object:

> training_data = data(X, Y);

The resulting object has 2 members: training_data.X and training_data.Y.
Models are derived from the class algorithm. They are constructed using a set of hyperparameters. Those are provided as a cell array of strings:

> hyperparameters = {'h1=val1', 'h2=val2'};
> untrained_model = algorithm(hyperparameters);     % Call to the constructor

In this way, hyperparameters can be provided in any order or omitted. Omitted hyperparameters take default values. To find out about the default values and allowed hyperparameter range, use the "default" method:
> default(algorithm)

Models all have at least 2 methods: train and test:

> [training_data_output, trained_model] = train(untrained_model, training_data);
> test_data_output = test(trained_model, test_data);

trained_model is a model object identical to untrained_model, except that its parameters (some data members) have been updated by training. Repeatedly calling train on the same model may have different effects depending on the model. test_data_output is a data object identical to test_data, except that the X member has been replaced by the output of trained_model when test_data.X has been processed. The test method does not look at the Y member. training_data_output is a data object identical to training_data, except that the X member has been replaced by the output of test(trained_model, training_data);
Of course, you may call untrained_model and trained_model the same name if you want.

I do not find the methods train and test in the @a_certain_model directory, where are they?
Models are derived from the class algorithm, look in the directory spider/basic/@algorithm of them. The train and test methods of algorithm objects call the methods "training" and "testing" of the models.

I do not find certain methods in @a_certain_directory, where are they?
Clop overloads some of the spider methods. You can find the missing methods in the spider directory, the directory tree is the same. Clop methods have precedence over the spider methods.

Is there a different interface for preprocessing modules and classifiers?
No. The interface is the same. For preprocessings, training_data_output.X and test_data_output.X are the preprocessed data. For classifiers, they correspond to a vector of discriminant values.

How do I save models?
At the Matlab prompt, type:
> save('filename', 'modelname');

To include your model with your submission, use the [dataname]_model as the file name for the corresponding (trained) model. Hence, each submission should have 5 models named: ada_model.mat, gina_model.mat, hiva_model.mat,  nova_model.mat, and sylva_model.mat.
If your models make your submission too big to be uploaded, you may use the clean method to remove some of the parameters of your model before you save it (the hyperparameters will remain untouched):
> my_lean_model = clean(my_model);
Please save your full model too and contact us at modelselect@clopinet.com for instructions on how to submit your full model.

How can I chain models?
You can create a chain object. A chain object behaves like another learning object. It has a train and a test method. The outputs of one member of the chain are just fed into the next one. In this example, feature standardization is used as a preprocessing; the output is fed into a neural network:
> my_chained_model=chain({standardize, neural({'units=10', 'shrinkage=0.01'})});
A chain can be trained and tested like another model. If an element of the chain is going to be re-used once trained as part of another chain, it can be extracted with the bracket notation. For instance:
> my_model= my_chained_model{1};
will return the first element of the chain.
Chains of trained models can also be formed. Just beware of inconsistencies.
You must use chain objects to save your composite models.

Can I create ensembles of models?
Yes. You can create an ensemble object, e.g.
> my_model=ensemble({neural, kridge, naive});
Models of the same class with different hyperparameters or different models can be combined in this way. Chains can also be part of ensembles. Or ensembles can be part of chains.
Like other objects, ensembles have a train and a test method. The test method forwards the data to all the elements of the ensemble. The output is a weighted sum of the discrimimant values of the models, plus a bias. Optionally, the sign of the discriminant values is used in place of the discriminant values. The weights are set to one and the bias to zero by default. Users are free to set the  weights and the bias to other values, using set_w and set_b, or by rewriting the training method. By default, the training method trains all the members of the ensembles and sets the voting weights to zero and the bias to one.

Which models are part of CLOP?
Models in the challenge_objects directory were designed for the challenge:

Object name
Function
Hyperparameters

Description
kridge
Classifier
shrinkage, kernel parameters (coef0, degree, gamma), balance
Kernel ridge regression.
naive
Classifier
none
Naive Bayes.
neural
Classifier
units (num. hidden), shrinkage, maxiter, balance
Neural network (Netlab.)
rf
Classifier
units (num. trees), mtry (num. feat. per split.)
Random Forest (RF). We are presently having segmentation fault problems with this classifier for large datasets. We regret very much the disparition of Leo Breiman, the father of CART and RF who died on July 5, 2005 at age 77. We will work with the other authors of the original package to fix the problem.
svc
Classifier
shrinkage, kernel parameters (coef0, degree, gamma)
Support vector classifier (LibSVM.)
s2n
Feature selection
f_max, w_min
Signal-to-noise ratio coefficient for feature ranking.
relief
Feature selection
f_max, w_min, k_num
Relief ranking criterion.
gs
Feature selection
f_max
Forward feature selection with Gram-Schmidt orthogonalization.
rffs
Feature selection
f_max, w_min, child
Random Forest used as feature selection filter. The "child" argument, which may be passed in the argument array, is an rf object, with defined hyperparameters. If no child is provided, an rf with default values is used.
svcrfe
Feature selection
f_max, child
Recursive Feature Elimination filter using svc. The "child" argument, which passed in the argument array, is an rf object, with defined hyperparameters. If no child is provided, an rf with default values is used.
standardize
Preprocessing
center
Standardization of the features (the columns of the data matrix are divided by their standard deviation; optionally, the mean is first subtracted if center=1.)
normalize
Preprocessing
center
Normalization of the lines of the data matrix (optionally the mean of the lines is subtracted first.)
shift_n_scale
Preprocessing
offset, factor, take_log
Performs X <- (X-offset)/scale globally on the data matrix. Optionally performs in addition log(1+X). offset and factor are set as hyperparameters, or subject to training.
pc_extract
Preprocessing
f_max
Extraction of features with principal component analysis.
subsample
Preprocessing
p_max, balance
Takes a subsample of the training patterns. The member pidx is set to a random subset of p_max patterns by training, unless it is set "by hand" to the indices of the patterns to be kept, with the method set_idx. May be used to downsize the training set or exclude outliers.
chain
Grouping
child
A chain of models, one feeding its outputs at inputs to the next one. The "child" argument is an array of models.
ensemble
Grouping
child, signed_output
A group of models voting to make the final decision. The "child" argument is an array of models. The default training method trains all the members of the ensemble and sets the voting weights to one and the bias to zero. The hypermarameter signed_output indicates whether the sign of the outputs should be taken prior to voting. Those can be set to different values with the methods set_w and set_b0. Alternatively, the training method may be overloaded.

What are reasonable hyperparameter values?

Kernel methods use a kernel k(x, y) = (coef0 + x . y)degree exp(-gamma ||x - y||2)

Hyperparameter
Default value
Range

Description and comments
coef0
0 for svc, 1 for kridge
[0, Inf]
Kernel parameter (bias). Often taken as 0 or 1.
degree
1
[0, Inf]
Kernel parameter (polynomial degree). Usually taken between 0 or 10. Larger values increase the model capacity.
gamma
0
[0, Inf]
Kernel parameter (inverse window/neighborhood width). The range of values may depend upon the geometry of the space (e.g. distance between closest examples of opposite classes.) Larger values increase the model capacity.
shrinkage
1e-14
[0, Inf]
For kernel methods: small value (ridge) added to the diagonal of the kernel matrix. For neural networks: weight decay. Acts as a regularizer or shrinkage parameter to  prevent overfitting.
balance
0
{0, 1}
Flag indicating whether one should enable class balancing, i.e. compensating for the inbalance of the number of examples in the two classes.
units
10 for neural;100 for rf
[0, Inf]
Number of hidden units (for a 2-layer neural network) or number of trees (for rf.)
maxiter
100
[0, Inf]
Maximum number of iterations (for a 2-layer neural network.)
mtry
[]
[0 Inf]
Number of candidate feature per split (for rf.) If [], it is set to sqrt(feature_number).
f_max
Inf
[0 Inf]
Maximum number of features selected. If f_max=Inf then no limit is set on the number of features.
p_max
Inf
[0 Inf]
Maximum number of patterns to train on.. If p_max=Inf then no limit is set on the number of patterns.
w_min
-Inf
[-Inf Inf]
Threshold on the ranking criterion W. If W(i) <= w_min, the feature i is eliminated.W is non-negative. A negative value of w_min means all the features are kept.
k_num 4
[0 Inf]
Number of neighbors in the Relief algorithm.
child
a model or an array
NA
A model passed as argument to a feature selection object or an array passed to a grouping object.
center
1 for standardize and 0 for normalize
{0, 1}
Flag indicating whether the data should be centered (the columns for standardize and the lines for normalize.)
offset
[]
[-Inf Inf]
Offset value to be subtracted from the data matrix X. If [], it is set to the min of X.
factor
[]
[0+ Inf]
Scaling factor by which the data matrix X gets divided. If [], it is set to max(X-offset).
take_log
0
{0, 1}
Flag indicating whether one should take the log of 1+X.
signed_output
0
{0, 1}
Flag indicating for the ensemble object whether the sign of the output of the classifiers should be taken prior to voting. Not to be confused with the private member use_signed_output that is fixed to 0 for all challenge learning objects, so that we can compute the AUC.

Can I also use other Spider functions or objects?

You can do anything you want. We have canceled the "bonus" entries for submissions using only CLOP models. This part of the challenge will be replaced by a post-challenge game TBA.

How do I do model selection with CLOP?
You need to write your own code. You can check some of the spider functions, which implement cross-validation.

How do I create an ensemble of models?
Implement a training method for the ensemble object. You can check some of the spider functions, which implement ensemble methods.

Do I need to understand the algorithms?
You can treat the algorithms as black boxes and play with the hyperparameters.

But I want to understand the algorithms, where do I learn about them?
Each object comes with a help that includes an appropriate reference.

Can I modify the code?
Since "bonus entries" have been canceled, any modifications are allowed:

Should we include our models with our submissions?
Yes, unless your models are very big. Contact the challenge web page administrator for instructions if your models are too big to upload.

Can a participant give an arbitrary hard time to the organizers?
DISCLAIMER: ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ISABELLE GUYON AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ISABELLE GUYON AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.

Who can I ask for more help and how do I report bugs?
Email modelselect@clopinet.com.


Last updated: September 30, 2005.