Datasets of the feature selection challenge
===========================================

This directory contains the data used in the NIPS 2003 feature selection challenge.
Each subdirectory corresponds to one of the datasets and contain the following files:

dataname.param         		-- data statistics and info, help read the files
dataname_feat.info     		-- feature information (was not available to participants)
dataname_feat.labels   		-- feature labels (were not available to participants)
dataname_train.data    		-- training data 
dataname_train.labels  		-- training labels
dataname_valid.data    		-- validation data	
dataname_valid.labels		-- validation labels (were not available until dec 1st 2003)
dataname_test.data     		-- test data 
dataname_test.scrambled_labels -- the test labels are scrambled 

All tasks are 2 class classification problems, with no missing values. For details, see Dataset.pdf.

We may make the test labels available upon request to teachers who want to use this benchmark for education purpose. Contact isabelle@clopinet.com. To descramble the features use the matlab function descramble('dataname_test.scrambled_labels', 'key'), where key will be given to you upon request.
Note: once you get the test lset labels, you must regenerate the Matlab data files otherwise they will not include the test labels.

We are releasing information on the features, which was not avaiable to participants:
-xxx_feat.info files: a feature identifier.
<- arcene -> feat or probe, followed by original index number (before shuffling) followed by m/z value. The index number runs through features and probes because some probes are actual features that were randomized in column. This allows reconstituting the original spectrum.
<- dexter -> feat or probe, followed by original index number (separate numbering for features and probes), followed by the word coded for for the features.
<- dorothea -> feat or probe, followed by original index number. The index number runs through features and probes because probes are actual features randomized in column.
<- gisette -> feat or probe, followed by pair or pixel, pixels are original pixels, pairs are products of pixels; last come one or two numbers identifying the pixels; the first features are all pixels so it is possible to reconstitute the original 28x28 image, the lines of which have been concatenated.
<- madelon -> feat or probe, followed by an order number (a single ordering for probes and features), followed for the features by "useful", "redundant", or "repeated".

- xxx_feat.labels: +-1 labels, one per line, one for each feature, indicating whether the feature is a real feature (+1) or a probe (-1).
