ML-Flex

Experiment Settings

The "Experiments" folder holds experiment configuration files. Each experiment can be configured to use particular data sets, algorithms, specific number of cross-validation folds, etc. This way, when you run an experiment, you will not only have flexibility in how you configure it, but you will also have a record of the precise settings that were used to run the experiment.

Only one experiment setting is mandatory (DATA_PROCESSORS); if not specified, ML-Flex will return an error. The remaining experiment settings are optional---if you do not specify a value, a default value will be used.

This folder and its subfolders contain various sample experiment files for demo purposes. At the top of each sample file is a description of the experiment.

When multiple values can be specified for a given configuration setting, they should be separated by semicolons.

Mandatory experiment setting

DATA_PROCESSORS

Description

Each experiment file must specify at least one data processor. A data processor is a Java class within ML-Flex that contains logic for processing a particular data set. Data processors can be used to process common input-data formats, including tab-delimited and ARFF. Additionally, users can create custom data processors by developing a class that inherits from mlflex.dataprocessors.AbstractDataProcessor. In the experiment file, parameters can also be specified for a data processor (under a similar syntax as the Java programming language). Many examples of how to specify data processors are provided in the Experiments directory.

One data processor provided with ML-Flex is mlflex.dataprocessors.DelimitedDataProcessor. This data processor can be used to parse simple input files that are text based. The default delimiter is a tab, but other delimiters can be specified. Delimited files should contain a header row with an ID for each data instance. Each subsequent row should start with an entry that specifies a data point name followed by the respective value for each instance. The class variable is specified as a row in the delimited file where the data point name is "Class" and the values are listed for each data instance. The example below illustrates how such a file might appear.

Instance1 Instance2 Instance3 Instance4

Width 5.234 3.928 4.001 -0.281

Height 9.637 8.289 10.207 7.126

HairColor grey brown black brown

EyeColor hazel brown green green

... ... ... ... ...

Class dog dog cat cat

Another data processor provided with ML-Flex is mlflex.dataprocessors.ArffDataProcessor. This data processor accepts files in the standard ARFF format.

Possible values

Values should be the exact name of a Java class that inherits from mlflex.dataprocessors.AbstractDataProcessor. When the class does not contain a default constructor, input parameters must be specified.

Multiple Data Categories

It is possible to specify multiple data processors for a given experiment. This feature might be applicable when you have disparate categories of data (for example, clinical data and genomic data) that describe the instances. Rather than combine those categories into a single data set, they can be kept separate. ML-Flex will perform classification for each data category separately and will also apply ensemble-learning techniques to derive aggregate predictions. In an experiment file, multiple data processors should be separated by semicolons. When multiple processors have been specified, it is also possible to use mlflex.dataprocessors.AggregateDataProcessor, which will create an additional data set that combines data from each of the other processors (see example experiment file called TCGA.txt).

Notes

Each experiment should include one data processor that produces a data point called "Class," which should contain the class values. Alternatively, if a data point with a different name contains the class values, this can be specified with the DEPENDENT_VARIABLE_NAME setting. The example experiment file called DelimitedInputFile.txt illustrates this.

Examples

mlflex.dataprocessors.DelimitedDataProcessor("UCI/soybean-large.all.data.tab")
mlflex.dataprocessors.ArffDataProcessor("UCI/iris.arff")
mlflex.dataprocessors.UciMachineLearningDataProcessor("InputData/UCI/iris.data", -1, 4)
mlflex.dataprocessors.tcga.TcgaClinicalDataProcessor;mlflex.dataprocessors.tcga.TcgaMrnaLevel3DataProcessor

Commonly used (optional) experiment settings

Name	Description	Possible Values	Default	Example(s)
CLASSIFICATION_ALGORITHMS	One or more classification algorithms can be specified for a given experiment. Such algorithms are designed to predict the class (dependent variable) for a given data instance. If multiple algorithms are specified, classification will be performed separately for each algorithm, and ensemble-learning techniques will then create aggregate predictions based on the output of the individual classifiers.	Values should correspond with the names specified in the Config/Classification_Algorithm.txt file.	[empty]	weka_naive_bayes weka_naive_bayes;weka_knn
FEATURE_SELECTION_ALGORITHMS	One or more feature selection/ranking algorithms can be specified for a given experiment. The purpose of these algorithms is to identify which features/variables are most relevant. When features are selected/ranked, it is hoped that classification performance will improve. Note that for feature selection to be performed, it is also necessary to specify a value for NUM_FEATURES_OPTIONS (see below).	The values specified here should correspond with the names specified in Config/Feature_Selection_Algorithms.txt. Excluding this setting or specifying a value of "None" indicates that no feature selection/ranking should be performed.	[empty]	weka_info_gain weka_info_gain;weka_relieff
NUM_FEATURES_OPTIONS	Commonly it is believed that not all the features in a given data set are informative--thus feature selection can be performed to narrow the features to those believed to be most informative. However, one challenge with this approach is that it is not known a priori how many features should be considered "informative." ML-Flex can use an "inner" cross validation approach to estimate the optimal number of features. This configuration value specifies the numbers of features that should be evaluated in estimating the optimal number of features.	One or more positive integers	[Total number of features]	1;5;10;50;100
NUM_OUTER_CROSS_VALIDATION_FOLDS	The number of "outer" cross validation folds that will be used when cross-validation experiments are performed.	0 = leave one out cross validation 1 = no cross validation (train / test split only) 2 ... total number of data instances = cross validation with the number of folds specified	10	10
NUM_INNER_CROSS_VALIDATION_FOLDS	The number of "inner" cross validation folds that will be used when cross-validation experiments are performed. The inner folds are used to estimate the "optimal" number of features to be used for outer-fold predictions and for estimating weights for ensemble learners.	0 = leave one out cross validation 1 = no cross validation (train / test split only) 2 ... total number of data instances = cross validation with the number of folds specified	10	10
NUM_ITERATIONS	The number of times to repeat a given experiment. Repeating the same experiment multiple times may be desirable when one desires to assess the robustness of a given result as train/test assignments are selected repeatedly. The experiment output will list the results of each iteration as well as averages across the iterations.	Any positive integer	10	100

Seldomly used (optional) experiment settings

Name Description Possible Values Default Example(s)

META_DATA_PROCESSORS The name of one or more Java classes that inherit from mlflex.dataprocessors.AbstractMetadataProcessor. These Java classes process raw metadata, which can then be accessed by regular data processors. For example, The Cancer Genome Atlas data sometimes contains variables with identifiers unique to a particular location in the human genome, and a metadata processor can be used to store information that translates these identifiers to gene or chromosome names. The exact name of a Java class that inherits from mlflex.dataprocessors.AbstractMetadataProcessor. [empty] mlflex.dataprocessors.tcga.TcgaMrnaMetadataProcessor

DEPENDENT_VARIABLE_TRANSFORMER In some situations it is desirable to transform dependent-variable (class) values within an ML-Flex experiment. For example, if the class values are continuous, they must be transformed to discrete values. The Java class specified here can perform that transformation (for example, values higher than the median might be transformed to "HIGH" while remaining values to "LOW"). The name of a Java class that inherits from mlflex.transformation.AbstractDependentVariableTransformer. [empty] mlflex.transformation.MedianContinuousDependentVariableTransformer

PERMUTE_DEPENDENT_VARIABLE_VALUES Sometimes it is valuable to perform a validation experiment in which the dependent-variable values are randomly permuted. One would not expect to obtain a positive result when this has occurred. When multiple permutations have been performed, it is also possible to perform comparisons between permuted results and non-permuted results. This technique may be useful to assess the quality of the predictions. true
false false false

RANDOM_SEED A random seed is used within each experiment for assigning cross-validation folds. It may also be used for other purposes, depending on the configuration settings. By default, the random seed corresponds with the current iteration number (see NUM_ITERATIONS setting). This means that each time an ML-Flex experiment is run, it should yield reproducible results; if reproducibility is not desired, 0 (zero) may be specified. Any positive integer or zero 1 for the first iteration, 2 for the second iteration, etc. 1

STACKING_CLASSIFICATION_ALGORITHM This configuration value supports the "Stacked Ensemble Learner," in which individual classification algorithms predict the outcome and a secondary classification algorithm is used to form an aggregate prediction based on the individual ones. The classification algorithm specified here is the secondary one. This value should correspond with the name of an algorithm specified in Config/Classification_Algorithms.txt. weka_decision_tree weka_decision_tree

TEST_INSTANCE_IDS In cases where training/testing (not cross validation) is performed, it is possible to specify some instances to be used for training and other instances to be used for testing. This configuration setting enables specification of testing IDs. If no instances are specified, instances will be assigned randomly. Alternatively, it is possible to specify the path to a file containing a list of instance IDs to be used (one per line). Any value that corresponds to an instance ID in the data set. [empty] ID6;ID7;ID8;ID9;ID10

TRAIN_INSTANCE_IDS In cases where training/testing is performed (not cross validation), it is possible to specify some instances to be used for training and other instances to be used for testing. This configuration setting enables specification of training IDs. If no instances are specified, instances will be assigned randomly. Alternatively, it is possible to specify the path to a file containing a list of instance IDs to be used (one per line). Any value that corresponds to an instance ID in the data set. [empty] ID1;ID2;ID3;ID4;ID5

INSTANCE_IDS_TO_EXCLUDE In some cases, it may be desirable to exclude some instances from an analysis (without having to filter them from the raw data). This configuration value enables that approach. Any value that corresponds with an instance ID in the data set. [empty] ID11;ID12;ID13;ID14;ID15

NUM_TRAINING_INSTANCES_TO_EXCLUDE_RANDOMLY In some cases it may be desirable to test how robust an experiment is to outliers. One way of doing this is to repeat an experiment multiple times and exclude one or more data instances randomly from the experiment in each iteration. If the performance drops off dramatically, it may be that individual observations are highly influential. This setting can be used to perform such an experiment. Instances are excluded from training sets only. any positive integer 0 5

Introduction to ML-Flex

Prerequisites

Configuring Algorithms

Creating an Experiment File

List of Experiment Settings

Running an Experiment

List of Command-line Arguments

Executing Experiments Across Multiple Computers

Modifying Java Source Code

Creating a New Data Processor

Third-party Machine Learning Software

Integrating with Third-party Machine Learning Software

About Ensemble Learners

	Instance1	Instance2	Instance3	Instance4
Width	5.234	3.928	4.001	-0.281
Height	9.637	8.289	10.207	7.126
HairColor	grey	brown	black	brown
EyeColor	hazel	brown	green	green
...	...	...	...	...
Class	dog	dog	cat	cat

Experiment Settings

Mandatory experiment setting

Commonly used (optional) experiment settings

Seldomly used (optional) experiment settings

Table of Contents