ML-Flex's execution is based on the concept of "experiments." Each experiment consists of one or more data sets. Various types of machine-learning analyses can be performed on the data, and experiment files specify settings that are used.
The Experiments directory contains a wide array of example experiment files that illustrate various scenarios for machine-learning analyses. These files can be used as templates for users to create their own experiment files. Below are the basic steps for creating an experiment file from scratch:
Each line in the experiment file should contain a name and a value. Between the name and value is an equals sign. In many cases, multiple values can be specified, and these should be separated by semicolons.
Below is an example of what I might place in an experiment file to perform cross validation on the classic Iris data set.
DATA_PROCESSORS=mlflex.dataprocessors.ArffDataProcessor("InputData/UCI/iris.arff")
CLASSIFICATION_ALGORITHMS=weka_naive_bayes
FEATURE_SELECTION_ALGORITHMS=weka_relieff
NUM_FEATURES_OPTIONS=1;2;3;4
NUM_OUTER_CROSS_VALIDATION_FOLDS=10
NUM_INNER_CROSS_VALIDATION_FOLDS=10
NUM_ITERATIONS=10
In the above example, seven lines are specified in the experiment file. In the first line, a data processor is specified. A data processor must correspond with the name of a Java class in ML-Flex that inherits from mlflex.dataprocessors.AbstractDataProcessor. One that comes with ML-Flex is mlflex.dataprocessors.ArffDataProcessor, which is designed to process input files in the ARFF format. Because the input file, iris.arff, is in the ARFF format, this data processor is specified and the path to the ARFF is specified as a parameter to ArffDataProcessor. The data file is provided in the InputData/UCI directory for demo purposes.
The second line specifies a classification algorithm that will be used for the experiment. This value must correspond with the name of a classification algorithm specified in Config/Classification_Algorithms.txt. In this case, the Weka implementation of the Naive Bayes Classifier algorithm is used. Multiple algorithms can be specified.
The third line is used to specify a feature-selection algorithm that will be applied within each training set. This value must correspond with the name of a feature-selection algorithm specified in Config/Feature_Selection_Algorithms.txt. In this case, the Weka implementation of the RELIEF-F algorithm is used. Multiple algorithms can be specified.
When feature selection is performed, it results in a list of features/variables that are ranked according to their perceived relevance. Some of the features are likely irrelevant and may negatively affect classification accuracy. However, it is unknown how many features is "optimal" to use for classification. ML-Flex can estimate the optimal number of features via nested cross-validation. The NUM_FEATURES_OPTIONS allows the user to specify which number-of-feature options should be evaluated. In this example, four options are specified--Iris only has four variables, so all possible options are listed. Note that the values are separated by semicolons.
The fifth and sixth lines specify how many cross-validation folds should be used. In this example, 10 folds are used for both outer and inner cross validation.
The final line indicates that cross validation will be repeated 10 times. The results will list performance metrics for each iteration as well as a summary that averages across all iterations.
It is possible to execute an experiment that is even simpler than the one described above. For example, a cross-validation experiment that uses the Naive Bayes Classifier algorithm to classify the Iris data could be executed using only the first two lines.
List of Command-line Arguments
Executing Experiments Across Multiple Computers
Third-party Machine Learning Software