Integrating with Third-party Machine Learning Software

ML-Flex is capable of interfacing with any third-party machine-learning software that supports a command-line interface for feature selection and/or classification. Out of the box, ML-Flex supports interfaces with the Weka, Orange, C5.0, and R software. However, other third-party software can also be integrated with ML-Flex.

For demonstration purposes, suppose you had developed a script to perform classification (called "NiftyClassifier.py") in the Python programming language. Suppose you created a sub-directory within ML-Flex called "MySoftware" and placed your script in that directory. Also suppose the script accepted three arguments: 1) training data file path, 2) test data file path, and 3) output file path. Using the specified argument values, the script would parse the input data, build a classification model on the training instances, and output predictions for the test instances. If ML-Flex were configured to interface with this script (described below), ML-Flex would create input files, invoke your script, parse the output file, assess performance, etc.

Various approaches could be used for interfacing between ML-Flex and the script:

  1. Via mlflex.learners.GenericArffLearner
  2. Via mlflex.learners.GenericDelimitedLearner
  3. Via developing a new Java class that inherits from mlflex.learners.AbstractMachineLearner

With option #1, ML-Flex would save the input files in the ARFF format. Thus in the above example, the Python script would need to support parsing such files. With option #2, input files would be saved in a tab-delimited format. Specifically, the tab-delimited file would contain a header row specifying instance IDs, which would be followed by rows containing data for each feature/variable. The first element in each data row would be the variable name. And the last row would contain the class variable.

Option #3 is more advanced but also supports greater flexibility. The user would need to modify the ML-Flex source code (see Modifying Java Source Code) and develop a new Java class that extended mlflex.learners.AbstractMachineLearner. This class would contain custom code to interface with the third-party software. Thus any input or output file formats could be supported, and the code could even interface with an external Web service, etc. Please see the Java source-code documentation for guidance in pursuing this option.

For options #1 and #2, the output files must be in a particular format. For classification, an output file must be tab delimited and must contain a row for each test instance. The rows must be listed in the same order that the test instances were specified in the input file. The first element in each row should contain the predicted class for a given test instance. The following elements in each row should contain the predicted probability that the given instance belongs to a particular class. The file should also contain a header row that specifies the names of the classes that correspond with each column. Below is an example:


Predictioniris-setosairis-versicoloriris-virginica
iris-setosa0.8000.1000.100
iris-versicolor0.0001.0000.000
iris-setosa0.5010.4990.000
iris-virginica0.1000.3000.600


If feature selection is performed, the output files use a simple format. The name of each feature/variable should be listed on a separate row in the output file. The order of the features should correspond with the algorithm's perception of utility of the features. Below is an example:


sepal-width
petal-width
petal-length
sepal-length


For ML-Flex to interface with such a script, new entries would need to be added to Config/Learner_Templates.txt and to Config/Classification_Algorithms.txt. Below is an example of how the entry in Config/Learner_Templates.txt might appear. The first element would be a unique key for the algorithm. The second element would be the full name of an ML-Flex class supports interfacing with this type of learner (mlflex.learners.GenericArffLearner or mlflex.learners.GenericDelimitedLearner, depending on the desired input file type). The third element would contain a template for how the script would be invoked at the command line. This template includes a path to the Python executable as well as tokens that ML-Flex will replace with actual values. For example, ML-Flex would replace "{INPUT_TRAINING_FILE}" with the path to a training file that it created when executing.


my_classifier;mlflex.learners.GenericArffLearner;/usr/bin/python;MySoftware/NiftyClassifier.py {INPUT_TRAINING_FILE} {INPUT_TEST_FILE} {OUTPUT_FILE}


Below is an example of an entry that would need to be added to Config/Classification_Algorithms.txt. The first entry is a unique key that references this algorithm. The second is a key that references the entry described above for Config/Learner_Templates.txt. Although in this example, creating an entry in Config/Classification_Algorithms.txt may seem redundant, this approach can be especially handy for many other learners that allow the user to specify different combinations of parameters. In this case, Config/Learner_Templates.txt would contain the generic template for interfacing with the learner and Config/Classification_Algorithms.txt would contain a unique entry for each desired parameter combination. (Examples are provided in Config/Classification_Algorithms.txt.)


nifty;my_classifier


Finally, this algorithm could then be specified in an experiment file:


CLASSIFICATION_ALGORITHMS=nifty


The ML-Flex distribution contains example learners that are similar to the above example. See the "Demo" scripts in the Internals/Python directory, corresponding entries in the Config files, and Experiments/DemoLearners.txt.

If any of the above steps are unclear, please email questions to the ML-Flex mailing list.


Table of Contents

Introduction to ML-Flex

Prerequisites

Configuring Algorithms

Creating an Experiment File

List of Experiment Settings

Running an Experiment

List of Command-line Arguments

Executing Experiments Across Multiple Computers

Modifying Java Source Code

Creating a New Data Processor

Third-party Machine Learning Software

Integrating with Third-party Machine Learning Software

About Ensemble Learners