# Machine learning: classification and regression

Overview
Questions:
• what are classification and regression techniques?

• How they can be used for prediction?

• How visualizations can be used to analyze predictions?

Objectives:
• Explain the types of supervised machine learning - classification and regression.

• Learn how to make predictions using the training and test dataset.

• Visualize the predictions.

Requirements:
Time estimation: 1 hour
Supporting Materials:
Published: Mar 7, 2019
Last modification: Mar 5, 2024
purl PURL: https://gxy.io/GTN:T00263
rating Rating: 4.0 (0 recent ratings, 1 all time)
version Revision: 12

Machine learning is a subset of artificial intelligence (AI) that provides machines with the ability to automatically learn from data without being explicitly programmed. It is a combined field of computer science, mathematics and statistics to create a predictive model by learning patterns in a dataset. The dataset may have an output field which makes the learning process supervised. The supervised learning methods in machine learning have outputs (also called as targets or classes or categories) defined in the datasets in a column. These targets can either be integers or real (continuous) numbers. When the targets are integers, the learning task is known as classification. Each row in the dataset is a sample and the classification is assigning a class label/target to each sample. The algorithm which is used for this learning task is called a classifier. When the targets are real numbers, the learning task is called regression and the algorithm which is used for this task is called a regressor. We will go through classification first and look at regression later in this tutorial.

Question

What are features and outputs/targets in a dataset?

Features and targets of a breast cancer dataset:

Agenda

In this tutorial, we will deal with:

1. Classification
2. Regression
3. Conclusion

# Classification

A classification task assigns a category/class to each sample by learning a decision boundary in a dataset. This dataset is called a training dataset and contains samples and desired class/category for each sample. The training dataset contains “features” as columns and a mapping between these features and the target is learned for each sample. The performance of mapping is evaluated using a test dataset (which is separate from training dataset). The test dataset contains only the feature columns and not the target column. The target column is predicted using the mapping learned on the training dataset. In this tutorial, we will use a classifier to train a model using a training dataset, predict the targets for test dataset and visualize the results using plots.

Comment

The terms like ‘targets’, ‘classes’, ‘categories’ or ‘labels’ have been used interchangeably for the classification part of this tutorial. They contain identical meaning. For regression, we will just use ‘targets’.

In figure 2, the line is a boundary which separates a class from another class (for example from tumor to no tumor). The task of a classifier is to learn this boundary, which can be used to classify or categorize an unseen/new sample. The line is the decision boundary. There are different ways to learn this decision boundary. If the dataset is linearly separable, linear classifiers can produce good classification results. But, when the dataset is complex and requires non-linear decision boundaries, more powerful classifiers like `support vector machine` or `tree` or `ensemble` based classifiers may prove to be beneficial. In the following part, we will perform classification on breast cancer dataset using a linear classifier and then will analyze the results with plots. Let’s begin by uploading the necessary datasets.

The datasets to be used for classification contain 9 features. Each feature contains some unique information about breast cancer including the thickness of clump, cell-size, cell-shape and so on. More information about the dataset can be found here - a and b. In addition to these features, the training dataset contains one more column as the `target`. It has a binary value (0 or 1) for each row. `0` indicates no breast cancer (benign) and `1` (malignant) indicates breast cancer. The test dataset does not contain the `target` column (which should be predicted by a classifier). The third dataset contains all the samples from the test dataset, this time including the `target` column which is needed to compare between real and predicted targets.

1. Create a new history for this tutorial.

To create a new history simply click the new-history icon at the top of the history panel:

2. Import the following datasets and choose the type of data as `tabular`.

``````https://zenodo.org/record/3248907/files/breast-w_targets.tsv
https://zenodo.org/record/3248907/files/breast-w_test.tsv
https://zenodo.org/record/3248907/files/breast-w_train.tsv
``````
• Copy the link location
• Click galaxy-upload Upload Data at the top of the tool panel

• Select galaxy-wf-edit Paste/Fetch Data
• Paste the link(s) into the text field

• Press Start

• Close the window

3. Rename datasets to `breast-w_train`, `breast-w_test` and `breast-w_targets`.

• Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
• In the central panel, change the Name field
• Click the Save button

## Learn using training dataset

The training dataset is used for learning the associations between features and the targets. The classifier learns general patterns in a dataset and saves a trained model. This model can be used for classifying a new sample. In this step, we will use `breast-w_train` as the training dataset and apply SVM (support vector machine) classifier. It will learn features from the dataset and maps them to the targets. This mapping is called a trained model. The training step produces a model file of type `zip`.

Hands-on: Train the model
1. Support vector machines (SVMs) for classification tool with the following parameters to train the classifier on training dataset:
• “Select a Classification Task”: `Train a model`
• “Classifier type”: `Linear Support Vector Classification`
• “Select input type”: `tabular data`
• param-file “Training samples dataset”: `breast-w_train`
• “Does the dataset contain header”: `Yes`
• “Choose how to select data by column”: `All columns EXCLUDING some by column header name(s)`
• “Type header name(s)”: `target`
• param-file “Dataset containing class labels”: `breast-w_train`
• “Does the dataset contain header”: `Yes`
• “Choose how to select data by column”: `Select columns by column header name(s)`
• “Select target column(s)”: `target`
2. Rename the generated file to `model`
Question

What is learned by the classifier?

Two attributes coef_ and intercept_ are learned by the classifier using the training dataset. The coef_ contains importance weight for each feature and intercept_ is just a constant scalar. However, for different classifiers, these attributes are different. The attributes shown here are specific to the Linear support vector classifier. These attributes are stored in the trained model and can be accessed by reading this file.

## Predict categories of test dataset

After the training process is complete, we can see the trained model file (`zip` file) which contains information about patterns in the form of weights. The trained model is used to predict the classes of the test (`breast-w_test`) dataset. It assigns a class (either tumor or no tumor) to each row in the `breast-w_test` dataset.

Hands-on: Predict classes using the trained model
1. Support vector machines (SVMs) for classification tool with the following parameters to predict classes of test dataset using the trained model:
• “Select a Classification Task”: `Load a model and predict`
• param-file “Models”: `model` file (output of the previous step)
• param-file “Data (tabular)”: `breast-w_test` file
• param-check “Does the dataset contain header”: `Yes`
• param-select “Select the type of prediction”: `Predict class labels`
2. Rename the generated file to `predicted_labels`.

## Visualise the predictions

We should evaluate the quality of predictions by comparing them against the true targets. To do this, we will use another dataset (`breast-w_targets`). This is the same as the test dataset (`breast-w_test`) but contains an extra `target` column containing the true classes of the test dataset. With the predicted and true classes, the learned model is evaluated to verify how correct the predictions are. To visualise these predictions, a plotting tool is used. It creates three plots - confusion matrix, precision, recall and F1 and ROC and AUC. We will mainly analyze the precision and recall plot.

Hands-on: Check and visualize the predictions
1. Plot confusion matrix, precision, recall and ROC and AUC curves tool with the following parameters to visualise the predictions:
• param-file “Select input data file”: `breast-w_targets`
• param-file “Select predicted data file”: `predicted_labels`
• param-file “Select trained model”: `model`

We will analyze the following plots:

Using these plots, the robustness of classification can be visualized.

## Summary

By following these steps from data upload until plotting, we have learned how to do classification and visualise the predictions using Galaxy’s machine learning and plotting tools. A similar analysis can be performed using a different dataset or by using a different classifier. This machine learning suite provides multiple classifiers from linear to complex ones suited for different classification tasks. For example for a binary class classification, `support vector machine` classifier may perform well. It is recommended to try out different classifiers on a dataset to find the best one.

# Regression

For classification, the targets are integers. However, when the targets in a dataset are real numbers, the machine learning task becomes regression. Each sample in the dataset has a real-valued output or target. Figure 6 shows how a (regression) curve is fitted which explains most of the data points (blue balls). Here, the curve is a straight line (red). The regression task is to learn this curve which explains the underlying distribution of the data points. The target for a new sample will lie on the curve learned by the regression task. A regressor learns the mapping between the features of a dataset row and its target value. Inherently, it tries to fit a curve for the targets. This curve can be linear or non-linear. In this part of the tutorial, we will perform regression on body density dataset.

The dataset contains information about human body density. It includes 14 features like underwater body density, age, weight, height, neck circumference and so on. The target is the percent body fat. The aim of the task is to learn a mapping between several body features and fat content inside the human body. Using this learning, the body fat percentage can be predicted using other features. To carry out this task, we will need training and test datasets. Again, we will also prepare another test dataset with targets included to evaluate the regression performance. `body_fat_train` dataset is used as the training dataset and `body_fat_test` as the test dataset. The dataset `body_fat_test_labels` contains the true targets for the test dataset (`body_fat_test`).

1. Create a new history for this tutorial.
2. Import the following datasets and choose the type of data as `tabular`.

``````https://zenodo.org/record/3248907/files/body_fat_train.tsv
https://zenodo.org/record/3248907/files/body_fat_test_labels.tsv
https://zenodo.org/record/3248907/files/body_fat_test.tsv
``````
• Copy the link location
• Click galaxy-upload Upload Data at the top of the tool panel

• Select galaxy-wf-edit Paste/Fetch Data
• Paste the link(s) into the text field

• Press Start

• Close the window

3. Rename datasets to `body_fat_train`, `body_fat_test_labels` and `body_fat_test`.

• Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
• In the central panel, change the Name field
• Click the Save button

## Learn from training dataset

To learn the mapping between several features and the targets, we will apply a regressor which is called the Gradient boosting regressor. It is an ensemble-based regressor because its prediction is the collective performance of multiple weak learners (e.g. decision trees). It learns features from training dataset (`body_fat_train`) and maps all the rows to their respective targets (real numbers). The process of mapping gives a trained model.

Hands-on: Train a model
1. Ensemble methods for classification and regression tool with the following parameters to train the regressor:
• “Select a Classification Task”: `Train a model`
• “Select an ensemble method”: `Gradient Boosting Regressor`
• “Select input type”: `tabular data`
• param-file “Training samples dataset”: `body_fat_train`
• param-check “Does the dataset contain header”: `Yes`
• param-select “Choose how to select data by column”: `All columns EXCLUDING some by column header name(s)`
• param-text “Type header name(s)”: `target`
• param-file “Dataset containing class labels”: `body_fat_train`
• param-check “Does the dataset contain header”: `Yes`
• param-select “Choose how to select data by column”: `Select columns by column header name(s)`
• param-text “Select target column(s)”: `target`
2. Rename the generated file to `model`.
Question

What is learned by the regressor?

Unlike the Linear support vector classifier (used for classification in the first part of the tutorial) which learned only two attributes, the Gradient boosting regressor learns multiple attributes such as feature_importances_ (weight for each feature/column), oob_improvement_ (which stores incremental improvements in learning), estimators_ (collection of weak learners) and a few more. These attributes are used to predict the target for a new sample and are stored in the trained model. They can be accessed by reading this file.

## Predict using test dataset

After learning on the training dataset, we should evaluate the performance on the test dataset to know whether the algorithm learned general patterns from the training dataset or not. These patterns are used to predict a new sample and a similar accuracy is expected. Similar to the classification task, the trained model is evaluated on `body_fat_test` which predicts a target value for each row. The predicted targets are compared to the expected targets to measure the robustness of learning.

Hands-on: Predict targets using the model
1. Ensemble methods for classification and regression tool with the following parameters to predict targets of test dataset using the trained model:
• “Select a Classification Task”: `Load a model and predict`
• param-file “Models”: `model`
• param-file “Data (tabular)”: `body_fat_test`
• param-check “Does the dataset contain header”: `Yes`
• param-select “Select the type of prediction”: `Predict class labels`
2. Rename the generated file to `predicted_data`.

## Visualise the prediction

We will evaluate the predictions by comparing them to the expected targets.

Hands-on: Check and visualize the predictions
1. Plot actual vs predicted curves and residual plots tool with the following parameters to visualise the predictions:
• param-file “Select input data file”: `body_fat_test_labels`
• param-file “Select predicted data file”: `predicted_data`

The visualization tool creates the following plots:

1. True vs predicted targets curves:

2. Scatter plot for true vs. predicted targets:

3. Residual plot between residual (predicted - true) and predicted targets:

These plots are important to visualize the quality of regression and the true and predicted targets - how close or far they are from each other. The closer they are, the better the prediction.

## Summary

By following these steps, we learned how to perform regression and visualise the predictions using Galaxy’s machine learning and plotting tools. The features of the training dataset are mapped to the real-valued targets. This mapping is used to make predictions on an unseen (test) dataset. The quality of predictions is visualised using a plotting tool. There are multiple other regression algorithms, few are simpler to use (with fewer parameters) and some are powerful, which can be tried out on this dataset and on other datasets as well.

# Conclusion

We learned how to perform classification and regression using different datasets and machine learning tools in Galaxy. Moreover, we visualized the results using multiple plots to ascertain the robustness of machine learning tasks. There are many other classifiers and regressors in the machine learning suite which can be tried out on these datasets to find how they perform. Different datasets can also be analysed using these classifiers and regressors. The classifiers and regressors have lots of parameters which can be altered while performing the analyses to see if they affect the prediction accuracy. It may be beneficial to perform hyperparameter search to tune these parameters of classifiers and regressors for different datasets. Some data pre-processors can also be used to clean the datasets.