Basics of machine learning

Overview

question Questions
  • What is machine learning?
  • Why is it useful?
  • What are its different approaches?
objectives Objectives
  • Provide the basics of machine learning and its variants.
  • Learn how to do classification using the training and test data.
  • Learn how to use Galaxy's machine learning tools.
requirements Requirements

time Time estimation: 30 minutes

Introduction

Machine learning uses the techniques from statistics, mathematics and computer science to make computer programs learn from data. It is one of the most popular fields of computer science and finds applications in multiple streams of data analysis like classification, regression, clustering, dimensionality reduction, density estimation and many more. Some real-life applications are spam filtering, medical diagnosis, autonomous driving, recommendation systems, facial recognition, stock prices prediction and many more. The following image shows a basic flow of any machine learning task. A user has data and it is given to a machine learning algorithm for analysis.

data
Figure 1: Flow of a machine learning task.

There are multiple ways in which machine learning can be used to perform data analysis. They depend on the nature of data and the kind of data analysis. The following image shows the most popular ones. In supervised learning techniques, the categories of data records are known beforehand. But in unsupervised learning, the categories of data records are not known.

data
Figure 2: Different types of machine learning.

In general, machine learning can be used in multiple real-life tasks by using applying its variants as depicted in the following image.

data
Figure 3: Real-life usage of machine learning.

The following image shows how a classification task is performed. The complete data is divided into training and test sets. The training set is used by a classifier to learn features. It results in a trained model and its robustness (of learning) is evaluated using the test set (unseen by the classifier during the training).

data
Figure 4: Supervised learning.

This tutorial shows how to use a machine learning module implemented as a Galaxy tool. The data used in this tutorial is available at Zenodo.

Agenda

Performing a machine learning task (classification) using a tool involves the following steps:

  1. Data upload
  2. Train a classifier
  3. Predict using a trained model
  4. See predictions

Data upload

The datasets required for this tutorial contain 9 features of breast cancer which include the thickness of clump, cell-size, cell-shape and so on (more information). In addition to these features, the training dataset contains one more column as target. It has a binary value (0 or 1) for each row. 0 indicates no breast cancer and 1 indicates breast cancer. The test dataset does not contain the target column.

hands_on Hands-on: Data upload

  1. Create a new history for this tutorial
  2. Import the following datasets and choose the type of data as tabular.

    https://zenodo.org/record/1401230/files/breast-w_train.tsv
    https://zenodo.org/record/1401230/files/breast-w_test.tsv
    
    • Copy the link location
    • Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)

    • Select Paste/Fetch Data
    • Paste the link into the text field

    • Press Start

    • Close the window

    By default, Galaxy uses the URL as the name, so rename the files with a more useful name.

  3. Rename datasets to breast-w_train and breast-w_test.

    tip Tip: Rename a dataset

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, change the Name field
    • Click the Save button
  4. The datasets should look like these:

    data
    Figure 5: Training data (breast-w_train) with targets (9 features and one target).
    data
    Figure 6: Test data (breast-w_test) (9 features and no target).

Train a classifier

In this step, SVM (Support vector machine) classifier is trained using breast-w_train dataset. The last column of this dataset assigns a category for each row. The classifier learns a mapping between each row and its category. This mapping is called a trained model. It is used to predict the categories of unseen data (breast-w_test).

hands_on Hands-on: Train a classifier

SVM Classifier (Support vector machine) tool with the following parameters

  1. param-select “Select a Classification Task”: Train a model
  2. param-select “Classifier type”: Linear Support Vector Classification
  3. param-select “Select input type”: tabular data
  4. param-file “Training samples dataset”: breast-w_train
  5. param-check “Does the dataset contain header”: Yes
  6. param-select “Choose how to select data by column”: All columns but by column header name(s)
  7. param-text “Type header name(s)”: target
  8. param-file “Dataset containing class labels”: breast-w_train
  9. param-check “Does the dataset contain header”: Yes
  10. param-select “Choose how to select data by column”: Select columns by column header name(s)
  11. param-text “Select target column(s)”: target
  12. Execute the classifier to train

Predict using a trained model

The previous step produces a model file of type zip. Rename this file to model.zip by using edit dataset property. The trained model is used to predict the categories of each row in breast-w_test dataset.

hands_on Hands-on: Predict using a trained model

SVM Classifier (Support vector machine) tool with the following parameters

  1. param-select “Select a Classification Task”: Load a model and predict
  2. param-file “Models”: model.zip
  3. param-file “Data (tabular)”: breast-w_test
  4. param-check “Does the dataset contain header”: Yes
  5. param-select “Select the type of prediction”: Predict class labels
  6. Execute to predict categories

See predictions

The last column of the predicted dataset shows the category of each row. A row either gets 0 (no breast cancer) or 1 (breast cancer) as its predicted category.

hands_on Hands-on: See the predicted column

  1. Click on view data link of the dataset created after executing the previous step.
  2. The last column of the tabular data shows the predicted category (target) for each row.

tip Additional resources:

Read more about machine learning using scikit-learn here.

keypoints Key points

  • Machine learning algorithms learn features from data.
  • It is used for multiple tasks like classification, regression, clustering and so on.
  • Multiple learning tasks can be performed using Galaxy's machine learning tools.
  • For the classification and regression tasks, data is divided into training and test sets.
  • Each sample/record in the training data has a category/class/label.
  • A machine learning algorithm learns features from the training data and do predictions on the test data.

Useful literature

Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.

congratulations Congratulations on successfully completing this tutorial!