Machine learning uses techniques from statistics, mathematics and computer science to make computer programs learn from data. It is one of the most popular fields of computer science and finds applications in multiple streams of data analysis such as classification, regression, clustering, dimensionality reduction, density estimation and many more. Some real-life applications are spam filtering, medical diagnosis, autonomous driving, recommendation systems, facial recognition, stock prices prediction and many more. The following image shows a basic flow of any machine learning task. Data is provided by a user to a machine learning algorithm for analysis.
There are multiple ways in which machine learning can be used to perform data analysis. They depend on the nature of data and the kind of data analysis. The following image shows the most popular ones. In supervised learning techniques, the categories of data records are known beforehand. But in unsupervised learning, the categories of data records are not known.
The following image shows how a classification task is performed. The complete data is divided into training and test sets. The training set is used by a classifier to learn features. It results in a trained model and its robustness (of learning) is evaluated using the test set (unseen by the classifier during the training).
The datasets required for this tutorial contain 9 features of breast cells which include the thickness of clump, cell-size, cell-shape and so on (more information). In addition to these features, the training dataset contains one more column as target. It has a binary value (0 or 1) for each row. 0 indicates no breast cancer and 1 indicates breast cancer. The test dataset does not contain the target column.
Hands-on: Data upload
Create a new history for this tutorial.
Click the new-history icon at the top of the history panel.
If the new-history is missing:
Click on the galaxy-gear icon (History options) on the top of the history panel
Select the option Create New from the menu
Import the following datasets and choose the type of data as tabular.
In this step, we will use the SVM (support vector machine) classifier for training on the breast-w_train dataset. The classifier learns a mapping between each row and its category. SVM is a memory efficient classifier which needs only those data points which lie on the decision boundaries among different classes to predict a class for a new sample. The rest of the data points can be thrown away. We will use the LinearSVC variant of SVM which is faster. Other variants SVC and NuSVC have high running time for large datasets. The last column of the training dataset contains a category/class for each row. The classifier learns a mapping between data row and its category which is called a trained model. The trained model is used to predict the categories of the unseen data.
Hands-on: Train a classifier
Support vector machines (SVMs) for classificationTool: toolshed.g2.bx.psu.edu/repos/bgruening/sklearn_svm_classifier/sklearn_svm_classifier/1.0.8.1 with the following parameters to train:
“Select a Classification Task”: Train a model
“Classifier type”: Linear Support Vector Classification
“Choose how to select data by column”: All columns EXCLUDING some by column header name(s)
“Type header name(s)”: target
param-file“Dataset containing class labels or target values”: breast-w_train tabular file
“Does the dataset contain header”: Yes
“Choose how to select data by column”: Select columns by column header name(s)
“Type header name(s):”: target
Predict using a trained model
The previous step produced a trained model (a zip archive) which we will now use to predict classes for the test data (breast-w_test).
Hands-on: Predict using a trained model
Support vector machines (SVMs) for classificationTool: toolshed.g2.bx.psu.edu/repos/bgruening/sklearn_svm_classifier/sklearn_svm_classifier/1.0.8.1 with the following parameters:
“Select a Classification Task”: Load a model and predict
param-file“Models”: the zip dataset produced by Support vector machines (SVMs) for classificationtool
param-file“Data (tabular)”: breast-w_test file
“Does the dataset contain header”: Yes
“Select the type of prediction”: Predict class labels
See predictions
The last column of the predicted dataset shows the category of each row. A row either got 0 (no breast cancer) or 1 (breast cancer) as its predicted category.
Hands-on: See the predicted column
Click on the galaxy-eye (eye) icon of the dataset created by the previous step.
The last column of the table shows the predicted category (target) for each row.
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Hiltemann, Saskia, Rasche, Helena et al., 2023 Galaxy Training: A Powerful Framework for Teaching! PLOS Computational Biology 10.1371/journal.pcbi.1010752
Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
@misc{statistics-machinelearning,
author = "Anup Kumar",
title = "Basics of machine learning (Galaxy Training Materials)",
year = "",
month = "",
day = ""
url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/machinelearning/tutorial.html}",
note = "[Online; accessed TODAY]"
}
@article{Hiltemann_2023,
doi = {10.1371/journal.pcbi.1010752},
url = {https://doi.org/10.1371%2Fjournal.pcbi.1010752},
year = 2023,
month = {jan},
publisher = {Public Library of Science ({PLoS})},
volume = {19},
number = {1},
pages = {e1010752},
author = {Saskia Hiltemann and Helena Rasche and Simon Gladman and Hans-Rudolf Hotz and Delphine Larivi{\`{e}}re and Daniel Blankenberg and Pratik D. Jagtap and Thomas Wollmann and Anthony Bretaudeau and Nadia Gou{\'{e}} and Timothy J. Griffin and Coline Royaux and Yvan Le Bras and Subina Mehta and Anna Syme and Frederik Coppens and Bert Droesbeke and Nicola Soranzo and Wendi Bacon and Fotis Psomopoulos and Crist{\'{o}}bal Gallardo-Alba and John Davis and Melanie Christine Föll and Matthias Fahrner and Maria A. Doyle and Beatriz Serrano-Solano and Anne Claire Fouilloux and Peter van Heusden and Wolfgang Maier and Dave Clements and Florian Heyl and Björn Grüning and B{\'{e}}r{\'{e}}nice Batut and},
editor = {Francis Ouellette},
title = {Galaxy Training: A powerful framework for teaching!},
journal = {PLoS Comput Biol} Computational Biology}
}
Congratulations on successfully completing this tutorial!
Galaxy Administrators: Install the missing tools
You can use Ephemeris's shed-tools install command to install the tools used in this tutorial.