Python - Warm-up for statistics and machine learning
Under Development!
This tutorial is not in its final state. The content may change a lot in the next months. Because of this status, it is also not listed in the topic pages.
Author(s) |
|
Editor(s) |
![]() |
OverviewQuestions:
Objectives:
to do
Requirements:
to do
Time estimation: 1 hourLevel: Introductory IntroductorySupporting Materials:Published: Mar 11, 2025Last modification: Mar 11, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITversion Revision: 1
Best viewed in a Jupyter NotebookThis tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways
Run on the GTN with JupyterLite (in-browser computations)
Launching the notebook in Jupyter in Galaxy
- Instructions to Launch JupyterLab
- Open a Terminal in JupyterLab with File -> New -> Terminal
- Run
wget https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-warmup-stat-ml/data-science-python-warmup-stat-ml.ipynb
- Select the notebook that appears in the list of files on the left.
Downloading the notebook
- Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
- Save Link As..
AgendaIn this tutorial, we will cover:
Basic python
X = []
for i in range(10):
X.append( i**2 )
print(X)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
for x in X:
print(x)
0
1
4
9
16
25
36
49
64
81
for x in X:
if x%2 == 1:
print(x,'is odd')
else:
print(x,'is even')
0 is even
1 is odd
4 is even
9 is odd
16 is even
25 is odd
36 is even
49 is odd
64 is even
81 is odd
# list comprehension is a very fine way of compressing all this
X = [ i**2 for i in range(10) ]
Xeven = [ x for x in X if x%2 == 0 ]
Xodd = [ x for x in X if x%2 == 1 ]
print( 'X ', X )
print( 'Xeven', Xeven )
print( 'Xodd ', Xodd )
X [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Xeven [0, 4, 16, 36, 64]
Xodd [1, 9, 25, 49, 81]
numpy and vectorized operations
import numpy as np
X_array = np.array(X)
print(X_array)
[ 0 1 4 9 16 25 36 49 64 81]
print(X_array / 2 )
[ 0. 0.5 2. 4.5 8. 12.5 18. 24.5 32. 40.5]
print( np.exp(X_array ) )
print( np.log(X_array ) )
[1.00000000e+00 2.71828183e+00 5.45981500e+01 8.10308393e+03
8.88611052e+06 7.20048993e+10 4.31123155e+15 1.90734657e+21
6.23514908e+27 1.50609731e+35]
[ -inf 0. 1.38629436 2.19722458 2.77258872 3.21887582
3.58351894 3.8918203 4.15888308 4.39444915]
/tmp/ipykernel_490123/2855859755.py:2: RuntimeWarning: divide by zero encountered in log
print( np.log(X_array ) )
print( 'shape' , X_array.shape )
print( 'mean ' , np.mean(X_array) )
print( 'standard deviation' , np.std(X_array) )
shape (10,)
mean 28.5
standard deviation 26.852374196707448
linspace and arange
print( 'linspace 0,2,9 :' , np.linspace(0,2,9) , sep='\t' )
print( 'linspace -0.5,0.5,11 :' , np.linspace(-0.5,0.5,11) , sep='\t' )
print( 'linspace 10,0,11 :' , np.linspace(10,0,11) , sep='\t' )
linspace 0,2,9 : [0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]
linspace -0.5,0.5,11 : [-0.5 -0.4 -0.3 -0.2 -0.1 0. 0.1 0.2 0.3 0.4 0.5]
linspace 10,0,11 : [10. 9. 8. 7. 6. 5. 4. 3. 2. 1. 0.]
print( "arange 0,2,0.1 :", np.arange(1.5,2,0.1) , sep='\t' )
print( "arange -1,1,0.125 :", np.arange(-1,1,0.125) , sep='\t' )
print( "arange 10,2 :", np.arange(10,2,1) , sep='\t' ) # reverse does not work!
arange 0,2,0.1 : [1.5 1.6 1.7 1.8 1.9]
arange -1,1,0.125 : [-1. -0.875 -0.75 -0.625 -0.5 -0.375 -0.25 -0.125 0. 0.125
0.25 0.375 0.5 0.625 0.75 0.875]
arange 10,2 : []
Basic plotting
import matplotlib.pyplot as plt
plt.plot( [0,1,2,3] , [10,5,7,0.2] )
plt.show()
Adding color, symbols, …
matplotlib
offers many options to customize the appearance of your plot.
Here are the (some) common arguments to plot()
(which can also be applied to many other graphical representations):
color
: could be given as a (red,green,blue) tuple, a name, a hex code, … (see Something better here for all the options)marker
: symbols for the data point.'.'
is a point,'v'
a down triangle, … see Something better here for the list of possibilities.linestyle
: style of the line.'-'
is solid,'--'
is dashed,''
for no line. See Something better here for more optionslinewidth
: width of the linesmarkersize
: size of the markers
You are invited to experiment and explore these options. Here are a few examples:
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
# green, dashed line, with circle markers
plt.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
# blue triangle with no line
plt.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
# solid orange line
plt.plot(y3, color = 'orange', marker = '', linestyle = '-', linewidth = 4 )
plt.show()
Note that:
- you can call plot several time in a row to make several lines appear (only
plt.show()
causes the figure to appear) - the frame of the picture automatically adjust to what it needs to show
multiple subplots
Now would normally be when we show you how to add labels, titles and legends to figures.
However, the way matplotlib
is built, it is actually a bit more efficient to first learn how to create multiple subplots.
Creating multiple plots is possible with the function plt.subplots()
.
Amon its many arguments, it takes:
nrows
: number of subplot rowsncols
: number of subplot columnsfigsize
: tuple (width,height) of the figure
This function creates a Figure and an Axes object. The Axes object can be either :
- a simple Axe is there is 1 row and 1 columns
- a list of Axe objects if there is 1 row and multiple columns, or 1 column and multiple rows
- a list of lists of Axes objects if there is multiple rows and multiple columns
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2) # 2 columns and 1 row
# ax is a list with two objects. Each object correspond to 1 subplot
# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )
plt.show()
Notice how we call ax[0].plot(...)
instead of plt.plot(...)
to specify in which subplots we want to plot.
multiple subplots - continued
Let’s see the same thing with several lines and several columns
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
y4 = [1,2,3,7,5]
# 2 columns and 2 rows, and we also set the figure size
fig, ax = plt.subplots(nrows=2, ncols=2 , figsize = (12,12))
# ax is a list of two lists with two objects each.
# accessing to the first row, first column : ax[0][0]
ax[0][0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
# accessing to the first row, second column : ax[0][1]
ax[0][1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
# accessing to the second row, first column : ax[1][0]
ax[1][0].plot( y3, color = 'orange', marker = 'x', linestyle = '-' )
# accessing to the first row, second column : ax[1][1]
ax[1][1].plot( y4, color = 'teal', linestyle = '-.' , linewidth=5 )
plt.show()
setting up labels
To set the labels at the x-axis, y-axis and title, we use the method of the Axe object:
.set_xlabel(...)
.set_ylabel(...)
.set_title(...)
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2 , figsize=(10,5)) # 2 columns and 1 row
# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
ax[0].set_xlabel('x-axis label')
ax[0].set_ylabel('y-axis label')
ax[0].set_title('plot 1')
# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )
ax[1].set_xlabel('x-axis label')
ax[1].set_ylabel('y-axis label')
ax[1].set_title('plot 2')
plt.show()
setting up a legend
Each element we add to the figure using plot()
can be given a label using the label
argument.
Then, a legend may be added to the figure using the legend()
method.
This legend()
method can take a loc
argument that specifies where it should be plotted.
Possible values for this argument are: 'best' , 'upper right' , 'upper left' , 'lower left' , 'lower right' , 'right' , 'center left' , 'center right' , 'lower center' , 'upper center' , 'center'
(the default is best
).
fig, ax = plt.subplots(nrows=1, ncols=1 , figsize=(10,5)) # 2 columns and 1 row
# NB : with 1 col and 1 row, ax is directly the sole subplot we have
# so to call it we just use ax.plot , ax.set_xlabel , ...
ax.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2 , label = 'line A' )
ax.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 8 , label = 'line B' )
ax.plot( y3, color = 'orange', marker = '', linestyle = '-' , linewidth = 2 , label = 'line C' )
ax.set_xlabel('x-axis label')
ax.set_ylabel('y-axis label')
ax.set_title('plot with a legend')
#adding a legend in the upper right
ax.legend( loc='upper right')
plt.show()
additional : writing a figure to a file
Writing a matplotlib figure to a file can be achieved simply by replacing the call to plt.show()
to plt.savefig(...)
.
plt.savefig
takes a number of argument, the most commons are :
fname
: name of the file to write the figure. The extension is used to determine the output format (.pdf,.png, .jpg , .svg , …). Many formats are supported, you can get a list with this command :plt.gcf().canvas.get_supported_filetypes()
dpi
: dots per inches , useful to set-up when saving to raster formats (ie., pixel-based such as png or jpeg). The actual size of the image is set using the argumentfigsize
ofplt.subplots()
Commentin a jupyter notebook the figure will still be shown, whereas in a standard .py script it will not appear on screen.
Here is a demonstration. Apply in on your side and verify that the file testPlot.png
was created:
import matplotlib.pyplot as plt
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2 , figsize = (10,6) ) # 2 columns and 1 row
# ax is a list with two objects. Each object correspond to 1 subplot
# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )
plt.savefig( 'testPlot.png' , dpi = 90 )
Exercise 00.01 : bringing together numpy and matplotlib
Numpy arrays can be plotted as if they were lists.
- plot x and y, where:
- y = 1/(1+exp(-x))
- x varies between -5 and 5 (plotting around a 100 points should suffice).
- Bonus : plot multiples lines : y = 1/(1+exp(-x*b)) , for the following values of b: 0.5 , 1 , 2 , 4.
- x still varies between -5 and 5 (plotting around a 100 points should suffice).
- put a legend in your plot
You can load the solution directly in this notebook by uncommenting and running the following line:
# %load -r -8 solutions/solution_00_01.py
bonus question solution:
# %load -r 9- solutions/solution_00_01.py
Generating random numbers
the basics
import numpy.random as rd
# random floats between 0 and 1
for i in range(4):
print( rd.random() )
0.6696103730869407
0.7426639266737763
0.6767219223242785
0.8602105555191791
print( rd.random(size=10) ) # draw directly 10 numbers
[0.37971723 0.80354745 0.4168427 0.70867247 0.17547126 0.43760884
0.75933345 0.06571168 0.45772397 0.67191214]
setting the seed: pseudorandomness and reproducibility
rd.seed(42) # setting the seed to 42
print( '1st draw' , rd.random(size=5) )
print( '2nd draw' , rd.random(size=5) )
rd.seed(42)
print( 'after resetting seed' , rd.random(size=5) )
1st draw [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
2nd draw [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
after resetting seed [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
beyond the uniform distribution
numpy offers you quite a large set of distributions you can draw from.
Let’s look at the normal distribution:
normalDraw = rd.normal(size = 1000 )
print( 'mean ' , np.mean( normalDraw ) )
print( 'stdev' , np.std( normalDraw ) )
mean 0.025354699638558926
stdev 1.0003731428167348
normalDraw2 = rd.normal( loc = -2 , scale = 3 , size = 300 ) # loc chnages the location (mean), and scale changes the standard deviation
print( 'mean ' , np.mean( normalDraw2 ) )
print( 'stdev' , np.std( normalDraw2 ) )
mean -1.9773491637651965
stdev 2.964622032924749
of course, we could want to plot these drawn numbers:
plt.hist( normalDraw , alpha = 0.5 , label='loc=0 , scale=1')
plt.hist( normalDraw2 , alpha = 0.5 , label='loc=-2 , scale=3')
plt.legend()
plt.show()
Statistical testing
numpy.random
let’s you draw random numbers ;
scipy.stats
implements the probability density functions, and Percent point function, as well as the most statistical tests.
import scipy.stats as stats
# plotting the probability density function for 1 of the random draw we just made:
x = np.linspace(-10,10,1001)
normPDF = stats.norm.pdf( x , loc = -2 , scale = 3 )
plt.hist( normalDraw2 , alpha = 0.5 , label='random draw' , density = True) # don't forget density=True
plt.plot(x,normPDF , label='PDF' )
plt.legend()
plt.show()
We can also get the expected quantiles of a distribution:
print( '95% quantile of a Chi-square distribution with 3 degrees of freedom:', stats.chi2.ppf(0.95 , df=3))
print( 'fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5' ,
1 - stats.chi2.cdf( 5 , df=3 ) )
95% quantile of a Chi-square distribution with 3 degrees of freedom: 7.814727903251179
fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5 0.17179714429673354
And you can apply some classical statistical tests:
# t-test of independance between two random samples:
rd.seed(73)
s1 = rd.normal(size=67)
s2 = rd.normal(size=54 , loc = 0.2)
testStat , pval = stats.ttest_ind(s1,s2 , equal_var=True) # equal variance : Student's t-test ; unequal : Welch's
#almost all of these stat functions return the same test-statistic , pvalue tuple
print('result of the t-test')
print('\tt:',testStat)
print('\tp-value:',pval)
result of the t-test
t: 0.26673986193074073
p-value: 0.7901311339594405
What is our conclusion for these tests results? What do you think about this?
# Kolmogorov-smirnov test for a chi-square distribution
sample = rd.chisquare(df=13 , size = 43)
# kstest expect as second argument the cdf function of the reference distribution
# this is how to handle the fact that me must set an argument (degree of freedom)
refDistribution = stats.chi2(df=13).cdf
testStat , pval = stats.kstest( sample , refDistribution )
# alternative :
# testStat , pval = stats.kstest( sample , lambda x : stats.chi2.cdf(x , df=13 ) )
print('result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom')
print('\tK:',testStat)
print('\tp-value:',pval)
result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom
K: 0.12249766392962913
p-value: 0.5003109000967569
If you are interested, this webpage references all implemented tests, with examples.
Bringing together numpy, numpy.random, and matplotlib
The random generation function return a numpy array, meaning it is fairly trivial to combine it with other arrays:
# combining
x = np.sort( rd.normal(loc=170 , scale = 23 , size = 100) )
y_theoretical = 0.75 * x + 100 # simple linear relationship : y = a * x + b
measurement_noise = rd.normal(scale = 10 , size = 100) # some noise associated to the measure
y_observed = y_theoretical + measurement_noise # observed = expected + noise
fig,ax = plt.subplots(figsize=(8,8))
plt.plot( x , y_theoretical , label = 'expected' )
plt.plot( x , y_observed , marker = '.' , linestyle='' , alpha = 0.7 , label = 'observed')
plt.legend()
plt.show()
The briefest intro to pandas
pandas
is a powerful library when doing data analysis, especially in the forms of table.
Basically, it reimplements R data.frame as a DataFrame object and ties together neatly with the libraries we’ve just seen.
import pandas as pd
df = pd.read_table( 'data/beetle.csv' , sep=',' , index_col=0 ) # pandas automatically detects header.
df.head()
dose | nexp | ndied | prop | nalive | |
---|---|---|---|---|---|
1 | 49.1 | 59 | 6 | 0.102 | 53 |
2 | 53.0 | 60 | 13 | 0.217 | 47 |
3 | 56.9 | 62 | 18 | 0.290 | 44 |
4 | 60.8 | 56 | 28 | 0.500 | 28 |
5 | 64.8 | 63 | 52 | 0.825 | 11 |
Nrows, Ncols = df.shape
print( 'number of rows:',Nrows, 'number of columns:', Ncols )
print( 'column names' , df.columns )
number of rows: 8 number of columns: 5
column names Index(['dose', 'nexp', 'ndied', 'prop', 'nalive'], dtype='object')
df.describe()
dose | nexp | ndied | prop | nalive | |
---|---|---|---|---|---|
count | 8.000000 | 8.000000 | 8.000000 | 8.000000 | 8.000000 |
mean | 62.800000 | 60.125000 | 36.375000 | 0.602000 | 23.750000 |
std | 9.599702 | 2.232071 | 22.557466 | 0.367937 | 21.985385 |
min | 49.100000 | 56.000000 | 6.000000 | 0.102000 | 0.000000 |
25% | 55.925000 | 59.000000 | 16.750000 | 0.271750 | 4.750000 |
50% | 62.800000 | 60.000000 | 40.000000 | 0.662500 | 19.500000 |
75% | 69.675000 | 62.000000 | 54.750000 | 0.919500 | 44.750000 |
max | 76.500000 | 63.000000 | 61.000000 | 1.000000 | 53.000000 |
# select a single column:
df['dose']
1 49.1
2 53.0
3 56.9
4 60.8
5 64.8
6 68.7
7 72.6
8 76.5
Name: dose, dtype: float64
df[ ['ndied','nalive'] ] # select several columns
ndied | nalive | |
---|---|---|
1 | 6 | 53 |
2 | 13 | 47 |
3 | 18 | 44 |
4 | 28 | 28 |
5 | 52 | 11 |
6 | 53 | 6 |
7 | 61 | 1 |
8 | 60 | 0 |
Plotting DataFrame Columns
Because DataFrame
columns are iterable, they can seamlessly be given as argument to plot()
.
# plotting the column dose along the x-axis and prop along the y-axis
# I use the + marker, with a teal color.
plt.plot(df['dose'] , df['prop'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )
plt.xlabel( 'dose' )
plt.ylabel( 'proportion of dead' )
plt.show()
DataFrame column can be manipulated like numpy array:
## we can combine columns using normal operators
Odds = df['nalive'] /df['ndied'] # the odds of being alive is nalive / ndead
## adding a new column to the DataFrame is trivial:
df['Odds'] = Odds
## we can also apply numpy function to them
df['logOdds'] = np.log( df['Odds'] )
plt.plot(df['dose'] , df['logOdds'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )
plt.xlabel( 'dose' )
plt.ylabel( 'log Odds' )
plt.show()
Exercise 00.02 : tying everything together
- Read the file
'data/kyphosis.csv'
. - how many columns are there ?
- What is the maximum Age ?
- create a new column
Stop
, corresponding to the addition of columns'Start'
and'Number'
- plot the relationship between
'Age'
and'Number'
(bonus point : use colors to indicate the presence or absence of kyphosis ).
Solutions:
# %load -r -7 solutions/solution_00_02.py
# %load -r 8-9 solutions/solution_00_02.py
# %load -r 11-12 solutions/solution_00_02.py
# %load -r 14-15 solutions/solution_00_02.py
# %load -r 17-22 solutions/solution_00_02.py
# %load -r 24- solutions/solution_00_02.py