Python - Warm-up for statistics and machine learning

Overview
Creative Commons License: CC-BY Questions:
  • to do

Objectives:
  • to do

Requirements:
Time estimation: 1 hour
Level: Introductory Introductory
Supporting Materials:
Published: Mar 11, 2025
Last modification: Mar 11, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
version Revision: 1
Best viewed in a Jupyter Notebook

This tutorial is best viewed in a Jupyter notebook! You can load this notebook one of the following ways

Run on the GTN with JupyterLite (in-browser computations)

  1. Click to Launch JupyterLite

Launching the notebook in Jupyter in Galaxy

  1. Instructions to Launch JupyterLab
  2. Open a Terminal in JupyterLab with File -> New -> Terminal
  3. Run wget https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-warmup-stat-ml/data-science-python-warmup-stat-ml.ipynb
  4. Select the notebook that appears in the list of files on the left.

Downloading the notebook

  1. Right click one of these links: Jupyter Notebook (With Solutions), Jupyter Notebook (Without Solutions)
  2. Save Link As..

Agenda

In this tutorial, we will cover:

  1. Basic python
  2. numpy and vectorized operations
  3. Basic plotting
  4. Exercise 00.01 : bringing together numpy and matplotlib
  5. Generating random numbers
  6. Statistical testing
  7. Bringing together numpy, numpy.random, and matplotlib
  8. The briefest intro to pandas
  9. Exercise 00.02 : tying everything together

Basic python


X = []

for i in range(10):
    X.append( i**2 )

print(X)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

for x in X:
    print(x)

0
1
4
9
16
25
36
49
64
81
for x in X:
    if x%2 == 1:
        print(x,'is odd')
    else:
        print(x,'is even')
0 is even
1 is odd
4 is even
9 is odd
16 is even
25 is odd
36 is even
49 is odd
64 is even
81 is odd
# list comprehension is a very fine way of compressing all this

X = [ i**2 for i in range(10) ]

Xeven = [ x for x in X if x%2 == 0 ]
Xodd = [ x for x in X if x%2 == 1 ]


print( 'X    ', X )
print( 'Xeven', Xeven )
print( 'Xodd ', Xodd )
X     [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Xeven [0, 4, 16, 36, 64]
Xodd  [1, 9, 25, 49, 81]

back to the top

numpy and vectorized operations

import numpy as np

X_array = np.array(X)

print(X_array)
[ 0  1  4  9 16 25 36 49 64 81]
print(X_array / 2 )
[ 0.   0.5  2.   4.5  8.  12.5 18.  24.5 32.  40.5]
print( np.exp(X_array ) )
print( np.log(X_array ) )
[1.00000000e+00 2.71828183e+00 5.45981500e+01 8.10308393e+03
 8.88611052e+06 7.20048993e+10 4.31123155e+15 1.90734657e+21
 6.23514908e+27 1.50609731e+35]
[      -inf 0.         1.38629436 2.19722458 2.77258872 3.21887582
 3.58351894 3.8918203  4.15888308 4.39444915]


/tmp/ipykernel_490123/2855859755.py:2: RuntimeWarning: divide by zero encountered in log
  print( np.log(X_array ) )
print( 'shape' , X_array.shape )
print( 'mean ' , np.mean(X_array) )
print( 'standard deviation' , np.std(X_array) )
shape (10,)
mean  28.5
standard deviation 26.852374196707448

linspace and arange

print( 'linspace 0,2,9 :' , np.linspace(0,2,9) , sep='\t' )
print( 'linspace -0.5,0.5,11 :' , np.linspace(-0.5,0.5,11) , sep='\t' )
print( 'linspace 10,0,11 :' , np.linspace(10,0,11) , sep='\t' )
linspace 0,2,9 :	[0.   0.25 0.5  0.75 1.   1.25 1.5  1.75 2.  ]
linspace -0.5,0.5,11 :	[-0.5 -0.4 -0.3 -0.2 -0.1  0.   0.1  0.2  0.3  0.4  0.5]
linspace 10,0,11 :	[10.  9.  8.  7.  6.  5.  4.  3.  2.  1.  0.]
print( "arange 0,2,0.1 :", np.arange(1.5,2,0.1) , sep='\t' )
print( "arange -1,1,0.125 :", np.arange(-1,1,0.125) , sep='\t' )
print( "arange 10,2 :", np.arange(10,2,1) , sep='\t' ) # reverse does not work!
arange 0,2,0.1 :	[1.5 1.6 1.7 1.8 1.9]
arange -1,1,0.125 :	[-1.    -0.875 -0.75  -0.625 -0.5   -0.375 -0.25  -0.125  0.     0.125
  0.25   0.375  0.5    0.625  0.75   0.875]
arange 10,2 :	[]

Basic plotting

import matplotlib.pyplot as plt

plt.plot( [0,1,2,3] , [10,5,7,0.2] )
plt.show()

Adding color, symbols, …

matplotlib offers many options to customize the appearance of your plot.

Here are the (some) common arguments to plot() (which can also be applied to many other graphical representations):

  • color : could be given as a (red,green,blue) tuple, a name, a hex code, … (see Something better here for all the options)
  • marker : symbols for the data point. '.' is a point, 'v' a down triangle, … see Something better here for the list of possibilities.
  • linestyle : style of the line. '-' is solid, '--' is dashed, '' for no line. See Something better here for more options
  • linewidth : width of the lines
  • markersize : size of the markers

You are invited to experiment and explore these options. Here are a few examples:

y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]

# green, dashed line, with circle markers
plt.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )

# blue triangle with no line
plt.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )

# solid orange line
plt.plot(y3, color = 'orange', marker = '', linestyle = '-', linewidth = 4 )

plt.show()

Note that:

  • you can call plot several time in a row to make several lines appear (only plt.show() causes the figure to appear)
  • the frame of the picture automatically adjust to what it needs to show

multiple subplots

Now would normally be when we show you how to add labels, titles and legends to figures.

However, the way matplotlib is built, it is actually a bit more efficient to first learn how to create multiple subplots.

Creating multiple plots is possible with the function plt.subplots(). Amon its many arguments, it takes:

  • nrows : number of subplot rows
  • ncols : number of subplot columns
  • figsize : tuple (width,height) of the figure

This function creates a Figure and an Axes object. The Axes object can be either :

  • a simple Axe is there is 1 row and 1 columns
  • a list of Axe objects if there is 1 row and multiple columns, or 1 column and multiple rows
  • a list of lists of Axes objects if there is multiple rows and multiple columns
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]


# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2) # 2 columns and 1 row

# ax is a list with two objects. Each object correspond to 1 subplot

# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )

# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )

plt.show()

Notice how we call ax[0].plot(...) instead of plt.plot(...) to specify in which subplots we want to plot.

multiple subplots - continued

Let’s see the same thing with several lines and several columns

y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]
y4 = [1,2,3,7,5]

# 2 columns and 2 rows, and we also set the figure size
fig, ax = plt.subplots(nrows=2, ncols=2 , figsize = (12,12))

# ax is a list of two lists with two objects each.

# accessing to the first row, first column : ax[0][0]
ax[0][0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )

# accessing to the first row, second column : ax[0][1]
ax[0][1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )

# accessing to the second row, first column : ax[1][0]
ax[1][0].plot( y3, color = 'orange', marker = 'x', linestyle = '-' )

# accessing to the first row, second column : ax[1][1]
ax[1][1].plot( y4, color = 'teal', linestyle = '-.' , linewidth=5 )

plt.show()

setting up labels

To set the labels at the x-axis, y-axis and title, we use the method of the Axe object:

  • .set_xlabel(...)
  • .set_ylabel(...)
  • .set_title(...)
y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]

# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2 , figsize=(10,5)) # 2 columns and 1 row


# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )
ax[0].set_xlabel('x-axis label')
ax[0].set_ylabel('y-axis label')
ax[0].set_title('plot 1')


# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )
ax[1].set_xlabel('x-axis label')
ax[1].set_ylabel('y-axis label')
ax[1].set_title('plot 2')

plt.show()

setting up a legend

Each element we add to the figure using plot() can be given a label using the label argument. Then, a legend may be added to the figure using the legend() method.

This legend() method can take a loc argument that specifies where it should be plotted. Possible values for this argument are: 'best' , 'upper right' , 'upper left' , 'lower left' , 'lower right' , 'right' , 'center left' , 'center right' , 'lower center' , 'upper center' , 'center' (the default is best).


fig, ax = plt.subplots(nrows=1, ncols=1 , figsize=(10,5)) # 2 columns and 1 row

# NB : with 1 col and 1 row, ax is directly the sole subplot we have
#      so to call it we just use ax.plot , ax.set_xlabel , ...

ax.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2 , label = 'line A' )
ax.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize =  8 , label = 'line B' )
ax.plot( y3, color = 'orange', marker = '', linestyle = '-' , linewidth = 2 , label = 'line C' )

ax.set_xlabel('x-axis label')
ax.set_ylabel('y-axis label')
ax.set_title('plot with a legend')

#adding a legend in the upper right
ax.legend( loc='upper right')

plt.show()

additional : writing a figure to a file

Writing a matplotlib figure to a file can be achieved simply by replacing the call to plt.show() to plt.savefig(...).

plt.savefig takes a number of argument, the most commons are :

  • fname : name of the file to write the figure. The extension is used to determine the output format (.pdf,.png, .jpg , .svg , …). Many formats are supported, you can get a list with this command : plt.gcf().canvas.get_supported_filetypes()
  • dpi : dots per inches , useful to set-up when saving to raster formats (ie., pixel-based such as png or jpeg). The actual size of the image is set using the argument figsize of plt.subplots()
Comment

in a jupyter notebook the figure will still be shown, whereas in a standard .py script it will not appear on screen.

Here is a demonstration. Apply in on your side and verify that the file testPlot.png was created:

import matplotlib.pyplot as plt

y1 = [1,2,3,10,5]
y2 = [10,9,7,5.5,6]
y3 = [4,3,1.5,1]


# subplots returns a Figure and an Axes object
fig, ax = plt.subplots(nrows=1, ncols=2 , figsize = (10,6) ) # 2 columns and 1 row

# ax is a list with two objects. Each object correspond to 1 subplot

# accessing to the first column ax[0]
ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )

# accessing to the second column ax[1]
ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )
ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )

plt.savefig( 'testPlot.png' , dpi = 90  )

Exercise 00.01 : bringing together numpy and matplotlib

Numpy arrays can be plotted as if they were lists.

  1. plot x and y, where:
    • y = 1/(1+exp(-x))
    • x varies between -5 and 5 (plotting around a 100 points should suffice).
  2. Bonus : plot multiples lines : y = 1/(1+exp(-x*b)) , for the following values of b: 0.5 , 1 , 2 , 4.
    • x still varies between -5 and 5 (plotting around a 100 points should suffice).
    • put a legend in your plot

You can load the solution directly in this notebook by uncommenting and running the following line:

# %load  -r -8 solutions/solution_00_01.py

bonus question solution:

# %load  -r 9- solutions/solution_00_01.py

Generating random numbers

the basics

import numpy.random as rd

# random floats between 0 and 1
for i in range(4):
    print( rd.random() )

0.6696103730869407
0.7426639266737763
0.6767219223242785
0.8602105555191791
print( rd.random(size=10) ) # draw directly 10 numbers
[0.37971723 0.80354745 0.4168427  0.70867247 0.17547126 0.43760884
 0.75933345 0.06571168 0.45772397 0.67191214]

setting the seed: pseudorandomness and reproducibility

rd.seed(42) # setting the seed to 42
print( '1st draw' , rd.random(size=5) )
print( '2nd draw' , rd.random(size=5) )
rd.seed(42)
print( 'after resetting seed' , rd.random(size=5) )
1st draw [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
2nd draw [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
after resetting seed [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]

beyond the uniform distribution

numpy offers you quite a large set of distributions you can draw from.

Let’s look at the normal distribution:


normalDraw = rd.normal(size = 1000 )

print( 'mean ' , np.mean( normalDraw ) )
print( 'stdev' , np.std( normalDraw ) )
mean  0.025354699638558926
stdev 1.0003731428167348
normalDraw2 = rd.normal( loc = -2 , scale = 3 , size = 300 ) # loc chnages the location (mean), and scale changes the standard deviation

print( 'mean ' , np.mean( normalDraw2 ) )
print( 'stdev' , np.std( normalDraw2 ) )
mean  -1.9773491637651965
stdev 2.964622032924749

of course, we could want to plot these drawn numbers:

plt.hist( normalDraw  , alpha = 0.5 , label='loc=0  , scale=1')
plt.hist( normalDraw2 , alpha = 0.5 , label='loc=-2 , scale=3')
plt.legend()
plt.show()

Statistical testing

numpy.random let’s you draw random numbers ; scipy.stats implements the probability density functions, and Percent point function, as well as the most statistical tests.

import scipy.stats as stats

# plotting the probability density function for 1 of the random draw we just made:

x = np.linspace(-10,10,1001)

normPDF = stats.norm.pdf( x , loc = -2 , scale = 3 )

plt.hist( normalDraw2 , alpha = 0.5 , label='random draw' , density = True) # don't forget density=True
plt.plot(x,normPDF , label='PDF' )
plt.legend()
plt.show()

We can also get the expected quantiles of a distribution:

print( '95% quantile of a Chi-square distribution with 3 degrees of freedom:', stats.chi2.ppf(0.95 , df=3))
print( 'fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5' ,  
      1 - stats.chi2.cdf( 5 , df=3 ) )
95% quantile of a Chi-square distribution with 3 degrees of freedom: 7.814727903251179
fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5 0.17179714429673354

And you can apply some classical statistical tests:

# t-test of independance between two random samples:
rd.seed(73)

s1 = rd.normal(size=67)
s2 = rd.normal(size=54 , loc = 0.2)

testStat , pval = stats.ttest_ind(s1,s2 , equal_var=True)  # equal variance : Student's t-test ; unequal : Welch's
#almost all of these stat functions return the same test-statistic , pvalue tuple

print('result of the t-test')
print('\tt:',testStat)
print('\tp-value:',pval)
result of the t-test
    t: 0.26673986193074073
    p-value: 0.7901311339594405

What is our conclusion for these tests results? What do you think about this?


# Kolmogorov-smirnov test for a chi-square distribution

sample = rd.chisquare(df=13 , size = 43)


# kstest expect as second argument the cdf function of the reference distribution
# this is how to handle the fact that me must set an argument (degree of freedom)
refDistribution = stats.chi2(df=13).cdf

testStat , pval = stats.kstest( sample , refDistribution )
# alternative : 
# testStat , pval = stats.kstest( sample , lambda x : stats.chi2.cdf(x , df=13 ) )

print('result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom')
print('\tK:',testStat)
print('\tp-value:',pval)

result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom
    K: 0.12249766392962913
    p-value: 0.5003109000967569

If you are interested, this webpage references all implemented tests, with examples.

back to the top

Bringing together numpy, numpy.random, and matplotlib

The random generation function return a numpy array, meaning it is fairly trivial to combine it with other arrays:

# combining 

x = np.sort( rd.normal(loc=170 , scale = 23 , size = 100) )

y_theoretical = 0.75 * x + 100 # simple linear relationship : y = a * x + b

measurement_noise = rd.normal(scale = 10 , size = 100) # some noise associated to the measure

y_observed = y_theoretical + measurement_noise # observed = expected + noise

fig,ax = plt.subplots(figsize=(8,8))
plt.plot( x , y_theoretical , label = 'expected' )
plt.plot( x , y_observed , marker = '.' , linestyle='' , alpha = 0.7 , label = 'observed')
plt.legend()
plt.show()

The briefest intro to pandas

pandas is a powerful library when doing data analysis, especially in the forms of table.

Basically, it reimplements R data.frame as a DataFrame object and ties together neatly with the libraries we’ve just seen.

 import pandas as pd
    
df = pd.read_table( 'data/beetle.csv' , sep=',' , index_col=0 ) # pandas automatically detects header.

df.head()
dose nexp ndied prop nalive
1 49.1 59 6 0.102 53
2 53.0 60 13 0.217 47
3 56.9 62 18 0.290 44
4 60.8 56 28 0.500 28
5 64.8 63 52 0.825 11
Nrows, Ncols = df.shape
print( 'number of rows:',Nrows, 'number of columns:', Ncols )
print( 'column names' , df.columns )
number of rows: 8 number of columns: 5
column names Index(['dose', 'nexp', 'ndied', 'prop', 'nalive'], dtype='object')
df.describe()
dose nexp ndied prop nalive
count 8.000000 8.000000 8.000000 8.000000 8.000000
mean 62.800000 60.125000 36.375000 0.602000 23.750000
std 9.599702 2.232071 22.557466 0.367937 21.985385
min 49.100000 56.000000 6.000000 0.102000 0.000000
25% 55.925000 59.000000 16.750000 0.271750 4.750000
50% 62.800000 60.000000 40.000000 0.662500 19.500000
75% 69.675000 62.000000 54.750000 0.919500 44.750000
max 76.500000 63.000000 61.000000 1.000000 53.000000
# select a single column:
df['dose']
1    49.1
2    53.0
3    56.9
4    60.8
5    64.8
6    68.7
7    72.6
8    76.5
Name: dose, dtype: float64
df[ ['ndied','nalive'] ] # select several columns
ndied nalive
1 6 53
2 13 47
3 18 44
4 28 28
5 52 11
6 53 6
7 61 1
8 60 0

Plotting DataFrame Columns

Because DataFrame columns are iterable, they can seamlessly be given as argument to plot().


# plotting the column dose along the x-axis and prop along the y-axis
# I use the + marker, with a teal color.
plt.plot(df['dose'] , df['prop'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )
plt.xlabel( 'dose' )
plt.ylabel( 'proportion of dead' )
plt.show()

DataFrame column can be manipulated like numpy array:


## we can combine columns using normal operators
Odds = df['nalive'] /df['ndied'] # the odds of being alive is nalive / ndead

## adding a new column to the DataFrame is trivial:
df['Odds'] = Odds


## we can also apply numpy function to them
df['logOdds'] = np.log( df['Odds'] )


plt.plot(df['dose'] , df['logOdds'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )
plt.xlabel( 'dose' )
plt.ylabel( 'log Odds' )
plt.show()

Exercise 00.02 : tying everything together

  1. Read the file 'data/kyphosis.csv'.
  2. how many columns are there ?
  3. What is the maximum Age ?
  4. create a new column Stop , corresponding to the addition of columns 'Start' and 'Number'
  5. plot the relationship between 'Age' and 'Number' (bonus point : use colors to indicate the presence or absence of kyphosis ).

Solutions:

# %load  -r -7 solutions/solution_00_02.py
# %load  -r 8-9 solutions/solution_00_02.py
# %load  -r 11-12 solutions/solution_00_02.py
# %load  -r 14-15 solutions/solution_00_02.py
# %load  -r 17-22 solutions/solution_00_02.py
# %load  -r 24- solutions/solution_00_02.py