{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Python - Warm-up for statistics and machine learning\n\nby [Wandrille Duchemin](https://training.galaxyproject.org/hall-of-fame/wandrilled/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- to do\n\n**Objectives**\n\n- to do\n\n**Time Estimation: 1H**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "
\n
Agenda
\n

In this tutorial, we will cover:

\n
    \n
  1. Basic python
  2. \n
\n
\n

Basic python

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "\n", "X = []\n", "\n", "for i in range(10):\n", " X.append( i**2 )\n", "\n", "print(X)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-2", "source": "

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-3", "source": [ "\n", "for x in X:\n", " print(x)\n", "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-4", "source": "

0\n 1\n 4\n 9\n 16\n 25\n 36\n 49\n 64\n 81

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "for x in X:\n", " if x%2 == 1:\n", " print(x,'is odd')\n", " else:\n", " print(x,'is even')" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-6", "source": "

0 is even\n 1 is odd\n 4 is even\n 9 is odd\n 16 is even\n 25 is odd\n 36 is even\n 49 is odd\n 64 is even\n 81 is odd

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "# list comprehension is a very fine way of compressing all this\n", "\n", "X = [ i**2 for i in range(10) ]\n", "\n", "Xeven = [ x for x in X if x%2 == 0 ]\n", "Xodd = [ x for x in X if x%2 == 1 ]\n", "\n", "\n", "print( 'X ', X )\n", "print( 'Xeven', Xeven )\n", "print( 'Xodd ', Xodd )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-8", "source": "

X [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]\n Xeven [0, 4, 16, 36, 64]\n Xodd [1, 9, 25, 49, 81]

\n

back to the top

\n

numpy and vectorized operations

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-9", "source": [ "import numpy as np\n", "\n", "X_array = np.array(X)\n", "\n", "print(X_array)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-10", "source": "

[ 0 1 4 9 16 25 36 49 64 81]

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "print(X_array / 2 )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-12", "source": "

[ 0. 0.5 2. 4.5 8. 12.5 18. 24.5 32. 40.5]

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "print( np.exp(X_array ) )\n", "print( np.log(X_array ) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-14", "source": "

[1.00000000e+00 2.71828183e+00 5.45981500e+01 8.10308393e+03\n 8.88611052e+06 7.20048993e+10 4.31123155e+15 1.90734657e+21\n 6.23514908e+27 1.50609731e+35]\n [ -inf 0. 1.38629436 2.19722458 2.77258872 3.21887582\n 3.58351894 3.8918203 4.15888308 4.39444915]

\n
/tmp/ipykernel_490123/2855859755.py:2: RuntimeWarning: divide by zero encountered in log\n  print( np.log(X_array ) )\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "print( 'shape' , X_array.shape )\n", "print( 'mean ' , np.mean(X_array) )\n", "print( 'standard deviation' , np.std(X_array) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-16", "source": "

shape (10,)\n mean 28.5\n standard deviation 26.852374196707448

\n

linspace and arange

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "print( 'linspace 0,2,9 :' , np.linspace(0,2,9) , sep='\\t' )\n", "print( 'linspace -0.5,0.5,11 :' , np.linspace(-0.5,0.5,11) , sep='\\t' )\n", "print( 'linspace 10,0,11 :' , np.linspace(10,0,11) , sep='\\t' )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-18", "source": "

linspace 0,2,9 :\t[0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]\n linspace -0.5,0.5,11 :\t[-0.5 -0.4 -0.3 -0.2 -0.1 0. 0.1 0.2 0.3 0.4 0.5]\n linspace 10,0,11 :\t[10. 9. 8. 7. 6. 5. 4. 3. 2. 1. 0.]

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "print( \"arange 0,2,0.1 :\", np.arange(1.5,2,0.1) , sep='\\t' )\n", "print( \"arange -1,1,0.125 :\", np.arange(-1,1,0.125) , sep='\\t' )\n", "print( \"arange 10,2 :\", np.arange(10,2,1) , sep='\\t' ) # reverse does not work!" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-20", "source": "

arange 0,2,0.1 :\t[1.5 1.6 1.7 1.8 1.9]\n arange -1,1,0.125 :\t[-1. -0.875 -0.75 -0.625 -0.5 -0.375 -0.25 -0.125 0. 0.125\n 0.25 0.375 0.5 0.625 0.75 0.875]\n arange 10,2 :\t[]

\n

Basic plotting

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.plot( [0,1,2,3] , [10,5,7,0.2] )\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-22", "source": "

Adding color, symbols, …

\n

matplotlib offers many options to customize the appearance of your plot.

\n

Here are the (some) common arguments to plot() (which can also be applied to many other graphical representations):

\n\n

You are invited to experiment and explore these options. Here are a few examples:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "y1 = [1,2,3,10,5]\n", "y2 = [10,9,7,5.5,6]\n", "y3 = [4,3,1.5,1]\n", "\n", "# green, dashed line, with circle markers\n", "plt.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )\n", "\n", "# blue triangle with no line\n", "plt.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )\n", "\n", "# solid orange line\n", "plt.plot(y3, color = 'orange', marker = '', linestyle = '-', linewidth = 4 )\n", "\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-24", "source": "

Note that:

\n\n

multiple subplots

\n

Now would normally be when we show you how to add labels, titles and legends to figures.

\n

However, the way matplotlib is built, it is actually a bit more efficient to first learn how to create multiple subplots.

\n

Creating multiple plots is possible with the function plt.subplots().\nAmon its many arguments, it takes:

\n\n

This function creates a Figure and an Axes object.\nThe Axes object can be either :

\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "y1 = [1,2,3,10,5]\n", "y2 = [10,9,7,5.5,6]\n", "y3 = [4,3,1.5,1]\n", "\n", "\n", "# subplots returns a Figure and an Axes object\n", "fig, ax = plt.subplots(nrows=1, ncols=2) # 2 columns and 1 row\n", "\n", "# ax is a list with two objects. Each object correspond to 1 subplot\n", "\n", "# accessing to the first column ax[0]\n", "ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )\n", "\n", "# accessing to the second column ax[1]\n", "ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )\n", "ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )\n", "\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-26", "source": "

Notice how we call ax[0].plot(...) instead of plt.plot(...) to specify in which subplots we want to plot.

\n

multiple subplots - continued

\n

Let’s see the same thing with several lines and several columns

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "y1 = [1,2,3,10,5]\n", "y2 = [10,9,7,5.5,6]\n", "y3 = [4,3,1.5,1]\n", "y4 = [1,2,3,7,5]\n", "\n", "# 2 columns and 2 rows, and we also set the figure size\n", "fig, ax = plt.subplots(nrows=2, ncols=2 , figsize = (12,12))\n", "\n", "# ax is a list of two lists with two objects each.\n", "\n", "# accessing to the first row, first column : ax[0][0]\n", "ax[0][0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )\n", "\n", "# accessing to the first row, second column : ax[0][1]\n", "ax[0][1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )\n", "\n", "# accessing to the second row, first column : ax[1][0]\n", "ax[1][0].plot( y3, color = 'orange', marker = 'x', linestyle = '-' )\n", "\n", "# accessing to the first row, second column : ax[1][1]\n", "ax[1][1].plot( y4, color = 'teal', linestyle = '-.' , linewidth=5 )\n", "\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-28", "source": "

setting up labels

\n

To set the labels at the x-axis, y-axis and title, we use the method of the Axe object:

\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "y1 = [1,2,3,10,5]\n", "y2 = [10,9,7,5.5,6]\n", "y3 = [4,3,1.5,1]\n", "\n", "# subplots returns a Figure and an Axes object\n", "fig, ax = plt.subplots(nrows=1, ncols=2 , figsize=(10,5)) # 2 columns and 1 row\n", "\n", "\n", "# accessing to the first column ax[0]\n", "ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )\n", "ax[0].set_xlabel('x-axis label')\n", "ax[0].set_ylabel('y-axis label')\n", "ax[0].set_title('plot 1')\n", "\n", "\n", "# accessing to the second column ax[1]\n", "ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )\n", "ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )\n", "ax[1].set_xlabel('x-axis label')\n", "ax[1].set_ylabel('y-axis label')\n", "ax[1].set_title('plot 2')\n", "\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-30", "source": "

setting up a legend

\n

Each element we add to the figure using plot() can be given a label using the label argument.\nThen, a legend may be added to the figure using the legend() method.

\n

This legend() method can take a loc argument that specifies where it should be plotted.\nPossible values for this argument are: 'best' , 'upper right' , 'upper left' , 'lower left' , 'lower right' , 'right' , 'center left' , 'center right' , 'lower center' , 'upper center' , 'center' (the default is best).

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "\n", "fig, ax = plt.subplots(nrows=1, ncols=1 , figsize=(10,5)) # 2 columns and 1 row\n", "\n", "# NB : with 1 col and 1 row, ax is directly the sole subplot we have\n", "# so to call it we just use ax.plot , ax.set_xlabel , ...\n", "\n", "ax.plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2 , label = 'line A' )\n", "ax.plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 8 , label = 'line B' )\n", "ax.plot( y3, color = 'orange', marker = '', linestyle = '-' , linewidth = 2 , label = 'line C' )\n", "\n", "ax.set_xlabel('x-axis label')\n", "ax.set_ylabel('y-axis label')\n", "ax.set_title('plot with a legend')\n", "\n", "#adding a legend in the upper right\n", "ax.legend( loc='upper right')\n", "\n", "plt.show()\n", "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-32", "source": "

additional : writing a figure to a file

\n

Writing a matplotlib figure to a file can be achieved simply by replacing the call to plt.show() to plt.savefig(...).

\n

plt.savefig takes a number of argument, the most commons are :

\n\n
\n
Comment
\n

in a jupyter notebook the figure will still be shown, whereas in a standard .py script it will not appear on screen.

\n
\n

Here is a demonstration. Apply in on your side and verify that the file testPlot.png was created:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-33", "source": [ "import matplotlib.pyplot as plt\n", "\n", "y1 = [1,2,3,10,5]\n", "y2 = [10,9,7,5.5,6]\n", "y3 = [4,3,1.5,1]\n", "\n", "\n", "# subplots returns a Figure and an Axes object\n", "fig, ax = plt.subplots(nrows=1, ncols=2 , figsize = (10,6) ) # 2 columns and 1 row\n", "\n", "# ax is a list with two objects. Each object correspond to 1 subplot\n", "\n", "# accessing to the first column ax[0]\n", "ax[0].plot( y1, color = 'green', marker = 'o', linestyle = '--', linewidth = 2, markersize = 8 )\n", "\n", "# accessing to the second column ax[1]\n", "ax[1].plot( y2, color = 'blue', marker = 'v', linestyle = '' , markersize = 16 )\n", "ax[1].plot( y3, color = 'orange', marker = '', linestyle = '-' )\n", "\n", "plt.savefig( 'testPlot.png' , dpi = 90 )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-34", "source": "

Exercise 00.01 : bringing together numpy and matplotlib

\n

Numpy arrays can be plotted as if they were lists.

\n
    \n
  1. plot x and y, where:\n\n
  2. \n
  3. Bonus : plot multiples lines : y = 1/(1+exp(-x*b)) , for the following values of b: 0.5 , 1 , 2 , 4.\n\n
  4. \n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-35", "source": [ "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-36", "source": "

You can load the solution directly in this notebook by uncommenting and running the following line:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-37", "source": [ "# %load -r -8 solutions/solution_00_01.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-38", "source": "

bonus question solution:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "# %load -r 9- solutions/solution_00_01.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-40", "source": "

Generating random numbers

\n

the basics

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-41", "source": [ "import numpy.random as rd\n", "\n", "# random floats between 0 and 1\n", "for i in range(4):\n", " print( rd.random() )\n", "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-42", "source": "

0.6696103730869407\n 0.7426639266737763\n 0.6767219223242785\n 0.8602105555191791

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-43", "source": [ "print( rd.random(size=10) ) # draw directly 10 numbers" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-44", "source": "

[0.37971723 0.80354745 0.4168427 0.70867247 0.17547126 0.43760884\n 0.75933345 0.06571168 0.45772397 0.67191214]

\n

setting the seed: pseudorandomness and reproducibility

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-45", "source": [ "rd.seed(42) # setting the seed to 42\n", "print( '1st draw' , rd.random(size=5) )\n", "print( '2nd draw' , rd.random(size=5) )\n", "rd.seed(42)\n", "print( 'after resetting seed' , rd.random(size=5) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-46", "source": "

1st draw [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]\n 2nd draw [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]\n after resetting seed [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]

\n

beyond the uniform distribution

\n

numpy offers you quite a large set of distributions you can draw from.

\n

Let’s look at the normal distribution:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "\n", "normalDraw = rd.normal(size = 1000 )\n", "\n", "print( 'mean ' , np.mean( normalDraw ) )\n", "print( 'stdev' , np.std( normalDraw ) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-48", "source": "

mean 0.025354699638558926\n stdev 1.0003731428167348

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-49", "source": [ "normalDraw2 = rd.normal( loc = -2 , scale = 3 , size = 300 ) # loc chnages the location (mean), and scale changes the standard deviation\n", "\n", "print( 'mean ' , np.mean( normalDraw2 ) )\n", "print( 'stdev' , np.std( normalDraw2 ) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-50", "source": "

mean -1.9773491637651965\n stdev 2.964622032924749

\n

of course, we could want to plot these drawn numbers:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-51", "source": [ "plt.hist( normalDraw , alpha = 0.5 , label='loc=0 , scale=1')\n", "plt.hist( normalDraw2 , alpha = 0.5 , label='loc=-2 , scale=3')\n", "plt.legend()\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-52", "source": "

Statistical testing

\n

numpy.random let’s you draw random numbers ;\nscipy.stats implements the probability density functions, and Percent point function, as well as the most statistical tests.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-53", "source": [ "import scipy.stats as stats\n", "\n", "# plotting the probability density function for 1 of the random draw we just made:\n", "\n", "x = np.linspace(-10,10,1001)\n", "\n", "normPDF = stats.norm.pdf( x , loc = -2 , scale = 3 )\n", "\n", "plt.hist( normalDraw2 , alpha = 0.5 , label='random draw' , density = True) # don't forget density=True\n", "plt.plot(x,normPDF , label='PDF' )\n", "plt.legend()\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-54", "source": "

We can also get the expected quantiles of a distribution:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-55", "source": [ "print( '95% quantile of a Chi-square distribution with 3 degrees of freedom:', stats.chi2.ppf(0.95 , df=3))\n", "print( 'fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5' ,\n", " 1 - stats.chi2.cdf( 5 , df=3 ) )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-56", "source": "

95% quantile of a Chi-square distribution with 3 degrees of freedom: 7.814727903251179\n fraction of a Chi-square distribution with 3 degrees of freedom above of equal to 5 0.17179714429673354

\n

And you can apply some classical statistical tests:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-57", "source": [ "# t-test of independance between two random samples:\n", "rd.seed(73)\n", "\n", "s1 = rd.normal(size=67)\n", "s2 = rd.normal(size=54 , loc = 0.2)\n", "\n", "testStat , pval = stats.ttest_ind(s1,s2 , equal_var=True) # equal variance : Student's t-test ; unequal : Welch's\n", "#almost all of these stat functions return the same test-statistic , pvalue tuple\n", "\n", "print('result of the t-test')\n", "print('\\tt:',testStat)\n", "print('\\tp-value:',pval)" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-58", "source": "

result of the t-test\n t: 0.26673986193074073\n p-value: 0.7901311339594405

\n

What is our conclusion for these tests results? What do you think about this?

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-59", "source": [ "\n", "# Kolmogorov-smirnov test for a chi-square distribution\n", "\n", "sample = rd.chisquare(df=13 , size = 43)\n", "\n", "\n", "# kstest expect as second argument the cdf function of the reference distribution\n", "# this is how to handle the fact that me must set an argument (degree of freedom)\n", "refDistribution = stats.chi2(df=13).cdf\n", "\n", "testStat , pval = stats.kstest( sample , refDistribution )\n", "# alternative :\n", "# testStat , pval = stats.kstest( sample , lambda x : stats.chi2.cdf(x , df=13 ) )\n", "\n", "print('result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom')\n", "print('\\tK:',testStat)\n", "print('\\tp-value:',pval)\n", "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-60", "source": "

result of the Kolmogorov-Smirnov test comparing our sample to a Chi-square distribution with 13 degrees of freedom\n K: 0.12249766392962913\n p-value: 0.5003109000967569

\n

If you are interested, this webpage references all implemented tests, with examples.

\n

back to the top

\n

Bringing together numpy, numpy.random, and matplotlib

\n

The random generation function return a numpy array, meaning it is fairly trivial to combine it with other arrays:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "# combining\n", "\n", "x = np.sort( rd.normal(loc=170 , scale = 23 , size = 100) )\n", "\n", "y_theoretical = 0.75 * x + 100 # simple linear relationship : y = a * x + b\n", "\n", "measurement_noise = rd.normal(scale = 10 , size = 100) # some noise associated to the measure\n", "\n", "y_observed = y_theoretical + measurement_noise # observed = expected + noise\n", "\n", "fig,ax = plt.subplots(figsize=(8,8))\n", "plt.plot( x , y_theoretical , label = 'expected' )\n", "plt.plot( x , y_observed , marker = '.' , linestyle='' , alpha = 0.7 , label = 'observed')\n", "plt.legend()\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-62", "source": "

The briefest intro to pandas

\n

pandas is a powerful library when doing data analysis, especially in the forms of table.

\n

Basically, it reimplements R data.frame as a DataFrame object and ties together neatly with the libraries we’ve just seen.

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-63", "source": [ " import pandas as pd\n", "\n", "df = pd.read_table( 'data/beetle.csv' , sep=',' , index_col=0 ) # pandas automatically detects header.\n", "\n", "df.head()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-64", "source": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
dosenexpndiedpropnalive
149.15960.10253
253.060130.21747
356.962180.29044
460.856280.50028
564.863520.82511
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-65", "source": [ "Nrows, Ncols = df.shape\n", "print( 'number of rows:',Nrows, 'number of columns:', Ncols )\n", "print( 'column names' , df.columns )" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-66", "source": "

number of rows: 8 number of columns: 5\n column names Index([‘dose’, ‘nexp’, ‘ndied’, ‘prop’, ‘nalive’], dtype=’object’)

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-67", "source": [ "df.describe()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-68", "source": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
dosenexpndiedpropnalive
count8.0000008.0000008.0000008.0000008.000000
mean62.80000060.12500036.3750000.60200023.750000
std9.5997022.23207122.5574660.36793721.985385
min49.10000056.0000006.0000000.1020000.000000
25%55.92500059.00000016.7500000.2717504.750000
50%62.80000060.00000040.0000000.66250019.500000
75%69.67500062.00000054.7500000.91950044.750000
max76.50000063.00000061.0000001.00000053.000000
\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "# select a single column:\n", "df['dose']" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-70", "source": "

1 49.1\n 2 53.0\n 3 56.9\n 4 60.8\n 5 64.8\n 6 68.7\n 7 72.6\n 8 76.5\n Name: dose, dtype: float64

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-71", "source": [ "df[ ['ndied','nalive'] ] # select several columns" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-72", "source": "
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
ndiednalive
1653
21347
31844
42828
55211
6536
7611
8600
\n
\n

Plotting DataFrame Columns

\n

Because DataFrame columns are iterable, they can seamlessly be given as argument to plot().

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-73", "source": [ "\n", "# plotting the column dose along the x-axis and prop along the y-axis\n", "# I use the + marker, with a teal color.\n", "plt.plot(df['dose'] , df['prop'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )\n", "plt.xlabel( 'dose' )\n", "plt.ylabel( 'proportion of dead' )\n", "plt.show()" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-74", "source": "

DataFrame column can be manipulated like numpy array:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-75", "source": [ "\n", "## we can combine columns using normal operators\n", "Odds = df['nalive'] /df['ndied'] # the odds of being alive is nalive / ndead\n", "\n", "## adding a new column to the DataFrame is trivial:\n", "df['Odds'] = Odds\n", "\n", "\n", "## we can also apply numpy function to them\n", "df['logOdds'] = np.log( df['Odds'] )\n", "\n", "\n", "plt.plot(df['dose'] , df['logOdds'] , color = 'teal' , linestyle='' , marker = '+' , markersize=10 )\n", "plt.xlabel( 'dose' )\n", "plt.ylabel( 'log Odds' )\n", "plt.show()\n", "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-76", "source": "

Exercise 00.02 : tying everything together

\n
    \n
  1. Read the file 'data/kyphosis.csv'.
  2. \n
  3. how many columns are there ?
  4. \n
  5. What is the maximum Age ?
  6. \n
  7. create a new column Stop , corresponding to the addition of columns 'Start' and 'Number'
  8. \n
  9. plot the relationship between 'Age' and 'Number' (bonus point : use colors to indicate the presence or absence of kyphosis ).
  10. \n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-77", "source": [ "" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-78", "source": "

Solutions:

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-79", "source": [ "# %load -r -7 solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-80", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-81", "source": [ "# %load -r 8-9 solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-82", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-83", "source": [ "# %load -r 11-12 solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-84", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-85", "source": [ "# %load -r 14-15 solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-86", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-87", "source": [ "# %load -r 17-22 solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-88", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-89", "source": [ "# %load -r 24- solutions/solution_00_02.py" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "> In this tutorial, we will cover:" ], "id": "" } } }, { "id": "cell-90", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- to do\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-warmup-stat-ml/tutorial.html#feedback) and check there for further resources!\n" ] } ] }