Creating Python shareable Python libraries#

The goal of this tutorial is to show you the way to make your code more modular, more shareable and more robust.

Not all code needs to be shareable#

Not every analysis that you do on your data needs to be easy for others to use. Ideally, if you want your work to be reproducible, it will be possible for others to run your code and get the same results. But this is not what we are talking about here. In the course of your work, you will sometimes find that you are creating pieces of code that are useful for you in more than one place, and may also be useful to others. For example, collaborators in your lab, or other researchers in your field. These pieces of code deserve to be written and shared in a manner that others can easily adopt them into their code. To do so, the code needs to be packaged into a library. Here, we will look at the nuts and bolts of doing that.

From notebook to module#

In the course of my work on analysis of diffusion MRI data, I have written the following code in a script or a Jupyter notebook:

# analysis.ipynb or analysis.py
import numpy as np
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')

blob_radius = blob_data['radius']

blob_area = np.pi * blob_radius ** 2
blob_circ = 2 * np.pi * blob_radius

pd.DataFrame(dict(area=blob_area, circ=blob_circ)).to_csv('./output_data/blob_properties.csv')

Unfortunately, this code is not very reusable, even while the results may be perfectly reproducible (provided the input data is accessible).

This is because it mixes file input and output with computations and different computations with each other (e.g., computation of area and circumference).

The first step is to identify what are reusable components of this script and to move these components into a module. For example, here the calculation of area and circumference seem like they could each be (separately) useful in many different contexts.

Let’s isolate them and rewrite them as functions:

# analysis.ipynb or analysis.py
import numpy as np
import pandas as pd


def calculate_area(r):
    area = np.pi * r **2
    return area


def calculate_circ(r):
    circ = 2 * np.pi * r
    return circ

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = calculate_area(blob_radius)
blob_circ = calculate_circ(blob_radius)
pd.DataFrame(dict(area=blob_area, circ=blob_circ)).to_csv('./output_data/blob_properties.csv')

In the next step, we might move these functions out into a separate file, and document what they do:

# geometry.py
import numpy as np

def calculate_area(r):
    """
    Calculates the area of a circle.

    Parameters
    ----------
    r : float or array
        The radius of a single circle or multiple circles

    Returns
    -------
    area : float or array
        The calculated area/s
    """
    area = np.pi * r **2
    return area


def calculate_circ(r):
    """
    Calculates the circumference of a circle.

    Parameters
    ----------
    r : float or array
        The radius of a single circle or multiple circles

    Returns
    -------
    circ : float or array
        The calculated circumference/s
    """
    circ = 2 * np.pi * r
    return circ

Note that nothing is accidental about these docstrings. They carefully comply with the numpy docstring guide. This guide provides details about how to write documentation so that: 1) People who are used to reading documentation formatted in this way know how to read your documentation; and 2) Programs that automatically process documentation do what they are supposed to do with your docstrings. For example in converting the docstrings into webpages for online documentation (more about that later).

Importing and using functions#

Before we continue, we need to know a bit about what happens when you call import statements in Python. When you type call import geometry, Python starts by looking for a file called geometry.py in your present working directory.

That means that if you saved geomtry.py alongside your analysis script, you can now rewrite that as:

# analysis.ipynb or analysis.py
import geometry as geo
import pandas as pd

blob_data = pd.read_csv('./input_data/blob.csv')
blob_radius = blob_data['radius']
blob_area = geo.calculate_area(blob_radius)
blob_circ = geo.calculate_circ(blob_radius)
pd.DataFrame(dict(area=blob_area, circ=blob_circ)).to_csv('./output_data/blob_properties.csv')

This is already good, because now you can import and reuse these functions across many different analysis scripts without having to copy this code everywhere. In summary: you have transitioned this part of your code from a one-off notebook or script to a module. Next, let’s see how you transition from a module to a library.