banner

Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 1 - Plotting with Plotly Express#

Theme: Tools & Technology

This module provides an introduction to visualising data with some basic charts using the Plotly Express package for Python.

Learning objectives#

In this module we will learn how to:

  • Prepare data for plotting

  • Create scatter plots

  • Create bar plots

  • Create line plots

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Python packages for data visualisation#

Being able to visualise your data is obviously a great skill to have, and there are some fantastic Python packages available for creating a wide range of different visualisations.

In fact, we are spoilt for choice, with packages like:

…and others all providing some incredibly powerful plotting tools for data scientists.

For this module I’ve chosen to begin with Plotly Express because:

  • It supports many different types of chart

  • It has a relatively simple interface with good documentation

  • You can create plots quickly with just a few lines of code (often just a single function call)

  • Plots are interactive

…which makes it relatively easy to learn and a good choice for exploratory data analysis.

In this module we are just going to look at some basic charts, but you might like to browse the Plotly Python website to see what other charts are possible.

Setup#

In this module we’ll use the Plotly Express package, and we’ll also use pandas for loading data to plot. (See workshop 2, module 1 for an introduction to pandas DataFrames if you missed it or need a recap.) Both of these packages are already installed on colab, so we can go ahead and import them.

import pandas as pd
import plotly.express as px

Preparing data for plotting#

Plotly Express can accept data in a variety of different input formats, but it works particularly well when you provide data as a pandas DataFrame.

Let’s remind ourselves what a DataFrame looks like, by loading one of the example DataFrames that come with the Plotly Express package.

df_medals_long = px.data.medals_long()
df_medals_long
nation medal count
0 South Korea gold 24
1 China gold 10
2 Canada gold 9
3 South Korea silver 13
4 China silver 15
5 Canada silver 12
6 South Korea bronze 11
7 China bronze 8
8 Canada bronze 12

One thing worth mentioning is that often the same data can be structured in different ways. For example, the same data above could also be stored in the following DataFrame:

df_medals_wide = px.data.medals_wide()
df_medals_wide
nation gold silver bronze
0 South Korea 24 13 11
1 China 10 15 8
2 Canada 9 12 12

The df_medals_long DataFrame is an example of a “long-form” DataFrame, so-called because it has more rows and fewer columns.

The df_medals_wide DataFrame is an example of a “wide-form” DataFrame, so-called because it has fewer rows and more columns.

Plotly Express can plot either, but for the examples we’re going to look at today, it is slightly more convenient to work with long-form data.

Let’s now load some more interesting data to practise plotting with, which is the Systema Globalis data on income, life expectancy and child mortality by country, used by Gapminder.

def load_gapminder_data():
    """Create a pandas DataFrame with some of the key indicators from the 
    Open Numbers Systema Globalis dataset."""

    # pin to a specific github tag
    base_url = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/v1.20.1/"
    
    # load income per person
    df_income = pd.read_csv(base_url + "ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv")

    # load life expectancy
    df_life_expectancy = pd.read_csv(base_url + "ddf--datapoints--life_expectancy_at_birth_with_projections--by--geo--time.csv")

    # load population size
    df_population = pd.read_csv(base_url + "ddf--datapoints--population_total--by--geo--time.csv")

    # load child mortality
    df_child_mortality = pd.read_csv(base_url + "ddf--datapoints--child_mortality_0_5_year_olds_dying_per_1000_born--by--geo--time.csv")

    # load country attributes
    df_countries = pd.read_csv(base_url + "ddf--entities--geo--country.csv")

    # rename some columns in the countries dataframe to help with merging
    df_countries = (
        df_countries
        [["country", "name", "world_4region", "world_6region"]]
        .rename(columns={"country": "geo", "name": "country"})
    )

    # capitalise regions
    df_countries["world_4region"] = df_countries["world_4region"].str.capitalize()

    # join all indicators into a single dataframe
    df_gapminder = pd.merge(df_population, df_income, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_life_expectancy, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_child_mortality, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_countries, on="geo")

    # rename some columns to be more concise
    df_gapminder = df_gapminder.rename(
        columns={
            "time": "year",
            "population_total": "population",
            "income_per_person_gdppercapita_ppp_inflation_adjusted": "income_per_person",
            "life_expectancy_at_birth_with_projections": "life_expectancy",
            "child_mortality_0_5_year_olds_dying_per_1000_born": "child_mortality",
        }
    )

    # keep only data between 1950 and 2021 - it's less jumpy
    df_gapminder = df_gapminder.query("1950 <= year <= 2021").reset_index(drop=True)

    # tidy up columns
    df_gapminder.drop(columns=["geo"], inplace=True)
    df_gapminder.insert(0, "country", df_gapminder.pop("country"))  # move country column to the front

    return df_gapminder
df_gapminder = load_gapminder_data()
df_gapminder
country year population income_per_person life_expectancy child_mortality world_4region world_6region
0 Afghanistan 1950 7752117 2392 32.48 415.95 Asia south_asia
1 Afghanistan 1951 7840151 2422 32.87 413.05 Asia south_asia
2 Afghanistan 1952 7935996 2462 33.58 407.19 Asia south_asia
3 Afghanistan 1953 8039684 2568 34.28 401.21 Asia south_asia
4 Afghanistan 1954 8151316 2576 34.99 395.12 Asia south_asia
... ... ... ... ... ... ... ... ...
13531 Zimbabwe 2017 14236599 2568 61.35 49.31 Africa sub_saharan_africa
13532 Zimbabwe 2018 14438812 2621 61.74 46.23 Africa sub_saharan_africa
13533 Zimbabwe 2019 14645473 2392 62.04 44.43 Africa sub_saharan_africa
13534 Zimbabwe 2020 14862927 2412 62.29 43.06 Africa sub_saharan_africa
13535 Zimbabwe 2021 15092171 2424 62.51 42.05 Africa sub_saharan_africa

13536 rows × 8 columns

Scatter plots#

Let’s use the Systema Globalis data to make a scatter plot. To make a scatter plot, we can use the px.scatter() function. Let’s look at the function documentation.

px.scatter?

The px.scatter() function documentation is also available on the Plotly website.

For any given type of plot or chart, there is also usually a user guide on the Plotly website, which provides some helpful examples. For example, here is the Plotly user guide on scatter plots.

First scatter plot#

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
)
fig

Exercise 1 (English)

Uncomment the code in the cell below and run it to create a scatter plot comparing income_per_person with child_mortality in 2021.

Exercice 1 (Français)

Décommenter le code dans la cellule ci-dessous et l’exécuter afin de créer un diagramme à nuage de points comparant income_per_person avec child_mortality en 2021.

# fig = px.scatter(
#     data_frame=df_gapminder.query("year == 2021"),
#     x="income_per_person",
#     y="child_mortality",
# )
# fig

Hover text (a.k.a. tooltips)#

To help explore these data, let’s use the hover_name and hover_data parameters to add more information into the hover text.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
)
fig

N.B., there is lots more information on how to use hover text in the Plotly docs.

Interactive controls#

Every Plotly plot has a set of interactive controls, which appear at the top right of the plot and look like this:

These controls are useful for zooming and panning a plot, as well as for downloading a static version of a plot.

Marker color#

To explore these data further, let’s use the color parameter to represent another variable.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
)
fig

Now we can see easily which region of the world each country belongs to.

Marker size#

Let’s use the size parameter to also visualise the population size of each country.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
)
fig

Note that we also used the size_max parameter to increase the allowed maximum size of markers, which is better for this particular data.

Exercise 2 (English)

Create a scatter plot using the Gapminder data for the year 1950, with income_per_person on the X axis and child_mortality on the Y axis. Use population for the marker size, and world_6region for marker color.

Exercice 2 (Français)

Créer un diagramme à nuage de points utilisant les données de Gapminder pour l’année 1950 avec income_per_person sur l’axe horizontal X et child_mortality sur l’axe vertical Y. Utiliser population pour la taille du point et world_6region pour sa couleur.

Plot title and axis labels#

If we’re presenting this plot to others, it is a good idea to tidy up the axis titles, and to add a title to the plot. We can do this with the labels and title parameters.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "child_mortality": "Child mortality",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021"
)
fig

Using log scale#

Some variables are more naturally visualised on a log scale, rather than a linear scale. Let’s use the log_x parameter to apply a log scale to the X axis.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021",
    log_x=True,
)
fig

Animation#

Let’s now add another variable, which is year. When you have a variable that represents time, it can also be useful to visualise this as an animation. We can do this via the animation_frame parameter.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
    height=700,
)
fig

Visual styling#

The scatter plot we’ve created above is more than good enough if we are doing some exploratory data analysis, but in case you need to make a really strong visual impact and you want to change any aspect of how the plot looks, you can do that via various additional function calls which update the figure. Here’s an example, where we alter the template, change the X axis tick positions and labels, and change the marker line color to black.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income per person (GDP/capita, PPP$ inflation-adjusted)",
        "life_expectancy": "Life expectancy (years)",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
#     color_discrete_sequence=px.colors.qualitative.Set1,
    color_discrete_map={"Asia": "#ff5872", "Africa": "#00d5e9", "Europe": "#ffe700", "Americas": "#7feb00"},
    opacity=0.9,
    template="plotly_white",
    height=600,
    width=800,
)

fig.update_layout(
    xaxis = dict(
        tickmode = "array",
        tickvals = [500, 1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000],
        ticktext = ["500", "1000", "2000", "4000", "8000", "16k", "32k", "64k", "128k"]
    )
)

fig.update_xaxes(showline=True, linewidth=1, linecolor="black")
fig.update_yaxes(showline=True, linewidth=1, linecolor="black")

fig.update_traces(
    marker=dict(line=dict(width=.5, color="black")),
)

fig

There’s more info on styling on the Plotly website, as well as info on continuous color scales and discrete colors.

Exercise 3 (English)

Create an animated scatter plot from the Gapminder data as above, but using child_mortality on the Y axis and world_6region for marker color.

Also, use a different palette for the marker colors. Hint: use the color_discrete_sequence parameter, and choose your favourite discrete color sequence (palette) from the Plotly website.

Exercice 3 (Français)

Créer un diagramme à nuage de points animé utilisant les données de Gapminder comme ci-dessus mais affichant child_mortality sur l’axe Y et world_6region pour la couleur du marqueur.

Utiliser aussi une palette de couleur différente. Indice: utiliser le paramètre color_discrete_sequence et choisir votre palette favorite sur le site Plotly.

3D scatter plots#

For a bit of extra interest, let’s use the px.scatter_3d() function to make a 3-dimensional version of the Gapminder animation, adding in the child_mortality variable.

fig = px.scatter_3d(
    data_frame=df_gapminder, 
    x="income_per_person", 
    y="life_expectancy", 
    z="child_mortality",
    hover_name="country",
    color="world_4region",
    size="population",
    size_max=100,
    animation_frame="year",
    log_x=True,
    range_x=[200, 200_000],
    range_y=[0, 95],
    range_z=[0, 500],
    height=700,
    width=700,
)

fig.update_layout(
    scene=dict(aspectmode="cube"),
    legend=dict(itemsizing="constant"),
)

fig

Bar plots#

To illustrate bar plots let’s use data from Alliance for Malaria Prevention’s Net Mapping Project. We’ll combine data from the 2020 report and the 2022 Q1 report, which together provide data on LLIN shipments by country for 2004-2021 broken down by LLIN type (standard, PBO and dual active ingredient).

def load_llin_data():
    """Load data on LLIN shipments from the Alliance for Malaria Prevention's
    Net Mapping Project."""

    # N.B., data are split over several spreadsheets, so some munging is required.

    # N.B., files have been obtained from the AMP website and uploaded to 
    # Google Cloud Storage for efficient download.

    # load the "Final-2020.xlsx" dataset, "SSA" sheet - this has LLINs for 2004-2020
    df_nmp_2020_ssa = pd.read_excel(
        "https://storage.googleapis.com/vo_agam_release/reference/amp_net_mapping_project/Final-2020.xlsx", 
        sheet_name="SSA",
        skiprows=2,
        skipfooter=2,
        names=["country"] + list(range(2004, 2021)),
        usecols=list(range(18))
    )

    # load the "Final-2020.xlsx" dataset, "SSA by net type" sheet - this has LLINs by type for 2018, 2019, 2020
    df_nmp_2020_ssa_by_type = pd.read_excel(
        "https://storage.googleapis.com/vo_agam_release/reference/amp_net_mapping_project/Final-2020.xlsx", 
        sheet_name="SSA by net type",
        skiprows=3,
        skipfooter=8,
        usecols="A,B,C,F,G,H,K,L,M",
        names=[
            "country",
            "2018_standard",
            "2018_pbo",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
        ],
    )

    # load the "NMP-1st-Q-2022.xlsx" dataset, "SSA by Type" sheet - this has LLINs by type for 2019, 2020, 2021
    df_nmp_2022q1_ssa_by_type = pd.read_excel(
        "https://storage.googleapis.com/vo_agam_release/reference/amp_net_mapping_project/NMP-1st-Q-2022.xlsx",
        sheet_name="SSA by Type",
        skiprows=3,
        skipfooter=2,
        usecols="A,C,D,E,H,I,J,M,N,O",
        names=[
            "country",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
            "2021_standard",
            "2021_pbo",
            "2021_dual",
        ],
    )

    # N.B., we would like LLINs by type for the full range 2004-2021.
    # We also would like the data in "long form" for easier plotting.
    # Let's munge!

    # start with data prior to 2018
    df_llins_pre_2018 = (
        df_nmp_2020_ssa
        .melt(id_vars="country", var_name="year", value_name="llins_shipped")
        .query("year < 2018")
    )
    df_llins_pre_2018["llin_type"] = "standard"  # assume all standard llins prior to 2018

    # now grab the data by type for 2018
    df_llins_2018 = (
        df_nmp_2020_ssa_by_type
        [["country", "2018_standard", "2018_pbo"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_2018["year"] = df_year_type["year"]
    df_llins_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_2018.drop(columns="year_type", inplace=True)

    # now grab the data by type for 2019, 2020, 2021
    df_llins_post_2018 = (
        df_nmp_2022q1_ssa_by_type
        [["country", "2019_standard", "2019_pbo", "2019_dual", "2020_standard", "2020_pbo", "2020_dual", "2021_standard", "2021_pbo", "2021_dual"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_post_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_post_2018["year"] = df_year_type["year"]
    df_llins_post_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_post_2018.drop(columns="year_type", inplace=True)

    # finally, concatenate everything
    df_llins = pd.concat([df_llins_pre_2018, df_llins_2018, df_llins_post_2018]).reset_index(drop=True)

    # ensure years have the right dtype
    df_llins["year"] = df_llins["year"].astype(int)

    # normalise country names
    df_llins["country"].replace("Congo (Democratic Republic of the)", "DR Congo", inplace=True)

    return df_llins
df_llins = load_llin_data()
df_llins
country year llins_shipped llin_type
0 Angola 2004 154010 standard
1 Benin 2004 26500 standard
2 Botswana 2004 0 standard
3 Burkina Faso 2004 216500 standard
4 Burundi 2004 160250 standard
... ... ... ... ...
1154 Togo 2021 0 dual
1155 Uganda 2021 0 dual
1156 Zambia 2021 0 dual
1157 Zanzibar 2021 0 dual
1158 Zimbabwe 2021 0 dual

1159 rows × 4 columns

df_llins.query("country == 'Nigeria'")
country year llins_shipped llin_type
30 Nigeria 2004 71400 standard
76 Nigeria 2005 262000 standard
122 Nigeria 2006 2147404 standard
168 Nigeria 2007 2724304 standard
214 Nigeria 2008 15310222 standard
260 Nigeria 2009 19813977 standard
306 Nigeria 2010 29908286 standard
352 Nigeria 2011 2555096 standard
398 Nigeria 2012 5452563 standard
444 Nigeria 2013 26355032 standard
490 Nigeria 2014 42973544 standard
536 Nigeria 2015 23794214 standard
582 Nigeria 2016 11240307 standard
628 Nigeria 2017 35498731 standard
674 Nigeria 2018 18635909 standard
720 Nigeria 2018 51000 pbo
767 Nigeria 2019 31642624 standard
814 Nigeria 2019 1760400 pbo
861 Nigeria 2019 0 dual
908 Nigeria 2020 4449900 standard
955 Nigeria 2020 11717441 pbo
1002 Nigeria 2020 5567000 dual
1049 Nigeria 2021 1433000 standard
1096 Nigeria 2021 33048807 pbo
1143 Nigeria 2021 2833598 dual

First bar plot#

To make a bar plot, we can use the px.bar() function. Let’s look at the function documentation.

px.bar?

Again the px.bar() function docs are also on the Plotly website, and there is also a guide to bar charts.

Let’s now make a bar plot, with year on the X axis and llins_shipped on the Y axis.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped"
)
fig

Improved bar plot#

Let’s now improve the bar plot by using color, hover text, and doing some visual styling.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped", 
    color="llin_type", 
    hover_name="country",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    title="LLIN shipments to countries in Sub-Saharan Africa",
    width=800,
    template="plotly_white",
)
fig

Exercise 4 (English)

Make a bar chart from the LLIN data as above, but using country for the X axis and year for the hover name.

Exercice 4 (Français)

Créer un diagramme à barres pour les données sur les LLINs comme au-dessus mais en utilisant country pour l’axe X et year pour le texte de survol.

Line and area plots#

Let’s also use the LLIN data to make some line and area plots, via the px.line() and px.area() functions.

Here is a line plot of LLINs shipped to Nigeria.

fig = px.line(
    data_frame=df_llins.query("country == 'Nigeria'"),
    x="year",
    y="llins_shipped",
    color="llin_type",
    markers=True,
    width=800,
    title="LLIN shipments to Nigeria",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    template="plotly_white",
)
fig

Exercise 5 (English)

Make a line plot as above but for Democratic Republic of the Congo. Hint: use the query "country == 'DR Congo'"

Exercice 5 (Français)

Créer un diagramme à lignes comme ci-dessus mais pour la République Démocratique du Congo. Indice: Utiliser la requête "country == 'DR Congo'"

Exercise 6 (English)

Make an area plot using the LLIN data from Nigeria. Hint: it’s exactly the same parameters as the line plot, just call the px.area() function instead of px.line().

Exercice 6 (Français)

Créer un diagramme à zones utilisant les données des LLINs du Nigeria. Indice: Les paramètres sont les mêmes que pour le diagramme à lignes mais il faut appeler la fonction px.area() au lieu de px.line().

Well done!#

Hopefully this has been a useful introduction to plotting in Python.

As I mentioned earlier, there are lots more plot types that Plotly Express provides, take a look at the user guide and the API docs for more information.

Happy plotting!

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.