Module 1 - Plotting with Plotly Express

Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors

Module 1 - Plotting with Plotly Express#

Theme: Tools & Technology

This module provides an introduction to visualising data with some basic charts using the Plotly Express package for Python.

Learning objectives#

In this module we will learn how to:

Prepare data for plotting
Create scatter plots
Create bar plots
Create line plots

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Python packages for data visualisation#

Being able to visualise your data is obviously a great skill to have, and there are some fantastic Python packages available for creating a wide range of different visualisations.

In fact, we are spoilt for choice, with packages like:

…and others all providing some incredibly powerful plotting tools for data scientists.

For this module I’ve chosen to begin with Plotly Express because:

It supports many different types of chart
It has a relatively simple interface with good documentation
You can create plots quickly with just a few lines of code (often just a single function call)
Plots are interactive

…which makes it relatively easy to learn and a good choice for exploratory data analysis.

In this module we are just going to look at some basic charts, but you might like to browse the Plotly Python website to see what other charts are possible.

Setup#

In this module we’ll use the Plotly Express package, and we’ll also use pandas for loading data to plot. (See workshop 2, module 1 for an introduction to pandas DataFrames if you missed it or need a recap.) Both of these packages are already installed on colab, so we can go ahead and import them.

import pandas as pd
import plotly.io as pio
pio.renderers.default = "notebook+colab"
import plotly.express as px

Preparing data for plotting#

Plotly Express can accept data in a variety of different input formats, but it works particularly well when you provide data as a pandas DataFrame.

Let’s remind ourselves what a DataFrame looks like, by loading one of the example DataFrames that come with the Plotly Express package.

df_medals_long = px.data.medals_long()
df_medals_long

	nation	medal	count
0	South Korea	gold	24
1	China	gold	10
2	Canada	gold	9
3	South Korea	silver	13
4	China	silver	15
5	Canada	silver	12
6	South Korea	bronze	11
7	China	bronze	8
8	Canada	bronze	12

One thing worth mentioning is that often the same data can be structured in different ways. For example, the same data above could also be stored in the following DataFrame:

df_medals_wide = px.data.medals_wide()
df_medals_wide

	nation	gold	silver	bronze
0	South Korea	24	13	11
1	China	10	15	8
2	Canada	9	12	12

The df_medals_long DataFrame is an example of a “long-form” DataFrame, so-called because it has more rows and fewer columns.

The df_medals_wide DataFrame is an example of a “wide-form” DataFrame, so-called because it has fewer rows and more columns.

Plotly Express can plot either, but for the examples we’re going to look at today, it is slightly more convenient to work with long-form data.

Let’s now load some more interesting data to practise plotting with, which is the Systema Globalis data on income, life expectancy and child mortality by country, used by Gapminder.

def load_gapminder_data():
    """Create a pandas DataFrame with some of the key indicators from the 
    Open Numbers Systema Globalis dataset."""

    # pin to a specific github tag
    base_url = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/v1.20.1/"
    
    # load income per person
    df_income = pd.read_csv(base_url + "ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv")

    # load life expectancy
    df_life_expectancy = pd.read_csv(base_url + "ddf--datapoints--life_expectancy_at_birth_with_projections--by--geo--time.csv")

    # load population size
    df_population = pd.read_csv(base_url + "ddf--datapoints--population_total--by--geo--time.csv")

    # load child mortality
    df_child_mortality = pd.read_csv(base_url + "ddf--datapoints--child_mortality_0_5_year_olds_dying_per_1000_born--by--geo--time.csv")

    # load country attributes
    df_countries = pd.read_csv(base_url + "ddf--entities--geo--country.csv")

    # rename some columns in the countries dataframe to help with merging
    df_countries = (
        df_countries
        [["country", "name", "world_4region", "world_6region"]]
        .rename(columns={"country": "geo", "name": "country"})
    )

    # capitalise regions
    df_countries["world_4region"] = df_countries["world_4region"].str.capitalize()

    # join all indicators into a single dataframe
    df_gapminder = pd.merge(df_population, df_income, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_life_expectancy, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_child_mortality, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_countries, on="geo")

    # rename some columns to be more concise
    df_gapminder = df_gapminder.rename(
        columns={
            "time": "year",
            "population_total": "population",
            "income_per_person_gdppercapita_ppp_inflation_adjusted": "income_per_person",
            "life_expectancy_at_birth_with_projections": "life_expectancy",
            "child_mortality_0_5_year_olds_dying_per_1000_born": "child_mortality",
        }
    )

    # keep only data between 1950 and 2021 - it's less jumpy
    df_gapminder = df_gapminder.query("1950 <= year <= 2021").reset_index(drop=True)

    # tidy up columns
    df_gapminder.drop(columns=["geo"], inplace=True)
    df_gapminder.insert(0, "country", df_gapminder.pop("country"))  # move country column to the front

    return df_gapminder

df_gapminder = load_gapminder_data()
df_gapminder

	country	year	population	income_per_person	life_expectancy	child_mortality	world_4region	world_6region
0	Afghanistan	1950	7752117	2392	32.48	415.95	Asia	south_asia
1	Afghanistan	1951	7840151	2422	32.87	413.05	Asia	south_asia
2	Afghanistan	1952	7935996	2462	33.58	407.19	Asia	south_asia
3	Afghanistan	1953	8039684	2568	34.28	401.21	Asia	south_asia
4	Afghanistan	1954	8151316	2576	34.99	395.12	Asia	south_asia
...	...	...	...	...	...	...	...	...
13531	Zimbabwe	2017	14236599	2568	61.35	49.31	Africa	sub_saharan_africa
13532	Zimbabwe	2018	14438812	2621	61.74	46.23	Africa	sub_saharan_africa
13533	Zimbabwe	2019	14645473	2392	62.04	44.43	Africa	sub_saharan_africa
13534	Zimbabwe	2020	14862927	2412	62.29	43.06	Africa	sub_saharan_africa
13535	Zimbabwe	2021	15092171	2424	62.51	42.05	Africa	sub_saharan_africa

13536 rows × 8 columns

Scatter plots#

Let’s use the Systema Globalis data to make a scatter plot. To make a scatter plot, we can use the px.scatter() function. Let’s look at the function documentation.

px.scatter?

Signature:
px.scatter(
    data_frame=None,
    x=None,
    y=None,
    color=None,
    symbol=None,
    size=None,
    hover_name=None,
    hover_data=None,
    custom_data=None,
    text=None,
    facet_row=None,
    facet_col=None,
    facet_col_wrap=0,
    facet_row_spacing=None,
    facet_col_spacing=None,
    error_x=None,
    error_x_minus=None,
    error_y=None,
    error_y_minus=None,
    animation_frame=None,
    animation_group=None,
    category_orders=None,
    labels=None,
    orientation=None,
    color_discrete_sequence=None,
    color_discrete_map=None,
    color_continuous_scale=None,
    range_color=None,
    color_continuous_midpoint=None,
    symbol_sequence=None,
    symbol_map=None,
    opacity=None,
    size_max=None,
    marginal_x=None,
    marginal_y=None,
    trendline=None,
    trendline_options=None,
    trendline_color_override=None,
    trendline_scope='trace',
    log_x=False,
    log_y=False,
    range_x=None,
    range_y=None,
    render_mode='auto',
    title=None,
    subtitle=None,
    template=None,
    width=None,
    height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
    In a scatter plot, each row of `data_frame` is represented by a symbol
    mark in 2D space.
    
Parameters
----------
data_frame: DataFrame or array-like or dict
    This argument needs to be passed for column names (and not keyword
    names) to be used. Array-like and dict are transformed internally to a
    pandas DataFrame. Optional: if missing, a DataFrame gets constructed
    under the hood using the other arguments.
x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the x axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the y axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
color: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign color to marks.
symbol: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign symbols to marks.
size: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign mark sizes.
hover_name: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in bold
    in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
    Either a name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects or a dict with column names as keys, with
    values True (for default formatting) False (in order to remove this
    column from hover information), or a formatting string, for example
    ':.3f' or '|%a' or list-like data to appear in the hover tooltip or
    tuples with a bool or formatting string as first element, and list-like
    data to appear in hover as second element Values from these columns
    appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
    Either name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects Values from these columns are extra data,
    to be used in widgets or Dash callbacks for example. This data is not
    user-visible but is included in events emitted by the figure (lasso
    selection etc.)
text: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in the
    figure as text labels.
facet_row: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
    Maximum number of facet columns. Wraps the column variable at this
    width, so that the column facets span multiple rows. Ignored if 0, and
    forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
    Spacing between facet rows, in paper units. Default is 0.03 or 0.07
    when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
    Spacing between facet columns, in paper units Default is 0.02.
error_x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars. If `error_x_minus` is `None`, error bars will
    be symmetrical, otherwise `error_x` is used for the positive direction
    only.
error_x_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars in the negative direction. Ignored if `error_x`
    is `None`.
error_y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars. If `error_y_minus` is `None`, error bars will
    be symmetrical, otherwise `error_y` is used for the positive direction
    only.
error_y_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars in the negative direction. Ignored if `error_y`
    is `None`.
animation_frame: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to animation frames.
animation_group: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    provide object-constancy across animation frames: rows with matching
    `animation_group`s will be treated as if they describe the same object
    in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
    By default, in Python 3.6+, the order of categorical values in axes,
    legends and facets depends on the order in which these values are first
    encountered in `data_frame` (and no order is guaranteed by default in
    Python below 3.6). This parameter is used to force a specific ordering
    of values per column. The keys of this dict should correspond to column
    names, and the values should be lists of strings corresponding to the
    specific display order desired.
labels: dict with str keys and str values (default `{}`)
    By default, column names are used in the figure for axis titles, legend
    entries and hovers. This parameter allows this to be overridden. The
    keys of this dict should correspond to column names, and the values
    should correspond to the desired label to be displayed.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical. 
    (default `'v'` if `x` and `y` are provided and both continuous or both
    categorical,  otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
    `y`(`x`) is continuous,  otherwise `'v'`(`'h'`) if only `x`(`y`) is
    provided)
color_discrete_sequence: list of str
    Strings should define valid CSS-colors. When `color` is set and the
    values in the corresponding column are not numeric, values in that
    column are assigned colors by cycling through `color_discrete_sequence`
    in the order described in `category_orders`, unless the value of
    `color` is a key in `color_discrete_map`. Various useful color
    sequences are available in the `plotly.express.colors` submodules,
    specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
    String values should define valid CSS-colors Used to override
    `color_discrete_sequence` to assign a specific colors to marks
    corresponding with specific values. Keys in `color_discrete_map` should
    be values in the column denoted by `color`. Alternatively, if the
    values of `color` are valid colors, the string `'identity'` may be
    passed to cause them to be used directly.
color_continuous_scale: list of str
    Strings should define valid CSS-colors This list is used to build a
    continuous color scale when the column denoted by `color` contains
    numeric data. Various useful color scales are available in the
    `plotly.express.colors` submodules, specifically
    `plotly.express.colors.sequential`, `plotly.express.colors.diverging`
    and `plotly.express.colors.cyclical`.
range_color: list of two numbers
    If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
    If set, computes the bounds of the continuous color scale to have the
    desired midpoint. Setting this value is recommended when using
    `plotly.express.colors.diverging` color scales as the inputs to
    `color_continuous_scale`.
symbol_sequence: list of str
    Strings should define valid plotly.js symbols. When `symbol` is set,
    values in that column are assigned symbols by cycling through
    `symbol_sequence` in the order described in `category_orders`, unless
    the value of `symbol` is a key in `symbol_map`.
symbol_map: dict with str keys and str values (default `{}`)
    String values should define plotly.js symbols Used to override
    `symbol_sequence` to assign a specific symbols to marks corresponding
    with specific values. Keys in `symbol_map` should be values in the
    column denoted by `symbol`. Alternatively, if the values of `symbol`
    are valid symbol names, the string `'identity'` may be passed to cause
    them to be used directly.
opacity: float
    Value between 0 and 1. Sets the opacity for markers.
size_max: int (default `20`)
    Set the maximum mark size when using `size`.
marginal_x: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    horizontal subplot is drawn above the main plot, visualizing the
    x-distribution.
marginal_y: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    vertical subplot is drawn to the right of the main plot, visualizing
    the y-distribution.
trendline: str
    One of `'ols'`, `'lowess'`, `'rolling'`, `'expanding'` or `'ewm'`. If
    `'ols'`, an Ordinary Least Squares regression line will be drawn for
    each discrete-color/symbol group. If `'lowess`', a Locally Weighted
    Scatterplot Smoothing line will be drawn for each discrete-color/symbol
    group. If `'rolling`', a Rolling (e.g. rolling average, rolling median)
    line will be drawn for each discrete-color/symbol group. If
    `'expanding`', an Expanding (e.g. expanding average, expanding sum)
    line will be drawn for each discrete-color/symbol group. If `'ewm`', an
    Exponentially Weighted Moment (e.g. exponentially-weighted moving
    average) line will be drawn for each discrete-color/symbol group. See
    the docstrings for the functions in
    `plotly.express.trendline_functions` for more details on these
    functions and how to configure them with the `trendline_options`
    argument.
trendline_options: dict
    Options passed as the first argument to the function from
    `plotly.express.trendline_functions`  named in the `trendline`
    argument.
trendline_color_override: str
    Valid CSS color. If provided, and if `trendline` is set, all trendlines
    will be drawn in this color rather than in the same color as the traces
    from which they draw their inputs.
trendline_scope: str (one of `'trace'` or `'overall'`, default `'trace'`)
    If `'trace'`, then one trendline is drawn per trace (i.e. per color,
    symbol, facet, animation frame etc) and if `'overall'` then one
    trendline is computed for the entire dataset, and replicated across all
    facets.
log_x: boolean (default `False`)
    If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
    If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
    If provided, overrides auto-scaling on the x-axis in cartesian
    coordinates.
range_y: list of two numbers
    If provided, overrides auto-scaling on the y-axis in cartesian
    coordinates.
render_mode: str
    One of `'auto'`, `'svg'` or `'webgl'`, default `'auto'` Controls the
    browser API used to draw marks. `'svg'` is appropriate for figures of
    less than 1000 data points, and will allow for fully-vectorized output.
    `'webgl'` is likely necessary for acceptable performance above 1000
    points but rasterizes part of the output.  `'auto'` uses heuristics to
    choose the mode.
title: str
    The figure title.
subtitle: str
    The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
    The figure template name (must be a key in plotly.io.templates) or
    definition.
width: int (default `None`)
    The figure width in pixels.
height: int (default `None`)
    The figure height in pixels.

Returns
-------
    plotly.graph_objects.Figure
File:      /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type:      function

The px.scatter() function documentation is also available on the Plotly website.

For any given type of plot or chart, there is also usually a user guide on the Plotly website, which provides some helpful examples. For example, here is the Plotly user guide on scatter plots.

First scatter plot#

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
)
fig

Exercise 1 (English)

Uncomment the code in the cell below and run it to create a scatter plot comparing income_per_person with child_mortality in 2021.

Exercice 1 (Français)

Décommenter le code dans la cellule ci-dessous et l’exécuter afin de créer un diagramme à nuage de points comparant income_per_person avec child_mortality en 2021.

# fig = px.scatter(
#     data_frame=df_gapminder.query("year == 2021"),
#     x="income_per_person",
#     y="child_mortality",
# )
# fig

Hover text (a.k.a. tooltips)#

To help explore these data, let’s use the hover_name and hover_data parameters to add more information into the hover text.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
)
fig

N.B., there is lots more information on how to use hover text in the Plotly docs.

Interactive controls#

Every Plotly plot has a set of interactive controls, which appear at the top right of the plot and look like this:

These controls are useful for zooming and panning a plot, as well as for downloading a static version of a plot.

Marker color#

To explore these data further, let’s use the color parameter to represent another variable.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
)
fig

Now we can see easily which region of the world each country belongs to.

Marker size#

Let’s use the size parameter to also visualise the population size of each country.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
)
fig

Note that we also used the size_max parameter to increase the allowed maximum size of markers, which is better for this particular data.

Exercise 2 (English)

Create a scatter plot using the Gapminder data for the year 1950, with income_per_person on the X axis and child_mortality on the Y axis. Use population for the marker size, and world_6region for marker color.

Exercice 2 (Français)

Créer un diagramme à nuage de points utilisant les données de Gapminder pour l’année 1950 avec income_per_person sur l’axe horizontal X et child_mortality sur l’axe vertical Y. Utiliser population pour la taille du point et world_6region pour sa couleur.

Plot title and axis labels#

If we’re presenting this plot to others, it is a good idea to tidy up the axis titles, and to add a title to the plot. We can do this with the labels and title parameters.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "child_mortality": "Child mortality",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021"
)
fig

Using log scale#

Some variables are more naturally visualised on a log scale, rather than a linear scale. Let’s use the log_x parameter to apply a log scale to the X axis.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021",
    log_x=True,
)
fig

Animation#

Let’s now add another variable, which is year. When you have a variable that represents time, it can also be useful to visualise this as an animation. We can do this via the animation_frame parameter.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
    height=700,
)
fig

Visual styling#

The scatter plot we’ve created above is more than good enough if we are doing some exploratory data analysis, but in case you need to make a really strong visual impact and you want to change any aspect of how the plot looks, you can do that via various additional function calls which update the figure. Here’s an example, where we alter the template, change the X axis tick positions and labels, and change the marker line color to black.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income per person (GDP/capita, PPP$ inflation-adjusted)",
        "life_expectancy": "Life expectancy (years)",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
#     color_discrete_sequence=px.colors.qualitative.Set1,
    color_discrete_map={"Asia": "#ff5872", "Africa": "#00d5e9", "Europe": "#ffe700", "Americas": "#7feb00"},
    opacity=0.9,
    template="plotly_white",
    height=600,
    width=800,
)

fig.update_layout(
    xaxis = dict(
        tickmode = "array",
        tickvals = [500, 1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000],
        ticktext = ["500", "1000", "2000", "4000", "8000", "16k", "32k", "64k", "128k"]
    )
)

fig.update_xaxes(showline=True, linewidth=1, linecolor="black")
fig.update_yaxes(showline=True, linewidth=1, linecolor="black")

fig.update_traces(
    marker=dict(line=dict(width=.5, color="black")),
)

fig

There’s more info on styling on the Plotly website, as well as info on continuous color scales and discrete colors.

Exercise 3 (English)

Create an animated scatter plot from the Gapminder data as above, but using child_mortality on the Y axis and world_6region for marker color.

Also, use a different palette for the marker colors. Hint: use the color_discrete_sequence parameter, and choose your favourite discrete color sequence (palette) from the Plotly website.

Exercice 3 (Français)

Créer un diagramme à nuage de points animé utilisant les données de Gapminder comme ci-dessus mais affichant child_mortality sur l’axe Y et world_6region pour la couleur du marqueur.

Utiliser aussi une palette de couleur différente. Indice: utiliser le paramètre color_discrete_sequence et choisir votre palette favorite sur le site Plotly.

3D scatter plots#

For a bit of extra interest, let’s use the px.scatter_3d() function to make a 3-dimensional version of the Gapminder animation, adding in the child_mortality variable.

fig = px.scatter_3d(
    data_frame=df_gapminder, 
    x="income_per_person", 
    y="life_expectancy", 
    z="child_mortality",
    hover_name="country",
    color="world_4region",
    size="population",
    size_max=100,
    animation_frame="year",
    log_x=True,
    range_x=[200, 200_000],
    range_y=[0, 95],
    range_z=[0, 500],
    height=700,
    width=700,
)

fig.update_layout(
    scene=dict(aspectmode="cube"),
    legend=dict(itemsizing="constant"),
)

fig

Bar plots#

To illustrate bar plots let’s use data from Alliance for Malaria Prevention’s Net Mapping Project. We’ll combine data from the 2020 report and the 2022 Q1 report, which together provide data on LLIN shipments by country for 2004-2021 broken down by LLIN type (standard, PBO and dual active ingredient).

def load_llin_data():
    """Load data on LLIN shipments from the Alliance for Malaria Prevention's
    Net Mapping Project."""

    # N.B., data are split over several spreadsheets, so some munging is required.

    # N.B., files have been obtained from the AMP website and uploaded to 
    # Google Cloud Storage for efficient download.

    # load the "Final-2020.xlsx" dataset, "SSA" sheet - this has LLINs for 2004-2020
    df_nmp_2020_ssa = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx", 
        sheet_name="SSA",
        skiprows=2,
        skipfooter=2,
        names=["country"] + list(range(2004, 2021)),
        usecols=list(range(18))
    )

    # load the "Final-2020.xlsx" dataset, "SSA by net type" sheet - this has LLINs by type for 2018, 2019, 2020
    df_nmp_2020_ssa_by_type = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx", 
        sheet_name="SSA by net type",
        skiprows=3,
        skipfooter=8,
        usecols="A,B,C,F,G,H,K,L,M",
        names=[
            "country",
            "2018_standard",
            "2018_pbo",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
        ],
    )

    # load the "NMP-1st-Q-2022.xlsx" dataset, "SSA by Type" sheet - this has LLINs by type for 2019, 2020, 2021
    df_nmp_2022q1_ssa_by_type = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-NMP-1st-Q-2022.xlsx",
        sheet_name="SSA by Type",
        skiprows=3,
        skipfooter=2,
        usecols="A,C,D,E,H,I,J,M,N,O",
        names=[
            "country",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
            "2021_standard",
            "2021_pbo",
            "2021_dual",
        ],
    )

    # N.B., we would like LLINs by type for the full range 2004-2021.
    # We also would like the data in "long form" for easier plotting.
    # Let's munge!

    # start with data prior to 2018
    df_llins_pre_2018 = (
        df_nmp_2020_ssa
        .melt(id_vars="country", var_name="year", value_name="llins_shipped")
        .query("year < 2018")
    )
    df_llins_pre_2018["llin_type"] = "standard"  # assume all standard llins prior to 2018

    # now grab the data by type for 2018
    df_llins_2018 = (
        df_nmp_2020_ssa_by_type
        [["country", "2018_standard", "2018_pbo"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_2018["year"] = df_year_type["year"]
    df_llins_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_2018.drop(columns="year_type", inplace=True)

    # now grab the data by type for 2019, 2020, 2021
    df_llins_post_2018 = (
        df_nmp_2022q1_ssa_by_type
        [["country", "2019_standard", "2019_pbo", "2019_dual", "2020_standard", "2020_pbo", "2020_dual", "2021_standard", "2021_pbo", "2021_dual"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_post_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_post_2018["year"] = df_year_type["year"]
    df_llins_post_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_post_2018.drop(columns="year_type", inplace=True)

    # finally, concatenate everything
    df_llins = pd.concat([df_llins_pre_2018, df_llins_2018, df_llins_post_2018]).reset_index(drop=True)

    # ensure years have the right dtype
    df_llins["year"] = df_llins["year"].astype(int)

    # normalise country names
    df_llins["country"] = df_llins["country"].replace("Congo (Democratic Republic of the)", "DR Congo")

    return df_llins

df_llins = load_llin_data()
df_llins

	country	year	llins_shipped	llin_type
0	Angola	2004	154010	standard
1	Benin	2004	26500	standard
2	Botswana	2004	0	standard
3	Burkina Faso	2004	216500	standard
4	Burundi	2004	160250	standard
...	...	...	...	...
1154	Togo	2021	0	dual
1155	Uganda	2021	0	dual
1156	Zambia	2021	0	dual
1157	Zanzibar	2021	0	dual
1158	Zimbabwe	2021	0	dual

1159 rows × 4 columns

df_llins.query("country == 'Nigeria'")

	country	year	llins_shipped	llin_type
30	Nigeria	2004	71400	standard
76	Nigeria	2005	262000	standard
122	Nigeria	2006	2147404	standard
168	Nigeria	2007	2724304	standard
214	Nigeria	2008	15310222	standard
260	Nigeria	2009	19813977	standard
306	Nigeria	2010	29908286	standard
352	Nigeria	2011	2555096	standard
398	Nigeria	2012	5452563	standard
444	Nigeria	2013	26355032	standard
490	Nigeria	2014	42973544	standard
536	Nigeria	2015	23794214	standard
582	Nigeria	2016	11240307	standard
628	Nigeria	2017	35498731	standard
674	Nigeria	2018	18635909	standard
720	Nigeria	2018	51000	pbo
767	Nigeria	2019	31642624	standard
814	Nigeria	2019	1760400	pbo
861	Nigeria	2019	0	dual
908	Nigeria	2020	4449900	standard
955	Nigeria	2020	11717441	pbo
1002	Nigeria	2020	5567000	dual
1049	Nigeria	2021	1433000	standard
1096	Nigeria	2021	33048807	pbo
1143	Nigeria	2021	2833598	dual

First bar plot#

To make a bar plot, we can use the px.bar() function. Let’s look at the function documentation.

px.bar?

Signature:
px.bar(
    data_frame=None,
    x=None,
    y=None,
    color=None,
    pattern_shape=None,
    facet_row=None,
    facet_col=None,
    facet_col_wrap=0,
    facet_row_spacing=None,
    facet_col_spacing=None,
    hover_name=None,
    hover_data=None,
    custom_data=None,
    text=None,
    base=None,
    error_x=None,
    error_x_minus=None,
    error_y=None,
    error_y_minus=None,
    animation_frame=None,
    animation_group=None,
    category_orders=None,
    labels=None,
    color_discrete_sequence=None,
    color_discrete_map=None,
    color_continuous_scale=None,
    pattern_shape_sequence=None,
    pattern_shape_map=None,
    range_color=None,
    color_continuous_midpoint=None,
    opacity=None,
    orientation=None,
    barmode='relative',
    log_x=False,
    log_y=False,
    range_x=None,
    range_y=None,
    text_auto=False,
    title=None,
    subtitle=None,
    template=None,
    width=None,
    height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
    In a bar plot, each row of `data_frame` is represented as a rectangular
    mark.
    
Parameters
----------
data_frame: DataFrame or array-like or dict
    This argument needs to be passed for column names (and not keyword
    names) to be used. Array-like and dict are transformed internally to a
    pandas DataFrame. Optional: if missing, a DataFrame gets constructed
    under the hood using the other arguments.
x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the x axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the y axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
color: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign color to marks.
pattern_shape: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign pattern shapes to marks.
facet_row: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
    Maximum number of facet columns. Wraps the column variable at this
    width, so that the column facets span multiple rows. Ignored if 0, and
    forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
    Spacing between facet rows, in paper units. Default is 0.03 or 0.07
    when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
    Spacing between facet columns, in paper units Default is 0.02.
hover_name: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in bold
    in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
    Either a name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects or a dict with column names as keys, with
    values True (for default formatting) False (in order to remove this
    column from hover information), or a formatting string, for example
    ':.3f' or '|%a' or list-like data to appear in the hover tooltip or
    tuples with a bool or formatting string as first element, and list-like
    data to appear in hover as second element Values from these columns
    appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
    Either name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects Values from these columns are extra data,
    to be used in widgets or Dash callbacks for example. This data is not
    user-visible but is included in events emitted by the figure (lasso
    selection etc.)
text: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in the
    figure as text labels.
base: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position the base of the bar.
error_x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars. If `error_x_minus` is `None`, error bars will
    be symmetrical, otherwise `error_x` is used for the positive direction
    only.
error_x_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars in the negative direction. Ignored if `error_x`
    is `None`.
error_y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars. If `error_y_minus` is `None`, error bars will
    be symmetrical, otherwise `error_y` is used for the positive direction
    only.
error_y_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars in the negative direction. Ignored if `error_y`
    is `None`.
animation_frame: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to animation frames.
animation_group: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    provide object-constancy across animation frames: rows with matching
    `animation_group`s will be treated as if they describe the same object
    in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
    By default, in Python 3.6+, the order of categorical values in axes,
    legends and facets depends on the order in which these values are first
    encountered in `data_frame` (and no order is guaranteed by default in
    Python below 3.6). This parameter is used to force a specific ordering
    of values per column. The keys of this dict should correspond to column
    names, and the values should be lists of strings corresponding to the
    specific display order desired.
labels: dict with str keys and str values (default `{}`)
    By default, column names are used in the figure for axis titles, legend
    entries and hovers. This parameter allows this to be overridden. The
    keys of this dict should correspond to column names, and the values
    should correspond to the desired label to be displayed.
color_discrete_sequence: list of str
    Strings should define valid CSS-colors. When `color` is set and the
    values in the corresponding column are not numeric, values in that
    column are assigned colors by cycling through `color_discrete_sequence`
    in the order described in `category_orders`, unless the value of
    `color` is a key in `color_discrete_map`. Various useful color
    sequences are available in the `plotly.express.colors` submodules,
    specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
    String values should define valid CSS-colors Used to override
    `color_discrete_sequence` to assign a specific colors to marks
    corresponding with specific values. Keys in `color_discrete_map` should
    be values in the column denoted by `color`. Alternatively, if the
    values of `color` are valid colors, the string `'identity'` may be
    passed to cause them to be used directly.
color_continuous_scale: list of str
    Strings should define valid CSS-colors This list is used to build a
    continuous color scale when the column denoted by `color` contains
    numeric data. Various useful color scales are available in the
    `plotly.express.colors` submodules, specifically
    `plotly.express.colors.sequential`, `plotly.express.colors.diverging`
    and `plotly.express.colors.cyclical`.
pattern_shape_sequence: list of str
    Strings should define valid plotly.js patterns-shapes. When
    `pattern_shape` is set, values in that column are assigned patterns-
    shapes by cycling through `pattern_shape_sequence` in the order
    described in `category_orders`, unless the value of `pattern_shape` is
    a key in `pattern_shape_map`.
pattern_shape_map: dict with str keys and str values (default `{}`)
    Strings values define plotly.js patterns-shapes. Used to override
    `pattern_shape_sequences` to assign a specific patterns-shapes to lines
    corresponding with specific values. Keys in `pattern_shape_map` should
    be values in the column denoted by `pattern_shape`. Alternatively, if
    the values of `pattern_shape` are valid patterns-shapes names, the
    string `'identity'` may be passed to cause them to be used directly.
range_color: list of two numbers
    If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
    If set, computes the bounds of the continuous color scale to have the
    desired midpoint. Setting this value is recommended when using
    `plotly.express.colors.diverging` color scales as the inputs to
    `color_continuous_scale`.
opacity: float
    Value between 0 and 1. Sets the opacity for markers.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical. 
    (default `'v'` if `x` and `y` are provided and both continuous or both
    categorical,  otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
    `y`(`x`) is continuous,  otherwise `'v'`(`'h'`) if only `x`(`y`) is
    provided)
barmode: str (default `'relative'`)
    One of `'group'`, `'overlay'` or `'relative'` In `'relative'` mode,
    bars are stacked above zero for positive values and below zero for
    negative values. In `'overlay'` mode, bars are drawn on top of one
    another. In `'group'` mode, bars are placed beside each other.
log_x: boolean (default `False`)
    If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
    If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
    If provided, overrides auto-scaling on the x-axis in cartesian
    coordinates.
range_y: list of two numbers
    If provided, overrides auto-scaling on the y-axis in cartesian
    coordinates.
text_auto: bool or string (default `False`)
    If `True` or a string, the x or y or z values will be displayed as
    text, depending on the orientation A string like `'.2f'` will be
    interpreted as a `texttemplate` numeric formatting directive.
title: str
    The figure title.
subtitle: str
    The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
    The figure template name (must be a key in plotly.io.templates) or
    definition.
width: int (default `None`)
    The figure width in pixels.
height: int (default `None`)
    The figure height in pixels.

Returns
-------
    plotly.graph_objects.Figure
File:      /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type:      function

Again the px.bar() function docs are also on the Plotly website, and there is also a guide to bar charts.

Let’s now make a bar plot, with year on the X axis and llins_shipped on the Y axis.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped"
)
fig

Improved bar plot#

Let’s now improve the bar plot by using color, hover text, and doing some visual styling.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped", 
    color="llin_type", 
    hover_name="country",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    title="LLIN shipments to countries in Sub-Saharan Africa",
    width=800,
    template="plotly_white",
)
fig

Exercise 4 (English)

Make a bar chart from the LLIN data as above, but using country for the X axis and year for the hover name.

Exercice 4 (Français)

Créer un diagramme à barres pour les données sur les LLINs comme au-dessus mais en utilisant country pour l’axe X et year pour le texte de survol.

Line and area plots#

Let’s also use the LLIN data to make some line and area plots, via the px.line() and px.area() functions.

Here is a line plot of LLINs shipped to Nigeria.

fig = px.line(
    data_frame=df_llins.query("country == 'Nigeria'"),
    x="year",
    y="llins_shipped",
    color="llin_type",
    markers=True,
    width=800,
    title="LLIN shipments to Nigeria",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    template="plotly_white",
)
fig

Exercise 5 (English)

Make a line plot as above but for Democratic Republic of the Congo. Hint: use the query "country == 'DR Congo'"

Exercice 5 (Français)

Créer un diagramme à lignes comme ci-dessus mais pour la République Démocratique du Congo. Indice: Utiliser la requête "country == 'DR Congo'"

Exercise 6 (English)

Make an area plot using the LLIN data from Nigeria. Hint: it’s exactly the same parameters as the line plot, just call the px.area() function instead of px.line().

Exercice 6 (Français)

Créer un diagramme à zones utilisant les données des LLINs du Nigeria. Indice: Les paramètres sont les mêmes que pour le diagramme à lignes mais il faut appeler la fonction px.area() au lieu de px.line().

Well done!#

Hopefully this has been a useful introduction to plotting in Python.

As I mentioned earlier, there are lots more plot types that Plotly Express provides, take a look at the user guide and the API docs for more information.

Happy plotting!

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.

Module 1 - Plotting with Plotly Express

Contents

Module 1 - Plotting with Plotly Express#

Learning objectives#

Lecture#

English#

Français#

Python packages for data visualisation#

Setup#

Preparing data for plotting#

Scatter plots#

First scatter plot#

Hover text (a.k.a. tooltips)#

Interactive controls#

Marker color#

Marker size#

Plot title and axis labels#

Using log scale#

Animation#

Visual styling#

3D scatter plots#

Bar plots#

First bar plot#

Improved bar plot#

Line and area plots#

Well done!#

Exercises#

English#

Français#