banner

Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 1 - Plotting with Plotly Express#

Theme: Tools & Technology

This module provides an introduction to visualising data with some basic charts using the Plotly Express package for Python.

Learning objectives#

In this module we will learn how to:

  • Prepare data for plotting

  • Create scatter plots

  • Create bar plots

  • Create line plots

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Python packages for data visualisation#

Being able to visualise your data is obviously a great skill to have, and there are some fantastic Python packages available for creating a wide range of different visualisations.

In fact, we are spoilt for choice, with packages like:

…and others all providing some incredibly powerful plotting tools for data scientists.

For this module I’ve chosen to begin with Plotly Express because:

  • It supports many different types of chart

  • It has a relatively simple interface with good documentation

  • You can create plots quickly with just a few lines of code (often just a single function call)

  • Plots are interactive

…which makes it relatively easy to learn and a good choice for exploratory data analysis.

In this module we are just going to look at some basic charts, but you might like to browse the Plotly Python website to see what other charts are possible.

Setup#

In this module we’ll use the Plotly Express package, and we’ll also use pandas for loading data to plot. (See workshop 2, module 1 for an introduction to pandas DataFrames if you missed it or need a recap.) Both of these packages are already installed on colab, so we can go ahead and import them.

import pandas as pd
import plotly.io as pio
pio.renderers.default = "notebook+colab"
import plotly.express as px

Preparing data for plotting#

Plotly Express can accept data in a variety of different input formats, but it works particularly well when you provide data as a pandas DataFrame.

Let’s remind ourselves what a DataFrame looks like, by loading one of the example DataFrames that come with the Plotly Express package.

df_medals_long = px.data.medals_long()
df_medals_long
nation medal count
0 South Korea gold 24
1 China gold 10
2 Canada gold 9
3 South Korea silver 13
4 China silver 15
5 Canada silver 12
6 South Korea bronze 11
7 China bronze 8
8 Canada bronze 12

One thing worth mentioning is that often the same data can be structured in different ways. For example, the same data above could also be stored in the following DataFrame:

df_medals_wide = px.data.medals_wide()
df_medals_wide
nation gold silver bronze
0 South Korea 24 13 11
1 China 10 15 8
2 Canada 9 12 12

The df_medals_long DataFrame is an example of a “long-form” DataFrame, so-called because it has more rows and fewer columns.

The df_medals_wide DataFrame is an example of a “wide-form” DataFrame, so-called because it has fewer rows and more columns.

Plotly Express can plot either, but for the examples we’re going to look at today, it is slightly more convenient to work with long-form data.

Let’s now load some more interesting data to practise plotting with, which is the Systema Globalis data on income, life expectancy and child mortality by country, used by Gapminder.

def load_gapminder_data():
    """Create a pandas DataFrame with some of the key indicators from the 
    Open Numbers Systema Globalis dataset."""

    # pin to a specific github tag
    base_url = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/v1.20.1/"
    
    # load income per person
    df_income = pd.read_csv(base_url + "ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv")

    # load life expectancy
    df_life_expectancy = pd.read_csv(base_url + "ddf--datapoints--life_expectancy_at_birth_with_projections--by--geo--time.csv")

    # load population size
    df_population = pd.read_csv(base_url + "ddf--datapoints--population_total--by--geo--time.csv")

    # load child mortality
    df_child_mortality = pd.read_csv(base_url + "ddf--datapoints--child_mortality_0_5_year_olds_dying_per_1000_born--by--geo--time.csv")

    # load country attributes
    df_countries = pd.read_csv(base_url + "ddf--entities--geo--country.csv")

    # rename some columns in the countries dataframe to help with merging
    df_countries = (
        df_countries
        [["country", "name", "world_4region", "world_6region"]]
        .rename(columns={"country": "geo", "name": "country"})
    )

    # capitalise regions
    df_countries["world_4region"] = df_countries["world_4region"].str.capitalize()

    # join all indicators into a single dataframe
    df_gapminder = pd.merge(df_population, df_income, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_life_expectancy, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_child_mortality, on=["geo", "time"])
    df_gapminder = pd.merge(df_gapminder, df_countries, on="geo")

    # rename some columns to be more concise
    df_gapminder = df_gapminder.rename(
        columns={
            "time": "year",
            "population_total": "population",
            "income_per_person_gdppercapita_ppp_inflation_adjusted": "income_per_person",
            "life_expectancy_at_birth_with_projections": "life_expectancy",
            "child_mortality_0_5_year_olds_dying_per_1000_born": "child_mortality",
        }
    )

    # keep only data between 1950 and 2021 - it's less jumpy
    df_gapminder = df_gapminder.query("1950 <= year <= 2021").reset_index(drop=True)

    # tidy up columns
    df_gapminder.drop(columns=["geo"], inplace=True)
    df_gapminder.insert(0, "country", df_gapminder.pop("country"))  # move country column to the front

    return df_gapminder
df_gapminder = load_gapminder_data()
df_gapminder
country year population income_per_person life_expectancy child_mortality world_4region world_6region
0 Afghanistan 1950 7752117 2392 32.48 415.95 Asia south_asia
1 Afghanistan 1951 7840151 2422 32.87 413.05 Asia south_asia
2 Afghanistan 1952 7935996 2462 33.58 407.19 Asia south_asia
3 Afghanistan 1953 8039684 2568 34.28 401.21 Asia south_asia
4 Afghanistan 1954 8151316 2576 34.99 395.12 Asia south_asia
... ... ... ... ... ... ... ... ...
13531 Zimbabwe 2017 14236599 2568 61.35 49.31 Africa sub_saharan_africa
13532 Zimbabwe 2018 14438812 2621 61.74 46.23 Africa sub_saharan_africa
13533 Zimbabwe 2019 14645473 2392 62.04 44.43 Africa sub_saharan_africa
13534 Zimbabwe 2020 14862927 2412 62.29 43.06 Africa sub_saharan_africa
13535 Zimbabwe 2021 15092171 2424 62.51 42.05 Africa sub_saharan_africa

13536 rows × 8 columns

Scatter plots#

Let’s use the Systema Globalis data to make a scatter plot. To make a scatter plot, we can use the px.scatter() function. Let’s look at the function documentation.

px.scatter?
Signature:
px.scatter(
    data_frame=None,
    x=None,
    y=None,
    color=None,
    symbol=None,
    size=None,
    hover_name=None,
    hover_data=None,
    custom_data=None,
    text=None,
    facet_row=None,
    facet_col=None,
    facet_col_wrap=0,
    facet_row_spacing=None,
    facet_col_spacing=None,
    error_x=None,
    error_x_minus=None,
    error_y=None,
    error_y_minus=None,
    animation_frame=None,
    animation_group=None,
    category_orders=None,
    labels=None,
    orientation=None,
    color_discrete_sequence=None,
    color_discrete_map=None,
    color_continuous_scale=None,
    range_color=None,
    color_continuous_midpoint=None,
    symbol_sequence=None,
    symbol_map=None,
    opacity=None,
    size_max=None,
    marginal_x=None,
    marginal_y=None,
    trendline=None,
    trendline_options=None,
    trendline_color_override=None,
    trendline_scope='trace',
    log_x=False,
    log_y=False,
    range_x=None,
    range_y=None,
    render_mode='auto',
    title=None,
    subtitle=None,
    template=None,
    width=None,
    height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
    In a scatter plot, each row of `data_frame` is represented by a symbol
    mark in 2D space.
    
Parameters
----------
data_frame: DataFrame or array-like or dict
    This argument needs to be passed for column names (and not keyword
    names) to be used. Array-like and dict are transformed internally to a
    pandas DataFrame. Optional: if missing, a DataFrame gets constructed
    under the hood using the other arguments.
x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the x axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the y axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
color: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign color to marks.
symbol: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign symbols to marks.
size: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign mark sizes.
hover_name: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in bold
    in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
    Either a name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects or a dict with column names as keys, with
    values True (for default formatting) False (in order to remove this
    column from hover information), or a formatting string, for example
    ':.3f' or '|%a' or list-like data to appear in the hover tooltip or
    tuples with a bool or formatting string as first element, and list-like
    data to appear in hover as second element Values from these columns
    appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
    Either name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects Values from these columns are extra data,
    to be used in widgets or Dash callbacks for example. This data is not
    user-visible but is included in events emitted by the figure (lasso
    selection etc.)
text: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in the
    figure as text labels.
facet_row: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
    Maximum number of facet columns. Wraps the column variable at this
    width, so that the column facets span multiple rows. Ignored if 0, and
    forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
    Spacing between facet rows, in paper units. Default is 0.03 or 0.07
    when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
    Spacing between facet columns, in paper units Default is 0.02.
error_x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars. If `error_x_minus` is `None`, error bars will
    be symmetrical, otherwise `error_x` is used for the positive direction
    only.
error_x_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars in the negative direction. Ignored if `error_x`
    is `None`.
error_y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars. If `error_y_minus` is `None`, error bars will
    be symmetrical, otherwise `error_y` is used for the positive direction
    only.
error_y_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars in the negative direction. Ignored if `error_y`
    is `None`.
animation_frame: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to animation frames.
animation_group: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    provide object-constancy across animation frames: rows with matching
    `animation_group`s will be treated as if they describe the same object
    in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
    By default, in Python 3.6+, the order of categorical values in axes,
    legends and facets depends on the order in which these values are first
    encountered in `data_frame` (and no order is guaranteed by default in
    Python below 3.6). This parameter is used to force a specific ordering
    of values per column. The keys of this dict should correspond to column
    names, and the values should be lists of strings corresponding to the
    specific display order desired.
labels: dict with str keys and str values (default `{}`)
    By default, column names are used in the figure for axis titles, legend
    entries and hovers. This parameter allows this to be overridden. The
    keys of this dict should correspond to column names, and the values
    should correspond to the desired label to be displayed.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical. 
    (default `'v'` if `x` and `y` are provided and both continuous or both
    categorical,  otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
    `y`(`x`) is continuous,  otherwise `'v'`(`'h'`) if only `x`(`y`) is
    provided)
color_discrete_sequence: list of str
    Strings should define valid CSS-colors. When `color` is set and the
    values in the corresponding column are not numeric, values in that
    column are assigned colors by cycling through `color_discrete_sequence`
    in the order described in `category_orders`, unless the value of
    `color` is a key in `color_discrete_map`. Various useful color
    sequences are available in the `plotly.express.colors` submodules,
    specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
    String values should define valid CSS-colors Used to override
    `color_discrete_sequence` to assign a specific colors to marks
    corresponding with specific values. Keys in `color_discrete_map` should
    be values in the column denoted by `color`. Alternatively, if the
    values of `color` are valid colors, the string `'identity'` may be
    passed to cause them to be used directly.
color_continuous_scale: list of str
    Strings should define valid CSS-colors This list is used to build a
    continuous color scale when the column denoted by `color` contains
    numeric data. Various useful color scales are available in the
    `plotly.express.colors` submodules, specifically
    `plotly.express.colors.sequential`, `plotly.express.colors.diverging`
    and `plotly.express.colors.cyclical`.
range_color: list of two numbers
    If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
    If set, computes the bounds of the continuous color scale to have the
    desired midpoint. Setting this value is recommended when using
    `plotly.express.colors.diverging` color scales as the inputs to
    `color_continuous_scale`.
symbol_sequence: list of str
    Strings should define valid plotly.js symbols. When `symbol` is set,
    values in that column are assigned symbols by cycling through
    `symbol_sequence` in the order described in `category_orders`, unless
    the value of `symbol` is a key in `symbol_map`.
symbol_map: dict with str keys and str values (default `{}`)
    String values should define plotly.js symbols Used to override
    `symbol_sequence` to assign a specific symbols to marks corresponding
    with specific values. Keys in `symbol_map` should be values in the
    column denoted by `symbol`. Alternatively, if the values of `symbol`
    are valid symbol names, the string `'identity'` may be passed to cause
    them to be used directly.
opacity: float
    Value between 0 and 1. Sets the opacity for markers.
size_max: int (default `20`)
    Set the maximum mark size when using `size`.
marginal_x: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    horizontal subplot is drawn above the main plot, visualizing the
    x-distribution.
marginal_y: str
    One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
    vertical subplot is drawn to the right of the main plot, visualizing
    the y-distribution.
trendline: str
    One of `'ols'`, `'lowess'`, `'rolling'`, `'expanding'` or `'ewm'`. If
    `'ols'`, an Ordinary Least Squares regression line will be drawn for
    each discrete-color/symbol group. If `'lowess`', a Locally Weighted
    Scatterplot Smoothing line will be drawn for each discrete-color/symbol
    group. If `'rolling`', a Rolling (e.g. rolling average, rolling median)
    line will be drawn for each discrete-color/symbol group. If
    `'expanding`', an Expanding (e.g. expanding average, expanding sum)
    line will be drawn for each discrete-color/symbol group. If `'ewm`', an
    Exponentially Weighted Moment (e.g. exponentially-weighted moving
    average) line will be drawn for each discrete-color/symbol group. See
    the docstrings for the functions in
    `plotly.express.trendline_functions` for more details on these
    functions and how to configure them with the `trendline_options`
    argument.
trendline_options: dict
    Options passed as the first argument to the function from
    `plotly.express.trendline_functions`  named in the `trendline`
    argument.
trendline_color_override: str
    Valid CSS color. If provided, and if `trendline` is set, all trendlines
    will be drawn in this color rather than in the same color as the traces
    from which they draw their inputs.
trendline_scope: str (one of `'trace'` or `'overall'`, default `'trace'`)
    If `'trace'`, then one trendline is drawn per trace (i.e. per color,
    symbol, facet, animation frame etc) and if `'overall'` then one
    trendline is computed for the entire dataset, and replicated across all
    facets.
log_x: boolean (default `False`)
    If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
    If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
    If provided, overrides auto-scaling on the x-axis in cartesian
    coordinates.
range_y: list of two numbers
    If provided, overrides auto-scaling on the y-axis in cartesian
    coordinates.
render_mode: str
    One of `'auto'`, `'svg'` or `'webgl'`, default `'auto'` Controls the
    browser API used to draw marks. `'svg'` is appropriate for figures of
    less than 1000 data points, and will allow for fully-vectorized output.
    `'webgl'` is likely necessary for acceptable performance above 1000
    points but rasterizes part of the output.  `'auto'` uses heuristics to
    choose the mode.
title: str
    The figure title.
subtitle: str
    The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
    The figure template name (must be a key in plotly.io.templates) or
    definition.
width: int (default `None`)
    The figure width in pixels.
height: int (default `None`)
    The figure height in pixels.

Returns
-------
    plotly.graph_objects.Figure
File:      /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type:      function

The px.scatter() function documentation is also available on the Plotly website.

For any given type of plot or chart, there is also usually a user guide on the Plotly website, which provides some helpful examples. For example, here is the Plotly user guide on scatter plots.

First scatter plot#

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
)
fig
020k40k60k80k100k120k55606570758085
income_per_personlife_expectancy

Exercise 1 (English)

Uncomment the code in the cell below and run it to create a scatter plot comparing income_per_person with child_mortality in 2021.

Exercice 1 (Français)

Décommenter le code dans la cellule ci-dessous et l’exécuter afin de créer un diagramme à nuage de points comparant income_per_person avec child_mortality en 2021.

# fig = px.scatter(
#     data_frame=df_gapminder.query("year == 2021"),
#     x="income_per_person",
#     y="child_mortality",
# )
# fig

Hover text (a.k.a. tooltips)#

To help explore these data, let’s use the hover_name and hover_data parameters to add more information into the hover text.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
)
fig
020k40k60k80k100k120k55606570758085
income_per_personlife_expectancy

N.B., there is lots more information on how to use hover text in the Plotly docs.

Interactive controls#

Every Plotly plot has a set of interactive controls, which appear at the top right of the plot and look like this:

These controls are useful for zooming and panning a plot, as well as for downloading a static version of a plot.

Marker color#

To explore these data further, let’s use the color parameter to represent another variable.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
)
fig
020k40k60k80k100k120k55606570758085
world_4regionAsiaAfricaEuropeAmericasincome_per_personlife_expectancy

Now we can see easily which region of the world each country belongs to.

Marker size#

Let’s use the size parameter to also visualise the population size of each country.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
)
fig
020k40k60k80k100k120k55606570758085
world_4regionAsiaAfricaEuropeAmericasincome_per_personlife_expectancy

Note that we also used the size_max parameter to increase the allowed maximum size of markers, which is better for this particular data.

Exercise 2 (English)

Create a scatter plot using the Gapminder data for the year 1950, with income_per_person on the X axis and child_mortality on the Y axis. Use population for the marker size, and world_6region for marker color.

Exercice 2 (Français)

Créer un diagramme à nuage de points utilisant les données de Gapminder pour l’année 1950 avec income_per_person sur l’axe horizontal X et child_mortality sur l’axe vertical Y. Utiliser population pour la taille du point et world_6region pour sa couleur.

Plot title and axis labels#

If we’re presenting this plot to others, it is a good idea to tidy up the axis titles, and to add a title to the plot. We can do this with the labels and title parameters.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "child_mortality": "Child mortality",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021"
)
fig
020k40k60k80k100k120k55606570758085
World regionAsiaAfricaEuropeAmericasLife expectancy and income by country in 2021IncomeLife expectancy

Using log scale#

Some variables are more naturally visualised on a log scale, rather than a linear scale. Let’s use the log_x parameter to apply a log scale to the X axis.

fig = px.scatter(
    data_frame=df_gapminder.query("year == 2021"),
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
    },
    title="Life expectancy and income by country in 2021",
    log_x=True,
)
fig
5678910002345678910k23456789100k55606570758085
World regionAsiaAfricaEuropeAmericasLife expectancy and income by country in 2021IncomeLife expectancy

Animation#

Let’s now add another variable, which is year. When you have a variable that represents time, it can also be useful to visualise this as an animation. We can do this via the animation_frame parameter.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income",
        "life_expectancy": "Life expectancy",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
    height=700,
)
fig
2510002510k25100k22030405060708090
World regionAsiaAfricaEuropeAmericasYear=1950195019541958196219661970197419781982198619901994199820022006201020142018Life expectancy and income by country, 1950-2021IncomeLife expectancy

Visual styling#

The scatter plot we’ve created above is more than good enough if we are doing some exploratory data analysis, but in case you need to make a really strong visual impact and you want to change any aspect of how the plot looks, you can do that via various additional function calls which update the figure. Here’s an example, where we alter the template, change the X axis tick positions and labels, and change the marker line color to black.

fig = px.scatter(
    data_frame=df_gapminder,
    x="income_per_person",
    y="life_expectancy",
    hover_name="country",
    hover_data=["child_mortality"],
    color="world_4region",
    size="population",
    size_max=80,
    labels={
        "income_per_person": "Income per person (GDP/capita, PPP$ inflation-adjusted)",
        "life_expectancy": "Life expectancy (years)",
        "world_4region": "World region",
        "population": "Population",
        "year": "Year",
    },
    title="Life expectancy and income by country, 1950-2021",
    log_x=True,
    animation_frame="year",
    range_x=[200, 200_000],
    range_y=[20, 95],
#     color_discrete_sequence=px.colors.qualitative.Set1,
    color_discrete_map={"Asia": "#ff5872", "Africa": "#00d5e9", "Europe": "#ffe700", "Americas": "#7feb00"},
    opacity=0.9,
    template="plotly_white",
    height=600,
    width=800,
)

fig.update_layout(
    xaxis = dict(
        tickmode = "array",
        tickvals = [500, 1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000],
        ticktext = ["500", "1000", "2000", "4000", "8000", "16k", "32k", "64k", "128k"]
    )
)

fig.update_xaxes(showline=True, linewidth=1, linecolor="black")
fig.update_yaxes(showline=True, linewidth=1, linecolor="black")

fig.update_traces(
    marker=dict(line=dict(width=.5, color="black")),
)

fig
500100020004000800016k32k64k128k2030405060708090
World regionAsiaAfricaEuropeAmericasYear=1950195019561962196819741980198619921998200420102016Life expectancy and income by country, 1950-2021Income per person (GDP/capita, PPP$ inflation-adjusted)Life expectancy (years)

There’s more info on styling on the Plotly website, as well as info on continuous color scales and discrete colors.

Exercise 3 (English)

Create an animated scatter plot from the Gapminder data as above, but using child_mortality on the Y axis and world_6region for marker color.

Also, use a different palette for the marker colors. Hint: use the color_discrete_sequence parameter, and choose your favourite discrete color sequence (palette) from the Plotly website.

Exercice 3 (Français)

Créer un diagramme à nuage de points animé utilisant les données de Gapminder comme ci-dessus mais affichant child_mortality sur l’axe Y et world_6region pour la couleur du marqueur.

Utiliser aussi une palette de couleur différente. Indice: utiliser le paramètre color_discrete_sequence et choisir votre palette favorite sur le site Plotly.

3D scatter plots#

For a bit of extra interest, let’s use the px.scatter_3d() function to make a 3-dimensional version of the Gapminder animation, adding in the child_mortality variable.

fig = px.scatter_3d(
    data_frame=df_gapminder, 
    x="income_per_person", 
    y="life_expectancy", 
    z="child_mortality",
    hover_name="country",
    color="world_4region",
    size="population",
    size_max=100,
    animation_frame="year",
    log_x=True,
    range_x=[200, 200_000],
    range_y=[0, 95],
    range_z=[0, 500],
    height=700,
    width=700,
)

fig.update_layout(
    scene=dict(aspectmode="cube"),
    legend=dict(itemsizing="constant"),
)

fig
world_4regionAsiaAfricaEuropeAmericasyear=195019501957196419711978198519921999200620132020

Bar plots#

To illustrate bar plots let’s use data from Alliance for Malaria Prevention’s Net Mapping Project. We’ll combine data from the 2020 report and the 2022 Q1 report, which together provide data on LLIN shipments by country for 2004-2021 broken down by LLIN type (standard, PBO and dual active ingredient).

def load_llin_data():
    """Load data on LLIN shipments from the Alliance for Malaria Prevention's
    Net Mapping Project."""

    # N.B., data are split over several spreadsheets, so some munging is required.

    # N.B., files have been obtained from the AMP website and uploaded to 
    # Google Cloud Storage for efficient download.

    # load the "Final-2020.xlsx" dataset, "SSA" sheet - this has LLINs for 2004-2020
    df_nmp_2020_ssa = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx", 
        sheet_name="SSA",
        skiprows=2,
        skipfooter=2,
        names=["country"] + list(range(2004, 2021)),
        usecols=list(range(18))
    )

    # load the "Final-2020.xlsx" dataset, "SSA by net type" sheet - this has LLINs by type for 2018, 2019, 2020
    df_nmp_2020_ssa_by_type = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx", 
        sheet_name="SSA by net type",
        skiprows=3,
        skipfooter=8,
        usecols="A,B,C,F,G,H,K,L,M",
        names=[
            "country",
            "2018_standard",
            "2018_pbo",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
        ],
    )

    # load the "NMP-1st-Q-2022.xlsx" dataset, "SSA by Type" sheet - this has LLINs by type for 2019, 2020, 2021
    df_nmp_2022q1_ssa_by_type = pd.read_excel(
        "http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-NMP-1st-Q-2022.xlsx",
        sheet_name="SSA by Type",
        skiprows=3,
        skipfooter=2,
        usecols="A,C,D,E,H,I,J,M,N,O",
        names=[
            "country",
            "2019_standard",
            "2019_pbo",
            "2019_dual",
            "2020_standard",
            "2020_pbo",
            "2020_dual",
            "2021_standard",
            "2021_pbo",
            "2021_dual",
        ],
    )

    # N.B., we would like LLINs by type for the full range 2004-2021.
    # We also would like the data in "long form" for easier plotting.
    # Let's munge!

    # start with data prior to 2018
    df_llins_pre_2018 = (
        df_nmp_2020_ssa
        .melt(id_vars="country", var_name="year", value_name="llins_shipped")
        .query("year < 2018")
    )
    df_llins_pre_2018["llin_type"] = "standard"  # assume all standard llins prior to 2018

    # now grab the data by type for 2018
    df_llins_2018 = (
        df_nmp_2020_ssa_by_type
        [["country", "2018_standard", "2018_pbo"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_2018["year"] = df_year_type["year"]
    df_llins_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_2018.drop(columns="year_type", inplace=True)

    # now grab the data by type for 2019, 2020, 2021
    df_llins_post_2018 = (
        df_nmp_2022q1_ssa_by_type
        [["country", "2019_standard", "2019_pbo", "2019_dual", "2020_standard", "2020_pbo", "2020_dual", "2021_standard", "2021_pbo", "2021_dual"]]
        .melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
    )
    df_year_type = (
        df_llins_post_2018["year_type"]
        .str.split("_", expand=True)
        .rename(columns={0: "year", 1: "llin_type"})
    )
    df_llins_post_2018["year"] = df_year_type["year"]
    df_llins_post_2018["llin_type"] = df_year_type["llin_type"]
    df_llins_post_2018.drop(columns="year_type", inplace=True)

    # finally, concatenate everything
    df_llins = pd.concat([df_llins_pre_2018, df_llins_2018, df_llins_post_2018]).reset_index(drop=True)

    # ensure years have the right dtype
    df_llins["year"] = df_llins["year"].astype(int)

    # normalise country names
    df_llins["country"] = df_llins["country"].replace("Congo (Democratic Republic of the)", "DR Congo")

    return df_llins
df_llins = load_llin_data()
df_llins
country year llins_shipped llin_type
0 Angola 2004 154010 standard
1 Benin 2004 26500 standard
2 Botswana 2004 0 standard
3 Burkina Faso 2004 216500 standard
4 Burundi 2004 160250 standard
... ... ... ... ...
1154 Togo 2021 0 dual
1155 Uganda 2021 0 dual
1156 Zambia 2021 0 dual
1157 Zanzibar 2021 0 dual
1158 Zimbabwe 2021 0 dual

1159 rows × 4 columns

df_llins.query("country == 'Nigeria'")
country year llins_shipped llin_type
30 Nigeria 2004 71400 standard
76 Nigeria 2005 262000 standard
122 Nigeria 2006 2147404 standard
168 Nigeria 2007 2724304 standard
214 Nigeria 2008 15310222 standard
260 Nigeria 2009 19813977 standard
306 Nigeria 2010 29908286 standard
352 Nigeria 2011 2555096 standard
398 Nigeria 2012 5452563 standard
444 Nigeria 2013 26355032 standard
490 Nigeria 2014 42973544 standard
536 Nigeria 2015 23794214 standard
582 Nigeria 2016 11240307 standard
628 Nigeria 2017 35498731 standard
674 Nigeria 2018 18635909 standard
720 Nigeria 2018 51000 pbo
767 Nigeria 2019 31642624 standard
814 Nigeria 2019 1760400 pbo
861 Nigeria 2019 0 dual
908 Nigeria 2020 4449900 standard
955 Nigeria 2020 11717441 pbo
1002 Nigeria 2020 5567000 dual
1049 Nigeria 2021 1433000 standard
1096 Nigeria 2021 33048807 pbo
1143 Nigeria 2021 2833598 dual

First bar plot#

To make a bar plot, we can use the px.bar() function. Let’s look at the function documentation.

px.bar?
Signature:
px.bar(
    data_frame=None,
    x=None,
    y=None,
    color=None,
    pattern_shape=None,
    facet_row=None,
    facet_col=None,
    facet_col_wrap=0,
    facet_row_spacing=None,
    facet_col_spacing=None,
    hover_name=None,
    hover_data=None,
    custom_data=None,
    text=None,
    base=None,
    error_x=None,
    error_x_minus=None,
    error_y=None,
    error_y_minus=None,
    animation_frame=None,
    animation_group=None,
    category_orders=None,
    labels=None,
    color_discrete_sequence=None,
    color_discrete_map=None,
    color_continuous_scale=None,
    pattern_shape_sequence=None,
    pattern_shape_map=None,
    range_color=None,
    color_continuous_midpoint=None,
    opacity=None,
    orientation=None,
    barmode='relative',
    log_x=False,
    log_y=False,
    range_x=None,
    range_y=None,
    text_auto=False,
    title=None,
    subtitle=None,
    template=None,
    width=None,
    height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
    In a bar plot, each row of `data_frame` is represented as a rectangular
    mark.
    
Parameters
----------
data_frame: DataFrame or array-like or dict
    This argument needs to be passed for column names (and not keyword
    names) to be used. Array-like and dict are transformed internally to a
    pandas DataFrame. Optional: if missing, a DataFrame gets constructed
    under the hood using the other arguments.
x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the x axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position marks along the y axis in cartesian coordinates. Either `x` or
    `y` can optionally be a list of column references or array_likes,  in
    which case the data will be treated as if it were 'wide' rather than
    'long'.
color: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign color to marks.
pattern_shape: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign pattern shapes to marks.
facet_row: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
    Maximum number of facet columns. Wraps the column variable at this
    width, so that the column facets span multiple rows. Ignored if 0, and
    forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
    Spacing between facet rows, in paper units. Default is 0.03 or 0.07
    when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
    Spacing between facet columns, in paper units Default is 0.02.
hover_name: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in bold
    in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
    Either a name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects or a dict with column names as keys, with
    values True (for default formatting) False (in order to remove this
    column from hover information), or a formatting string, for example
    ':.3f' or '|%a' or list-like data to appear in the hover tooltip or
    tuples with a bool or formatting string as first element, and list-like
    data to appear in hover as second element Values from these columns
    appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
    Either name or list of names of columns in `data_frame`, or pandas
    Series, or array_like objects Values from these columns are extra data,
    to be used in widgets or Dash callbacks for example. This data is not
    user-visible but is included in events emitted by the figure (lasso
    selection etc.)
text: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like appear in the
    figure as text labels.
base: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    position the base of the bar.
error_x: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars. If `error_x_minus` is `None`, error bars will
    be symmetrical, otherwise `error_x` is used for the positive direction
    only.
error_x_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size x-axis error bars in the negative direction. Ignored if `error_x`
    is `None`.
error_y: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars. If `error_y_minus` is `None`, error bars will
    be symmetrical, otherwise `error_y` is used for the positive direction
    only.
error_y_minus: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    size y-axis error bars in the negative direction. Ignored if `error_y`
    is `None`.
animation_frame: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    assign marks to animation frames.
animation_group: str or int or Series or array-like
    Either a name of a column in `data_frame`, or a pandas Series or
    array_like object. Values from this column or array_like are used to
    provide object-constancy across animation frames: rows with matching
    `animation_group`s will be treated as if they describe the same object
    in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
    By default, in Python 3.6+, the order of categorical values in axes,
    legends and facets depends on the order in which these values are first
    encountered in `data_frame` (and no order is guaranteed by default in
    Python below 3.6). This parameter is used to force a specific ordering
    of values per column. The keys of this dict should correspond to column
    names, and the values should be lists of strings corresponding to the
    specific display order desired.
labels: dict with str keys and str values (default `{}`)
    By default, column names are used in the figure for axis titles, legend
    entries and hovers. This parameter allows this to be overridden. The
    keys of this dict should correspond to column names, and the values
    should correspond to the desired label to be displayed.
color_discrete_sequence: list of str
    Strings should define valid CSS-colors. When `color` is set and the
    values in the corresponding column are not numeric, values in that
    column are assigned colors by cycling through `color_discrete_sequence`
    in the order described in `category_orders`, unless the value of
    `color` is a key in `color_discrete_map`. Various useful color
    sequences are available in the `plotly.express.colors` submodules,
    specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
    String values should define valid CSS-colors Used to override
    `color_discrete_sequence` to assign a specific colors to marks
    corresponding with specific values. Keys in `color_discrete_map` should
    be values in the column denoted by `color`. Alternatively, if the
    values of `color` are valid colors, the string `'identity'` may be
    passed to cause them to be used directly.
color_continuous_scale: list of str
    Strings should define valid CSS-colors This list is used to build a
    continuous color scale when the column denoted by `color` contains
    numeric data. Various useful color scales are available in the
    `plotly.express.colors` submodules, specifically
    `plotly.express.colors.sequential`, `plotly.express.colors.diverging`
    and `plotly.express.colors.cyclical`.
pattern_shape_sequence: list of str
    Strings should define valid plotly.js patterns-shapes. When
    `pattern_shape` is set, values in that column are assigned patterns-
    shapes by cycling through `pattern_shape_sequence` in the order
    described in `category_orders`, unless the value of `pattern_shape` is
    a key in `pattern_shape_map`.
pattern_shape_map: dict with str keys and str values (default `{}`)
    Strings values define plotly.js patterns-shapes. Used to override
    `pattern_shape_sequences` to assign a specific patterns-shapes to lines
    corresponding with specific values. Keys in `pattern_shape_map` should
    be values in the column denoted by `pattern_shape`. Alternatively, if
    the values of `pattern_shape` are valid patterns-shapes names, the
    string `'identity'` may be passed to cause them to be used directly.
range_color: list of two numbers
    If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
    If set, computes the bounds of the continuous color scale to have the
    desired midpoint. Setting this value is recommended when using
    `plotly.express.colors.diverging` color scales as the inputs to
    `color_continuous_scale`.
opacity: float
    Value between 0 and 1. Sets the opacity for markers.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical. 
    (default `'v'` if `x` and `y` are provided and both continuous or both
    categorical,  otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
    `y`(`x`) is continuous,  otherwise `'v'`(`'h'`) if only `x`(`y`) is
    provided)
barmode: str (default `'relative'`)
    One of `'group'`, `'overlay'` or `'relative'` In `'relative'` mode,
    bars are stacked above zero for positive values and below zero for
    negative values. In `'overlay'` mode, bars are drawn on top of one
    another. In `'group'` mode, bars are placed beside each other.
log_x: boolean (default `False`)
    If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
    If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
    If provided, overrides auto-scaling on the x-axis in cartesian
    coordinates.
range_y: list of two numbers
    If provided, overrides auto-scaling on the y-axis in cartesian
    coordinates.
text_auto: bool or string (default `False`)
    If `True` or a string, the x or y or z values will be displayed as
    text, depending on the orientation A string like `'.2f'` will be
    interpreted as a `texttemplate` numeric formatting directive.
title: str
    The figure title.
subtitle: str
    The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
    The figure template name (must be a key in plotly.io.templates) or
    definition.
width: int (default `None`)
    The figure width in pixels.
height: int (default `None`)
    The figure height in pixels.

Returns
-------
    plotly.graph_objects.Figure
File:      /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type:      function

Again the px.bar() function docs are also on the Plotly website, and there is also a guide to bar charts.

Let’s now make a bar plot, with year on the X axis and llins_shipped on the Y axis.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped"
)
fig
200420062008201020122014201620182020050M100M150M200M
yearllins_shipped

Improved bar plot#

Let’s now improve the bar plot by using color, hover text, and doing some visual styling.

fig = px.bar(
    data_frame=df_llins, 
    x="year", 
    y="llins_shipped", 
    color="llin_type", 
    hover_name="country",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    title="LLIN shipments to countries in Sub-Saharan Africa",
    width=800,
    template="plotly_white",
)
fig
2005201020152020050M100M150M200M
LLIN typestandardpbodualLLIN shipments to countries in Sub-Saharan AfricaYearNo. LLINs

Exercise 4 (English)

Make a bar chart from the LLIN data as above, but using country for the X axis and year for the hover name.

Exercice 4 (Français)

Créer un diagramme à barres pour les données sur les LLINs comme au-dessus mais en utilisant country pour l’axe X et year pour le texte de survol.

Line and area plots#

Let’s also use the LLIN data to make some line and area plots, via the px.line() and px.area() functions.

Here is a line plot of LLINs shipped to Nigeria.

fig = px.line(
    data_frame=df_llins.query("country == 'Nigeria'"),
    x="year",
    y="llins_shipped",
    color="llin_type",
    markers=True,
    width=800,
    title="LLIN shipments to Nigeria",
    labels={
        "year": "Year",
        "llins_shipped": "No. LLINs",
        "llin_type": "LLIN type"
    },
    template="plotly_white",
)
fig
2005201020152020010M20M30M40M
LLIN typestandardpbodualLLIN shipments to NigeriaYearNo. LLINs

Exercise 5 (English)

Make a line plot as above but for Democratic Republic of the Congo. Hint: use the query "country == 'DR Congo'"

Exercice 5 (Français)

Créer un diagramme à lignes comme ci-dessus mais pour la République Démocratique du Congo. Indice: Utiliser la requête "country == 'DR Congo'"

Exercise 6 (English)

Make an area plot using the LLIN data from Nigeria. Hint: it’s exactly the same parameters as the line plot, just call the px.area() function instead of px.line().

Exercice 6 (Français)

Créer un diagramme à zones utilisant les données des LLINs du Nigeria. Indice: Les paramètres sont les mêmes que pour le diagramme à lignes mais il faut appeler la fonction px.area() au lieu de px.line().

Well done!#

Hopefully this has been a useful introduction to plotting in Python.

As I mentioned earlier, there are lots more plot types that Plotly Express provides, take a look at the user guide and the API docs for more information.

Happy plotting!

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.