Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 1 - Plotting with Plotly Express#
Theme: Tools & Technology
This module provides an introduction to visualising data with some basic charts using the Plotly Express package for Python.
Learning objectives#
In this module we will learn how to:
Prepare data for plotting
Create scatter plots
Create bar plots
Create line plots
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Python packages for data visualisation#
Being able to visualise your data is obviously a great skill to have, and there are some fantastic Python packages available for creating a wide range of different visualisations.
In fact, we are spoilt for choice, with packages like:
…and others all providing some incredibly powerful plotting tools for data scientists.
For this module I’ve chosen to begin with Plotly Express because:
It supports many different types of chart
It has a relatively simple interface with good documentation
You can create plots quickly with just a few lines of code (often just a single function call)
Plots are interactive
…which makes it relatively easy to learn and a good choice for exploratory data analysis.
In this module we are just going to look at some basic charts, but you might like to browse the Plotly Python website to see what other charts are possible.
Setup#
In this module we’ll use the Plotly Express package, and we’ll also use pandas for loading data to plot. (See workshop 2, module 1 for an introduction to pandas DataFrames if you missed it or need a recap.) Both of these packages are already installed on colab, so we can go ahead and import them.
import pandas as pd
import plotly.io as pio
pio.renderers.default = "notebook+colab"
import plotly.express as px
Preparing data for plotting#
Plotly Express can accept data in a variety of different input formats, but it works particularly well when you provide data as a pandas DataFrame.
Let’s remind ourselves what a DataFrame looks like, by loading one of the example DataFrames that come with the Plotly Express package.
df_medals_long = px.data.medals_long()
df_medals_long
nation | medal | count | |
---|---|---|---|
0 | South Korea | gold | 24 |
1 | China | gold | 10 |
2 | Canada | gold | 9 |
3 | South Korea | silver | 13 |
4 | China | silver | 15 |
5 | Canada | silver | 12 |
6 | South Korea | bronze | 11 |
7 | China | bronze | 8 |
8 | Canada | bronze | 12 |
One thing worth mentioning is that often the same data can be structured in different ways. For example, the same data above could also be stored in the following DataFrame:
df_medals_wide = px.data.medals_wide()
df_medals_wide
nation | gold | silver | bronze | |
---|---|---|---|---|
0 | South Korea | 24 | 13 | 11 |
1 | China | 10 | 15 | 8 |
2 | Canada | 9 | 12 | 12 |
The df_medals_long
DataFrame is an example of a “long-form” DataFrame, so-called because it has more rows and fewer columns.
The df_medals_wide
DataFrame is an example of a “wide-form” DataFrame, so-called because it has fewer rows and more columns.
Plotly Express can plot either, but for the examples we’re going to look at today, it is slightly more convenient to work with long-form data.
Let’s now load some more interesting data to practise plotting with, which is the Systema Globalis data on income, life expectancy and child mortality by country, used by Gapminder.
def load_gapminder_data():
"""Create a pandas DataFrame with some of the key indicators from the
Open Numbers Systema Globalis dataset."""
# pin to a specific github tag
base_url = "https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/v1.20.1/"
# load income per person
df_income = pd.read_csv(base_url + "ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv")
# load life expectancy
df_life_expectancy = pd.read_csv(base_url + "ddf--datapoints--life_expectancy_at_birth_with_projections--by--geo--time.csv")
# load population size
df_population = pd.read_csv(base_url + "ddf--datapoints--population_total--by--geo--time.csv")
# load child mortality
df_child_mortality = pd.read_csv(base_url + "ddf--datapoints--child_mortality_0_5_year_olds_dying_per_1000_born--by--geo--time.csv")
# load country attributes
df_countries = pd.read_csv(base_url + "ddf--entities--geo--country.csv")
# rename some columns in the countries dataframe to help with merging
df_countries = (
df_countries
[["country", "name", "world_4region", "world_6region"]]
.rename(columns={"country": "geo", "name": "country"})
)
# capitalise regions
df_countries["world_4region"] = df_countries["world_4region"].str.capitalize()
# join all indicators into a single dataframe
df_gapminder = pd.merge(df_population, df_income, on=["geo", "time"])
df_gapminder = pd.merge(df_gapminder, df_life_expectancy, on=["geo", "time"])
df_gapminder = pd.merge(df_gapminder, df_child_mortality, on=["geo", "time"])
df_gapminder = pd.merge(df_gapminder, df_countries, on="geo")
# rename some columns to be more concise
df_gapminder = df_gapminder.rename(
columns={
"time": "year",
"population_total": "population",
"income_per_person_gdppercapita_ppp_inflation_adjusted": "income_per_person",
"life_expectancy_at_birth_with_projections": "life_expectancy",
"child_mortality_0_5_year_olds_dying_per_1000_born": "child_mortality",
}
)
# keep only data between 1950 and 2021 - it's less jumpy
df_gapminder = df_gapminder.query("1950 <= year <= 2021").reset_index(drop=True)
# tidy up columns
df_gapminder.drop(columns=["geo"], inplace=True)
df_gapminder.insert(0, "country", df_gapminder.pop("country")) # move country column to the front
return df_gapminder
df_gapminder = load_gapminder_data()
df_gapminder
country | year | population | income_per_person | life_expectancy | child_mortality | world_4region | world_6region | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1950 | 7752117 | 2392 | 32.48 | 415.95 | Asia | south_asia |
1 | Afghanistan | 1951 | 7840151 | 2422 | 32.87 | 413.05 | Asia | south_asia |
2 | Afghanistan | 1952 | 7935996 | 2462 | 33.58 | 407.19 | Asia | south_asia |
3 | Afghanistan | 1953 | 8039684 | 2568 | 34.28 | 401.21 | Asia | south_asia |
4 | Afghanistan | 1954 | 8151316 | 2576 | 34.99 | 395.12 | Asia | south_asia |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13531 | Zimbabwe | 2017 | 14236599 | 2568 | 61.35 | 49.31 | Africa | sub_saharan_africa |
13532 | Zimbabwe | 2018 | 14438812 | 2621 | 61.74 | 46.23 | Africa | sub_saharan_africa |
13533 | Zimbabwe | 2019 | 14645473 | 2392 | 62.04 | 44.43 | Africa | sub_saharan_africa |
13534 | Zimbabwe | 2020 | 14862927 | 2412 | 62.29 | 43.06 | Africa | sub_saharan_africa |
13535 | Zimbabwe | 2021 | 15092171 | 2424 | 62.51 | 42.05 | Africa | sub_saharan_africa |
13536 rows × 8 columns
Scatter plots#
Let’s use the Systema Globalis data to make a scatter plot. To make a scatter plot, we can use the px.scatter()
function. Let’s look at the function documentation.
px.scatter?
Signature:
px.scatter(
data_frame=None,
x=None,
y=None,
color=None,
symbol=None,
size=None,
hover_name=None,
hover_data=None,
custom_data=None,
text=None,
facet_row=None,
facet_col=None,
facet_col_wrap=0,
facet_row_spacing=None,
facet_col_spacing=None,
error_x=None,
error_x_minus=None,
error_y=None,
error_y_minus=None,
animation_frame=None,
animation_group=None,
category_orders=None,
labels=None,
orientation=None,
color_discrete_sequence=None,
color_discrete_map=None,
color_continuous_scale=None,
range_color=None,
color_continuous_midpoint=None,
symbol_sequence=None,
symbol_map=None,
opacity=None,
size_max=None,
marginal_x=None,
marginal_y=None,
trendline=None,
trendline_options=None,
trendline_color_override=None,
trendline_scope='trace',
log_x=False,
log_y=False,
range_x=None,
range_y=None,
render_mode='auto',
title=None,
subtitle=None,
template=None,
width=None,
height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
In a scatter plot, each row of `data_frame` is represented by a symbol
mark in 2D space.
Parameters
----------
data_frame: DataFrame or array-like or dict
This argument needs to be passed for column names (and not keyword
names) to be used. Array-like and dict are transformed internally to a
pandas DataFrame. Optional: if missing, a DataFrame gets constructed
under the hood using the other arguments.
x: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
position marks along the x axis in cartesian coordinates. Either `x` or
`y` can optionally be a list of column references or array_likes, in
which case the data will be treated as if it were 'wide' rather than
'long'.
y: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
position marks along the y axis in cartesian coordinates. Either `x` or
`y` can optionally be a list of column references or array_likes, in
which case the data will be treated as if it were 'wide' rather than
'long'.
color: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign color to marks.
symbol: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign symbols to marks.
size: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign mark sizes.
hover_name: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like appear in bold
in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
Either a name or list of names of columns in `data_frame`, or pandas
Series, or array_like objects or a dict with column names as keys, with
values True (for default formatting) False (in order to remove this
column from hover information), or a formatting string, for example
':.3f' or '|%a' or list-like data to appear in the hover tooltip or
tuples with a bool or formatting string as first element, and list-like
data to appear in hover as second element Values from these columns
appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
Either name or list of names of columns in `data_frame`, or pandas
Series, or array_like objects Values from these columns are extra data,
to be used in widgets or Dash callbacks for example. This data is not
user-visible but is included in events emitted by the figure (lasso
selection etc.)
text: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like appear in the
figure as text labels.
facet_row: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
Maximum number of facet columns. Wraps the column variable at this
width, so that the column facets span multiple rows. Ignored if 0, and
forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
Spacing between facet rows, in paper units. Default is 0.03 or 0.07
when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
Spacing between facet columns, in paper units Default is 0.02.
error_x: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size x-axis error bars. If `error_x_minus` is `None`, error bars will
be symmetrical, otherwise `error_x` is used for the positive direction
only.
error_x_minus: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size x-axis error bars in the negative direction. Ignored if `error_x`
is `None`.
error_y: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size y-axis error bars. If `error_y_minus` is `None`, error bars will
be symmetrical, otherwise `error_y` is used for the positive direction
only.
error_y_minus: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size y-axis error bars in the negative direction. Ignored if `error_y`
is `None`.
animation_frame: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to animation frames.
animation_group: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
provide object-constancy across animation frames: rows with matching
`animation_group`s will be treated as if they describe the same object
in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
By default, in Python 3.6+, the order of categorical values in axes,
legends and facets depends on the order in which these values are first
encountered in `data_frame` (and no order is guaranteed by default in
Python below 3.6). This parameter is used to force a specific ordering
of values per column. The keys of this dict should correspond to column
names, and the values should be lists of strings corresponding to the
specific display order desired.
labels: dict with str keys and str values (default `{}`)
By default, column names are used in the figure for axis titles, legend
entries and hovers. This parameter allows this to be overridden. The
keys of this dict should correspond to column names, and the values
should correspond to the desired label to be displayed.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical.
(default `'v'` if `x` and `y` are provided and both continuous or both
categorical, otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
`y`(`x`) is continuous, otherwise `'v'`(`'h'`) if only `x`(`y`) is
provided)
color_discrete_sequence: list of str
Strings should define valid CSS-colors. When `color` is set and the
values in the corresponding column are not numeric, values in that
column are assigned colors by cycling through `color_discrete_sequence`
in the order described in `category_orders`, unless the value of
`color` is a key in `color_discrete_map`. Various useful color
sequences are available in the `plotly.express.colors` submodules,
specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
String values should define valid CSS-colors Used to override
`color_discrete_sequence` to assign a specific colors to marks
corresponding with specific values. Keys in `color_discrete_map` should
be values in the column denoted by `color`. Alternatively, if the
values of `color` are valid colors, the string `'identity'` may be
passed to cause them to be used directly.
color_continuous_scale: list of str
Strings should define valid CSS-colors This list is used to build a
continuous color scale when the column denoted by `color` contains
numeric data. Various useful color scales are available in the
`plotly.express.colors` submodules, specifically
`plotly.express.colors.sequential`, `plotly.express.colors.diverging`
and `plotly.express.colors.cyclical`.
range_color: list of two numbers
If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
If set, computes the bounds of the continuous color scale to have the
desired midpoint. Setting this value is recommended when using
`plotly.express.colors.diverging` color scales as the inputs to
`color_continuous_scale`.
symbol_sequence: list of str
Strings should define valid plotly.js symbols. When `symbol` is set,
values in that column are assigned symbols by cycling through
`symbol_sequence` in the order described in `category_orders`, unless
the value of `symbol` is a key in `symbol_map`.
symbol_map: dict with str keys and str values (default `{}`)
String values should define plotly.js symbols Used to override
`symbol_sequence` to assign a specific symbols to marks corresponding
with specific values. Keys in `symbol_map` should be values in the
column denoted by `symbol`. Alternatively, if the values of `symbol`
are valid symbol names, the string `'identity'` may be passed to cause
them to be used directly.
opacity: float
Value between 0 and 1. Sets the opacity for markers.
size_max: int (default `20`)
Set the maximum mark size when using `size`.
marginal_x: str
One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
horizontal subplot is drawn above the main plot, visualizing the
x-distribution.
marginal_y: str
One of `'rug'`, `'box'`, `'violin'`, or `'histogram'`. If set, a
vertical subplot is drawn to the right of the main plot, visualizing
the y-distribution.
trendline: str
One of `'ols'`, `'lowess'`, `'rolling'`, `'expanding'` or `'ewm'`. If
`'ols'`, an Ordinary Least Squares regression line will be drawn for
each discrete-color/symbol group. If `'lowess`', a Locally Weighted
Scatterplot Smoothing line will be drawn for each discrete-color/symbol
group. If `'rolling`', a Rolling (e.g. rolling average, rolling median)
line will be drawn for each discrete-color/symbol group. If
`'expanding`', an Expanding (e.g. expanding average, expanding sum)
line will be drawn for each discrete-color/symbol group. If `'ewm`', an
Exponentially Weighted Moment (e.g. exponentially-weighted moving
average) line will be drawn for each discrete-color/symbol group. See
the docstrings for the functions in
`plotly.express.trendline_functions` for more details on these
functions and how to configure them with the `trendline_options`
argument.
trendline_options: dict
Options passed as the first argument to the function from
`plotly.express.trendline_functions` named in the `trendline`
argument.
trendline_color_override: str
Valid CSS color. If provided, and if `trendline` is set, all trendlines
will be drawn in this color rather than in the same color as the traces
from which they draw their inputs.
trendline_scope: str (one of `'trace'` or `'overall'`, default `'trace'`)
If `'trace'`, then one trendline is drawn per trace (i.e. per color,
symbol, facet, animation frame etc) and if `'overall'` then one
trendline is computed for the entire dataset, and replicated across all
facets.
log_x: boolean (default `False`)
If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
If provided, overrides auto-scaling on the x-axis in cartesian
coordinates.
range_y: list of two numbers
If provided, overrides auto-scaling on the y-axis in cartesian
coordinates.
render_mode: str
One of `'auto'`, `'svg'` or `'webgl'`, default `'auto'` Controls the
browser API used to draw marks. `'svg'` is appropriate for figures of
less than 1000 data points, and will allow for fully-vectorized output.
`'webgl'` is likely necessary for acceptable performance above 1000
points but rasterizes part of the output. `'auto'` uses heuristics to
choose the mode.
title: str
The figure title.
subtitle: str
The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
The figure template name (must be a key in plotly.io.templates) or
definition.
width: int (default `None`)
The figure width in pixels.
height: int (default `None`)
The figure height in pixels.
Returns
-------
plotly.graph_objects.Figure
File: /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type: function
The px.scatter() function documentation is also available on the Plotly website.
For any given type of plot or chart, there is also usually a user guide on the Plotly website, which provides some helpful examples. For example, here is the Plotly user guide on scatter plots.
First scatter plot#
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
)
fig
Exercise 1 (English)
Uncomment the code in the cell below and run it to create a scatter plot comparing income_per_person
with child_mortality
in 2021.
Exercice 1 (Français)
Décommenter le code dans la cellule ci-dessous et l’exécuter afin de créer un diagramme à nuage de points comparant income_per_person
avec child_mortality
en 2021.
# fig = px.scatter(
# data_frame=df_gapminder.query("year == 2021"),
# x="income_per_person",
# y="child_mortality",
# )
# fig
Hover text (a.k.a. tooltips)#
To help explore these data, let’s use the hover_name
and hover_data
parameters to add more information into the hover text.
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
)
fig
N.B., there is lots more information on how to use hover text in the Plotly docs.
Interactive controls#
Every Plotly plot has a set of interactive controls, which appear at the top right of the plot and look like this:
These controls are useful for zooming and panning a plot, as well as for downloading a static version of a plot.
Marker color#
To explore these data further, let’s use the color
parameter to represent another variable.
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
)
fig
Now we can see easily which region of the world each country belongs to.
Marker size#
Let’s use the size
parameter to also visualise the population size of each country.
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
size="population",
size_max=80,
)
fig
Note that we also used the size_max
parameter to increase the allowed maximum size of markers, which is better for this particular data.
Exercise 2 (English)
Create a scatter plot using the Gapminder data for the year 1950, with income_per_person
on the X axis and child_mortality
on the Y axis. Use population
for the marker size, and world_6region
for marker color.
Exercice 2 (Français)
Créer un diagramme à nuage de points utilisant les données de Gapminder pour l’année 1950 avec income_per_person
sur l’axe horizontal X et child_mortality
sur l’axe vertical Y. Utiliser population
pour la taille du point et world_6region
pour sa couleur.
Plot title and axis labels#
If we’re presenting this plot to others, it is a good idea to tidy up the axis titles, and to add a title to the plot. We can do this with the labels
and title
parameters.
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
size="population",
size_max=80,
labels={
"income_per_person": "Income",
"life_expectancy": "Life expectancy",
"child_mortality": "Child mortality",
"world_4region": "World region",
"population": "Population",
},
title="Life expectancy and income by country in 2021"
)
fig
Using log scale#
Some variables are more naturally visualised on a log scale, rather than a linear scale. Let’s use the log_x
parameter to apply a log scale to the X axis.
fig = px.scatter(
data_frame=df_gapminder.query("year == 2021"),
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
size="population",
size_max=80,
labels={
"income_per_person": "Income",
"life_expectancy": "Life expectancy",
"world_4region": "World region",
"population": "Population",
},
title="Life expectancy and income by country in 2021",
log_x=True,
)
fig
Animation#
Let’s now add another variable, which is year
. When you have a variable that represents time, it can also be useful to visualise this as an animation. We can do this via the animation_frame
parameter.
fig = px.scatter(
data_frame=df_gapminder,
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
size="population",
size_max=80,
labels={
"income_per_person": "Income",
"life_expectancy": "Life expectancy",
"world_4region": "World region",
"population": "Population",
"year": "Year",
},
title="Life expectancy and income by country, 1950-2021",
log_x=True,
animation_frame="year",
range_x=[200, 200_000],
range_y=[20, 95],
height=700,
)
fig
Visual styling#
The scatter plot we’ve created above is more than good enough if we are doing some exploratory data analysis, but in case you need to make a really strong visual impact and you want to change any aspect of how the plot looks, you can do that via various additional function calls which update the figure. Here’s an example, where we alter the template, change the X axis tick positions and labels, and change the marker line color to black.
fig = px.scatter(
data_frame=df_gapminder,
x="income_per_person",
y="life_expectancy",
hover_name="country",
hover_data=["child_mortality"],
color="world_4region",
size="population",
size_max=80,
labels={
"income_per_person": "Income per person (GDP/capita, PPP$ inflation-adjusted)",
"life_expectancy": "Life expectancy (years)",
"world_4region": "World region",
"population": "Population",
"year": "Year",
},
title="Life expectancy and income by country, 1950-2021",
log_x=True,
animation_frame="year",
range_x=[200, 200_000],
range_y=[20, 95],
# color_discrete_sequence=px.colors.qualitative.Set1,
color_discrete_map={"Asia": "#ff5872", "Africa": "#00d5e9", "Europe": "#ffe700", "Americas": "#7feb00"},
opacity=0.9,
template="plotly_white",
height=600,
width=800,
)
fig.update_layout(
xaxis = dict(
tickmode = "array",
tickvals = [500, 1_000, 2_000, 4_000, 8_000, 16_000, 32_000, 64_000, 128_000],
ticktext = ["500", "1000", "2000", "4000", "8000", "16k", "32k", "64k", "128k"]
)
)
fig.update_xaxes(showline=True, linewidth=1, linecolor="black")
fig.update_yaxes(showline=True, linewidth=1, linecolor="black")
fig.update_traces(
marker=dict(line=dict(width=.5, color="black")),
)
fig
There’s more info on styling on the Plotly website, as well as info on continuous color scales and discrete colors.
Exercise 3 (English)
Create an animated scatter plot from the Gapminder data as above, but using child_mortality
on the Y axis and world_6region
for marker color.
Also, use a different palette for the marker colors. Hint: use the color_discrete_sequence
parameter, and choose your favourite discrete color sequence (palette) from the Plotly website.
Exercice 3 (Français)
Créer un diagramme à nuage de points animé utilisant les données de Gapminder comme ci-dessus mais affichant child_mortality
sur l’axe Y et world_6region
pour la couleur du marqueur.
Utiliser aussi une palette de couleur différente. Indice: utiliser le paramètre color_discrete_sequence
et choisir votre palette favorite sur le site Plotly.
3D scatter plots#
For a bit of extra interest, let’s use the px.scatter_3d()
function to make a 3-dimensional version of the Gapminder animation, adding in the child_mortality
variable.
fig = px.scatter_3d(
data_frame=df_gapminder,
x="income_per_person",
y="life_expectancy",
z="child_mortality",
hover_name="country",
color="world_4region",
size="population",
size_max=100,
animation_frame="year",
log_x=True,
range_x=[200, 200_000],
range_y=[0, 95],
range_z=[0, 500],
height=700,
width=700,
)
fig.update_layout(
scene=dict(aspectmode="cube"),
legend=dict(itemsizing="constant"),
)
fig
Bar plots#
To illustrate bar plots let’s use data from Alliance for Malaria Prevention’s Net Mapping Project. We’ll combine data from the 2020 report and the 2022 Q1 report, which together provide data on LLIN shipments by country for 2004-2021 broken down by LLIN type (standard, PBO and dual active ingredient).
def load_llin_data():
"""Load data on LLIN shipments from the Alliance for Malaria Prevention's
Net Mapping Project."""
# N.B., data are split over several spreadsheets, so some munging is required.
# N.B., files have been obtained from the AMP website and uploaded to
# Google Cloud Storage for efficient download.
# load the "Final-2020.xlsx" dataset, "SSA" sheet - this has LLINs for 2004-2020
df_nmp_2020_ssa = pd.read_excel(
"http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx",
sheet_name="SSA",
skiprows=2,
skipfooter=2,
names=["country"] + list(range(2004, 2021)),
usecols=list(range(18))
)
# load the "Final-2020.xlsx" dataset, "SSA by net type" sheet - this has LLINs by type for 2018, 2019, 2020
df_nmp_2020_ssa_by_type = pd.read_excel(
"http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-Final-2020.xlsx",
sheet_name="SSA by net type",
skiprows=3,
skipfooter=8,
usecols="A,B,C,F,G,H,K,L,M",
names=[
"country",
"2018_standard",
"2018_pbo",
"2019_standard",
"2019_pbo",
"2019_dual",
"2020_standard",
"2020_pbo",
"2020_dual",
],
)
# load the "NMP-1st-Q-2022.xlsx" dataset, "SSA by Type" sheet - this has LLINs by type for 2019, 2020, 2021
df_nmp_2022q1_ssa_by_type = pd.read_excel(
"http://vobs-resources.cog.sanger.ac.uk/training/img/workshop-3/reference-amp_net_mapping_project-NMP-1st-Q-2022.xlsx",
sheet_name="SSA by Type",
skiprows=3,
skipfooter=2,
usecols="A,C,D,E,H,I,J,M,N,O",
names=[
"country",
"2019_standard",
"2019_pbo",
"2019_dual",
"2020_standard",
"2020_pbo",
"2020_dual",
"2021_standard",
"2021_pbo",
"2021_dual",
],
)
# N.B., we would like LLINs by type for the full range 2004-2021.
# We also would like the data in "long form" for easier plotting.
# Let's munge!
# start with data prior to 2018
df_llins_pre_2018 = (
df_nmp_2020_ssa
.melt(id_vars="country", var_name="year", value_name="llins_shipped")
.query("year < 2018")
)
df_llins_pre_2018["llin_type"] = "standard" # assume all standard llins prior to 2018
# now grab the data by type for 2018
df_llins_2018 = (
df_nmp_2020_ssa_by_type
[["country", "2018_standard", "2018_pbo"]]
.melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
)
df_year_type = (
df_llins_2018["year_type"]
.str.split("_", expand=True)
.rename(columns={0: "year", 1: "llin_type"})
)
df_llins_2018["year"] = df_year_type["year"]
df_llins_2018["llin_type"] = df_year_type["llin_type"]
df_llins_2018.drop(columns="year_type", inplace=True)
# now grab the data by type for 2019, 2020, 2021
df_llins_post_2018 = (
df_nmp_2022q1_ssa_by_type
[["country", "2019_standard", "2019_pbo", "2019_dual", "2020_standard", "2020_pbo", "2020_dual", "2021_standard", "2021_pbo", "2021_dual"]]
.melt(id_vars="country", var_name="year_type", value_name="llins_shipped")
)
df_year_type = (
df_llins_post_2018["year_type"]
.str.split("_", expand=True)
.rename(columns={0: "year", 1: "llin_type"})
)
df_llins_post_2018["year"] = df_year_type["year"]
df_llins_post_2018["llin_type"] = df_year_type["llin_type"]
df_llins_post_2018.drop(columns="year_type", inplace=True)
# finally, concatenate everything
df_llins = pd.concat([df_llins_pre_2018, df_llins_2018, df_llins_post_2018]).reset_index(drop=True)
# ensure years have the right dtype
df_llins["year"] = df_llins["year"].astype(int)
# normalise country names
df_llins["country"] = df_llins["country"].replace("Congo (Democratic Republic of the)", "DR Congo")
return df_llins
df_llins = load_llin_data()
df_llins
country | year | llins_shipped | llin_type | |
---|---|---|---|---|
0 | Angola | 2004 | 154010 | standard |
1 | Benin | 2004 | 26500 | standard |
2 | Botswana | 2004 | 0 | standard |
3 | Burkina Faso | 2004 | 216500 | standard |
4 | Burundi | 2004 | 160250 | standard |
... | ... | ... | ... | ... |
1154 | Togo | 2021 | 0 | dual |
1155 | Uganda | 2021 | 0 | dual |
1156 | Zambia | 2021 | 0 | dual |
1157 | Zanzibar | 2021 | 0 | dual |
1158 | Zimbabwe | 2021 | 0 | dual |
1159 rows × 4 columns
df_llins.query("country == 'Nigeria'")
country | year | llins_shipped | llin_type | |
---|---|---|---|---|
30 | Nigeria | 2004 | 71400 | standard |
76 | Nigeria | 2005 | 262000 | standard |
122 | Nigeria | 2006 | 2147404 | standard |
168 | Nigeria | 2007 | 2724304 | standard |
214 | Nigeria | 2008 | 15310222 | standard |
260 | Nigeria | 2009 | 19813977 | standard |
306 | Nigeria | 2010 | 29908286 | standard |
352 | Nigeria | 2011 | 2555096 | standard |
398 | Nigeria | 2012 | 5452563 | standard |
444 | Nigeria | 2013 | 26355032 | standard |
490 | Nigeria | 2014 | 42973544 | standard |
536 | Nigeria | 2015 | 23794214 | standard |
582 | Nigeria | 2016 | 11240307 | standard |
628 | Nigeria | 2017 | 35498731 | standard |
674 | Nigeria | 2018 | 18635909 | standard |
720 | Nigeria | 2018 | 51000 | pbo |
767 | Nigeria | 2019 | 31642624 | standard |
814 | Nigeria | 2019 | 1760400 | pbo |
861 | Nigeria | 2019 | 0 | dual |
908 | Nigeria | 2020 | 4449900 | standard |
955 | Nigeria | 2020 | 11717441 | pbo |
1002 | Nigeria | 2020 | 5567000 | dual |
1049 | Nigeria | 2021 | 1433000 | standard |
1096 | Nigeria | 2021 | 33048807 | pbo |
1143 | Nigeria | 2021 | 2833598 | dual |
First bar plot#
To make a bar plot, we can use the px.bar()
function. Let’s look at the function documentation.
px.bar?
Signature:
px.bar(
data_frame=None,
x=None,
y=None,
color=None,
pattern_shape=None,
facet_row=None,
facet_col=None,
facet_col_wrap=0,
facet_row_spacing=None,
facet_col_spacing=None,
hover_name=None,
hover_data=None,
custom_data=None,
text=None,
base=None,
error_x=None,
error_x_minus=None,
error_y=None,
error_y_minus=None,
animation_frame=None,
animation_group=None,
category_orders=None,
labels=None,
color_discrete_sequence=None,
color_discrete_map=None,
color_continuous_scale=None,
pattern_shape_sequence=None,
pattern_shape_map=None,
range_color=None,
color_continuous_midpoint=None,
opacity=None,
orientation=None,
barmode='relative',
log_x=False,
log_y=False,
range_x=None,
range_y=None,
text_auto=False,
title=None,
subtitle=None,
template=None,
width=None,
height=None,
) -> plotly.graph_objs._figure.Figure
Docstring:
In a bar plot, each row of `data_frame` is represented as a rectangular
mark.
Parameters
----------
data_frame: DataFrame or array-like or dict
This argument needs to be passed for column names (and not keyword
names) to be used. Array-like and dict are transformed internally to a
pandas DataFrame. Optional: if missing, a DataFrame gets constructed
under the hood using the other arguments.
x: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
position marks along the x axis in cartesian coordinates. Either `x` or
`y` can optionally be a list of column references or array_likes, in
which case the data will be treated as if it were 'wide' rather than
'long'.
y: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
position marks along the y axis in cartesian coordinates. Either `x` or
`y` can optionally be a list of column references or array_likes, in
which case the data will be treated as if it were 'wide' rather than
'long'.
color: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign color to marks.
pattern_shape: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign pattern shapes to marks.
facet_row: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to facetted subplots in the vertical direction.
facet_col: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to facetted subplots in the horizontal direction.
facet_col_wrap: int
Maximum number of facet columns. Wraps the column variable at this
width, so that the column facets span multiple rows. Ignored if 0, and
forced to 0 if `facet_row` or a `marginal` is set.
facet_row_spacing: float between 0 and 1
Spacing between facet rows, in paper units. Default is 0.03 or 0.07
when facet_col_wrap is used.
facet_col_spacing: float between 0 and 1
Spacing between facet columns, in paper units Default is 0.02.
hover_name: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like appear in bold
in the hover tooltip.
hover_data: str, or list of str or int, or Series or array-like, or dict
Either a name or list of names of columns in `data_frame`, or pandas
Series, or array_like objects or a dict with column names as keys, with
values True (for default formatting) False (in order to remove this
column from hover information), or a formatting string, for example
':.3f' or '|%a' or list-like data to appear in the hover tooltip or
tuples with a bool or formatting string as first element, and list-like
data to appear in hover as second element Values from these columns
appear as extra data in the hover tooltip.
custom_data: str, or list of str or int, or Series or array-like
Either name or list of names of columns in `data_frame`, or pandas
Series, or array_like objects Values from these columns are extra data,
to be used in widgets or Dash callbacks for example. This data is not
user-visible but is included in events emitted by the figure (lasso
selection etc.)
text: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like appear in the
figure as text labels.
base: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
position the base of the bar.
error_x: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size x-axis error bars. If `error_x_minus` is `None`, error bars will
be symmetrical, otherwise `error_x` is used for the positive direction
only.
error_x_minus: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size x-axis error bars in the negative direction. Ignored if `error_x`
is `None`.
error_y: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size y-axis error bars. If `error_y_minus` is `None`, error bars will
be symmetrical, otherwise `error_y` is used for the positive direction
only.
error_y_minus: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
size y-axis error bars in the negative direction. Ignored if `error_y`
is `None`.
animation_frame: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
assign marks to animation frames.
animation_group: str or int or Series or array-like
Either a name of a column in `data_frame`, or a pandas Series or
array_like object. Values from this column or array_like are used to
provide object-constancy across animation frames: rows with matching
`animation_group`s will be treated as if they describe the same object
in each frame.
category_orders: dict with str keys and list of str values (default `{}`)
By default, in Python 3.6+, the order of categorical values in axes,
legends and facets depends on the order in which these values are first
encountered in `data_frame` (and no order is guaranteed by default in
Python below 3.6). This parameter is used to force a specific ordering
of values per column. The keys of this dict should correspond to column
names, and the values should be lists of strings corresponding to the
specific display order desired.
labels: dict with str keys and str values (default `{}`)
By default, column names are used in the figure for axis titles, legend
entries and hovers. This parameter allows this to be overridden. The
keys of this dict should correspond to column names, and the values
should correspond to the desired label to be displayed.
color_discrete_sequence: list of str
Strings should define valid CSS-colors. When `color` is set and the
values in the corresponding column are not numeric, values in that
column are assigned colors by cycling through `color_discrete_sequence`
in the order described in `category_orders`, unless the value of
`color` is a key in `color_discrete_map`. Various useful color
sequences are available in the `plotly.express.colors` submodules,
specifically `plotly.express.colors.qualitative`.
color_discrete_map: dict with str keys and str values (default `{}`)
String values should define valid CSS-colors Used to override
`color_discrete_sequence` to assign a specific colors to marks
corresponding with specific values. Keys in `color_discrete_map` should
be values in the column denoted by `color`. Alternatively, if the
values of `color` are valid colors, the string `'identity'` may be
passed to cause them to be used directly.
color_continuous_scale: list of str
Strings should define valid CSS-colors This list is used to build a
continuous color scale when the column denoted by `color` contains
numeric data. Various useful color scales are available in the
`plotly.express.colors` submodules, specifically
`plotly.express.colors.sequential`, `plotly.express.colors.diverging`
and `plotly.express.colors.cyclical`.
pattern_shape_sequence: list of str
Strings should define valid plotly.js patterns-shapes. When
`pattern_shape` is set, values in that column are assigned patterns-
shapes by cycling through `pattern_shape_sequence` in the order
described in `category_orders`, unless the value of `pattern_shape` is
a key in `pattern_shape_map`.
pattern_shape_map: dict with str keys and str values (default `{}`)
Strings values define plotly.js patterns-shapes. Used to override
`pattern_shape_sequences` to assign a specific patterns-shapes to lines
corresponding with specific values. Keys in `pattern_shape_map` should
be values in the column denoted by `pattern_shape`. Alternatively, if
the values of `pattern_shape` are valid patterns-shapes names, the
string `'identity'` may be passed to cause them to be used directly.
range_color: list of two numbers
If provided, overrides auto-scaling on the continuous color scale.
color_continuous_midpoint: number (default `None`)
If set, computes the bounds of the continuous color scale to have the
desired midpoint. Setting this value is recommended when using
`plotly.express.colors.diverging` color scales as the inputs to
`color_continuous_scale`.
opacity: float
Value between 0 and 1. Sets the opacity for markers.
orientation: str, one of `'h'` for horizontal or `'v'` for vertical.
(default `'v'` if `x` and `y` are provided and both continuous or both
categorical, otherwise `'v'`(`'h'`) if `x`(`y`) is categorical and
`y`(`x`) is continuous, otherwise `'v'`(`'h'`) if only `x`(`y`) is
provided)
barmode: str (default `'relative'`)
One of `'group'`, `'overlay'` or `'relative'` In `'relative'` mode,
bars are stacked above zero for positive values and below zero for
negative values. In `'overlay'` mode, bars are drawn on top of one
another. In `'group'` mode, bars are placed beside each other.
log_x: boolean (default `False`)
If `True`, the x-axis is log-scaled in cartesian coordinates.
log_y: boolean (default `False`)
If `True`, the y-axis is log-scaled in cartesian coordinates.
range_x: list of two numbers
If provided, overrides auto-scaling on the x-axis in cartesian
coordinates.
range_y: list of two numbers
If provided, overrides auto-scaling on the y-axis in cartesian
coordinates.
text_auto: bool or string (default `False`)
If `True` or a string, the x or y or z values will be displayed as
text, depending on the orientation A string like `'.2f'` will be
interpreted as a `texttemplate` numeric formatting directive.
title: str
The figure title.
subtitle: str
The figure subtitle.
template: str or dict or plotly.graph_objects.layout.Template instance
The figure template name (must be a key in plotly.io.templates) or
definition.
width: int (default `None`)
The figure width in pixels.
height: int (default `None`)
The figure height in pixels.
Returns
-------
plotly.graph_objects.Figure
File: /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/plotly/express/_chart_types.py
Type: function
Again the px.bar() function docs are also on the Plotly website, and there is also a guide to bar charts.
Let’s now make a bar plot, with year
on the X axis and llins_shipped
on the Y axis.
fig = px.bar(
data_frame=df_llins,
x="year",
y="llins_shipped"
)
fig
Improved bar plot#
Let’s now improve the bar plot by using color, hover text, and doing some visual styling.
fig = px.bar(
data_frame=df_llins,
x="year",
y="llins_shipped",
color="llin_type",
hover_name="country",
labels={
"year": "Year",
"llins_shipped": "No. LLINs",
"llin_type": "LLIN type"
},
title="LLIN shipments to countries in Sub-Saharan Africa",
width=800,
template="plotly_white",
)
fig
Exercise 4 (English)
Make a bar chart from the LLIN data as above, but using country
for the X axis and year
for the hover name.
Exercice 4 (Français)
Créer un diagramme à barres pour les données sur les LLINs comme au-dessus mais en utilisant country
pour l’axe X et year
pour le texte de survol.
Line and area plots#
Let’s also use the LLIN data to make some line and area plots, via the px.line()
and px.area()
functions.
Here is a line plot of LLINs shipped to Nigeria.
fig = px.line(
data_frame=df_llins.query("country == 'Nigeria'"),
x="year",
y="llins_shipped",
color="llin_type",
markers=True,
width=800,
title="LLIN shipments to Nigeria",
labels={
"year": "Year",
"llins_shipped": "No. LLINs",
"llin_type": "LLIN type"
},
template="plotly_white",
)
fig
Exercise 5 (English)
Make a line plot as above but for Democratic Republic of the Congo. Hint: use the query "country == 'DR Congo'"
Exercice 5 (Français)
Créer un diagramme à lignes comme ci-dessus mais pour la République Démocratique du Congo. Indice: Utiliser la requête "country == 'DR Congo'"
Exercise 6 (English)
Make an area plot using the LLIN data from Nigeria. Hint: it’s exactly the same parameters as the line plot, just call the px.area()
function instead of px.line()
.
Exercice 6 (Français)
Créer un diagramme à zones utilisant les données des LLINs du Nigeria. Indice: Les paramètres sont les mêmes que pour le diagramme à lignes mais il faut appeler la fonction px.area()
au lieu de px.line()
.
Well done!#
Hopefully this has been a useful introduction to plotting in Python.
As I mentioned earlier, there are lots more plot types that Plotly Express provides, take a look at the user guide and the API docs for more information.
Happy plotting!
Exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.
Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Français#
Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.
Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.
Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.