Workshop 1 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 2 - Accessing and exploring Anopheles genomic data#
Theme: Data
This module provides an introduction to accessing and exploring data about Anopheles mosquito specimens collected in the field and submitted for whole-genome sequencing by MalariaGEN.
Learning objectives#
After completing this module, you will be able to:
Explain how Anopheles genomic data are generated.
Explain what types of data are available from MalariaGEN.
Explain where data from MalariaGEN are stored.
Use the
malariagen_data
Python package to accessAg3.0
data in Google Cloud.Explore the
Ag3.0
data release and summarise the mosquito samples for which genomic data are available using pivot tables and maps.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Where do the data come from?#
The data we’ll be analysing in this training course where generated by multiple research groups collaborating as part of the Malaria Genomic Epidemiology Network (MalariaGEN).
MalariaGEN is a collaborative programme providing access to genome sequencing and data processing services to support surveillance of malaria parasites and vectors.
Through this programme, members of research groups and disease control programmes in malaria-endemic countries work in partnership with the Wellcome Sanger Institute.
The basic workflow involves collecting mosquitoes, shipping them to sequencing facilities, preparing DNA samples and performing Illumina whole-genome sequencing, then processing the resulting data so they are ready for analysis, as shown below.
%%html
<img width="50%" height="50%" src="https://vobs-resources.cog.sanger.ac.uk/training/img/workshop-1/w1m2-1.png"/>
Note that raw genome sequence data is not particularly useful by itself, and so the sequence reads are processed through variant-calling pipelines which identify different types of genetic variation between individual mosquitoes.
The results of variant-calling pipelines are then passed through a number of quality control, filtering and annotation steps to ensure data quality. We call this process data curation.
The analysis-ready genome variation data is then made available to all partners in the collaboration. This data can then be analysed to answer questions about the surveillance of mosquito populations, such as whether new forms of insecticide resistance are emerging and spreading.
What types of analysis-ready genomic data are available?#
When DNA is passed from one generation of mosquitoes to the next, it undergoes mutations, which are errors in the DNA copying process. There are different types of mutations that can occur. These include:
Single Nucleotide Polymorphisms (SNPs) - substitutions of a single letter in the DNA sequence
Copy Number Variants (CNVs) - duplications or deletions of sections of a DNA sequence
Different variant calling pipelines are used to identify these different types of mutations.
It is also very useful to know whether combinations of mutations occur together in the same DNA sequence. In order to reconstruct this information, another pipeline is used to produce phased haplotypes.
To help make sense of the genomic data, we also need some data about the mosquitoes which were sequenced, such as the time and place of collection. This data is known as sample metadata.
We will revisit CNVs and haplotypes in future workshops. For this workshop, we are only interested in SNPs and sample metadata.
Where are the data stored?#
To make accessing these data as simple as possible, the resulting data are stored in Google Cloud using a service called Google Cloud Storage (GCS). These data can then be downloaded to any computer, or can be analysed within the cloud using cloud computing services like colab.
If you are using colab to access and analyses these data, then you don’t need to download any data to your own computer or install any special software. You access colab through a web browser, and the code you run is executed on a different computer (a “virtual machine”) which sits alongside the data in Google Cloud.
Accessing the Ag3.0
data resource#
In this workshop we’ll be accessing and analysing data from the Anopheles gambiae 1000 Genomes Project phase 3 data resource, also known as “Ag3.0” for short. This includes data from whole-genome sequencing of 3,081 mosquitoes from 19 African countries.
To set up your notebook to access these data, first install the malariagen_data package.
%pip install -q --no-warn-conflicts malariagen_data
Then import packages and set up access to Anopheles gambiae genomic data.
Note that authentication is required to access data through the package, please follow the instructions here.
import malariagen_data
import plotly.express as px
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | None |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
You can now access a number of different types of data through the ag3
object. The full list of functions is available from the Ag3 API docs. For the rest of this module, we are just going to look at sample metadata.
Loading sample metadata#
We can use the sample_metadata()
function to retrieve a pandas DataFrame containing metadata about all 3,081 samples in the Ag3.0 resource. In this DataFrame, each row represents one mosquito sample, and the columns such as country
and year
provide information about where the mosquito was originally collected.
df_samples = ag3.sample_metadata(sample_sets="3.0")
df_samples
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AR0047-C | LUA047 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
1 | AR0049-C | LUA049 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
2 | AR0051-C | LUA051 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
3 | AR0061-C | LUA061 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
4 | AR0078-C | LUA078 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3076 | AD0494-C | 80-2-o-16 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3077 | AD0495-C | 80-2-o-17 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3078 | AD0496-C | 80-2-o-18 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3079 | AD0497-C | 80-2-o-19 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3080 | AD0498-C | 80-2-o-20 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3081 rows × 32 columns
Exploring sample metadata#
Let’s use some pandas features such as groupby()
and query()
to explore the sample metadata.
For example, let’s first find out a bit more information about the different countries represented.
df_samples.groupby("country").size()
country
Angola 81
Burkina Faso 296
Cameroon 444
Central African Republic 73
Cote d'Ivoire 80
Democratic Republic of the Congo 76
Equatorial Guinea 10
Gabon 69
Gambia, The 279
Ghana 100
Guinea 136
Guinea-Bissau 101
Kenya 86
Lab Cross 297
Malawi 41
Mali 225
Mayotte 23
Mozambique 74
Tanzania 300
Uganda 290
dtype: int64
We can then use the pandas query() function to select all samples from a given country. E.g., find all samples from Burkina Faso.
df_samples.query("country == 'Burkina Faso'")
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81 | AB0085-Cx | BF2-4 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
82 | AB0086-Cx | BF2-6 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
83 | AB0087-C | BF3-3 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
84 | AB0088-C | BF3-5 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
85 | AB0089-Cx | BF3-8 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
372 | AB0314-C | 6775 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
373 | AB0315-C | 6777 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
374 | AB0316-C | 6779 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
375 | AB0318-C | 5072 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 7 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_07 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_07 | BF-07_Bazega_gamb_2004_Q3 |
376 | AB0325-C | 1403 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 6 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_06 | BF-07_gamb_2004_Q2 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_06 | BF-07_Bazega_gamb_2004_Q2 |
296 rows × 32 columns
From a quick glance at the preview above, we can see there are samples collected in different years. Let’s summarise that.
df_samples.query("country == 'Burkina Faso'").groupby("year").size()
year
2004 13
2012 181
2014 102
dtype: int64
If we wanted to now inspect the samples collected from Burkina Faso in 2014, we could combine these conditions in a query.
df_samples.query("country == 'Burkina Faso' and year == 2014")
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
262 | AB0326-C | BF18-1 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
263 | AB0327-C | BF18-3 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
264 | AB0328-C | BF18-4 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
265 | AB0329-C | BF18-5 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
266 | AB0330-C | BF18-6 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
359 | AB0533-C | BF13-18 | Austin Burt | Burkina Faso | Souroukoudinga | 2014 | 7 | 11.238 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2014 | BF-09_gamb_2014_07 | BF-09_gamb_2014_Q3 | BF-09_Houet_gamb_2014 | BF-09_Houet_gamb_2014_07 | BF-09_Houet_gamb_2014_Q3 |
360 | AB0536-C | BF13-31 | Austin Burt | Burkina Faso | Souroukoudinga | 2014 | 7 | 11.238 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2014 | BF-09_gamb_2014_07 | BF-09_gamb_2014_Q3 | BF-09_Houet_gamb_2014 | BF-09_Houet_gamb_2014_07 | BF-09_Houet_gamb_2014_Q3 |
361 | AB0537-C | BF13-32 | Austin Burt | Burkina Faso | Souroukoudinga | 2014 | 7 | 11.238 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2014 | BF-09_gamb_2014_07 | BF-09_gamb_2014_Q3 | BF-09_Houet_gamb_2014 | BF-09_Houet_gamb_2014_07 | BF-09_Houet_gamb_2014_Q3 |
362 | AB0538-C | BF13-33 | Austin Burt | Burkina Faso | Souroukoudinga | 2014 | 7 | 11.238 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2014 | BF-09_gamb_2014_07 | BF-09_gamb_2014_Q3 | BF-09_Houet_gamb_2014 | BF-09_Houet_gamb_2014_07 | BF-09_Houet_gamb_2014_Q3 |
363 | AB0408-C | BF14-20 | Austin Burt | Burkina Faso | Bana Village | 2014 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2014 | BF-09_colu_2014_07 | BF-09_colu_2014_Q3 | BF-09_Houet_colu_2014 | BF-09_Houet_colu_2014_07 | BF-09_Houet_colu_2014_Q3 |
102 rows × 32 columns
Finally, let’s break this down by mosquito species.
df_samples.query("country == 'Burkina Faso' and year == 2014").groupby("taxon").size()
taxon
arabiensis 3
coluzzii 53
gambiae 46
dtype: int64
Summarising sample metadata with pivot tables#
In the examples above we explored a part of the sample metadata, but it can also be useful to get an overall summary of how many samples have been sequenced, broken down by time and place of collection and mosquito species. For that kind of summary the pivot_table()
function is useful.
Let’s start by summarising the number of mosquitoes by country and species (taxon).
pivot_country_taxon = (
df_samples
.pivot_table(
index="country",
columns="taxon",
values="sample_id",
aggfunc="count",
fill_value=0
)
)
pivot_country_taxon
taxon | arabiensis | coluzzii | gambiae | gcx1 | gcx3 | unassigned |
---|---|---|---|---|---|---|
country | ||||||
Angola | 0 | 81 | 0 | 0 | 0 | 0 |
Burkina Faso | 3 | 135 | 158 | 0 | 0 | 0 |
Cameroon | 2 | 26 | 416 | 0 | 0 | 0 |
Central African Republic | 0 | 18 | 55 | 0 | 0 | 0 |
Cote d'Ivoire | 0 | 80 | 0 | 0 | 0 | 0 |
Democratic Republic of the Congo | 0 | 0 | 76 | 0 | 0 | 0 |
Equatorial Guinea | 0 | 0 | 10 | 0 | 0 | 0 |
Gabon | 0 | 0 | 69 | 0 | 0 | 0 |
Gambia, The | 0 | 200 | 2 | 77 | 0 | 0 |
Ghana | 0 | 64 | 36 | 0 | 0 | 0 |
Guinea | 0 | 11 | 124 | 0 | 0 | 1 |
Guinea-Bissau | 0 | 0 | 7 | 93 | 0 | 1 |
Kenya | 13 | 0 | 19 | 0 | 54 | 0 |
Lab Cross | 0 | 0 | 0 | 0 | 0 | 297 |
Malawi | 41 | 0 | 0 | 0 | 0 | 0 |
Mali | 2 | 90 | 131 | 0 | 0 | 2 |
Mayotte | 0 | 0 | 23 | 0 | 0 | 0 |
Mozambique | 0 | 0 | 74 | 0 | 0 | 0 |
Tanzania | 225 | 0 | 64 | 0 | 11 | 0 |
Uganda | 82 | 0 | 207 | 0 | 0 | 1 |
We could also turn this into a bar chart.
fig = px.bar(pivot_country_taxon, height=600, width=800)
fig.update_layout(
title="Ag3.0 genomes sequenced",
yaxis_title="no. genomes",
)
fig.show()
Mosquitoes were also sampled in different years. Let’s make a new pivot table, breaking down by country, year and taxon.
pivot_country_year_taxon = (
df_samples
.pivot_table(
index=["country", "year"],
columns=["taxon"],
values="sample_id",
aggfunc="count",
fill_value=0
)
)
pivot_country_year_taxon
taxon | arabiensis | coluzzii | gambiae | gcx1 | gcx3 | unassigned | |
---|---|---|---|---|---|---|---|
country | year | ||||||
Angola | 2009 | 0 | 81 | 0 | 0 | 0 | 0 |
Burkina Faso | 2004 | 0 | 0 | 13 | 0 | 0 | 0 |
2012 | 0 | 82 | 99 | 0 | 0 | 0 | |
2014 | 3 | 53 | 46 | 0 | 0 | 0 | |
Cameroon | 2005 | 0 | 7 | 90 | 0 | 0 | 0 |
2009 | 0 | 0 | 303 | 0 | 0 | 0 | |
2013 | 2 | 19 | 23 | 0 | 0 | 0 | |
Central African Republic | 1993 | 0 | 5 | 2 | 0 | 0 | 0 |
1994 | 0 | 13 | 53 | 0 | 0 | 0 | |
Cote d'Ivoire | 2012 | 0 | 80 | 0 | 0 | 0 | 0 |
Democratic Republic of the Congo | 2015 | 0 | 0 | 76 | 0 | 0 | 0 |
Equatorial Guinea | 2002 | 0 | 0 | 10 | 0 | 0 | 0 |
Gabon | 2000 | 0 | 0 | 69 | 0 | 0 | 0 |
Gambia, The | 2006 | 0 | 22 | 0 | 9 | 0 | 0 |
2011 | 0 | 6 | 0 | 68 | 0 | 0 | |
2012 | 0 | 172 | 2 | 0 | 0 | 0 | |
Ghana | 2012 | 0 | 64 | 36 | 0 | 0 | 0 |
Guinea | 2012 | 0 | 11 | 124 | 0 | 0 | 1 |
Guinea-Bissau | 2010 | 0 | 0 | 7 | 93 | 0 | 1 |
Kenya | 2000 | 0 | 0 | 19 | 0 | 0 | 0 |
2007 | 3 | 0 | 0 | 0 | 0 | 0 | |
2012 | 10 | 0 | 0 | 0 | 54 | 0 | |
Lab Cross | -1 | 0 | 0 | 0 | 0 | 0 | 297 |
Malawi | 2015 | 41 | 0 | 0 | 0 | 0 | 0 |
Mali | 2004 | 2 | 36 | 33 | 0 | 0 | 0 |
2012 | 0 | 27 | 65 | 0 | 0 | 2 | |
2014 | 0 | 27 | 33 | 0 | 0 | 0 | |
Mayotte | 2011 | 0 | 0 | 23 | 0 | 0 | 0 |
Mozambique | 2003 | 0 | 0 | 3 | 0 | 0 | 0 |
2004 | 0 | 0 | 71 | 0 | 0 | 0 | |
Tanzania | 2012 | 87 | 0 | 0 | 0 | 0 | 0 |
2013 | 1 | 0 | 32 | 0 | 10 | 0 | |
2015 | 137 | 0 | 32 | 0 | 1 | 0 | |
Uganda | 2012 | 82 | 0 | 207 | 0 | 0 | 1 |
For some countries there are data from multiple collection sites. Let’s inspect that for Burkina Faso by applying a query then creating a pivot table.
pivot_location_year_taxon_bf = (
df_samples
.query("country == 'Burkina Faso'")
.pivot_table(
index=["country", "admin1_name", "admin2_name", "location", "year"],
columns=["taxon"],
values="sample_id",
aggfunc="count",
fill_value=0
)
)
pivot_location_year_taxon_bf
taxon | arabiensis | coluzzii | gambiae | ||||
---|---|---|---|---|---|---|---|
country | admin1_name | admin2_name | location | year | |||
Burkina Faso | Centre-Sud | Bazega | Monomtenga | 2004 | 0 | 0 | 13 |
Hauts-Bassins | Houet | Bana Village | 2012 | 0 | 42 | 23 | |
2014 | 1 | 47 | 15 | ||||
Pala | 2012 | 0 | 11 | 48 | |||
2014 | 2 | 0 | 16 | ||||
Souroukoudinga | 2012 | 0 | 29 | 28 | |||
2014 | 0 | 6 | 15 |
We can see there are four collection sites in Burkina Faso.
Plotting maps of sampling locations#
To explore the different mosquito collection locations it can also be useful to plot some maps. You can plot maps within a notebook using various packages such as ipyleaflet. Let’s install the ipyleaflet package.
%pip install -qq ipyleaflet
Note: you may need to restart the kernel to use updated packages.
Now import some useful functions from ipyleaflet.
import ipyleaflet
Creating an interactive map is very straightforward, using the Map() function. Here is a world map centered on Africa. Note that this is an interactive map, you can pan and zoom.
m = ipyleaflet.Map(
basemap=ipyleaflet.basemaps.OpenStreetMap.Mapnik,
center=[0, 20],
zoom=3,
)
m
Let’s now plot a map, adding in markers for all of the locations where we have mosquitoes. First create a pivot table with the location data we need.
pivot_location_taxon = (
df_samples
.pivot_table(
index=["country", "location", "latitude", "longitude"],
columns=["taxon"],
values="sample_id",
aggfunc="count",
fill_value=0,
)
)
pivot_location_taxon
taxon | arabiensis | coluzzii | gambiae | gcx1 | gcx3 | unassigned | |||
---|---|---|---|---|---|---|---|---|---|
country | location | latitude | longitude | ||||||
Angola | Luanda | -8.884 | 13.302 | 0 | 81 | 0 | 0 | 0 | 0 |
Burkina Faso | Bana Village | 11.233 | -4.472 | 1 | 89 | 38 | 0 | 0 | 0 |
Monomtenga | 12.060 | -1.170 | 0 | 0 | 13 | 0 | 0 | 0 | |
Pala | 11.151 | -4.235 | 2 | 11 | 64 | 0 | 0 | 0 | |
Souroukoudinga | 11.238 | -4.235 | 0 | 35 | 43 | 0 | 0 | 0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Tanzania | Muheza | -4.940 | 38.948 | 1 | 0 | 32 | 0 | 10 | 0 |
Muleba | -1.962 | 31.621 | 137 | 0 | 32 | 0 | 1 | 0 | |
Tarime | -1.431 | 34.199 | 47 | 0 | 0 | 0 | 0 | 0 | |
Uganda | Kihihi | -0.751 | 29.701 | 1 | 0 | 95 | 0 | 0 | 0 |
Nagongera | 0.770 | 34.026 | 81 | 0 | 112 | 0 | 0 | 1 |
127 rows × 6 columns
Now create a map with markers.
# create a map
m = ipyleaflet.Map(
basemap=ipyleaflet.basemaps.OpenStreetMap.Mapnik,
center=[0, 20],
zoom=3,
)
# add markers for sampling locations
for row in pivot_location_taxon.reset_index().itertuples():
title = (
f"{row.location}, {row.country} ({row.latitude:.3f}, {row.longitude:.3f})\n"
f"{row.gambiae} gambiae, {row.coluzzii} coluzzii, {row.arabiensis} arabiensis"
)
marker = ipyleaflet.Marker(
location=(row.latitude, row.longitude),
draggable=False,
title=title,
)
m.add_layer(marker)
# add a scale bar
m.add_control(ipyleaflet.ScaleControl(position="bottomleft"))
# display the map
m
Try hovering over the markers, you should see some text with a summary of how many samples are available by species.
Practical exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. Hint: click the rocket icon () at the top of the page, then select “Colab” from the drop-down menu. When colab opens, click the “Edit” menu, then select “Clear all outputs”, then begin running the cells.
Find out how many mosquito specimens are available for each of the different Anopheles species represented. Hint: try grouping the sample metadata dataframe by the “taxon” column, then calling the
size()
method.Make a pivot table that shows how many samples are available in the Ag3.0 resource that were collected in Mali, summarised by year, location and taxon. Now try Cameroon, or any other country of interest.
How many countries are there for which we have some samples of Anopheles coluzzii? What about Anopheles arabiensis and Anopheles gambiae? Hint: Make a pivot table by country and taxon, and then query it.
Plot a map of all sampling locations, changing the
basemap
parameter to show a different background map. Hint: see the ipyleaflet basemaps documentation for a list of available options.Plot a map that starts centered and zoomed in to Uganda, or any other country of interest. Hint: change the
center
andzoom
parameters when calling the ipyleafletMap()
function.Plot a map showing only locations where we have samples of Anopheles coluzzii. Now try Anopheles arabiensis or Anopheles gambiae.
If you feel like a challenge, plot a map with markers for sampling locations, and add a popup to each marker showing a pivot table of how many samples were collected by year and species.
Français#
Ouvrir ce notebook dans Google Colab et exécuter le vous-même du début à la fin. Indice: cliquer sur l’icone fusée () au sommet de la page et sélectionner “Colab” dans le menu déroulant. Quand Colab s’ouvre, cliquer sur le menu “Edit” et sélectionner “Clear all outputs”, commencer ensuite à exéuter les cellules.
Trouver combien de moustiques sont disponibles pour chacune des différentes espèces d’Anophèles représentées. Indice: essayer de grouper le dataframe des métadonnées des échantillons selon la colonne “taxon”, ensuite utiliser la méthode
size()
.Créer une table à pivôt qui montre combien de moustiques capturés au Mali sont présents dans Ag3.0, résummés par année, lieu de capture et taxon. Essayer ensuite le Cameroun ou autre pays de votre choix.
Pour combien de pays avons nous des Anophèles coluzzii? Même question pour Anophèles arabiensis et Anophèles gambiae? Indice: créer une table à pivôt par pays et taxon et utiliser une requête.
Créer une carte de tous les lieux de capture utilisant une autre basemap pour avoir un fond différent. Indice: regarder la documentation d’ipyleaflet basemaps pour une liste des options disponibles.
Créer une carte centrée et zoomée sur l’Ouganda ou un autre pays de votre choix. Indice: modifier les paramètres center et zoom quand vous utilisez la fonction
Map()
d’ipyleaflet.Créer une carte ne montrant que les lieux de capture où des Anophèles coluzzii ont été capturés. Faire la même chose pour les Anophèles arabiensis ou les Anophèles gambiae.
Si vous souhaitez un défi, créer une carte avec un marqueur pour chaque lieu de capture et ajouter un pop-up à chaque marqueur montrant une table à pivôt donnant le nombre de moustiques par année et taxon.