Workshop 2 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - Analysing CNV frequencies at metabolic resistance genes#
Theme: Analysis
In this module we’re going to analyse copy number variation in genes that are known to be associated with metabolic insecticide resistance in the Anopheles gambiae complex, particularly cytochrome P450 (Cyp) genes. We will compute the frequency of copy number variation in these genes of interest, and compare these frequencies between mosquitoes from different time points, geographical locations and species.
Learning objectives#
At the end of this module you will be able to:
Compute the frequency of gene copy number variation in different mosquito cohorts.
Extract and analyse CNV frequencies for metabolic resistance genes.
Visualise frequencies using heatmaps, time series plots and maps.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
First, let’s install the python packages we will need for our analyses.
%pip install -q --no-warn-conflicts malariagen_data
Now we’ve installed malariagen_data
, we can import it into our environment and set it up to access data in the cloud.
Note that authentication is required to access data through the package, please follow the instructions here.
import malariagen_data
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | None |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
Above we can see some information about how access to Anopheles gambiae genomic data from MalariaGEN has been configured. This tells us where the data are stored (the “gs” stands for Google Cloud Storage), which data releases are available to analyse (Ag3.0) and what the default analyses are for cohorts, species and site filters. We can also see that we are using version 8.7.0 of the malariagen_data
Python package.
Summarising copy number variation by gene#
For surveillance analyses, it can be convenient to summarise the copy number data for each individual by gene. This allows us to ask questions like, for some metabolic resistance gene of interest, does a given mosquito have any copy number variation at that gene?
We’ve seen in the previous module how the copy number variation data is calculated in 300 base pair windows over the genome, but in general genes are larger than 300 base pairs and so will span multiple windows. So how do we summarise the windowed copy number data by gene?
One way to do this is to compute some kind of average over all the windows overlapping a given gene of interest. Because copy number state is a discrete variable, we typically prefer to do this by computing the modal copy number for windows overlapping a gene.
The diagram below shows the windowed copy number for a single mosquito for the region of the genome spanning several metabolic resistance genes. We illustrate how the summarisation is done to compute modal copy number by gene.
Computing gene CNV frequencies#
Now that we know how to summarise the data by gene, the next step is to then summarise these data over multiple individual mosquitoes, so that we can estimate the fraction of mosquitoes from a given time, place and species that carry some form of copy number variation at a given gene of interest. This is ultimately what we are interested in, because we would like to know if metabolic resistance is at low or high frequency and whether this varies over time or between geographical locations or species.
Because copy number variation is more complex than single nucleotide polymorphism, we compute the frequency of CNVs in a simplified way. We compute the fraction of individuals that carry some kind of copy number variation at a given gene, ignoring whether the individual was heterozygous or homozygous for that CNV, and also ignoring the number of additional copies that may be present. We also divide this into two types of CNV - amplification and deletion. So, for example, we compute the fraction of individuals which have an amplification at some gene; and similarly we compute the fraction with some deletion. Although generally we’re more interested in amplification, because amplification of metabolic resistance genes can cause increased insecticide resistance.
The diagram below illustrates this simple calculation of amplification and deletion frequencies for some genes of interest.
Let’s now compute gene CNV frequencies. To do this we can use the gene_cnv_frequencies()
function.
We can look at the documentation for this function by using “?”.
ag3.gene_cnv_frequencies?
The documentation shows us that there are two required parameters for this function. The region
parameter is the genomic region that we want to analyse, and the cohorts
parameter determines how samples are to be grouped into cohorts.
Let’s initially analyse all genes in the Cyp6aa/p cluster on chromosome arm 2R, which starts around position 28,450,000 and ends at around position 28,510,000.
cyp6aap_region = "2R:28,480,000-28,510,000"
There are different ways to group our mosquitoes in to cohorts (more on this later) but for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.
cohorts = "admin1_year"
Were also going to use one of the optional parameters, sample_sets
, to analyse just a subset the data.
Let’s have a look at what sample sets are available in the Ag3.0 data resource.
ag3.sample_sets(release='3.0')[['sample_set','sample_count','study_id']]
sample_set | sample_count | study_id | |
---|---|---|---|
0 | AG1000G-AO | 81 | AG1000G-AO |
1 | AG1000G-BF-A | 181 | AG1000G-BF-1 |
2 | AG1000G-BF-B | 102 | AG1000G-BF-1 |
3 | AG1000G-BF-C | 13 | AG1000G-BF-2 |
4 | AG1000G-CD | 76 | AG1000G-CD |
5 | AG1000G-CF | 73 | AG1000G-CF |
6 | AG1000G-CI | 80 | AG1000G-CI |
7 | AG1000G-CM-A | 303 | AG1000G-CM-1 |
8 | AG1000G-CM-B | 97 | AG1000G-CM-2 |
9 | AG1000G-CM-C | 44 | AG1000G-CM-3 |
10 | AG1000G-FR | 23 | AG1000G-FR |
11 | AG1000G-GA-A | 69 | AG1000G-GA-1 |
12 | AG1000G-GH | 100 | AG1000G-GH |
13 | AG1000G-GM-A | 74 | AG1000G-GM-1 |
14 | AG1000G-GM-B | 31 | AG1000G-GM-2 |
15 | AG1000G-GM-C | 174 | AG1000G-GM-3 |
16 | AG1000G-GN-A | 45 | AG1000G-GN-ML |
17 | AG1000G-GN-B | 185 | AG1000G-GN-ML |
18 | AG1000G-GQ | 10 | AG1000G-GQ |
19 | AG1000G-GW | 101 | AG1000G-GW |
20 | AG1000G-KE | 86 | AG1000G-KE |
21 | AG1000G-ML-A | 60 | AG1000G-ML-1 |
22 | AG1000G-ML-B | 71 | AG1000G-ML-2 |
23 | AG1000G-MW | 41 | AG1000G-MW |
24 | AG1000G-MZ | 74 | AG1000G-MZ |
25 | AG1000G-TZ | 300 | AG1000G-TZ |
26 | AG1000G-UG | 290 | AG1000G-UG |
27 | AG1000G-X | 669 | AG1000G-X |
To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]
Now we’re ready to run the analysis.
burkina_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
region=cyp6aap_region,
cohorts=cohorts,
sample_sets=sample_sets,
drop_invariant=False,
)
burkina_cyp6aap_cnv_freqs_df
gene_strand | gene_description | contig | start | end | frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | windows | label | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | gene_name | cnv_type | |||||||||||||
AGAP002859 | NaN | amp | + | solute carrier family 8 (sodium/calcium exchan... | 2R | 28397312 | 28516028 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 397 | AGAP002859 amp |
del | + | solute carrier family 8 (sodium/calcium exchan... | 2R | 28397312 | 28516028 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 397 | AGAP002859 del | ||
AGAP002862 | CYP6AA1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 | 0.0 | 0.9125 | 0.811321 | 0.040816 | 0.065217 | 0.912500 | 8 | AGAP002862 (CYP6AA1) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8 | AGAP002862 (CYP6AA1) del | ||
AGAP013128 | CYP6AA2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 | 0.0 | 0.8250 | 0.792453 | 0.040816 | 0.065217 | 0.825000 | 6 | AGAP013128 (CYP6AA2) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP013128 (CYP6AA2) del | ||
AGAP002863 | COEAE6O | amp | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 | 0.0 | 0.6000 | 0.509434 | 0.010204 | 0.021739 | 0.600000 | 7 | AGAP002863 (COEAE6O) amp |
del | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7 | AGAP002863 (COEAE6O) del | ||
AGAP002864 | CYP6P15P | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 | 0.0 | 0.5250 | 0.528302 | 0.010204 | 0.021739 | 0.528302 | 6 | AGAP002864 (CYP6P15P) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP002864 (CYP6P15P) del | ||
AGAP002865 | CYP6P3 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | 0.0 | 0.0375 | 0.075472 | 0.010204 | 0.021739 | 0.075472 | 7 | AGAP002865 (CYP6P3) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | 0.0 | 0.0000 | 0.000000 | 0.112245 | 0.000000 | 0.112245 | 7 | AGAP002865 (CYP6P3) del | ||
AGAP002866 | CYP6P5 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 | 0.0 | 0.0375 | 0.056604 | 0.020408 | 0.021739 | 0.056604 | 6 | AGAP002866 (CYP6P5) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP002866 (CYP6P5) del | ||
AGAP002867 | CYP6P4 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 | 0.0 | 0.0375 | 0.075472 | 0.010204 | 0.021739 | 0.075472 | 6 | AGAP002867 (CYP6P4) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP002867 (CYP6P4) del | ||
AGAP002868 | CYP6P1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 6 | AGAP002868 (CYP6P1) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP002868 (CYP6P1) del | ||
AGAP002869 | CYP6P2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 7 | AGAP002869 (CYP6P2) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7 | AGAP002869 (CYP6P2) del | ||
AGAP002870 | CYP6AD1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 6 | AGAP002870 (CYP6AD1) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | AGAP002870 (CYP6AD1) del |
The function outputs a pandas DataFrame, where rows represent gene CNVs. There are two rows for each gene, one representing an amplification (“amp” - increase in copy number), the other representing a deletion (“del” - decrease in copy number). There are 11 annotated genes on the chromosome arms we selected, so there are 2 times 11 = 22 rows in this dataframe.
len(burkina_cyp6aap_cnv_freqs_df)
22
Let’s look at the columns that provide information about the genes.
burkina_cyp6aap_cnv_freqs_df[["gene_strand", "gene_description", "contig", "start", "end"]]
gene_strand | gene_description | contig | start | end | |||
---|---|---|---|---|---|---|---|
gene_id | gene_name | cnv_type | |||||
AGAP002859 | NaN | amp | + | solute carrier family 8 (sodium/calcium exchan... | 2R | 28397312 | 28516028 |
del | + | solute carrier family 8 (sodium/calcium exchan... | 2R | 28397312 | 28516028 | ||
AGAP002862 | CYP6AA1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 | ||
AGAP013128 | CYP6AA2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 | ||
AGAP002863 | COEAE6O | amp | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 |
del | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 | ||
AGAP002864 | CYP6P15P | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 | ||
AGAP002865 | CYP6P3 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | ||
AGAP002866 | CYP6P5 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 | ||
AGAP002867 | CYP6P4 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 | ||
AGAP002868 | CYP6P1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 | ||
AGAP002869 | CYP6P2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | ||
AGAP002870 | CYP6AD1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 |
Reminder - cohorts#
When we want to analyse frequencies we first define cohorts, which are simply groups of mosquitoes that we have data for and which we take as representatives of a mosquito population from a given time, place and species.
To define these cohorts we group mosquitoes according to these different variables, i.e., geographically, temporally and according to species (taxon). There are different options for doing this, depending on how fine-grained we want our analysis to be, and how many samples are available.
When computing frequencies and choosing cohorts, we need to think about two factors:
Cohort size - The larger the cohort the lower the uncertainty and the more confident we can be that the frequency we calculate within the cohort is representative of the frequency in the population from which the cohort was sampled. By default, the functions that compute frequencies in the
malariagen_data
package require a minimum cohort size of 10 individuals.Heterogeneity - If we group samples together from a wide geographical area, or from a long time period, we may be losing information about any smaller-scale variation.
Often we have to make a compromise, and do the best we can with available data.
In the example above, we provided the cohorts="admin1_year"
parameter, which means that samples will be grouped by species (taxon), administrative level 1 divisions, and year of collection. This gives us frequencies for five cohorts. Let’s look at the cohort frequency columns.
frequency_columns = [
col for col in burkina_cyp6aap_cnv_freqs_df.columns
if col.startswith("frq_")
]
frequency_columns
['frq_BF-07_gamb_2004',
'frq_BF-09_colu_2012',
'frq_BF-09_colu_2014',
'frq_BF-09_gamb_2012',
'frq_BF-09_gamb_2014']
Remember, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.
Let’s look at the values of these frequency columns in our DataFrame.
burkina_cyp6aap_cnv_freqs_df[frequency_columns + ["max_af", "windows"]]
frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | windows | |||
---|---|---|---|---|---|---|---|---|---|
gene_id | gene_name | cnv_type | |||||||
AGAP002859 | NaN | amp | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 397 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 397 | ||
AGAP002862 | CYP6AA1 | amp | 0.0 | 0.9125 | 0.811321 | 0.040816 | 0.065217 | 0.912500 | 8 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8 | ||
AGAP013128 | CYP6AA2 | amp | 0.0 | 0.8250 | 0.792453 | 0.040816 | 0.065217 | 0.825000 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | ||
AGAP002863 | COEAE6O | amp | 0.0 | 0.6000 | 0.509434 | 0.010204 | 0.021739 | 0.600000 | 7 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7 | ||
AGAP002864 | CYP6P15P | amp | 0.0 | 0.5250 | 0.528302 | 0.010204 | 0.021739 | 0.528302 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | ||
AGAP002865 | CYP6P3 | amp | 0.0 | 0.0375 | 0.075472 | 0.010204 | 0.021739 | 0.075472 | 7 |
del | 0.0 | 0.0000 | 0.000000 | 0.112245 | 0.000000 | 0.112245 | 7 | ||
AGAP002866 | CYP6P5 | amp | 0.0 | 0.0375 | 0.056604 | 0.020408 | 0.021739 | 0.056604 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | ||
AGAP002867 | CYP6P4 | amp | 0.0 | 0.0375 | 0.075472 | 0.010204 | 0.021739 | 0.075472 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | ||
AGAP002868 | CYP6P1 | amp | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 | ||
AGAP002869 | CYP6P2 | amp | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 7 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 7 | ||
AGAP002870 | CYP6AD1 | amp | 0.0 | 0.0375 | 0.056604 | 0.010204 | 0.021739 | 0.056604 | 6 |
del | 0.0 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6 |
Remember frequency values range between 0 and 1. 0 means no individuals have the variant. 1 means all individuals have the variant. 0.5 means 50% of individuals have the variant.
The max_af
column simply shows the highest frequency found in the cohorts and is useful for filtering the output down to just CNVs at appreciable frequency in at least one cohort, and the windows
columns shows how many 300 bp windows the frequency is calculated over.
Plotting frequency heatmaps#
Just like we could with SNP allele frequencies, we can also make gene cnv frequency DataFrames easier to interpret by plotting them as a heat map.
Let’s also filter our DataFrame by max_af
to just CNVs present in at least one cohort over 5%, using a pandas query.
ag3.plot_frequencies_heatmap(
burkina_cyp6aap_cnv_freqs_df.query("max_af > 0.05"),
title="Gene CNV frequencies, Burkina Faso, Cyp6aa/p locus")
We can see clearly now that amplifications appear to be at high frequency within An. coluzzii but not An. gambiae cohorts, and the amplifications appear to be highest for the Cyp6aa1 gene.
Exercise 1 (English)
Recall that there are several other cytochrome P450 (Cyp) genes and gene clusters associated with insecticide resistance in the genome, so let’s make a new region query to look at some other Cyp genes of interest and visualise them as heatmaps.
First, have a look at the Cyp9k1 gene on the X chromosome (15,240,000-15,250,000).
Hint: cyp9k1_region = "X:15,240,000-15,250,000"
Exercice 1 (Français)
Il y a d’autres gènes cytochrome P450 (Cyp) et groupes de gènes associés à la réistance aux insecticides dans le génome, utilisons donc une autre requête pour obtenir une région contenant d’autres gènes Cyp et les visualiser sous forme de heatmap.
D’abord, regardons le gène Cyp9k1 sur le chromosome X (15,240,000-15,250,000).
Indice: cyp9k1_region = "X:15,240,000-15,250,000"
# cyp9k1_region = ???
# burkina_cyp9k1_cnv_freqs_df = ag3.gene_cnv_frequencies(
# region=cyp9k1_region,
# cohorts="admin1_year",
# sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )
# ag3.plot_frequencies_heatmap(
# burkina_cyp9k1_cnv_freqs_df.query("max_af > 0.05"),
# )
Exercise 2 (English)
Next, try the Cyp6mz region.
Hint: cyp6mz_region = "3R:6,924,000-6,980,000"
Exercice 2 (Français)
Essayons ensuite la région de Cyp6mz.
Indice: cyp6mz_region = "3R:6,924,000-6,980,000"
# cyp6mz_region = ???
# burkina_cyp6mz_cnv_freqs_df = ag3.gene_cnv_frequencies(
# region=cyp6mz_region,
# cohorts="admin1_year",
# sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )
# ag3.plot_frequencies_heatmap(
# burkina_cyp6mz_cnv_freqs_df.query("max_af > 0.05"),
# )
Exercise 3 (English)
Let’s also have a look at the Gste gene cluster on chromosome arm 3R. What patterns can you see?
Hint: gste_region = "3R:28,590,000-28,610,000"
Exercice 3 (Français)
Regardons aussi le groupe de gènes Gste sur le bras de chromosome 3R. Quel motif observez-vous?
Indice: gste_region = "3R:28,590,000-28,610,000"
# gste_region = ???
# burkina_gste_cnv_freqs_df = ag3.gene_cnv_frequencies(
# region=???,
# cohorts="admin1_year",
# sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )
# ag3.plot_frequencies_heatmap(
# burkina_gste_cnv_freqs_df.query("max_af > 0.05"),
# )
Using sample queries#
Sometimes you may want to analyse data across many sample sets, but select only samples from a single country or species. The useful sample_query
parameter enables us to achieve this easily.
For example, if we wanted to analyse the Cyp6aa/p gene cluster in all of the Ag3.0 sample sets, but only look at Anopheles coluzzii samples in those sample sets, then we would set the sample_sets
parameter to “3.0”…
sample_sets = "3.0"
… and pass a pandas query string to sample_query
.
sample_query = "taxon == 'coluzzii'"
coluzzii_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
region=cyp6aap_region,
cohorts=cohorts,
sample_sets=sample_sets,
sample_query=sample_query,
)
coluzzii_cyp6aap_cnv_freqs_df
gene_strand | gene_description | contig | start | end | frq_AO-LUA_colu_2009 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_CF-BGF_colu_1994 | frq_CI-LG_colu_2012 | ... | frq_GH-WP_colu_2012 | frq_GM-M_colu_2012 | frq_GN-N_colu_2012 | frq_ML-2_colu_2004 | frq_ML-2_colu_2014 | frq_ML-3_colu_2012 | frq_ML-4_colu_2004 | max_af | windows | label | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | gene_name | cnv_type | |||||||||||||||||||||
AGAP002862 | CYP6AA1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28480576 | 28482637 | 0.0 | 0.9125 | 0.811321 | 0.0 | 0.870130 | ... | 0.125000 | 0.046784 | 0.8 | 0.8 | 0.526316 | 0.846154 | 0.260870 | 0.912500 | 8 | AGAP002862 (CYP6AA1) amp |
AGAP013128 | CYP6AA2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28483301 | 28484921 | 0.0 | 0.8250 | 0.792453 | 0.0 | 0.870130 | ... | 0.125000 | 0.029240 | 0.8 | 0.8 | 0.421053 | 0.769231 | 0.260870 | 0.870130 | 6 | AGAP013128 (CYP6AA2) amp |
AGAP002863 | COEAE6O | amp | - | carboxylesterase alpha esterase [Source:VB Com... | 2R | 28485262 | 28487080 | 0.0 | 0.6000 | 0.509434 | 0.0 | 0.740260 | ... | 0.083333 | 0.000000 | 0.1 | 0.0 | 0.263158 | 0.461538 | 0.043478 | 0.740260 | 7 | AGAP002863 (COEAE6O) amp |
AGAP002864 | CYP6P15P | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28487640 | 28489092 | 0.0 | 0.5250 | 0.528302 | 0.0 | 0.922078 | ... | 0.083333 | 0.000000 | 0.0 | 0.2 | 0.263158 | 0.346154 | 0.043478 | 0.922078 | 6 | AGAP002864 (CYP6P15P) amp |
AGAP002865 | CYP6P3 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28491415 | 28493141 | 0.0 | 0.0375 | 0.075472 | 0.0 | 0.922078 | ... | 0.041667 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.922078 | 7 | AGAP002865 (CYP6P3) amp |
AGAP002866 | CYP6P5 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28494017 | 28495645 | 0.0 | 0.0375 | 0.056604 | 0.0 | 0.922078 | ... | 0.000000 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.922078 | 6 | AGAP002866 (CYP6P5) amp |
AGAP002867 | CYP6P4 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28497087 | 28498674 | 0.0 | 0.0375 | 0.075472 | 0.0 | 0.922078 | ... | 0.000000 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.922078 | 6 | AGAP002867 (CYP6P4) amp |
AGAP002868 | CYP6P1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28499251 | 28500900 | 0.0 | 0.0375 | 0.056604 | 0.0 | 0.909091 | ... | 0.000000 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.909091 | 6 | AGAP002868 (CYP6P1) amp |
AGAP002869 | CYP6P2 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | 0.0 | 0.0375 | 0.056604 | 0.0 | 0.922078 | ... | 0.000000 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.922078 | 7 | AGAP002869 (CYP6P2) amp |
del | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28501033 | 28502910 | 0.0 | 0.0000 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.040000 | 7 | AGAP002869 (CYP6P2) del | ||
AGAP002870 | CYP6AD1 | amp | - | cytochrome P450 [Source:VB Community Annotation] | 2R | 28504248 | 28505816 | 0.0 | 0.0375 | 0.056604 | 0.0 | 0.922078 | ... | 0.000000 | 0.000000 | 0.0 | 0.2 | 0.000000 | 0.038462 | 0.000000 | 0.922078 | 6 | AGAP002870 (CYP6AD1) amp |
11 rows × 22 columns
We can see that our query string has worked by listing the cohort frequency (“frq”) columns in our output DataFrame. Note that the taxon is “colu” in each cohort.
colu_frequency_columns = [
col for col in coluzzii_cyp6aap_cnv_freqs_df.columns
if col.startswith("frq_")
]
colu_frequency_columns
['frq_AO-LUA_colu_2009',
'frq_BF-09_colu_2012',
'frq_BF-09_colu_2014',
'frq_CF-BGF_colu_1994',
'frq_CI-LG_colu_2012',
'frq_GH-AA_colu_2012',
'frq_GH-CP_colu_2012',
'frq_GH-WP_colu_2012',
'frq_GM-M_colu_2012',
'frq_GN-N_colu_2012',
'frq_ML-2_colu_2004',
'frq_ML-2_colu_2014',
'frq_ML-3_colu_2012',
'frq_ML-4_colu_2004']
Now we can visualise data for a locus of interest from this output using a heatmap, just as we did before.
ag3.plot_frequencies_heatmap(
coluzzii_cyp6aap_cnv_freqs_df.query("max_af > 0.05"),
title="Gene CNV frequencies, An. coluzzii, Cyp6aa/p locus"
)
We can see CNV frequencies appear to differ between geographical locations drastically. For example, Cote D’Ivoire has the highest frequencies across the most genes in the cluster (“CI-LG_colu_2012”), but the collection from the same year in neighbouring Ghana (“GH-AA_colu_2012”) has no evidence for CNVs at this locus.
Exercise 4 (English)
Look at the same Cypaa/p locus, but this time at only An. gambiae. What patterns can you see in the results?
Hint: change sample_query
taxon to “gambiae”.
Exercice 4 (Français)
Observer le même locus Cypaa/p, mais cette fois seulement pour An. gambiae. Quels motifs pouvez-vous observer dans ces résultats?
Indice: remplacer le taxon dans sample_query
par “gambiae”.
# sample_query = ???
# gambiae_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
# region=cyp6aap_region,
# cohorts=cohorts,
# sample_sets=sample_sets,
# sample_query=sample_query,
# )
# ag3.plot_frequencies_heatmap(
# gambiae_cyp6aap_cnv_freqs_df.query("max_af > 0.05"),
# title="Gene CNV frequencies, An. gambiae, Cyp6aa/p locus"
# )
Plotting frequency time series#
For surveillance it can be very informative to observe changes in CNV frequencies over time for a given sampling location and species. CNVs under selection due to insecticide resistance will increase in frequency over time, and plotting the frequency data as a time series can help to see this. The malariagen_data
package includes a function plot_frequencies_time_series()
for plotting frequency time series. These are a little more advanced to use, but we’ll illustrate them below.
N.B., in the Ag3.0 data resource there is only one country where we have data from multiple time points, which is Burkina Faso. However, data in future releases will contain a number of time series, and so it’s useful to illustrate this functionality now.
First we need to compute the gene CNV frequencies. To obtain the necessary data for time series plotting, we need to use a slightly different function for computing the frequencies - gene_cnv_frequencies_advanced()
.
ag3.gene_cnv_frequencies_advanced?
We can see alongside region
there are a couple of new required parameters for this function.
The area_by
parameter determines how to group samples in space - we’re going to use “admin1_iso” here, which uses the ISO codes for level 1 administrative divisions. (Another option would be “admin2_name” to group by level 2 administrative divisions.)
The period_by
parameter determines how to group samples in time, let’s do this by “year”.
So let’s look at the Cyp6aa/p locus, again using just samples from Burkina Faso.
burkina_cyp6aap_cnv_ds = ag3.gene_cnv_frequencies_advanced(
region=cyp6aap_region,
area_by="admin1_iso",
period_by="year",
sample_sets="3.0",
sample_query="country == 'Burkina Faso'",
variant_query="max_af > 0.05"
)
burkina_cyp6aap_cnv_ds
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:
'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 4kB Dimensions: (cohorts: 5, variants: 11) Dimensions without coordinates: cohorts, variants Data variables: (12/28) cohort_area (cohorts) object 40B 'BF-09' 'BF-09' ... 'BF-09' cohort_label (cohorts) object 40B 'BF-09_colu_2012' ... 'BF-09... cohort_lat_max (cohorts) float64 40B 11.24 11.24 12.06 11.24 11.24 cohort_lat_mean (cohorts) float64 40B 11.22 11.23 12.06 11.2 11.21 cohort_lat_min (cohorts) float64 40B 11.15 11.23 12.06 11.15 11.15 cohort_lon_max (cohorts) float64 40B -4.235 -4.235 ... -4.235 ... ... variant_gene_name (variants) object 88B 'CYP6AA1' ... 'CYP6AD1' variant_gene_strand (variants) object 88B '-' '-' '-' ... '-' '-' '-' variant_label (variants) object 88B 'AGAP002862 (CYP6AA1) amp' ... variant_max_af (variants) float64 88B 0.9125 0.825 ... 0.0566 variant_start (variants) int64 88B 28480576 28483301 ... 28504248 variant_windows (variants) int64 88B 8 6 7 6 7 7 6 6 6 7 6 Attributes: title: Gene CNV frequencies (2R:28,480,000-28,510,000)
Instead of a pandas DataFrame, this function outputs an xarray Dataset containing the gene cnv frequencies alongside associated metadata.
We will look at xarray in more detail in later workshops, but for now we can just use this Dataset with our time series plotting function.
ag3.plot_frequencies_time_series(
burkina_cyp6aap_cnv_ds,
height=500,
width=800,
title="Gene CNV frequencies, Burkina Faso, Cyp6aa/p locus"
)
Note that this figure is interactive, you can hover over points for more information, and zoom into regions of plots.
In this plot we have data from two Burkina Faso admin 1 regions, BF-09 (Haut-Bassins) and BF-07 (Centre-Sud). In BF-09 we have data from 2012 and 2014, and so we have the beginning of a time series. In BF-07 we have data only from 2004, so just a single point in time.
There are 3 sub-plots, faceted by taxon and area to allow for comparison. The error bars in the plots show 95% confidence intervals for the frequency estimates.
We only have two times points in this time series, so we cannot draw strong conclusions. However, future sample sets will contain more time series data.
Exercise 5 (English)
Plot a CNV frequency time-series for Gste.
Hint: gste_region = "3R:28,590,000-28,610,000"
Exercice 5 (Français)
Afficher la série temporelle des CNVs pour Gste.
Indice: gste_region = "3R:28,590,000-28,610,000"
# ??? = ???
# burkina_gste_cnv_ds = ag3.gene_cnv_frequencies_advanced(
# region=???,
# area_by="admin1_iso",
# period_by="year",
# sample_sets="3.0",
# sample_query="country == 'Burkina Faso'",
# variant_query="max_af > 0.05"
# )
# ag3.plot_frequencies_time_series(
# burkina_gste_cnv_ds,
# height=500,
# width=800,
# title="Gene CNV frequencies, Burkina Faso, gste locus"
# )
Plotting frequencies using interactive maps#
It is often useful, when making geographical comparisons, to plot gene CNV frequencies on a map. We can do this via the plot_frequencies_interactive_map()
function in the malariagen_data
package. Let’s demonstrate, analysing the Cyp6aa/p gene cluster using all data in the Ag3.0 resource.
As with the time series plots, we first calculate our gene CNV frequencies using the gene_cnv_frequencies_advanced()
function as this provides additional metadata that we need for plotting data on a map. We will keep all our parameters the same as the time series analysis above, except that this time we will not use a sample query, because we want to include all available samples in the analysis.
cyp6aap_cnv_ds = ag3.gene_cnv_frequencies_advanced(
region=cyp6aap_region,
area_by="admin1_iso", # group samples in space by admin level 1
period_by="year", # group samples in time by year
sample_sets="3.0",
variant_query="max_af > 0.05"
)
cyp6aap_cnv_ds
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:
'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 30kB Dimensions: (cohorts: 49, variants: 12) Dimensions without coordinates: cohorts, variants Data variables: (12/28) cohort_area (cohorts) object 392B 'MW-S' 'TZ-05' ... 'KE-14' cohort_label (cohorts) object 392B 'MW-S_arab_2015' ... 'KE-14... cohort_lat_max (cohorts) float64 392B -15.93 -1.962 ... -3.511 cohort_lat_mean (cohorts) float64 392B -15.93 -1.962 ... -3.511 cohort_lat_min (cohorts) float64 392B -15.93 -1.962 ... -3.511 cohort_lon_max (cohorts) float64 392B 34.76 31.62 ... -15.58 39.91 ... ... variant_gene_name (variants) object 96B nan 'CYP6AA1' ... 'CYP6AD1' variant_gene_strand (variants) object 96B '+' '-' '-' ... '-' '-' '-' variant_label (variants) object 96B 'AGAP002859 amp' ... 'AGAP0... variant_max_af (variants) float64 96B 0.1579 0.9355 ... 0.9221 variant_start (variants) int64 96B 28397312 28480576 ... 28504248 variant_windows (variants) int64 96B 397 8 6 7 6 7 7 6 6 6 7 6 Attributes: title: Gene CNV frequencies (2R:28,480,000-28,510,000)
Then we can pass the xarray Dataset to the interactive map plotting funtion.
ag3.plot_frequencies_interactive_map(
cyp6aap_cnv_ds,
title="Gene CNV frequencies, Cyp6aa/p locus"
)
The drop down menus allow selection of variant, taxon and time period. Gene cnv frequencies are then plotted on the map where samples were collected. The opacity of the marker shows relative frequency, but actual values can be accessed by clicking on the marker.
Exercise 6 (English)
Using the interactive map we have just created - what is the frequency of Cyp6aa1 amplifications in An. gambiae from Uganda area UG-E in 2012?
What is the frequency of Cyp6p3 amplifications in An. coluzzii from Cote D’Ivoire area CI-LG in 2012?
Exercice 6 (Français)
En utilisant la carte interactive que nous venons de créer, quelle est la fréquence de l’amplification de Cyp6aa1 chez les An. gambiae venant de la région de l’Ouganda UG-E en 2012?
Quelle est la fréquence de l’amplification de Cyp6p3 chez les An. coluzzii venant de la région CI-LG de la Côte D’Ivoire en 2012?
Well done!#
In this module we have learnt how to analyse CNVs in genes that are associated with metabolic insecticide resistance in the Anopheles gambiae complex.
We have computed the frequency of copy number variation in genes of interest, and learnt different ways to compare these frequencies between mosquitoes from different time points, geographical locations and species.
Exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.
Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Français#
Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.
Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.
Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.