banner

Workshop 2 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 4 - Analysing CNV frequencies at metabolic resistance genes#

Theme: Analysis

In this module we’re going to analyse copy number variation in genes that are known to be associated with metabolic insecticide resistance in the Anopheles gambiae complex, particularly cytochrome P450 (Cyp) genes. We will compute the frequency of copy number variation in these genes of interest, and compare these frequencies between mosquitoes from different time points, geographical locations and species.

Learning objectives#

At the end of this module you will be able to:

  • Compute the frequency of gene copy number variation in different mosquito cohorts.

  • Extract and analyse CNV frequencies for metabolic resistance genes.

  • Visualise frequencies using heatmaps, time series plots and maps.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s install the python packages we will need for our analyses.

%pip install -q --no-warn-conflicts malariagen_data

Now we’ve installed malariagen_data, we can import it into our environment and set it up to access data in the cloud.

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9
Results cache None
Cohorts analysis 20240418
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 9.0.0
Client location unknown

Above we can see some information about how access to Anopheles gambiae genomic data from MalariaGEN has been configured. This tells us where the data are stored (the “gs” stands for Google Cloud Storage), which data releases are available to analyse (Ag3.0) and what the default analyses are for cohorts, species and site filters. We can also see that we are using version 8.7.0 of the malariagen_data Python package.

Summarising copy number variation by gene#

For surveillance analyses, it can be convenient to summarise the copy number data for each individual by gene. This allows us to ask questions like, for some metabolic resistance gene of interest, does a given mosquito have any copy number variation at that gene?

We’ve seen in the previous module how the copy number variation data is calculated in 300 base pair windows over the genome, but in general genes are larger than 300 base pairs and so will span multiple windows. So how do we summarise the windowed copy number data by gene?

One way to do this is to compute some kind of average over all the windows overlapping a given gene of interest. Because copy number state is a discrete variable, we typically prefer to do this by computing the modal copy number for windows overlapping a gene.

The diagram below shows the windowed copy number for a single mosquito for the region of the genome spanning several metabolic resistance genes. We illustrate how the summarisation is done to compute modal copy number by gene.

Computing gene CNV frequencies#

Now that we know how to summarise the data by gene, the next step is to then summarise these data over multiple individual mosquitoes, so that we can estimate the fraction of mosquitoes from a given time, place and species that carry some form of copy number variation at a given gene of interest. This is ultimately what we are interested in, because we would like to know if metabolic resistance is at low or high frequency and whether this varies over time or between geographical locations or species.

Because copy number variation is more complex than single nucleotide polymorphism, we compute the frequency of CNVs in a simplified way. We compute the fraction of individuals that carry some kind of copy number variation at a given gene, ignoring whether the individual was heterozygous or homozygous for that CNV, and also ignoring the number of additional copies that may be present. We also divide this into two types of CNV - amplification and deletion. So, for example, we compute the fraction of individuals which have an amplification at some gene; and similarly we compute the fraction with some deletion. Although generally we’re more interested in amplification, because amplification of metabolic resistance genes can cause increased insecticide resistance.

The diagram below illustrates this simple calculation of amplification and deletion frequencies for some genes of interest.

Let’s now compute gene CNV frequencies. To do this we can use the gene_cnv_frequencies() function.

We can look at the documentation for this function by using “?”.

ag3.gene_cnv_frequencies?

The documentation shows us that there are two required parameters for this function. The region parameter is the genomic region that we want to analyse, and the cohorts parameter determines how samples are to be grouped into cohorts.

Let’s initially analyse all genes in the Cyp6aa/p cluster on chromosome arm 2R, which starts around position 28,450,000 and ends at around position 28,510,000.

cyp6aap_region = "2R:28,480,000-28,510,000"

There are different ways to group our mosquitoes in to cohorts (more on this later) but for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.

cohorts = "admin1_year"

Were also going to use one of the optional parameters, sample_sets, to analyse just a subset the data.

Let’s have a look at what sample sets are available in the Ag3.0 data resource.

ag3.sample_sets(release='3.0')[['sample_set','sample_count','study_id']]
sample_set sample_count study_id
0 AG1000G-AO 81 AG1000G-AO
1 AG1000G-BF-A 181 AG1000G-BF-1
2 AG1000G-BF-B 102 AG1000G-BF-1
3 AG1000G-BF-C 13 AG1000G-BF-2
4 AG1000G-CD 76 AG1000G-CD
5 AG1000G-CF 73 AG1000G-CF
6 AG1000G-CI 80 AG1000G-CI
7 AG1000G-CM-A 303 AG1000G-CM-1
8 AG1000G-CM-B 97 AG1000G-CM-2
9 AG1000G-CM-C 44 AG1000G-CM-3
10 AG1000G-FR 23 AG1000G-FR
11 AG1000G-GA-A 69 AG1000G-GA-1
12 AG1000G-GH 100 AG1000G-GH
13 AG1000G-GM-A 74 AG1000G-GM-1
14 AG1000G-GM-B 31 AG1000G-GM-2
15 AG1000G-GM-C 174 AG1000G-GM-3
16 AG1000G-GN-A 45 AG1000G-GN-ML
17 AG1000G-GN-B 185 AG1000G-GN-ML
18 AG1000G-GQ 10 AG1000G-GQ
19 AG1000G-GW 101 AG1000G-GW
20 AG1000G-KE 86 AG1000G-KE
21 AG1000G-ML-A 60 AG1000G-ML-1
22 AG1000G-ML-B 71 AG1000G-ML-2
23 AG1000G-MW 41 AG1000G-MW
24 AG1000G-MZ 74 AG1000G-MZ
25 AG1000G-TZ 300 AG1000G-TZ
26 AG1000G-UG 290 AG1000G-UG
27 AG1000G-X 669 AG1000G-X

To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.

sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]

Now we’re ready to run the analysis.

burkina_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
    region=cyp6aap_region,
    cohorts=cohorts,
    sample_sets=sample_sets,
    drop_invariant=False,
)
burkina_cyp6aap_cnv_freqs_df
                                    
                                     
                                     
gene_strand gene_description contig start end frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af windows label
gene_id gene_name cnv_type
AGAP002859 NaN amp + solute carrier family 8 (sodium/calcium exchan... 2R 28397312 28516028 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 397 AGAP002859 amp
del + solute carrier family 8 (sodium/calcium exchan... 2R 28397312 28516028 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 397 AGAP002859 del
AGAP002862 CYP6AA1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28480576 28482637 0.0 0.9125 0.811321 0.040816 0.065217 0.912500 8 AGAP002862 (CYP6AA1) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28480576 28482637 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 8 AGAP002862 (CYP6AA1) del
AGAP013128 CYP6AA2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28483301 28484921 0.0 0.8250 0.792453 0.040816 0.065217 0.825000 6 AGAP013128 (CYP6AA2) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28483301 28484921 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP013128 (CYP6AA2) del
AGAP002863 COEAE6O amp - carboxylesterase alpha esterase [Source:VB Com... 2R 28485262 28487080 0.0 0.6000 0.509434 0.010204 0.021739 0.600000 7 AGAP002863 (COEAE6O) amp
del - carboxylesterase alpha esterase [Source:VB Com... 2R 28485262 28487080 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 7 AGAP002863 (COEAE6O) del
AGAP002864 CYP6P15P amp - cytochrome P450 [Source:VB Community Annotation] 2R 28487640 28489092 0.0 0.5250 0.528302 0.010204 0.021739 0.528302 6 AGAP002864 (CYP6P15P) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28487640 28489092 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP002864 (CYP6P15P) del
AGAP002865 CYP6P3 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28491415 28493141 0.0 0.0375 0.075472 0.010204 0.021739 0.075472 7 AGAP002865 (CYP6P3) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28491415 28493141 0.0 0.0000 0.000000 0.112245 0.000000 0.112245 7 AGAP002865 (CYP6P3) del
AGAP002866 CYP6P5 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28494017 28495645 0.0 0.0375 0.056604 0.020408 0.021739 0.056604 6 AGAP002866 (CYP6P5) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28494017 28495645 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP002866 (CYP6P5) del
AGAP002867 CYP6P4 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28497087 28498674 0.0 0.0375 0.075472 0.010204 0.021739 0.075472 6 AGAP002867 (CYP6P4) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28497087 28498674 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP002867 (CYP6P4) del
AGAP002868 CYP6P1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28499251 28500900 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 6 AGAP002868 (CYP6P1) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28499251 28500900 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP002868 (CYP6P1) del
AGAP002869 CYP6P2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 7 AGAP002869 (CYP6P2) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 7 AGAP002869 (CYP6P2) del
AGAP002870 CYP6AD1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28504248 28505816 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 6 AGAP002870 (CYP6AD1) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28504248 28505816 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6 AGAP002870 (CYP6AD1) del

The function outputs a pandas DataFrame, where rows represent gene CNVs. There are two rows for each gene, one representing an amplification (“amp” - increase in copy number), the other representing a deletion (“del” - decrease in copy number). There are 11 annotated genes on the chromosome arms we selected, so there are 2 times 11 = 22 rows in this dataframe.

len(burkina_cyp6aap_cnv_freqs_df)
22

Let’s look at the columns that provide information about the genes.

burkina_cyp6aap_cnv_freqs_df[["gene_strand", "gene_description", "contig", "start", "end"]]
gene_strand gene_description contig start end
gene_id gene_name cnv_type
AGAP002859 NaN amp + solute carrier family 8 (sodium/calcium exchan... 2R 28397312 28516028
del + solute carrier family 8 (sodium/calcium exchan... 2R 28397312 28516028
AGAP002862 CYP6AA1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28480576 28482637
del - cytochrome P450 [Source:VB Community Annotation] 2R 28480576 28482637
AGAP013128 CYP6AA2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28483301 28484921
del - cytochrome P450 [Source:VB Community Annotation] 2R 28483301 28484921
AGAP002863 COEAE6O amp - carboxylesterase alpha esterase [Source:VB Com... 2R 28485262 28487080
del - carboxylesterase alpha esterase [Source:VB Com... 2R 28485262 28487080
AGAP002864 CYP6P15P amp - cytochrome P450 [Source:VB Community Annotation] 2R 28487640 28489092
del - cytochrome P450 [Source:VB Community Annotation] 2R 28487640 28489092
AGAP002865 CYP6P3 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28491415 28493141
del - cytochrome P450 [Source:VB Community Annotation] 2R 28491415 28493141
AGAP002866 CYP6P5 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28494017 28495645
del - cytochrome P450 [Source:VB Community Annotation] 2R 28494017 28495645
AGAP002867 CYP6P4 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28497087 28498674
del - cytochrome P450 [Source:VB Community Annotation] 2R 28497087 28498674
AGAP002868 CYP6P1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28499251 28500900
del - cytochrome P450 [Source:VB Community Annotation] 2R 28499251 28500900
AGAP002869 CYP6P2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910
del - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910
AGAP002870 CYP6AD1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28504248 28505816
del - cytochrome P450 [Source:VB Community Annotation] 2R 28504248 28505816

Reminder - cohorts#

When we want to analyse frequencies we first define cohorts, which are simply groups of mosquitoes that we have data for and which we take as representatives of a mosquito population from a given time, place and species.

To define these cohorts we group mosquitoes according to these different variables, i.e., geographically, temporally and according to species (taxon). There are different options for doing this, depending on how fine-grained we want our analysis to be, and how many samples are available.

When computing frequencies and choosing cohorts, we need to think about two factors:

  • Cohort size - The larger the cohort the lower the uncertainty and the more confident we can be that the frequency we calculate within the cohort is representative of the frequency in the population from which the cohort was sampled. By default, the functions that compute frequencies in the malariagen_data package require a minimum cohort size of 10 individuals.

  • Heterogeneity - If we group samples together from a wide geographical area, or from a long time period, we may be losing information about any smaller-scale variation.

Often we have to make a compromise, and do the best we can with available data.

In the example above, we provided the cohorts="admin1_year" parameter, which means that samples will be grouped by species (taxon), administrative level 1 divisions, and year of collection. This gives us frequencies for five cohorts. Let’s look at the cohort frequency columns.

frequency_columns = [
    col for col in burkina_cyp6aap_cnv_freqs_df.columns 
    if col.startswith("frq_")
]
frequency_columns
['frq_BF-07_gamb_2004',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_BF-09_gamb_2012',
 'frq_BF-09_gamb_2014']

Remember, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.

Let’s look at the values of these frequency columns in our DataFrame.

burkina_cyp6aap_cnv_freqs_df[frequency_columns + ["max_af", "windows"]]
frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af windows
gene_id gene_name cnv_type
AGAP002859 NaN amp 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 397
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 397
AGAP002862 CYP6AA1 amp 0.0 0.9125 0.811321 0.040816 0.065217 0.912500 8
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 8
AGAP013128 CYP6AA2 amp 0.0 0.8250 0.792453 0.040816 0.065217 0.825000 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6
AGAP002863 COEAE6O amp 0.0 0.6000 0.509434 0.010204 0.021739 0.600000 7
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 7
AGAP002864 CYP6P15P amp 0.0 0.5250 0.528302 0.010204 0.021739 0.528302 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6
AGAP002865 CYP6P3 amp 0.0 0.0375 0.075472 0.010204 0.021739 0.075472 7
del 0.0 0.0000 0.000000 0.112245 0.000000 0.112245 7
AGAP002866 CYP6P5 amp 0.0 0.0375 0.056604 0.020408 0.021739 0.056604 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6
AGAP002867 CYP6P4 amp 0.0 0.0375 0.075472 0.010204 0.021739 0.075472 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6
AGAP002868 CYP6P1 amp 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6
AGAP002869 CYP6P2 amp 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 7
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 7
AGAP002870 CYP6AD1 amp 0.0 0.0375 0.056604 0.010204 0.021739 0.056604 6
del 0.0 0.0000 0.000000 0.000000 0.000000 0.000000 6

Remember frequency values range between 0 and 1. 0 means no individuals have the variant. 1 means all individuals have the variant. 0.5 means 50% of individuals have the variant.

The max_af column simply shows the highest frequency found in the cohorts and is useful for filtering the output down to just CNVs at appreciable frequency in at least one cohort, and the windows columns shows how many 300 bp windows the frequency is calculated over.

Plotting frequency heatmaps#

Just like we could with SNP allele frequencies, we can also make gene cnv frequency DataFrames easier to interpret by plotting them as a heat map.

Let’s also filter our DataFrame by max_af to just CNVs present in at least one cohort over 5%, using a pandas query.

ag3.plot_frequencies_heatmap(
    burkina_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
    title="Gene CNV frequencies, Burkina Faso, Cyp6aa/p locus")

We can see clearly now that amplifications appear to be at high frequency within An. coluzzii but not An. gambiae cohorts, and the amplifications appear to be highest for the Cyp6aa1 gene.

Exercise 1 (English)

Recall that there are several other cytochrome P450 (Cyp) genes and gene clusters associated with insecticide resistance in the genome, so let’s make a new region query to look at some other Cyp genes of interest and visualise them as heatmaps.

First, have a look at the Cyp9k1 gene on the X chromosome (15,240,000-15,250,000).

Hint: cyp9k1_region = "X:15,240,000-15,250,000"

Exercice 1 (Français)

Il y a d’autres gènes cytochrome P450 (Cyp) et groupes de gènes associés à la réistance aux insecticides dans le génome, utilisons donc une autre requête pour obtenir une région contenant d’autres gènes Cyp et les visualiser sous forme de heatmap.

D’abord, regardons le gène Cyp9k1 sur le chromosome X (15,240,000-15,250,000).

Indice: cyp9k1_region = "X:15,240,000-15,250,000"

# cyp9k1_region = ???

# burkina_cyp9k1_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp9k1_region,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_cyp9k1_cnv_freqs_df.query("max_af > 0.05"),
# )

Exercise 2 (English)

Next, try the Cyp6mz region.

Hint: cyp6mz_region = "3R:6,924,000-6,980,000"

Exercice 2 (Français)

Essayons ensuite la région de Cyp6mz.

Indice: cyp6mz_region = "3R:6,924,000-6,980,000"

# cyp6mz_region = ???

# burkina_cyp6mz_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp6mz_region,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_cyp6mz_cnv_freqs_df.query("max_af > 0.05"),
# )

Exercise 3 (English)

Let’s also have a look at the Gste gene cluster on chromosome arm 3R. What patterns can you see?

Hint: gste_region = "3R:28,590,000-28,610,000"

Exercice 3 (Français)

Regardons aussi le groupe de gènes Gste sur le bras de chromosome 3R. Quel motif observez-vous?

Indice: gste_region = "3R:28,590,000-28,610,000"

# gste_region = ???

# burkina_gste_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=???,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_gste_cnv_freqs_df.query("max_af > 0.05"),
# )

Using sample queries#

Sometimes you may want to analyse data across many sample sets, but select only samples from a single country or species. The useful sample_query parameter enables us to achieve this easily.

For example, if we wanted to analyse the Cyp6aa/p gene cluster in all of the Ag3.0 sample sets, but only look at Anopheles coluzzii samples in those sample sets, then we would set the sample_sets parameter to “3.0”…

sample_sets = "3.0"

… and pass a pandas query string to sample_query.

sample_query = "taxon == 'coluzzii'"
coluzzii_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
    region=cyp6aap_region,
    cohorts=cohorts,
    sample_sets=sample_sets,
    sample_query=sample_query,
)
coluzzii_cyp6aap_cnv_freqs_df
                                     
                                     
gene_strand gene_description contig start end frq_AO-LUA_colu_2009 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_CF-BGF_colu_1994 frq_CI-LG_colu_2012 ... frq_GH-WP_colu_2012 frq_GM-M_colu_2012 frq_GN-N_colu_2012 frq_ML-2_colu_2004 frq_ML-2_colu_2014 frq_ML-3_colu_2012 frq_ML-4_colu_2004 max_af windows label
gene_id gene_name cnv_type
AGAP002862 CYP6AA1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28480576 28482637 0.0 0.9125 0.811321 0.0 0.870130 ... 0.125000 0.046784 0.8 0.8 0.526316 0.846154 0.260870 0.912500 8 AGAP002862 (CYP6AA1) amp
AGAP013128 CYP6AA2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28483301 28484921 0.0 0.8250 0.792453 0.0 0.870130 ... 0.125000 0.029240 0.8 0.8 0.421053 0.769231 0.260870 0.870130 6 AGAP013128 (CYP6AA2) amp
AGAP002863 COEAE6O amp - carboxylesterase alpha esterase [Source:VB Com... 2R 28485262 28487080 0.0 0.6000 0.509434 0.0 0.740260 ... 0.083333 0.000000 0.1 0.0 0.263158 0.461538 0.043478 0.740260 7 AGAP002863 (COEAE6O) amp
AGAP002864 CYP6P15P amp - cytochrome P450 [Source:VB Community Annotation] 2R 28487640 28489092 0.0 0.5250 0.528302 0.0 0.922078 ... 0.083333 0.000000 0.0 0.2 0.263158 0.346154 0.043478 0.922078 6 AGAP002864 (CYP6P15P) amp
AGAP002865 CYP6P3 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28491415 28493141 0.0 0.0375 0.075472 0.0 0.922078 ... 0.041667 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.922078 7 AGAP002865 (CYP6P3) amp
AGAP002866 CYP6P5 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28494017 28495645 0.0 0.0375 0.056604 0.0 0.922078 ... 0.000000 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.922078 6 AGAP002866 (CYP6P5) amp
AGAP002867 CYP6P4 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28497087 28498674 0.0 0.0375 0.075472 0.0 0.922078 ... 0.000000 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.922078 6 AGAP002867 (CYP6P4) amp
AGAP002868 CYP6P1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28499251 28500900 0.0 0.0375 0.056604 0.0 0.909091 ... 0.000000 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.909091 6 AGAP002868 (CYP6P1) amp
AGAP002869 CYP6P2 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910 0.0 0.0375 0.056604 0.0 0.922078 ... 0.000000 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.922078 7 AGAP002869 (CYP6P2) amp
del - cytochrome P450 [Source:VB Community Annotation] 2R 28501033 28502910 0.0 0.0000 0.000000 0.0 0.000000 ... 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.000000 0.040000 7 AGAP002869 (CYP6P2) del
AGAP002870 CYP6AD1 amp - cytochrome P450 [Source:VB Community Annotation] 2R 28504248 28505816 0.0 0.0375 0.056604 0.0 0.922078 ... 0.000000 0.000000 0.0 0.2 0.000000 0.038462 0.000000 0.922078 6 AGAP002870 (CYP6AD1) amp

11 rows × 22 columns

We can see that our query string has worked by listing the cohort frequency (“frq”) columns in our output DataFrame. Note that the taxon is “colu” in each cohort.

colu_frequency_columns = [
    col for col in coluzzii_cyp6aap_cnv_freqs_df.columns 
    if col.startswith("frq_")
]

colu_frequency_columns
['frq_AO-LUA_colu_2009',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_CF-BGF_colu_1994',
 'frq_CI-LG_colu_2012',
 'frq_GH-AA_colu_2012',
 'frq_GH-CP_colu_2012',
 'frq_GH-WP_colu_2012',
 'frq_GM-M_colu_2012',
 'frq_GN-N_colu_2012',
 'frq_ML-2_colu_2004',
 'frq_ML-2_colu_2014',
 'frq_ML-3_colu_2012',
 'frq_ML-4_colu_2004']

Now we can visualise data for a locus of interest from this output using a heatmap, just as we did before.

ag3.plot_frequencies_heatmap(
    coluzzii_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
    title="Gene CNV frequencies, An. coluzzii, Cyp6aa/p locus"
)

We can see CNV frequencies appear to differ between geographical locations drastically. For example, Cote D’Ivoire has the highest frequencies across the most genes in the cluster (“CI-LG_colu_2012”), but the collection from the same year in neighbouring Ghana (“GH-AA_colu_2012”) has no evidence for CNVs at this locus.

Exercise 4 (English)

Look at the same Cypaa/p locus, but this time at only An. gambiae. What patterns can you see in the results?

Hint: change sample_query taxon to “gambiae”.

Exercice 4 (Français)

Observer le même locus Cypaa/p, mais cette fois seulement pour An. gambiae. Quels motifs pouvez-vous observer dans ces résultats?

Indice: remplacer le taxon dans sample_query par “gambiae”.

# sample_query = ???

# gambiae_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp6aap_region,
#     cohorts=cohorts,
#     sample_sets=sample_sets,
#     sample_query=sample_query,
# )

# ag3.plot_frequencies_heatmap(
#     gambiae_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
#     title="Gene CNV frequencies, An. gambiae, Cyp6aa/p locus"
# )

Plotting frequency time series#

For surveillance it can be very informative to observe changes in CNV frequencies over time for a given sampling location and species. CNVs under selection due to insecticide resistance will increase in frequency over time, and plotting the frequency data as a time series can help to see this. The malariagen_data package includes a function plot_frequencies_time_series() for plotting frequency time series. These are a little more advanced to use, but we’ll illustrate them below.

N.B., in the Ag3.0 data resource there is only one country where we have data from multiple time points, which is Burkina Faso. However, data in future releases will contain a number of time series, and so it’s useful to illustrate this functionality now.

First we need to compute the gene CNV frequencies. To obtain the necessary data for time series plotting, we need to use a slightly different function for computing the frequencies - gene_cnv_frequencies_advanced().

ag3.gene_cnv_frequencies_advanced?

We can see alongside region there are a couple of new required parameters for this function.

The area_by parameter determines how to group samples in space - we’re going to use “admin1_iso” here, which uses the ISO codes for level 1 administrative divisions. (Another option would be “admin2_name” to group by level 2 administrative divisions.)

The period_by parameter determines how to group samples in time, let’s do this by “year”.

So let’s look at the Cyp6aa/p locus, again using just samples from Burkina Faso.

burkina_cyp6aap_cnv_ds = ag3.gene_cnv_frequencies_advanced(
    region=cyp6aap_region,
    area_by="admin1_iso",
    period_by="year",
    sample_sets="3.0",
    sample_query="country == 'Burkina Faso'",
    variant_query="max_af > 0.05"
)
burkina_cyp6aap_cnv_ds
                                    
                                     
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:

'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 4kB
Dimensions:                 (cohorts: 5, variants: 11)
Dimensions without coordinates: cohorts, variants
Data variables: (12/28)
    cohort_area             (cohorts) object 40B 'BF-09' 'BF-09' ... 'BF-09'
    cohort_label            (cohorts) object 40B 'BF-09_colu_2012' ... 'BF-09...
    cohort_lat_max          (cohorts) float64 40B 11.24 11.24 12.06 11.24 11.24
    cohort_lat_mean         (cohorts) float64 40B 11.22 11.23 12.06 11.2 11.21
    cohort_lat_min          (cohorts) float64 40B 11.15 11.23 12.06 11.15 11.15
    cohort_lon_max          (cohorts) float64 40B -4.235 -4.235 ... -4.235
    ...                      ...
    variant_gene_name       (variants) object 88B 'CYP6AA1' ... 'CYP6AD1'
    variant_gene_strand     (variants) object 88B '-' '-' '-' ... '-' '-' '-'
    variant_label           (variants) object 88B 'AGAP002862 (CYP6AA1) amp' ...
    variant_max_af          (variants) float64 88B 0.9125 0.825 ... 0.0566
    variant_start           (variants) int64 88B 28480576 28483301 ... 28504248
    variant_windows         (variants) int64 88B 8 6 7 6 7 7 6 6 6 7 6
Attributes:
    title:    Gene CNV frequencies (2R:28,480,000-28,510,000)

Instead of a pandas DataFrame, this function outputs an xarray Dataset containing the gene cnv frequencies alongside associated metadata.

We will look at xarray in more detail in later workshops, but for now we can just use this Dataset with our time series plotting function.

ag3.plot_frequencies_time_series(
    burkina_cyp6aap_cnv_ds, 
    height=500, 
    width=800, 
    title="Gene CNV frequencies, Burkina Faso, Cyp6aa/p locus"
)

Note that this figure is interactive, you can hover over points for more information, and zoom into regions of plots.

In this plot we have data from two Burkina Faso admin 1 regions, BF-09 (Haut-Bassins) and BF-07 (Centre-Sud). In BF-09 we have data from 2012 and 2014, and so we have the beginning of a time series. In BF-07 we have data only from 2004, so just a single point in time.

There are 3 sub-plots, faceted by taxon and area to allow for comparison. The error bars in the plots show 95% confidence intervals for the frequency estimates.

We only have two times points in this time series, so we cannot draw strong conclusions. However, future sample sets will contain more time series data.

Exercise 5 (English)

Plot a CNV frequency time-series for Gste.

Hint: gste_region = "3R:28,590,000-28,610,000"

Exercice 5 (Français)

Afficher la série temporelle des CNVs pour Gste.

Indice: gste_region = "3R:28,590,000-28,610,000"

#  ??? = ???

#  burkina_gste_cnv_ds = ag3.gene_cnv_frequencies_advanced(
#     region=???,
#     area_by="admin1_iso",
#     period_by="year",
#     sample_sets="3.0",
#     sample_query="country == 'Burkina Faso'",
#     variant_query="max_af > 0.05"
# )

# ag3.plot_frequencies_time_series(
#     burkina_gste_cnv_ds, 
#     height=500, 
#     width=800, 
#     title="Gene CNV frequencies, Burkina Faso, gste locus"
# )

Plotting frequencies using interactive maps#

It is often useful, when making geographical comparisons, to plot gene CNV frequencies on a map. We can do this via the plot_frequencies_interactive_map() function in the malariagen_data package. Let’s demonstrate, analysing the Cyp6aa/p gene cluster using all data in the Ag3.0 resource.

As with the time series plots, we first calculate our gene CNV frequencies using the gene_cnv_frequencies_advanced() function as this provides additional metadata that we need for plotting data on a map. We will keep all our parameters the same as the time series analysis above, except that this time we will not use a sample query, because we want to include all available samples in the analysis.

cyp6aap_cnv_ds = ag3.gene_cnv_frequencies_advanced(
    region=cyp6aap_region,
    area_by="admin1_iso",  # group samples in space by admin level 1
    period_by="year",  # group samples in time by year
    sample_sets="3.0",
    variant_query="max_af > 0.05"
)
cyp6aap_cnv_ds
                                    
                                     
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:

'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 30kB
Dimensions:                 (cohorts: 49, variants: 12)
Dimensions without coordinates: cohorts, variants
Data variables: (12/28)
    cohort_area             (cohorts) object 392B 'MW-S' 'TZ-05' ... 'KE-14'
    cohort_label            (cohorts) object 392B 'MW-S_arab_2015' ... 'KE-14...
    cohort_lat_max          (cohorts) float64 392B -15.93 -1.962 ... -3.511
    cohort_lat_mean         (cohorts) float64 392B -15.93 -1.962 ... -3.511
    cohort_lat_min          (cohorts) float64 392B -15.93 -1.962 ... -3.511
    cohort_lon_max          (cohorts) float64 392B 34.76 31.62 ... -15.58 39.91
    ...                      ...
    variant_gene_name       (variants) object 96B nan 'CYP6AA1' ... 'CYP6AD1'
    variant_gene_strand     (variants) object 96B '+' '-' '-' ... '-' '-' '-'
    variant_label           (variants) object 96B 'AGAP002859 amp' ... 'AGAP0...
    variant_max_af          (variants) float64 96B 0.1579 0.9355 ... 0.9221
    variant_start           (variants) int64 96B 28397312 28480576 ... 28504248
    variant_windows         (variants) int64 96B 397 8 6 7 6 7 7 6 6 6 7 6
Attributes:
    title:    Gene CNV frequencies (2R:28,480,000-28,510,000)

Then we can pass the xarray Dataset to the interactive map plotting funtion.

ag3.plot_frequencies_interactive_map(
    cyp6aap_cnv_ds,
    title="Gene CNV frequencies, Cyp6aa/p locus"
)

The drop down menus allow selection of variant, taxon and time period. Gene cnv frequencies are then plotted on the map where samples were collected. The opacity of the marker shows relative frequency, but actual values can be accessed by clicking on the marker.

Exercise 6 (English)

Using the interactive map we have just created - what is the frequency of Cyp6aa1 amplifications in An. gambiae from Uganda area UG-E in 2012?

What is the frequency of Cyp6p3 amplifications in An. coluzzii from Cote D’Ivoire area CI-LG in 2012?

Exercice 6 (Français)

En utilisant la carte interactive que nous venons de créer, quelle est la fréquence de l’amplification de Cyp6aa1 chez les An. gambiae venant de la région de l’Ouganda UG-E en 2012?

Quelle est la fréquence de l’amplification de Cyp6p3 chez les An. coluzzii venant de la région CI-LG de la Côte D’Ivoire en 2012?

Well done!#

In this module we have learnt how to analyse CNVs in genes that are associated with metabolic insecticide resistance in the Anopheles gambiae complex.

We have computed the frequency of copy number variation in genes of interest, and learnt different ways to compare these frequencies between mosquitoes from different time points, geographical locations and species.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.