Module 4 - Analysing CNV frequencies at metabolic resistance genes

Module 4 - Analysing CNV frequencies at metabolic resistance genes#

Theme: Analysis

In this module we’re going to analyse copy number variation in genes that are known to be associated with metabolic insecticide resistance in the Anopheles gambiae complex, particularly cytochrome P450 (Cyp) genes. We will compute the frequency of copy number variation in these genes of interest, and compare these frequencies between mosquitoes from different time points, geographical locations and species.

Learning objectives#

At the end of this module you will be able to:

Compute the frequency of gene copy number variation in different mosquito cohorts.
Extract and analyse CNV frequencies for metabolic resistance genes.
Visualise frequencies using heatmaps, time series plots and maps.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s install the python packages we will need for our analyses.

%pip install -q --no-warn-conflicts malariagen_data

Now we’ve installed malariagen_data, we can import it into our environment and set it up to access data in the cloud.

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data

#import os
#os.environ["MGEN_SHOW_PROGRESS"] = "0"
import plotly.io as pio
pio.renderers.default = "notebook+colab"

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	None
Cohorts analysis	20250502
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.0.1
Client location	Iowa, United States (Google Cloud us-central1)

Above we can see some information about how access to Anopheles gambiae genomic data from MalariaGEN has been configured. This tells us where the data are stored (the “gs” stands for Google Cloud Storage), which data releases are available to analyse (Ag3.0, Ag3.1, …) and what the default analyses are for cohorts, species and site filters. We can also see that we are using version 15.0.1 of the malariagen_data Python package.

Summarising copy number variation by gene#

For surveillance analyses, it can be convenient to summarise the copy number data for each individual by gene. This allows us to ask questions like, for some metabolic resistance gene of interest, does a given mosquito have any copy number variation at that gene?

We’ve seen in the previous module how the copy number variation data is calculated in 300 base pair windows over the genome, but in general genes are larger than 300 base pairs and so will span multiple windows. So how do we summarise the windowed copy number data by gene?

One way to do this is to compute some kind of average over all the windows overlapping a given gene of interest. Because copy number state is a discrete variable, we typically prefer to do this by computing the modal copy number for windows overlapping a gene.

The diagram below shows the windowed copy number for a single mosquito for the region of the genome spanning several metabolic resistance genes. We illustrate how the summarisation is done to compute modal copy number by gene.

Computing gene CNV frequencies#

Now that we know how to summarise the data by gene, the next step is to then summarise these data over multiple individual mosquitoes, so that we can estimate the fraction of mosquitoes from a given time, place and species that carry some form of copy number variation at a given gene of interest. This is ultimately what we are interested in, because we would like to know if metabolic resistance is at low or high frequency and whether this varies over time or between geographical locations or species.

Because copy number variation is more complex than single nucleotide polymorphism, we compute the frequency of CNVs in a simplified way. We compute the fraction of individuals that carry some kind of copy number variation at a given gene, ignoring whether the individual was heterozygous or homozygous for that CNV, and also ignoring the number of additional copies that may be present. We also divide this into two types of CNV - amplification and deletion. So, for example, we compute the fraction of individuals which have an amplification at some gene; and similarly we compute the fraction with some deletion. Although generally we’re more interested in amplification, because amplification of metabolic resistance genes can cause increased insecticide resistance.

The diagram below illustrates this simple calculation of amplification and deletion frequencies for some genes of interest.

Let’s now compute gene CNV frequencies. To do this we can use the gene_cnv_frequencies() function.

We can look at the documentation for this function by using “?”.

ag3.gene_cnv_frequencies?

The documentation shows us that there are two required parameters for this function. The region parameter is the genomic region that we want to analyse, and the cohorts parameter determines how samples are to be grouped into cohorts.

Let’s initially analyse all genes in the Cyp6aa/p cluster on chromosome arm 2R, which starts around position 28,450,000 and ends at around position 28,510,000.

cyp6aap_region = "2R:28,480,000-28,510,000"

There are different ways to group our mosquitoes in to cohorts (more on this later) but for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.

cohorts = "admin1_year"

Were also going to use one of the optional parameters, sample_sets, to analyse just a subset the data.

Let’s have a look at what sample sets are available in the Ag3.0 data resource.

ag3.sample_sets(release='3.0')[['sample_set','sample_count','study_id']]

	sample_set	sample_count	study_id
0	AG1000G-AO	81	AG1000G-AO
1	AG1000G-BF-A	181	AG1000G-BF-1
2	AG1000G-BF-B	102	AG1000G-BF-1
3	AG1000G-BF-C	13	AG1000G-BF-2
4	AG1000G-CD	76	AG1000G-CD
5	AG1000G-CF	73	AG1000G-CF
6	AG1000G-CI	80	AG1000G-CI
7	AG1000G-CM-A	303	AG1000G-CM-1
8	AG1000G-CM-B	97	AG1000G-CM-2
9	AG1000G-CM-C	44	AG1000G-CM-3
10	AG1000G-FR	23	AG1000G-FR
11	AG1000G-GA-A	69	AG1000G-GA-1
12	AG1000G-GH	100	AG1000G-GH
13	AG1000G-GM-A	74	AG1000G-GM-1
14	AG1000G-GM-B	31	AG1000G-GM-2
15	AG1000G-GM-C	174	AG1000G-GM-3
16	AG1000G-GN-A	45	AG1000G-GN-ML
17	AG1000G-GN-B	185	AG1000G-GN-ML
18	AG1000G-GQ	10	AG1000G-GQ
19	AG1000G-GW	101	AG1000G-GW
20	AG1000G-KE	86	AG1000G-KE
21	AG1000G-ML-A	60	AG1000G-ML-1
22	AG1000G-ML-B	71	AG1000G-ML-2
23	AG1000G-MW	41	AG1000G-MW
24	AG1000G-MZ	74	AG1000G-MZ
25	AG1000G-TZ	300	AG1000G-TZ
26	AG1000G-UG	290	AG1000G-UG
27	AG1000G-X	669	AG1000G-X

To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.

sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]

Now we’re ready to run the analysis.

burkina_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
    region=cyp6aap_region,
    cohorts=cohorts,
    sample_sets=sample_sets,
    drop_invariant=False,
)
burkina_cyp6aap_cnv_freqs_df

			gene_strand	gene_description	contig	start	end	frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	windows	label
gene_id	gene_name	cnv_type
AGAP002859	NaN	amp	+	solute carrier family 8 (sodium/calcium exchan...	2R	28397312	28516028	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	397	AGAP002859 amp
AGAP002859	NaN	del	+	solute carrier family 8 (sodium/calcium exchan...	2R	28397312	28516028	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	397	AGAP002859 del
AGAP002862	CYP6AA1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28480576	28482637	0.0	0.9125	0.811321	0.040816	0.065217	0.912500	8	AGAP002862 (CYP6AA1) amp
AGAP002862	CYP6AA1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28480576	28482637	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	8	AGAP002862 (CYP6AA1) del
AGAP013128	CYP6AA2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28483301	28484921	0.0	0.8250	0.792453	0.040816	0.065217	0.825000	6	AGAP013128 (CYP6AA2) amp
AGAP013128	CYP6AA2	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28483301	28484921	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP013128 (CYP6AA2) del
AGAP002863	COEAE6O	amp	-	carboxylesterase alpha esterase [Source:VB Com...	2R	28485262	28487080	0.0	0.6000	0.509434	0.010204	0.021739	0.600000	7	AGAP002863 (COEAE6O) amp
AGAP002863	COEAE6O	del	-	carboxylesterase alpha esterase [Source:VB Com...	2R	28485262	28487080	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	7	AGAP002863 (COEAE6O) del
AGAP002864	CYP6P15P	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28487640	28489092	0.0	0.5250	0.528302	0.010204	0.021739	0.528302	6	AGAP002864 (CYP6P15P) amp
AGAP002864	CYP6P15P	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28487640	28489092	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP002864 (CYP6P15P) del
AGAP002865	CYP6P3	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28491415	28493141	0.0	0.0375	0.075472	0.010204	0.021739	0.075472	7	AGAP002865 (CYP6P3) amp
AGAP002865	CYP6P3	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28491415	28493141	0.0	0.0000	0.000000	0.112245	0.000000	0.112245	7	AGAP002865 (CYP6P3) del
AGAP002866	CYP6P5	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28494017	28495645	0.0	0.0375	0.056604	0.020408	0.021739	0.056604	6	AGAP002866 (CYP6P5) amp
AGAP002866	CYP6P5	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28494017	28495645	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP002866 (CYP6P5) del
AGAP002867	CYP6P4	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28497087	28498674	0.0	0.0375	0.075472	0.010204	0.021739	0.075472	6	AGAP002867 (CYP6P4) amp
AGAP002867	CYP6P4	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28497087	28498674	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP002867 (CYP6P4) del
AGAP002868	CYP6P1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28499251	28500900	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	6	AGAP002868 (CYP6P1) amp
AGAP002868	CYP6P1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28499251	28500900	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP002868 (CYP6P1) del
AGAP002869	CYP6P2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	7	AGAP002869 (CYP6P2) amp
AGAP002869	CYP6P2	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	7	AGAP002869 (CYP6P2) del
AGAP002870	CYP6AD1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28504248	28505816	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	6	AGAP002870 (CYP6AD1) amp
AGAP002870	CYP6AD1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28504248	28505816	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6	AGAP002870 (CYP6AD1) del

The function outputs a pandas DataFrame, where rows represent gene CNVs. There are two rows for each gene, one representing an amplification (“amp” - increase in copy number), the other representing a deletion (“del” - decrease in copy number). There are 11 annotated genes on the chromosome arms we selected, so there are 2 times 11 = 22 rows in this dataframe.

len(burkina_cyp6aap_cnv_freqs_df)

Let’s look at the columns that provide information about the genes.

burkina_cyp6aap_cnv_freqs_df[["gene_strand", "gene_description", "contig", "start", "end"]]

			gene_strand	gene_description	contig	start	end
gene_id	gene_name	cnv_type
AGAP002859	NaN	amp	+	solute carrier family 8 (sodium/calcium exchan...	2R	28397312	28516028
AGAP002859	NaN	del	+	solute carrier family 8 (sodium/calcium exchan...	2R	28397312	28516028
AGAP002862	CYP6AA1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28480576	28482637
AGAP002862	CYP6AA1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28480576	28482637
AGAP013128	CYP6AA2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28483301	28484921
AGAP013128	CYP6AA2	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28483301	28484921
AGAP002863	COEAE6O	amp	-	carboxylesterase alpha esterase [Source:VB Com...	2R	28485262	28487080
AGAP002863	COEAE6O	del	-	carboxylesterase alpha esterase [Source:VB Com...	2R	28485262	28487080
AGAP002864	CYP6P15P	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28487640	28489092
AGAP002864	CYP6P15P	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28487640	28489092
AGAP002865	CYP6P3	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28491415	28493141
AGAP002865	CYP6P3	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28491415	28493141
AGAP002866	CYP6P5	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28494017	28495645
AGAP002866	CYP6P5	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28494017	28495645
AGAP002867	CYP6P4	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28497087	28498674
AGAP002867	CYP6P4	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28497087	28498674
AGAP002868	CYP6P1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28499251	28500900
AGAP002868	CYP6P1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28499251	28500900
AGAP002869	CYP6P2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910
AGAP002869	CYP6P2	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910
AGAP002870	CYP6AD1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28504248	28505816
AGAP002870	CYP6AD1	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28504248	28505816

Reminder - cohorts#

When we want to analyse frequencies we first define cohorts, which are simply groups of mosquitoes that we have data for and which we take as representatives of a mosquito population from a given time, place and species.

To define these cohorts we group mosquitoes according to these different variables, i.e., geographically, temporally and according to species (taxon). There are different options for doing this, depending on how fine-grained we want our analysis to be, and how many samples are available.

When computing frequencies and choosing cohorts, we need to think about two factors:

Cohort size - The larger the cohort the lower the uncertainty and the more confident we can be that the frequency we calculate within the cohort is representative of the frequency in the population from which the cohort was sampled. By default, the functions that compute frequencies in the malariagen_data package require a minimum cohort size of 10 individuals.
Heterogeneity - If we group samples together from a wide geographical area, or from a long time period, we may be losing information about any smaller-scale variation.

Often we have to make a compromise, and do the best we can with available data.

In the example above, we provided the cohorts="admin1_year" parameter, which means that samples will be grouped by species (taxon), administrative level 1 divisions, and year of collection. This gives us frequencies for five cohorts. Let’s look at the cohort frequency columns.

frequency_columns = [
    col for col in burkina_cyp6aap_cnv_freqs_df.columns 
    if col.startswith("frq_")
]
frequency_columns

['frq_BF-07_gamb_2004',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_BF-09_gamb_2012',
 'frq_BF-09_gamb_2014']

Remember, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.

Let’s look at the values of these frequency columns in our DataFrame.

burkina_cyp6aap_cnv_freqs_df[frequency_columns + ["max_af", "windows"]]

			frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	windows
gene_id	gene_name	cnv_type
AGAP002859	NaN	amp	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	397
AGAP002859	NaN	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	397
AGAP002862	CYP6AA1	amp	0.0	0.9125	0.811321	0.040816	0.065217	0.912500	8
AGAP002862	CYP6AA1	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	8
AGAP013128	CYP6AA2	amp	0.0	0.8250	0.792453	0.040816	0.065217	0.825000	6
AGAP013128	CYP6AA2	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6
AGAP002863	COEAE6O	amp	0.0	0.6000	0.509434	0.010204	0.021739	0.600000	7
AGAP002863	COEAE6O	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	7
AGAP002864	CYP6P15P	amp	0.0	0.5250	0.528302	0.010204	0.021739	0.528302	6
AGAP002864	CYP6P15P	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6
AGAP002865	CYP6P3	amp	0.0	0.0375	0.075472	0.010204	0.021739	0.075472	7
AGAP002865	CYP6P3	del	0.0	0.0000	0.000000	0.112245	0.000000	0.112245	7
AGAP002866	CYP6P5	amp	0.0	0.0375	0.056604	0.020408	0.021739	0.056604	6
AGAP002866	CYP6P5	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6
AGAP002867	CYP6P4	amp	0.0	0.0375	0.075472	0.010204	0.021739	0.075472	6
AGAP002867	CYP6P4	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6
AGAP002868	CYP6P1	amp	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	6
AGAP002868	CYP6P1	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6
AGAP002869	CYP6P2	amp	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	7
AGAP002869	CYP6P2	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	7
AGAP002870	CYP6AD1	amp	0.0	0.0375	0.056604	0.010204	0.021739	0.056604	6
AGAP002870	CYP6AD1	del	0.0	0.0000	0.000000	0.000000	0.000000	0.000000	6

Remember frequency values range between 0 and 1. 0 means no individuals have the variant. 1 means all individuals have the variant. 0.5 means 50% of individuals have the variant.

The max_af column simply shows the highest frequency found in the cohorts and is useful for filtering the output down to just CNVs at appreciable frequency in at least one cohort, and the windows columns shows how many 300 bp windows the frequency is calculated over.

Plotting frequency heatmaps#

Just like we could with SNP allele frequencies, we can also make gene cnv frequency DataFrames easier to interpret by plotting them as a heat map.

Let’s also filter our DataFrame by max_af to just CNVs present in at least one cohort over 5%, using a pandas query.

ag3.plot_frequencies_heatmap(
    burkina_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
    title="Gene CNV frequencies, Burkina Faso, Cyp6aa/p locus")

We can see clearly now that amplifications appear to be at high frequency within An. coluzzii but not An. gambiae cohorts, and the amplifications appear to be highest for the Cyp6aa1 gene.

Exercise 1 (English)

Recall that there are several other cytochrome P450 (Cyp) genes and gene clusters associated with insecticide resistance in the genome, so let’s make a new region query to look at some other Cyp genes of interest and visualise them as heatmaps.

First, have a look at the Cyp9k1 gene on the X chromosome (15,240,000-15,250,000).

Hint: cyp9k1_region = "X:15,240,000-15,250,000"

Exercice 1 (Français)

Il y a d’autres gènes cytochrome P450 (Cyp) et groupes de gènes associés à la résistance aux insecticides dans le génome, utilisons donc une autre requête pour obtenir une région contenant d’autres gènes Cyp et les visualiser sous forme de heatmap.

D’abord, regardons le gène Cyp9k1 sur le chromosome X (15,240,000-15,250,000).

Indice: cyp9k1_region = "X:15,240,000-15,250,000"

# cyp9k1_region = ???

# burkina_cyp9k1_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp9k1_region,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_cyp9k1_cnv_freqs_df.query("max_af > 0.05"),
# )

Exercise 2 (English)

Next, try the Cyp6mz region.

Hint: cyp6mz_region = "3R:6,924,000-6,980,000"

Exercice 2 (Français)

Essayons ensuite la région de Cyp6mz.

Indice: cyp6mz_region = "3R:6,924,000-6,980,000"

# cyp6mz_region = ???

# burkina_cyp6mz_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp6mz_region,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_cyp6mz_cnv_freqs_df.query("max_af > 0.05"),
# )

Exercise 3 (English)

Let’s also have a look at the Gste gene cluster on chromosome arm 3R. What patterns can you see?

Hint: gste_region = "3R:28,590,000-28,610,000"

Exercice 3 (Français)

Regardons aussi le groupe de gènes Gste sur le bras de chromosome 3R. Quel motif observez-vous?

Indice: gste_region = "3R:28,590,000-28,610,000"

# gste_region = ???

# burkina_gste_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=???,
#     cohorts="admin1_year",
#     sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
# )

# ag3.plot_frequencies_heatmap(
#     burkina_gste_cnv_freqs_df.query("max_af > 0.05"),
# )

Using sample queries#

Sometimes you may want to analyse data across many sample sets, but select only samples from a single country or species. The useful sample_query parameter enables us to achieve this easily.

For example, if we wanted to analyse the Cyp6aa/p gene cluster in all of the Ag3.0 sample sets, but only look at Anopheles coluzzii samples in those sample sets, then we would set the sample_sets parameter to “3.0”…

sample_sets = "3.0"

… and pass a pandas query string to sample_query.

sample_query = "taxon == 'coluzzii'"

coluzzii_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
    region=cyp6aap_region,
    cohorts=cohorts,
    sample_sets=sample_sets,
    sample_query=sample_query,
)
coluzzii_cyp6aap_cnv_freqs_df

			gene_strand	gene_description	contig	start	end	frq_AO-LUA_colu_2009	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_CF-BGF_colu_1994	frq_CI-LG_colu_2012	...	frq_GH-WP_colu_2012	frq_GM-M_colu_2012	frq_GN-N_colu_2012	frq_ML-2_colu_2004	frq_ML-2_colu_2014	frq_ML-3_colu_2012	frq_ML-4_colu_2004	max_af	windows	label
gene_id	gene_name	cnv_type
AGAP002862	CYP6AA1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28480576	28482637	0.0	0.9125	0.811321	0.0	0.870130	...	0.125000	0.046784	0.8	0.8	0.526316	0.846154	0.260870	0.912500	8	AGAP002862 (CYP6AA1) amp
AGAP013128	CYP6AA2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28483301	28484921	0.0	0.8250	0.792453	0.0	0.870130	...	0.125000	0.029240	0.8	0.8	0.421053	0.769231	0.260870	0.870130	6	AGAP013128 (CYP6AA2) amp
AGAP002863	COEAE6O	amp	-	carboxylesterase alpha esterase [Source:VB Com...	2R	28485262	28487080	0.0	0.6000	0.509434	0.0	0.740260	...	0.083333	0.000000	0.1	0.0	0.263158	0.461538	0.043478	0.740260	7	AGAP002863 (COEAE6O) amp
AGAP002864	CYP6P15P	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28487640	28489092	0.0	0.5250	0.528302	0.0	0.922078	...	0.083333	0.000000	0.0	0.2	0.263158	0.346154	0.043478	0.922078	6	AGAP002864 (CYP6P15P) amp
AGAP002865	CYP6P3	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28491415	28493141	0.0	0.0375	0.075472	0.0	0.922078	...	0.041667	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.922078	7	AGAP002865 (CYP6P3) amp
AGAP002866	CYP6P5	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28494017	28495645	0.0	0.0375	0.056604	0.0	0.922078	...	0.000000	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.922078	6	AGAP002866 (CYP6P5) amp
AGAP002867	CYP6P4	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28497087	28498674	0.0	0.0375	0.075472	0.0	0.922078	...	0.000000	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.922078	6	AGAP002867 (CYP6P4) amp
AGAP002868	CYP6P1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28499251	28500900	0.0	0.0375	0.056604	0.0	0.909091	...	0.000000	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.909091	6	AGAP002868 (CYP6P1) amp
AGAP002869	CYP6P2	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910	0.0	0.0375	0.056604	0.0	0.922078	...	0.000000	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.922078	7	AGAP002869 (CYP6P2) amp
AGAP002869	CYP6P2	del	-	cytochrome P450 [Source:VB Community Annotation]	2R	28501033	28502910	0.0	0.0000	0.000000	0.0	0.000000	...	0.000000	0.000000	0.0	0.0	0.000000	0.000000	0.000000	0.040000	7	AGAP002869 (CYP6P2) del
AGAP002870	CYP6AD1	amp	-	cytochrome P450 [Source:VB Community Annotation]	2R	28504248	28505816	0.0	0.0375	0.056604	0.0	0.922078	...	0.000000	0.000000	0.0	0.2	0.000000	0.038462	0.000000	0.922078	6	AGAP002870 (CYP6AD1) amp

11 rows × 22 columns

We can see that our query string has worked by listing the cohort frequency (“frq”) columns in our output DataFrame. Note that the taxon is “colu” in each cohort.

colu_frequency_columns = [
    col for col in coluzzii_cyp6aap_cnv_freqs_df.columns 
    if col.startswith("frq_")
]

colu_frequency_columns

['frq_AO-LUA_colu_2009',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_CF-BGF_colu_1994',
 'frq_CI-LG_colu_2012',
 'frq_GH-AA_colu_2012',
 'frq_GH-CP_colu_2012',
 'frq_GH-WP_colu_2012',
 'frq_GM-M_colu_2012',
 'frq_GN-N_colu_2012',
 'frq_ML-2_colu_2004',
 'frq_ML-2_colu_2014',
 'frq_ML-3_colu_2012',
 'frq_ML-4_colu_2004']

Now we can visualise data for a locus of interest from this output using a heatmap, just as we did before.

ag3.plot_frequencies_heatmap(
    coluzzii_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
    title="Gene CNV frequencies, An. coluzzii, Cyp6aa/p locus"
)

We can see CNV frequencies appear to differ between geographical locations drastically. For example, Cote D’Ivoire has the highest frequencies across the most genes in the cluster (“CI-LG_colu_2012”), but the collection from the same year in neighbouring Ghana (“GH-AA_colu_2012”) has no evidence for CNVs at this locus.

Exercise 4 (English)

Look at the same Cypaa/p locus, but this time at only An. gambiae. What patterns can you see in the results?

Hint: change sample_query taxon to “gambiae”.

Exercice 4 (Français)

Observer le même locus Cypaa/p, mais cette fois seulement pour An. gambiae. Quels motifs pouvez-vous observer dans ces résultats?

Indice: remplacer le taxon dans sample_query par “gambiae”.

# sample_query = ???

# gambiae_cyp6aap_cnv_freqs_df = ag3.gene_cnv_frequencies(
#     region=cyp6aap_region,
#     cohorts=cohorts,
#     sample_sets=sample_sets,
#     sample_query=sample_query,
# )

# ag3.plot_frequencies_heatmap(
#     gambiae_cyp6aap_cnv_freqs_df.query("max_af > 0.05"), 
#     title="Gene CNV frequencies, An. gambiae, Cyp6aa/p locus"
# )

Well done!#

In this module we have learnt how to analyse CNVs in genes that are associated with metabolic insecticide resistance in the Anopheles gambiae complex.

We have computed the frequency of copy number variation in genes of interest, and learnt different ways to compare these frequencies between mosquitoes from different time points, geographical locations and species.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.

Module 4 - Analysing CNV frequencies at metabolic resistance genes

Contents

Module 4 - Analysing CNV frequencies at metabolic resistance genes#

Learning objectives#

Lecture#

English#

Français#

Setup#

Summarising copy number variation by gene#

Computing gene CNV frequencies#

Reminder - cohorts#

Plotting frequency heatmaps#

Using sample queries#

Plotting frequency time series#

Plotting frequencies using interactive maps#

Well done!#

Exercises#

English#

Français#