Training course in data analysis for genomic surveillance of African malaria vectors
Signals of insecticide resistance in Anopheles funestus#
Theme: Analysis
DISCLAIMER: This is work in progress and subject to change and updates.
This module provides a quick overview of various methods that can be used to detect and analyse signals of insecticide resistance in whole-genome sequenced Anopheles funestus mosquitoes using the Af1 resource.
The goal of this module is to show an example of a possible analysis plan using the Af1 resource. This analysis plan is not exhaustive and will need to be adapted to the specific questions one is interested in but it should provide a solid basis on which to build further analyses.
A certain degree of familiarity with the content of the training course on Anopheles gambiae s.l. is expected. Many of the concepts presented in this module were introduced in the training course and we will refer to the relevant workshop and module instead of giving detailed explanations.
More specifically, this module will use some functions to access SNP and CNV frequencies (which can be found, respectively, in Workshop 1 and Workshop 2) as well as H12 (which was covered in Workshop 6 - Module 3) to detect signals of selection. We will also look at gene flow for which methods were introduced in Workshop 7. Users are encouraged to go back to these workshops to refamiliarize themselves with their content.
Learning objectives#
After completing this module, you will be able to use the malariagen_data
Python package to:
Discover signals of selection in An. funestus.
Identify key SNPs.
Analyse CNVs at important loci.
Use haplotypes to detect gene flow between populations.
Lecture?#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
First, let’s install the python packages we will need for our analyses.
%pip install -q --no-warn-conflicts malariagen_data
Now we’ve installed malariagen_data
, we can import it into our environment and set it up to access data in the cloud.
Note that authentication is required to access data through the package, please follow the instructions here.
import malariagen_data
import numpy as np
import plotly.io as pio
pio.renderers.default = "notebook+colab"
af1 = malariagen_data.Af1()
af1
Voltage-gated sodium channel#
We know that pyrethroids, the class of insecticides used as the main active ingredient in all long-lasting bednets, target the voltage-gated sodium channel (VGSC). The mutations of the VGSC gene have been well-document in An. gambiae s.l. [1]. We looked at them in An. gambiae in Workshop 1. Recent studies have shown high levels of pyrethroid resistance in An. funestus from Tanzania [2] and Ghana [3]. It seems thus sensible to start by looking at SNPs of the VGSC gene in An. funestus.
We looked for this gene during our exploration of the An. funestus genome Advanced Funestus - Exploration. When looking at SNPS, we need to provide a transcript and not a gene. In the genome annotations, transcripts are obtained by appending _t?
where ?
is a number. We will thus use ‘LOC125769886_t1’.
transcript_vgsc = 'LOC125769886_t1' #VGSC
vgsc_snp_freq_df = af1.snp_allele_frequencies(
transcript=transcript_vgsc,
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(vgsc_snp_freq_df.query("max_af > 0.05 and effect == 'NON_SYNONYMOUS_CODING'"))
None of the mutations that we observe is kdr-east or west (Do we need to say what it would be?) and no mutation is frequent enough to be a plausible cause of insecticide resistance. Indeed, if we look more closely at [2] and [3], no mutation of VGSC was connected to pyrethroid resistance. We are thus going to ignore what we know about insecticide resistance in An. gambiae (while still remembering the methods that we applied to it) and investigate An. funestus with a fresh mind.
Selection scans#
To detect signals of selection, we will compute H12 (see Workshop 6 - Module 3 for a refresher).
We are not going to look at all available cohorts in this analysis as it would be too much work. Instead we are going to focus on three countries: Mozambique, the Democratic Republic of the Congo and Nigeria. Let’s start by looking at the metadata for these three countries. More specifically, we are going to look at the first administrative level of origin of the samples.
af1.sample_metadata(sample_query="country in ['Mozambique', 'Democratic Republic of the Congo', 'Nigeria']").groupby(['country', 'admin1_name']).size()
For the Democratic Republic of the Congo, we will look at the samples coming from the Kinshasa and Upper Uele regions. For Mozambique, we will use the samples from the Cabo Delgado and Maputo regions. Finally, we will use the samples from the Ogun region for Nigeria.
Let’s start by calibrating the size of our windows. We will use the samples from Cabo Delgado and chromosome 2 for this calibration. In a real analysis, we would check that the chosen window size is correct for every chromosome and every sample set but we will assume that it is the case for this example.
af1.plot_h12_calibration(
contig="2RL",
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Cabo Delgado'",
)
It looks like 2000 base pairs is an good size for our windows.
for contig in af1.contigs:
af1.plot_h12_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Cabo Delgado'",
window_size=2000
)
The only obvious signal is around 8.7 Mbp on 2RL. If we look at the genes in the region, it includes the Cyp6aa/p cluster. We will mark this region for further CNV analysis.
for contig in af1.contigs:
af1.plot_h12_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Maputo'",
window_size=2000
)
The situation is very similar in Maputo.
for contig in af1.contigs:
af1.plot_h12_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Upper Uele'",
window_size=2000
)
We observe two new signals on X, as well as the one around the Cyp6aa/p cluster. The first one, around 8.5 Mbp corresponds to Cyp9k1, another gene that we will remember for our CNV analyses. The second one, around 13.6 Mbp, is around the gene diacylglycerol kinase (dgk). We will mark this gene for further SNP analysis.
for contig in af1.contigs:
af1.plot_h12_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Kinshasa'",
window_size=2000
)
The clearest signals in Kinshasa are the signal around the Cyp6aa/p cluster and a signal on 2RL around 76.5 Mbp. This signal is connected to Gste2.
for contig in af1.contigs:
af1.plot_h12_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
sample_query="admin1_name == 'Ogun'",
window_size=2000
)
We see a new signal on 3RL around 13.5Mbp. This is the region of Rdl. The signals for Gste2, cyp6aa/p and Cyp9k1 are also present.
SNPs#
We are now going to study the SNPs for a few of the genes where signals were detected in at least one population. We will look at every available cohort composed of more than 20 samples. Let’s start with Rdl.
transcript_rdl = 'LOC125769835_t1' #Rdl
rdl_snp_freq_df = af1.snp_allele_frequencies(
transcript=transcript_rdl,
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(rdl_snp_freq_df.query("max_af > 0.05 and effect == 'NON_SYNONYMOUS_CODING'"))
The mutation A296S is the one associated with insecticide resistance. It is present at very high frequencies in a few populations (“GH-AH” is Ghana - Ashanti and “NG-OG” is Nigeria - Ogun, the cohort we used during the selection scans). It is also present at lower frequencies elsewhere in Central Africa. We can also see that every cohort has the mutation V493A as fixed or almost fixed, this is a strong indication that the reference allele at that site is not representative of the wild population.
We will now look at Gste2.
transcript_gste = 'LOC125763977_t1'
gste_snp_freq_df = af1.snp_allele_frequencies(
transcript=transcript_gste,
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(gste_snp_freq_df.query("max_af > 0.05 and effect == 'NON_SYNONYMOUS_CODING'"))
The SNP of interest this time is L143F (I think) which is fixed in Benin and at high frequency in Kinshasa (DRC), Ashanti (Ghana) and Ogun (Nigeria) and present at lower frequencies in many other places.
Finally, let’s look at dgk.
transcript_dgk ='LOC125760558_t1'
dgk_snp_freq_df = af1.snp_allele_frequencies(
transcript=transcript_dgk,
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(dgk_snp_freq_df.query("max_af > 0.05 and effect == 'NON_SYNONYMOUS_CODING'"))
The situation is not as clear this time and dgk has not been studied as much as other genes so no particular SNP has been closely connected to insecticide resistance. There are several possible candidates but the data is not sufficient to come to any conclusion.
CNVs#
We are now going to study the CNVs for a few of the regions where signals were detected in at least one population. We will, again, look at every available cohort composed of more than 20 samples. Let’s start with Cyp6aa/p. We will look at the annotations to confirm that these are cytochrome P450 genes.
af1.genome_features(region="2RL:8600000-8700000").query("type == 'protein_coding_gene'")
This confirms what we expected.
cyp6aap_cnv_freq_df = af1.gene_cnv_frequencies(
region="2RL:8600000-8700000",
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(cyp6aap_cnv_freq_df)
Many genes seem to be amplified at very high frequencies accross the continent which is a strong indication that metabolic resistance to pyrethroids is wide-spread.
Let’s now look at Cyp9k1. Again, we will check that it is indeed in the region where we observed the signal of selection.
af1.genome_features(region="X:8448000-8451000").query("type == 'protein_coding_gene'")
cyp9k1_cnv_freq_df = af1.gene_cnv_frequencies(
region="X:8448000-8451000",
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(cyp9k1_cnv_freq_df)
Once more, this gene seems to be amplified at very high frequencies in several different countries, though mostly in Central and North-East Africa.
We will look next at Gste2. We already known that it is ‘LOC125763977’.
gste_cnv_freq_df = af1.gene_cnv_frequencies(
region="2RL:76405000-76408000",
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(gste_cnv_freq_df)
Gste2 is amplified in a few populations in Benin and the DRC but at fairly low frequencies.
Finally, let’s look at dgk. We know it is ‘LOC125760558’.
gste_cnv_freq_df = af1.gene_cnv_frequencies(
region="X:13598000-13676000",
sample_sets="1.0",
cohorts="admin1_year",
drop_invariant=False,
min_cohort_size=20,
)
af1.plot_frequencies_heatmap(gste_cnv_freq_df)
It is not amplified at all in any population.
We will look at the CNV HMM for Cyp9k1 in Kinshasa to see what the amplification looks like. We extend the window a little bit to see the actual read calls a little better.
af1.plot_cnv_hmm_heatmap(
region="X:8445000-8457000",
sample_sets="1.0",
sample_query="admin1_name == 'Kinshasa'",
row_height=5
);
We see several samples with an additional copy as well as multiple ones with more than 1 extra copy of the gene. Let’s look at the HMM coverage for a few samples.
samples = ["VBS17491", "VBS17512", "VBS17502", "VBS17522"]
for s in samples:
af1.plot_cnv_hmm_coverage(s, sample_set="1.0", region="X:8445000-8457000")
The first sample shows an amplification that starts well before Cyp9k1 and ends after the end of the window. The second has a different discordant read that ends much sooner. The third one has the same discordant read as the second one but an extra copy. Finally, the last one doesn’t have any amplification: the only one that is found is in a single window and is probably wrong. In general, we do not have as many windows as in Workshop 2 so we need to be more careful with our conclusions.
Gene flow#
Let us now look at potential gene flow between populations in Mozambique and the Democratic Republic of the Congo. For that, we are going to use the methods described in Workshop 7, starting with H1X.
for contig in af1.contigs:
af1.plot_h1x_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
cohort1_query="admin1_name == 'Cabo Delgado'",
cohort2_query="admin1_name == 'Maputo'",
window_size=2000
)
It is, indeed, the only clear signal of a shared haplotype. Let’s repeat the experiment with the data from the DRC.
for contig in af1.contigs:
af1.plot_h1x_gwss(
contig=contig,
analysis="funestus",
sample_sets="1.0",
cohort1_query="admin1_name == 'Upper Uele'",
cohort2_query="admin1_name == 'Kinshasa'",
window_size=2000
)
This time, and despite the clear signals for both populations around the Cyp6aa/p cluster, there is no signal of a shared haplotype. Let’s use haplotype clusterings to check this conclusion.
af1.plot_haplotype_clustering(
region='2RL:8600000-8700000',
analysis='funestus',
sample_sets='1.0',
sample_query= 'country == "Mozambique"',
color='admin1_name')
Despite the very high number of SNPs, there are two clear shared haplotypes between Maputo and Cabo Delgado.
af1.plot_haplotype_clustering(
region='2RL:8600000-8700000',
analysis='funestus',
sample_sets='1.0',
sample_query= 'country == "Democratic Republic of the Congo"',
color='admin1_name')
In the DRC, however, we observe that some haplotypes are widely shared within each region, which is consistent with it being under selection, but the haplotypes are not shared accross the country at all. Let’s show the same results using haplotype networks. (Except they don’t show on bespin)
af1.plot_haplotype_network(
region='2RL:8600000-8700000',
analysis='funestus',
sample_sets='1.0',
sample_query= 'country in ["Mozambique", "Democratic Republic of the Congo"]',
color="admin1_name",
max_dist=2,
)
Congratulations on reaching the end of this notebook. You should now be able to run your own insecticide resistance analyses of Anopheles funestus data using Af1.0.
Clarkson CS, Miles A, Harding NJ, O’Reilly AO, Weetman D, Kwiatkowski D, Donnelly MJ; Anopheles gambiae 1000 Genomes Consortium. The genetic architecture of target-site resistance to pyrethroid insecticides in the African malaria vectors Anopheles gambiae and Anopheles coluzzii. Mol Ecol. 2021 Nov;30(21):5303-5317. doi: 10.1111/mec.15845. Epub 2021 Mar 8. PMID: 33590926; PMCID: PMC9019111.
Matowo, N.S., Martin, J., Kulkarni, M.A. et al. An increasing role of pyrethroid-resistant Anopheles funestus in malaria transmission in the Lake Zone, Tanzania. Sci Rep 11, 13457 (2021). https://doi.org/10.1038/s41598-021-92741-8
Mugenzi LMJ, Akosah-Brempong G, Tchouakui M, Menze BD, Tekoh TA, Tchoupo M, Nkemngo FN, Wondji MJ, Nwaefuna EK, Osae M, Wondji CS. Escalating pyrethroid resistance in two major malaria vectors Anopheles funestus and Anopheles gambiae (s.l.) in Atatam, Southern Ghana. BMC Infect Dis. 2022 Oct 25;22(1):799. doi: 10.1186/s12879-022-07795-4. PMID: 36284278; PMCID: PMC9597992.