Workshop 6 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 3 - Detecting new forms of insecticide resistance using selection scans#
Theme: Analysis
In this module we’re going learn how to scan the genome for signals of recent selection with the H12 statistic, using haplotype variation data from Ag3.0.
We will use functions in the malariagen_data
Python package to calibrate, run and plot H12 genome-wide selection scans. Will then learn how interpret and investigate the results to detect candidate genes potentially involved in new forms of insecticide resistance.
Learning objectives#
At the end of this module you will be able to:
Calibrate the window size parameter for an H12 analysis.
Run, plot and interpret a genome scan for recent selection using the H12 statistic.
Identify candidate genes that are under selection.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.
%pip install -q --no-warn-conflicts malariagen_data
Note: you may need to restart the kernel to use updated packages.
import warnings
warnings.simplefilter(action='ignore', category=Warning)
import malariagen_data
Saving H12 results#
H12 genome wide scans for selection may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab.
To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.
When mounting your Google Drive you will need to follow the authorization instructions.
try:
from google.colab import drive
drive.mount("drive")
except ImportError:
pass
With our Google Drive now mounted, we can define and make a directory where we want to save our H12 results.
results_dir = "drive/MyDrive/malariagen_data_cache"
In Google Colab, we can actually see our mounted drive and H12 results directory by clicking on the file tab on the left hand side of the screen.
Next we should setup the malariagen_data package. As we want to save our H12 results in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the H12 results would instead get stored on our hard drive.
Note that authentication is required to access data through the package, please follow the instructions here.
ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | /home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-6/drive/MyDrive/malariagen_data_cache |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
Why do we need to scan the genome for signals of recent selection?#
Mosquitoes are exposed to high levels of insecticides both directly via malaria vector control campaigns and more widely through agricultural use of these chemicals. The insecticides being used are also changing, as new compounds are developed. The high strength of selective pressure exerted by these insecticides is evidenced by the speed at which insecticide resistance evolves in mosquitoes.
Historically, molecular techniques were used to discover genes and SNPs associated with insecticide resistance, and produce the assays necessary to track them in natural populations. These processes are skill and time intensive, often taking years to discover new loci.
Clearly these kind of time scales are not operationally feasible when it comes to public health. The purchase and use of insecticides needs to be informed by recent resistance data, to ensure that the vector control chemicals can be changed before high levels of resistance cause them to become less effective or even fail completely.
Known unknowns#
In this series of workshops, we have already seen how to use genomic data to investigate regions of the genome that we know are associated with insecticide resistance. For example, in workshop 1 we looked at SNP variation in VGSC and in workshop 2 we looked at CNV variation in the Cyp6aap region. Genomic data are great for these kind of analyses because they allow large numbers of samples to be investigated quickly, easily, and in parallel, effectively running traditional molecular assays but in silico.
Unknown unknowns#
The speed at which insecticide resistance evolves coupled with the use of new insecticidal compounds makes it risky to only investigate genes known to be associated with resistance, as it is likely that new forms of resistance will go undetected. Fortunately, whole genome data allows us to naively scan the genome for signals of adaptation known as selective sweeps. By using these genome wide scans for selection (GWSS), novel loci under selection in mosquito populations can be quickly identified for further investigation as candidate insecticide resistance loci.
Running a GWSS using the H12 statistic#
What is H12 analysis?#
H12 measures haplotype homozygosity, in effect the similarity of haplotypes, and will run it on the phased Ag3.0 haplotype data we learnt about in module 1. This statistic is sensitive to recent selection, making it ideal for detecting selective sweeps driven by recent insecticidal pressures and, unlike some other selection detecting statistics is excellent at detecting soft sweeps.
Different kinds of selective sweeps#
Selective sweeps can be divided into hard or soft sweeps. When positive selection acts on a single adaptive mutation, causing it to increase in frequency, it is referred to a hard sweep. Where multiple adaptive mutations occur at the same locus and increase in frequency due to positive selection, it is known as a soft sweep.
However, with H12, we aren’t focusing on the individual mutations, e.g. SNPs, but rather haplotypes. We would expect the haplotype around a locus under selection to increase in frequency with it. So, in the case of hard sweep a single haplotype should be detected at high frequency and in a soft sweep, multiple haplotypes should be present at high frequencies. The figure below demonstrates this concept with mutation frequency on the left and haplotype frequency at the end time point on the right.
How H12 works#
H12 is based on measure of population diversity known as haplotype homozygosity or H1. In their 2015 paper, Garud and colleagues demonstrated that although H1 was a good statistic to detect hard sweeps, with a single haplotype at high frequency, it was less successful at finding soft sweeps.
In order to detect both hard and soft sweeps, Garud et al. modified the H1 statistic so that the first and second most common haplotypes frequencies are combined (hence H12).
For a deeper dive in the statistic and it’s underlying mathematics, I would highly recommend the Garud lab website or for the brave, the paper.
H12 output#
It is possible to run “peak” or “outlier” detection on the H12 GWSS results, but for this module we will interpret the results by visualising a plot of the data (we will explore how to make these plots in detail later).
Just to get a sense of what the H12 output looks like, let’s run an H12 analysis over chromosome arm 2R, using An. gambiae mosquitoes from a cohort collected in Burkina Faso in 2014.
ag3.plot_h12_gwss(
contig="2R",
analysis="gamb_colu",
window_size=1000,
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)
Load genome features: ⠙ (0:00:00.08)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
In the example above, we can see from the y-axis that the H12 statistic ranges from 0 to 1. Each point on the plot is the H12 statistic computed in a window, in this case window_size=1000
SNPs, across the 2R chromosome arm (x-axis). Selective sweeps are suggested by peaks in the data. Single high value windows are generally ignored as noise, because the imprint of selection on the genome should leave a peak centered near to the locus under selection, with shoulders of the peak dropping away either side.
If we zoom in to ~28,500,000bp, and roll the mouse over the gene track, we can see that H12 has detected the expected selective sweep over the Cyp6aa/p gene cluster, containing genes known to be involved in insecticide resistance. The sweep is composed of many 1000 SNP windows, providing strong evidence a true sweep has occurred here.
H12 analytical workflow#
Running an H12 analysis can be broken down into the following steps:
Select cohorts for analysis
Calibrate
window_size
parameter for each cohortRun and plot H12 scans
Peak identification
Investigate signals to identify candidate resistance genes
Let’s explore each of these in detail.
Step 1. Select cohorts for analysis#
Recap on cohorts#
First we need to define a cohort of individuals to run the H12 analysis on. The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.
Cohort size#
When we are choosing the cohort for a H12 analyses we need to consider how many samples it contains, too few samples and it could become difficult to identify selective sweep signals over background noise, too many and computational time is wasted needlessly. We have found that a reasonable cohort size is 30 samples, hence this is the default cohort_size
parameter in all the H12 GWSS functions that we will use. If you try to analyse a cohort with less samples than this parameter, it will error. If your cohort has more samples, they will be randomly downsampled to cohort size
.
For this example, let’s focus on Burkina Faso, and first examine what cohorts are available.
burkina_samples_df = ag3.sample_metadata(sample_sets = "3.0", sample_query="country == 'Burkina Faso'")
burkina_samples_df
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AB0085-Cx | BF2-4 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
1 | AB0086-Cx | BF2-6 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
2 | AB0087-C | BF3-3 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
3 | AB0088-C | BF3-5 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
4 | AB0089-Cx | BF3-8 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
291 | AB0314-C | 6775 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
292 | AB0315-C | 6777 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
293 | AB0316-C | 6779 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 8 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_08 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_08 | BF-07_Bazega_gamb_2004_Q3 |
294 | AB0318-C | 5072 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 7 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_07 | BF-07_gamb_2004_Q3 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_07 | BF-07_Bazega_gamb_2004_Q3 |
295 | AB0325-C | 1403 | Nora Besansky | Burkina Faso | Monomtenga | 2004 | 6 | 12.060 | -1.170 | F | ... | Centre-Sud | BF-07 | Bazega | gambiae | BF-07_gamb_2004 | BF-07_gamb_2004_06 | BF-07_gamb_2004_Q2 | BF-07_Bazega_gamb_2004 | BF-07_Bazega_gamb_2004_06 | BF-07_Bazega_gamb_2004_Q2 |
296 rows × 32 columns
burkina_samples_df.groupby("cohort_admin2_year").size()
cohort_admin2_year
BF-07_Bazega_gamb_2004 13
BF-09_Houet_arab_2014 3
BF-09_Houet_colu_2012 82
BF-09_Houet_colu_2014 53
BF-09_Houet_gamb_2012 99
BF-09_Houet_gamb_2014 46
dtype: int64
In the Burkina Faso data from Ag3.0, there are six cohorts, four of which have more than 30 samples, so lets pick one of these to run a H12 GWSS on.
Step 2. Calibrate window size#
Recap: Windowed analyses#
The Anopheles gambiae genome is relatively large, e.g. there are millions of SNPs on the 3R chromosome arm alone. We are going to use a windowed approach for H12, where we define a window size in number of SNPs, then move this window along the chromosome arm haplotypes, calculating our statistic across each window. This gives us a summary of data that is easier to interpret. This is similar to the approach we used to analyse heterozygosity in workshop 5, however, in the case of heterozygosity, our window was defined by fixed genomic length rather than by number of SNPs.
Choosing the window size for H12 analyses#
Once we have decided on our cohorts, we need to calibrate the window size, an important parameter for the H12 analysis. Every cohort needs its own window calibration because their demographic history will be unique. In workshop 5 we learnt how this demographic history can affect genome-wide background genetic diversity.
Let’s have a look at what happens if the wrong window size is used, then run through our methodology for picking the best size.
Here’s the example we saw earlier, using a 1,000 SNP window on the 2R chromosome arm.
ag3.plot_h12_gwss(
contig="2R",
analysis="gamb_colu",
window_size=1000,
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
title="BF-09_Houet_gamb_2014; window_size=1000"
)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
As we saw earlier, by zooming into an established pyrethroid metabolic insecticide resistance locus, known to be under selection, we are able to see a clear peak in the H12 statistic if we have a good window size. Let’s zoom in to 28,500,000bp (Cyp6aa/p gene cluster) and remind ourselves.
However, if our window size is too small, then our signal may not be visible over background noise. To illustrate what happens, let’s run the same scan but with a much smaller window size of 100 bp.
ag3.plot_h12_gwss(
contig="2R",
analysis="gamb_colu",
window_size=100,
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
title="BF-09_Houet_gamb_2014; window_size=100 (too small!)"
)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
This scan is far too noisy and it is impossible to detect any signals.
Conversely, if our window size is too big, then a true signal of selective sweep may not cover enough of the genome to generate a peak in the H12 data. To illustrate what happens, let’s run the same scan with a larger window size of 20,000 bp.
ag3.plot_h12_gwss(
contig="2R",
analysis="gamb_colu",
window_size=20_000,
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
title="BF-09_Houet_gamb_2014; window_size=20,000 (too big!)"
)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
With a window size this large, all but the Cyp6aa/p signal have been lost.
Calibrate window size#
We can use the ag3.plot_h12_calibration()
function to come to a decision about which window size will likely be most effective without having to visualise multiple H12 GWSS plots. Let’s have a look at the function.
ag3.plot_h12_calibration?
The required parameters are:
contig
- the chromosome to run the calibration scans on.analysis
- which phasing analysis to use, depends on what taxa are in your cohort.sample_query
sample_sets
Let’s run the calibration with the Burkina cohort we looked at previously. It doesn’t matter too much which contig to use, but we often use 3L because it isn’t affected by chromosomal inversions.
ag3.plot_h12_calibration(
contig="3L",
analysis="gamb_colu",
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
The plot shows the distribution of H12 values across a range of window sizes from 100 SNPs to 20,000 SNPs.
Our rule of thumb is to choose a window size where the where the 95% percentile of H12 values is at or below 0.1.
For this cohort, a reasonable window size to choose is 1,000 SNPs, and as we saw earlier, that produced clean signals of selective sweeps at known insecticide resistance loci in this cohort.
Remember, the window size parameter needs to be calibrated for each cohort you want to analyse.
Step 3. Run GWSS with H12#
Now we know what a reasonable window size is for our GWSS, we can run and plot the H12 analysis.
Let’s have a look at the function we will use.
ag3.plot_h12_gwss?
The required parameters are almost identical to the calibration function, except now we need to specify the window_size
we’d like to use for the analysis.
Let’s run a H12 GWSS analysis, end to end, all five chromosome arms, for our chosen cohort of interest.
for contig in ag3.contigs:
ag3.plot_h12_gwss(
contig=contig,
analysis="gamb_colu",
sample_sets="3.0",
sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
window_size=1000
)
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
BokehDeprecationWarning: 'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.
Step 4. Peak identification#
Reading an H12 plot - recap:
The H12 statistic ranges from 0-1
Each point on the plot is H12 caculated over a window, measured in SNPs
True signals will generally have a max H12 value that is an outlier in the genome-wide distribution
True signals will have a peak shape with roughly exponential decay on both flanks
In the Burkina Faso cohort GWSS plots above, we can spot several signals that have both outlier maximums and fit our peak shape criteria.
Let’s pick the signal on chromosome arm 2L at ~28,550,000 bp and look for candidate insecticide resistance genes.
Step 5. Identify candidate genes#
Case study - Burkina Faso, 2L, peak centered ~28,550,000#
So we’ve done the “easy” bit, run our H12 GWSS and found some promising looking signals of selective sweeps. Now we need to identify candidate genes for further investigation.
What follows is a suggested approach to identify candidate genes potentially involved IR.
Find the approximate centre of your peak of interest by zooming in on the plot above - this is ~28,550,000bp along 2L.
Look at the genes around the centre of the peak. The gene under under selection may not be exactly under the peak, but weight priority by closeness to centre.
Build a list of genes for further investigation.
As the signal is quite wide, I am starting by looking 25kb either side of the peak.
We can also use the
ag3.genome_features()
function and a Pandasquery
to pull out a list of genes given a region.
peak_genes_df = (
ag3.genome_features(region="2L:28,525,000-28,575,000")
.query('type == "gene"')
[["contig", "start", "end", "ID", "Name", "description"]]
.set_index("ID")
)
peak_genes_df
contig | start | end | Name | description | |
---|---|---|---|---|---|
ID | |||||
AGAP006222 | 2L | 28524225 | 28526317 | NaN | glucosyl/glucuronosyl transferases [Source:VB ... |
AGAP006223 | 2L | 28526558 | 28528641 | NaN | glucosyl/glucuronosyl transferases [Source:VB ... |
AGAP006224 | 2L | 28528758 | 28533199 | NaN | aldehyde oxidase [Source:VB Community Annotation] |
AGAP006225 | 2L | 28534732 | 28539416 | NaN | aldehyde oxidase [Source:VB Community Annotation] |
AGAP006226 | 2L | 28540651 | 28545294 | Aldehyde_oxidase | NaN |
AGAP006227 | 2L | 28545396 | 28547938 | NaN | alpha esterase [Source:VB Community Annotation] |
AGAP006228 | 2L | 28548433 | 28550748 | COEAE2F | carboxylesterase [Source:VB Community Annotation] |
AGAP006229 | 2L | 28550814 | 28552032 | Vps20 | vacuolar protein sorting 20 [Source:VB Communi... |
AGAP006231 | 2L | 28552352 | 28560186 | NaN | serine/threonine-protein phosphatase dullard h... |
AGAP006232 | 2L | 28563646 | 28565368 | Pex14 | peroxin-14 [Source:VB Community Annotation] |
AGAP006233 | 2L | 28565893 | 28567186 | NaN | NaN |
AGAP006234 | 2L | 28567535 | 28569087 | NaN | protein SHQ1 [Source:VB Community Annotation] |
AGAP006235 | 2L | 28569164 | 28572971 | NaN | NaN |
AGAP006236 | 2L | 28573531 | 28574496 | NaN | NaN |
AGAP006237 | 2L | 28574680 | 28575725 | NaN | Negative elongation factor E [Source:VB Commun... |
Look for any genes within gene families that have association with insecticide metabolism or resistance.
E.g. Cytochrome P450s (Cyps), glutathione S -transferases (GSTs), ion channels (e.g. VGSC), ABC transporters, esterases.
Our gene list is quite long, but if we start looking around the peak at 28,550,000bp we can see a carboxylesterase (COEAE2F) which might be IR linked, so looks like a good candidate to start investigating.
Search the literature for evidence of IR links to these genes.
We could just start searching on Google Scholar (other academic search services are available). If we do that for COEAE2F, we get a few hits straight away from the Anopheles gambiae literature including one which found that more of this carboxylesterase was made by mosquitoes exposed to insecticides.
However, a lot of insecticide resistance research happens in other insect systems where the orthologs of this gene might not be called COEAE2F.
Another approach is to first go to Vectorbase and look for orthologs of the gene, then we can also search for those on Google Scholar. If we do that for COEAE2F we find lots of the orthologs in other insects are called esterase B1. If we do a literature search with that term, we find a wealth of papers, with evidence that the same gene causes organophosphate resistance in Culex mosquitoes and has been studied in that system extensively for over 30 years!
Success!#
In our case study, we found evidence for a selective sweep at a novel locus on 2L that we don’t currently monitor in any Anopheles gambiae IR surveillance. We interrogated the genes beneath the peak of the sweep signal and found a gene belonging to a family implicated in insecticide resistance. We then searched Vectorbase to find out the different names this gene might have in other insect systems (orthologs). Using one of these orthologs we found research validating this genes involvement in organophosphate resistance in Culex, a strong suggestion that this selective sweep could be being driven by a mutation affecting this phenotype in Anopheles. Understanding and tracking organophosphate resistance is particularly important as pirimiphos-methyl (Actellic) is increasing in use both in vector control and agriculture.
What next?#
Once a new candidate resistance locus has been found, there are a number of avenues for future work. Though these are beyond the scope of this workshop but we should briefly consider them.
Identify candidate genetic variation driving the selective sweep at this locus. In workshop 1 we learnt how to analyse SNP frequencies in genes and in workshop 2 we learnt how to analyse gene copy number frequencies. By comparing cohorts where we see this selective sweep, with cohorts where we don’t, the causative variation may become clear.
Add COEAE2F to future IR surveillance. Though the involvement of this gene in Anopheles has not yet been validated, given what we know about Culex it would make sense to track SNP variation, gene copy number and evidence of selective sweeps in future mosquito collections.
Validate phenotypic effects of this gene. Populations of mosquitoes (natural or lab colony) that are susceptible and resistant to carbamates could be identified, then investigated for differences using molecular techniques. Often this approach will involved comparing the presence or absence of certain molecular motifs using polymerase chain reaction (PCR). In module 4 of this workshop, we will explore a tool which helps create the DNA primers and probes required for this type of analysis.
Well done!#
In this module we have learnt how to:
Calibrate H12 analysis window size.
Run, plot and interpret a genome scan for recent selection using the H12 statistic.
Identify candidate genes that are under selection.
Exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.
Have go at the practical exercise, but please don’t worry if you don’t have time to complete it during the practical session, and please ask the teaching assistants for help if you are stuck.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
For the practical exercise, I’d like you to run your own case study from start to end, using all the methodology we have explored today.
Calibrate a window size for a Burkina Faso coluzzii cohort, contig 2R hint: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.
Run a H12 GWSS on contig 2R, for this cohort, using your calibrated window size.
Is there evidence of a selective sweep around 2R:40,950,000? If so, would this signal be considered a good hit or not, why?
List the genes under this peak. Hint: 2R:40,900,000-41,000,000. Do any of these genes immediately jump out as IR candidates?
Run a Google Scholar literature search on these genes. Remember to start at the centre of the peak and work out hint: try searching using the Drosophila ortholog name of the gene (from Vectorbase) with the word “Anopheles”.
If you find IR related hits to your gene literature search, list the top hit (highest up the page) alongside the gene name.
Français#
Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.
Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.
Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.
Pour l’exercice pratique, je vais vous demander de réaliser l’étude d’un cas du début à la fin en utilisant la méthodologie que nous avons explorée aujourd’hui.
Calibrer la taille des fenêtres pour une cohorte de coluzzii venant du Burkina Faso en utilisant le bras de chromosome 2R. Indice: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.
Exécuter un scan du génome en utilisant H12 sur le bras de chromosome 2R pour cette cohorte. Utiliser la taille des fenêtres que vous venez d’obtenir par calibration.
Y a-t-il des indications d’un balayage sélectif autour de 2R:40,950,000? Si oui, est-ce que ce signal peut être considéré comme de bonne qualité? Pourquoi?
Lister les gènes sous ce pic. Indice: 2R:40,900,000-41,000,000. Est-ce que certains de ces gènes vous sautent aux yeux comme candidats pour la résistance aux insecticides?
Réaliser une recherche de la littérature sur Google Scholar pour ces gènes. Se souvenir de commencer au centre du pic et de s’en éloigner progressivement. Indice: essayer la recherche en utilisant le nom de l’orthologue du gène pour Drosophile (en utilisant Vectorbase)) avec le mot “Anopheles”.
Si vous trouvez des articles connectés à la résistance aux insecticides lors de votre recherche de la littérature, lister les premiers résultats (en haut de la page) ainsi que le nom du gène.