banner

Workshop 6 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 3 - Detecting new forms of insecticide resistance using selection scans#

Theme: Analysis

In this module we’re going learn how to scan the genome for signals of recent selection with the H12 statistic, using haplotype variation data from Ag3.0.

We will use functions in the malariagen_data Python package to calibrate, run and plot H12 genome-wide selection scans. Will then learn how interpret and investigate the results to detect candidate genes potentially involved in new forms of insecticide resistance.

Learning objectives#

At the end of this module you will be able to:

  • Calibrate the window size parameter for an H12 analysis.

  • Run, plot and interpret a genome scan for recent selection using the H12 statistic.

  • Identify candidate genes that are under selection.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data

Saving H12 results#

H12 genome wide scans for selection may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab.

To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

When mounting your Google Drive you will need to follow the authorization instructions.

try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

With our Google Drive now mounted, we can define and make a directory where we want to save our H12 results.

results_dir = "drive/MyDrive/malariagen_data_cache"

In Google Colab, we can actually see our mounted drive and H12 results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our H12 results in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the H12 results would instead get stored on our hard drive.

ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0
Results cache /home/cclarkson/Git/anopheles-genomic-surveillance.github.io/docs/workshop-6/drive/MyDrive/malariagen_data_cache
Cohorts analysis 20230516
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 7.12.0
Client location unknown

Why do we need to scan the genome for signals of recent selection?#

Mosquitoes are exposed to high levels of insecticides both directly via malaria vector control campaigns and more widely through agricultural use of these chemicals. The insecticides being used are also changing, as new compounds are developed. The high strength of selective pressure exerted by these insecticides is evidenced by the speed at which insecticide resistance evolves in mosquitoes.

Historically, molecular techniques were used to discover genes and SNPs associated with insecticide resistance, and produce the assays necessary to track them in natural populations. These processes are skill and time intensive, often taking years to discover new loci.

Clearly these kind of time scales are not operationally feasible when it comes to public health. The purchase and use of insecticides needs to be informed by recent resistance data, to ensure that the vector control chemicals can be changed before high levels of resistance cause them to become less effective or even fail completely.

Known unknowns#

In this series of workshops, we have already seen how to use genomic data to investigate regions of the genome that we know are associated with insecticide resistance. For example, in workshop 1 we looked at SNP variation in VGSC and in workshop 2 we looked at CNV variation in the Cyp6aap region. Genomic data are great for these kind of analyses because they allow large numbers of samples to be investigated quickly, easily, and in parallel, effectively running traditional molecular assays but in silico.

Unknown unknowns#

The speed at which insecticide resistance evolves coupled with the use of new insecticidal compounds makes it risky to only investigate genes known to be associated with resistance, as it is likely that new forms of resistance will go undetected. Fortunately, whole genome data allows us to naively scan the genome for signals of adaptation known as selective sweeps. By using these genome wide scans for selection (GWSS), novel loci under selection in mosquito populations can be quickly identified for further investigation as candidate insecticide resistance loci.

Running a GWSS using the H12 statistic#

What is H12 analysis?#

H12 measures haplotype homozygosity, in effect the similarity of haplotypes, and will run it on the phased Ag3.0 haplotype data we learnt about in module 1. This statistic is sensitive to recent selection, making it ideal for detecting selective sweeps driven by recent insecticidal pressures and, unlike some other selection detecting statistics is excellent at detecting soft sweeps.

Different kinds of selective sweeps#

Selective sweeps can be divided into hard or soft sweeps. When positive selection acts on a single adaptive mutation, causing it to increase in frequency, it is referred to a hard sweep. Where multiple adaptive mutations occur at the same locus and increase in frequency due to positive selection, it is known as a soft sweep.

However, with H12, we aren’t focusing on the individual mutations, e.g. SNPs, but rather haplotypes. We would expect the haplotype around a locus under selection to increase in frequency with it. So, in the case of hard sweep a single haplotype should be detected at high frequency and in a soft sweep, multiple haplotypes should be present at high frequencies. The figure below demonstrates this concept with mutation frequency on the left and haplotype frequency at the end time point on the right.

hard and soft sweeps

How H12 works#

H12 is based on measure of population diversity known as haplotype homozygosity or H1. In their 2015 paper, Garud and colleagues demonstrated that although H1 was a good statistic to detect hard sweeps, with a single haplotype at high frequency, it was less successful at finding soft sweeps.

In order to detect both hard and soft sweeps, Garud et al. modified the H1 statistic so that the first and second most common haplotypes frequencies are combined (hence H12).

For a deeper dive in the statistic and it’s underlying mathematics, I would highly recommend the Garud lab website or for the brave, the paper.

Garud paper

H12 output#

It is possible to run “peak” or “outlier” detection on the H12 GWSS results, but for this module we will interpret the results by visualising a plot of the data (we will explore how to make these plots in detail later).

Just to get a sense of what the H12 output looks like, let’s run an H12 analysis over chromosome arm 2R, using An. gambiae mosquitoes from a cohort collected in Burkina Faso in 2014.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'"
)

In the example above, we can see from the y-axis that the H12 statistic ranges from 0 to 1. Each point on the plot is the H12 statistic computed in a window, in this case window_size=1000 SNPs, across the 2R chromosome arm (x-axis). Selective sweeps are suggested by peaks in the data. Single high value windows are generally ignored as noise, because the imprint of selection on the genome should leave a peak centered near to the locus under selection, with shoulders of the peak dropping away either side.

If we zoom in to ~28,500,000bp, and roll the mouse over the gene track, we can see that H12 has detected the expected selective sweep over the Cyp6aa/p gene cluster, containing genes known to be involved in insecticide resistance. The sweep is composed of many 1000 SNP windows, providing strong evidence a true sweep has occurred here.

H12 analytical workflow#

Running an H12 analysis can be broken down into the following steps:

  1. Select cohorts for analysis

  2. Calibrate window_size parameter for each cohort

  3. Run and plot H12 scans

  4. Peak identification

  5. Investigate signals to identify candidate resistance genes

Let’s explore each of these in detail.

Step 1. Select cohorts for analysis#

Recap on cohorts#

First we need to define a cohort of individuals to run the H12 analysis on. The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.

Cohort size#

When we are choosing the cohort for a H12 analyses we need to consider how many samples it contains, too few samples and it could become difficult to identify selective sweep signals over background noise, too many and computational time is wasted needlessly. We have found that a reasonable cohort size is 30 samples, hence this is the default cohort_size parameter in all the H12 GWSS functions that we will use. If you try to analyse a cohort with less samples than this parameter, it will error. If your cohort has more samples, they will be randomly downsampled to cohort size.

For this example, let’s focus on Burkina Faso, and first examine what cohorts are available.

burkina_samples_df = ag3.sample_metadata(sample_query="country == 'Burkina Faso'")
burkina_samples_df
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 AB0085-Cx BF2-4 Austin Burt Burkina Faso Pala 2012 7 11.151 -4.235 F ... Hauts-Bassins BF-09 Houet gambiae BF-09_gamb_2012 BF-09_gamb_2012_07 BF-09_gamb_2012_Q3 BF-09_Houet_gamb_2012 BF-09_Houet_gamb_2012_07 BF-09_Houet_gamb_2012_Q3
1 AB0086-Cx BF2-6 Austin Burt Burkina Faso Pala 2012 7 11.151 -4.235 F ... Hauts-Bassins BF-09 Houet gambiae BF-09_gamb_2012 BF-09_gamb_2012_07 BF-09_gamb_2012_Q3 BF-09_Houet_gamb_2012 BF-09_Houet_gamb_2012_07 BF-09_Houet_gamb_2012_Q3
2 AB0087-C BF3-3 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3
3 AB0088-C BF3-5 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3
4 AB0089-Cx BF3-8 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
291 AB0314-C 6775 Nora Besansky Burkina Faso Monomtenga 2004 8 12.060 -1.170 F ... Centre-Sud BF-07 Bazega gambiae BF-07_gamb_2004 BF-07_gamb_2004_08 BF-07_gamb_2004_Q3 BF-07_Bazega_gamb_2004 BF-07_Bazega_gamb_2004_08 BF-07_Bazega_gamb_2004_Q3
292 AB0315-C 6777 Nora Besansky Burkina Faso Monomtenga 2004 8 12.060 -1.170 F ... Centre-Sud BF-07 Bazega gambiae BF-07_gamb_2004 BF-07_gamb_2004_08 BF-07_gamb_2004_Q3 BF-07_Bazega_gamb_2004 BF-07_Bazega_gamb_2004_08 BF-07_Bazega_gamb_2004_Q3
293 AB0316-C 6779 Nora Besansky Burkina Faso Monomtenga 2004 8 12.060 -1.170 F ... Centre-Sud BF-07 Bazega gambiae BF-07_gamb_2004 BF-07_gamb_2004_08 BF-07_gamb_2004_Q3 BF-07_Bazega_gamb_2004 BF-07_Bazega_gamb_2004_08 BF-07_Bazega_gamb_2004_Q3
294 AB0318-C 5072 Nora Besansky Burkina Faso Monomtenga 2004 7 12.060 -1.170 F ... Centre-Sud BF-07 Bazega gambiae BF-07_gamb_2004 BF-07_gamb_2004_07 BF-07_gamb_2004_Q3 BF-07_Bazega_gamb_2004 BF-07_Bazega_gamb_2004_07 BF-07_Bazega_gamb_2004_Q3
295 AB0325-C 1403 Nora Besansky Burkina Faso Monomtenga 2004 6 12.060 -1.170 F ... Centre-Sud BF-07 Bazega gambiae BF-07_gamb_2004 BF-07_gamb_2004_06 BF-07_gamb_2004_Q2 BF-07_Bazega_gamb_2004 BF-07_Bazega_gamb_2004_06 BF-07_Bazega_gamb_2004_Q2

296 rows × 30 columns

burkina_samples_df.groupby("cohort_admin2_year").size()
cohort_admin2_year
BF-07_Bazega_gamb_2004    13
BF-09_Houet_arab_2014      3
BF-09_Houet_colu_2012     82
BF-09_Houet_colu_2014     53
BF-09_Houet_gamb_2012     99
BF-09_Houet_gamb_2014     46
dtype: int64

In the Burkina Faso data from Ag3.0, there are six cohorts, four of which have more than 30 samples, so lets pick one of these to run a H12 GWSS on.

Step 2. Calibrate window size#

Recap: Windowed analyses#

The Anopheles gambiae genome is relatively large, e.g. there are millions of SNPs on the 3R chromosome arm alone. We are going to use a windowed approach for H12, where we define a window size in number of SNPs, then move this window along the chromosome arm haplotypes, calculating our statistic across each window. This gives us a summary of data that is easier to interpret. This is similar to the approach we used to analyse heterozygosity in workshop 5, however, in the case of heterozygosity, our window was defined by fixed genomic length rather than by number of SNPs.

Choosing the window size for H12 analyses#

Once we have decided on our cohorts, we need to calibrate the window size, an important parameter for the H12 analysis. Every cohort needs its own window calibration because their demographic history will be unique. In workshop 5 we learnt how this demographic history can affect genome-wide background genetic diversity.

Let’s have a look at what happens if the wrong window size is used, then run through our methodology for picking the best size.

Here’s the example we saw earlier, using a 1,000 SNP window on the 2R chromosome arm.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=1000"
)

As we saw earlier, by zooming into an established pyrethroid metabolic insecticide resistance locus, known to be under selection, we are able to see a clear peak in the H12 statistic if we have a good window size. Let’s zoom in to 28,500,000bp (Cyp6aa/p gene cluster) and remind ourselves.

However, if our window size is too small, then our signal may not be visible over background noise. To illustrate what happens, let’s run the same scan but with a much smaller window size of 100 bp.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=100, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=100 (too small!)"
)

This scan is far too noisy and it is impossible to detect any signals.

Conversely, if our window size is too big, then a true signal of selective sweep may not cover enough of the genome to generate a peak in the H12 data. To illustrate what happens, let’s run the same scan with a larger window size of 20,000 bp.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=20_000, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=20,000 (too big!)"
)

With a window size this large, all but the Cyp6aa/p signal have been lost.

Calibrate window size#

We can use the ag3.plot_h12_calibration() function to come to a decision about which window size will likely be most effective without having to visualise multiple H12 GWSS plots. Let’s have a look at the function.

ag3.plot_h12_calibration?
Signature:
ag3.plot_h12_calibration(
    contig: str,
    analysis: str = 'default',
    sample_query: Optional[str] = None,
    sample_sets: Union[Sequence[str], str, NoneType] = None,
    cohort_size: Optional[int] = None,
    min_cohort_size: Optional[int] = 15,
    max_cohort_size: Optional[int] = 50,
    window_sizes: Tuple[int, ...] = (100, 200, 500, 1000, 2000, 5000, 10000, 20000),
    random_seed: int = 42,
    title: Optional[str] = None,
    show: bool = True,
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot h12 GWSS calibration data for different window sizes.

Parameters
----------
contig : str
    Reference genome contig name. See the `contigs` property for valid
    contig names.
analysis : str, optional, default: 'default'
    Which haplotype phasing analysis to use. See the
    `phasing_analysis_ids` property for available values.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
min_cohort_size : int or None, optional, default: 15
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional, default: 50
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
window_sizes : tuple of int, optional, default: (100, 200, 500, 1000, 2000, 5000, 10000, 20000)
    The sizes of windows (number of SNPs) used to calculate statistics
    within.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
title : str or None, optional
    Plot title.
show : bool, optional, default: True
    If True, show the plot.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/7ceef7902fe4be9ce8d86da5906b7a73b21b380a7f60111a370bab4e894be850-20230706-172507-025772-51-training-nb-maintenance-mgen-7.12.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type:      method

The required parameters are:

  • contig - the chromosome to run the calibration scans on.

  • analysis - which phasing analysis to use, depends on what taxa are in your cohort.

  • sample_query

  • sample_sets

Let’s run the calibration with the Burkina cohort we looked at previously. It doesn’t matter too much which contig to use, but we often use 3L because it isn’t affected by chromosomal inversions.

ag3.plot_h12_calibration(
    contig="3L", 
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)

The plot shows the distribution of H12 values across a range of window sizes from 100 SNPs to 20,000 SNPs.

Our rule of thumb is to choose a window size where the where the 95% percentile of H12 values is at or below 0.1.

For this cohort, a reasonable window size to choose is 1,000 SNPs, and as we saw earlier, that produced clean signals of selective sweeps at known insecticide resistance loci in this cohort.

Remember, the window size parameter needs to be calibrated for each cohort you want to analyse.

Step 3. Run GWSS with H12#

Now we know what a reasonable window size is for our GWSS, we can run and plot the H12 analysis.

Let’s have a look at the function we will use.

ag3.plot_h12_gwss?
Signature:
ag3.plot_h12_gwss(
    contig: str,
    window_size: int,
    analysis: str = 'default',
    sample_sets: Union[Sequence[str], str, NoneType] = None,
    sample_query: Optional[str] = None,
    cohort_size: Optional[int] = None,
    min_cohort_size: Optional[int] = 15,
    max_cohort_size: Optional[int] = 50,
    random_seed: int = 42,
    title: Union[str, bool, NoneType] = None,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    track_height: int = 170,
    genes_height: int = 90,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot h12 GWSS data.

Parameters
----------
contig : str
    Reference genome contig name. See the `contigs` property for valid
    contig names.
window_size : int
    The size of windows (number of SNPs) used to calculate statistics
    within.
analysis : str, optional, default: 'default'
    Which haplotype phasing analysis to use. See the
    `phasing_analysis_ids` property for available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
min_cohort_size : int or None, optional, default: 15
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional, default: 50
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
title : str or bool or None, optional
    Plot title. If True, a title may be automatically generated.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
track_height : int, optional, default: 170
    Main track height in pixels (px).
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/7ceef7902fe4be9ce8d86da5906b7a73b21b380a7f60111a370bab4e894be850-20230706-172507-025772-51-training-nb-maintenance-mgen-7.12.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type:      method

The required parameters are almost identical to the calibration function, except now we need to specify the window_size we’d like to use for the analysis.

Let’s run a H12 GWSS analysis, end to end, all five chromosome arms, for our chosen cohort of interest.

for contig in ag3.contigs:
  ag3.plot_h12_gwss(
      contig=contig, 
      analysis="gamb_colu",
      sample_sets="3.0",
      sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
      window_size=1000
  )

Step 4. Peak identification#

Reading an H12 plot - recap:

  • The H12 statistic ranges from 0-1

  • Each point on the plot is H12 caculated over a window, measured in SNPs

  • True signals will generally have a max H12 value that is an outlier in the genome-wide distribution

  • True signals will have a peak shape with roughly exponential decay on both flanks

In the Burkina Faso cohort GWSS plots above, we can spot several signals that have both outlier maximums and fit our peak shape criteria.

Let’s pick the signal on chromosome arm 2L at ~28,550,000 bp and look for candidate insecticide resistance genes.

Step 5. Identify candidate genes#

Case study - Burkina Faso, 2L, peak centered ~28,550,000#

So we’ve done the “easy” bit, run our H12 GWSS and found some promising looking signals of selective sweeps. Now we need to identify candidate genes for further investigation.

What follows is a suggested approach to identify candidate genes potentially involved IR.

  1. Find the approximate centre of your peak of interest by zooming in on the plot above - this is ~28,550,000bp along 2L.

  2. Look at the genes around the centre of the peak. The gene under under selection may not be exactly under the peak, but weight priority by closeness to centre.

  3. Build a list of genes for further investigation.

  • As the signal is quite wide, I am starting by looking 25kb either side of the peak.

  • We can also use the ag3.genome_features() function and a Pandas query to pull out a list of genes given a region.

peak_genes_df = (
    ag3.genome_features(region="2L:28,525,000-28,575,000")
    .query('type == "gene"')
    [["contig", "start", "end", "ID", "Name", "description"]]
    .set_index("ID")
)
peak_genes_df
contig start end Name description
ID
AGAP006222 2L 28524225 28526317 NaN glucosyl/glucuronosyl transferases [Source:VB ...
AGAP006223 2L 28526558 28528641 NaN glucosyl/glucuronosyl transferases [Source:VB ...
AGAP006224 2L 28528758 28533199 NaN aldehyde oxidase [Source:VB Community Annotation]
AGAP006225 2L 28534732 28539416 NaN aldehyde oxidase [Source:VB Community Annotation]
AGAP006226 2L 28540651 28545294 Aldehyde_oxidase NaN
AGAP006227 2L 28545396 28547938 NaN alpha esterase [Source:VB Community Annotation]
AGAP006228 2L 28548433 28550748 COEAE2F carboxylesterase [Source:VB Community Annotation]
AGAP006229 2L 28550814 28552032 Vps20 vacuolar protein sorting 20 [Source:VB Communi...
AGAP006231 2L 28552352 28560186 NaN serine/threonine-protein phosphatase dullard h...
AGAP006232 2L 28563646 28565368 Pex14 peroxin-14 [Source:VB Community Annotation]
AGAP006233 2L 28565893 28567186 NaN NaN
AGAP006234 2L 28567535 28569087 NaN protein SHQ1 [Source:VB Community Annotation]
AGAP006235 2L 28569164 28572971 NaN NaN
AGAP006236 2L 28573531 28574496 NaN NaN
AGAP006237 2L 28574680 28575725 NaN Negative elongation factor E [Source:VB Commun...
  1. Look for any genes within gene families that have association with insecticide metabolism or resistance.

  • E.g. Cytochrome P450s (Cyps), glutathione S -transferases (GSTs), ion channels (e.g. VGSC), ABC transporters, esterases.

  • Our gene list is quite long, but if we start looking around the peak at 28,550,000bp we can see a carboxylesterase (COEAE2F) which might be IR linked, so looks like a good candidate to start investigating.

  1. Search the literature for evidence of IR links to these genes.

  • We could just start searching on Google Scholar (other academic search services are available). If we do that for COEAE2F, we get a few hits straight away from the Anopheles gambiae literature including one which found that more of this carboxylesterase was made by mosquitoes exposed to insecticides.

  • However, a lot of insecticide resistance research happens in other insect systems where the orthologs of this gene might not be called COEAE2F.

  • Another approach is to first go to Vectorbase and look for orthologs of the gene, then we can also search for those on Google Scholar. If we do that for COEAE2F we find lots of the orthologs in other insects are called esterase B1. If we do a literature search with that term, we find a wealth of papers, with evidence that the same gene causes organophosphate resistance in Culex mosquitoes and has been studied in that system extensively for over 30 years!

Success!#

In our case study, we found evidence for a selective sweep at a novel locus on 2L that we don’t currently monitor in any Anopheles gambiae IR surveillance. We interrogated the genes beneath the peak of the sweep signal and found a gene belonging to a family implicated in insecticide resistance. We then searched Vectorbase to find out the different names this gene might have in other insect systems (orthologs). Using one of these orthologs we found research validating this genes involvement in organophosphate resistance in Culex, a strong suggestion that this selective sweep could be being driven by a mutation affecting this phenotype in Anopheles. Understanding and tracking organophosphate resistance is particularly important as pirimiphos-methyl (Actellic) is increasing in use both in vector control and agriculture.

What next?#

Once a new candidate resistance locus has been found, there are a number of avenues for future work. Though these are beyond the scope of this workshop but we should briefly consider them.

  • Identify candidate genetic variation driving the selective sweep at this locus. In workshop 1 we learnt how to analyse SNP frequencies in genes and in workshop 2 we learnt how to analyse gene copy number frequencies. By comparing cohorts where we see this selective sweep, with cohorts where we don’t, the causative variation may become clear.

  • Add COEAE2F to future IR surveillance. Though the involvement of this gene in Anopheles has not yet been validated, given what we know about Culex it would make sense to track SNP variation, gene copy number and evidence of selective sweeps in future mosquito collections.

  • Validate phenotypic effects of this gene. Populations of mosquitoes (natural or lab colony) that are susceptible and resistant to carbamates could be identified, then investigated for differences using molecular techniques. Often this approach will involved comparing the presence or absence of certain molecular motifs using polymerase chain reaction (PCR). In module 4 of this workshop, we will explore a tool which helps create the DNA primers and probes required for this type of analysis.

Well done!#

In this module we have learnt how to:

  • Calibrate H12 analysis window size.

  • Run, plot and interpret a genome scan for recent selection using the H12 statistic.

  • Identify candidate genes that are under selection.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Have go at the practical exercise, but please don’t worry if you don’t have time to complete it during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

For the practical exercise, I’d like you to run your own case study from start to end, using all the methodology we have explored today.

  1. Calibrate a window size for a Burkina Faso coluzzii cohort, contig 2R hint: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.

  2. Run a H12 GWSS on contig 2R, for this cohort, using your calibrated window size.

  3. Is there evidence of a selective sweep around 2R:40,950,000? If so, would this signal be considered a good hit or not, why?

  4. List the genes under this peak. Hint: 2R:40,900,000-41,000,000. Do any of these genes immediately jump out as IR candidates?

  5. Run a Google Scholar literature search on these genes. Remember to start at the centre of the peak and work out hint: try searching using the Drosophila ortholog name of the gene (from Vectorbase) with the word “Anopheles”.

  6. If you find IR related hits to your gene literature search, list the top hit (highest up the page) alongside the gene name.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.

Pour l’exercice pratique, je vais vous demander de réaliser l’étude d’un cas du début à la fin en utilisant la méthodologie que nous avons explorée aujourd’hui.

  1. Calibrer la taille des fenêtres pour une cohorte de coluzzii venant du Burkina Faso en utilisant le bras de chromosome 2R. Indice: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.

  2. Exécuter un scan du génome en utilisant H12 sur le bras de chromosome 2R pour cette cohorte. Utiliser la taille des fenêtres que vous venez d’obtenir par calibration.

  3. Y a-t-il des indications d’un balayage sélectif autour de 2R:40,950,000? Si oui, est-ce que ce signal peut être considéré comme de bonne qualité? Pourquoi?

  4. Lister les gènes sous ce pic. Indice: 2R:40,900,000-41,000,000. Est-ce que certains de ces gènes vous sautent aux yeux comme candidats pour la résistance aux insecticides?

  5. Réaliser une recherche de la littérature sur Google scholar pour ces gènes. Se souvenir de commencer au centre du pic et de s’en éloigner progressivement. Indice: essayer la recherche en utilisant le nom de l’orthologue du gène pour Drosophile (en utilisant “VectorBase”) avec le mot “Anopheles”.

  6. Si vous trouvez des articles connectés à la résistance aux insecticides lors de votre recherche de la littérature, lister les premiers résultats (en haut de la page) ainsi que le nom du gène.