Module 3 - Detecting new forms of insecticide resistance using selection scans

Workshop 6 - Training course in data analysis for genomic surveillance of African malaria vectors

Module 3 - Detecting new forms of insecticide resistance using selection scans#

Theme: Analysis

In this module we’re going learn how to scan the genome for signals of recent selection with the H12 statistic, using haplotype variation data from Ag3.0.

We will use functions in the malariagen_data Python package to calibrate, run and plot H12 genome-wide selection scans. Will then learn how interpret and investigate the results to detect candidate genes potentially involved in new forms of insecticide resistance.

Learning objectives#

At the end of this module you will be able to:

Calibrate the window size parameter for an H12 analysis.
Run, plot and interpret a genome scan for recent selection using the H12 statistic.
Identify candidate genes that are under selection.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

%pip install -q --no-warn-conflicts malariagen_data

import warnings
warnings.simplefilter(action='ignore', category=Warning)

import malariagen_data

Saving H12 results#

H12 genome wide scans for selection may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab.

To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

When mounting your Google Drive you will need to follow the authorization instructions.

try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

With our Google Drive now mounted, we can define and make a directory where we want to save our H12 results.

results_dir = "drive/MyDrive/malariagen_data_cache"

In Google Colab, we can actually see our mounted drive and H12 results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our H12 results in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the H12 results would instead get stored on our hard drive.

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	/home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-6/drive/MyDrive/malariagen_data_cache
Cohorts analysis	20250502
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.2.0
Client location	Iowa, United States (Google Cloud us-central1)

Why do we need to scan the genome for signals of recent selection?#

Mosquitoes are exposed to high levels of insecticides both directly via malaria vector control campaigns and more widely through agricultural use of these chemicals. The insecticides being used are also changing, as new compounds are developed. The high strength of selective pressure exerted by these insecticides is evidenced by the speed at which insecticide resistance evolves in mosquitoes.

Historically, molecular techniques were used to discover genes and SNPs associated with insecticide resistance, and produce the assays necessary to track them in natural populations. These processes are skill and time intensive, often taking years to discover new loci.

Clearly these kind of time scales are not operationally feasible when it comes to public health. The purchase and use of insecticides needs to be informed by recent resistance data, to ensure that the vector control chemicals can be changed before high levels of resistance cause them to become less effective or even fail completely.

Known unknowns#

In this series of workshops, we have already seen how to use genomic data to investigate regions of the genome that we know are associated with insecticide resistance. For example, in workshop 1 we looked at SNP variation in VGSC and in workshop 2 we looked at CNV variation in the Cyp6aap region. Genomic data are great for these kind of analyses because they allow large numbers of samples to be investigated quickly, easily, and in parallel, effectively running traditional molecular assays but in silico.

Unknown unknowns#

The speed at which insecticide resistance evolves coupled with the use of new insecticidal compounds makes it risky to only investigate genes known to be associated with resistance, as it is likely that new forms of resistance will go undetected. Fortunately, whole genome data allows us to naively scan the genome for signals of adaptation known as selective sweeps. By using these genome wide scans for selection (GWSS), novel loci under selection in mosquito populations can be quickly identified for further investigation as candidate insecticide resistance loci.

Running a GWSS using the H12 statistic#

What is H12 analysis?#

H12 measures haplotype homozygosity, in effect the similarity of haplotypes, and will run it on the phased Ag3.0 haplotype data we learnt about in module 1. This statistic is sensitive to recent selection, making it ideal for detecting selective sweeps driven by recent insecticidal pressures and, unlike some other selection detecting statistics is excellent at detecting soft sweeps.

Different kinds of selective sweeps#

Selective sweeps can be divided into hard or soft sweeps. When positive selection acts on a single adaptive mutation, causing it to increase in frequency, it is referred to a hard sweep. Where multiple adaptive mutations occur at the same locus and increase in frequency due to positive selection, it is known as a soft sweep.

However, with H12, we aren’t focusing on the individual mutations, e.g. SNPs, but rather haplotypes. We would expect the haplotype around a locus under selection to increase in frequency with it. So, in the case of hard sweep a single haplotype should be detected at high frequency and in a soft sweep, multiple haplotypes should be present at high frequencies. The figure below demonstrates this concept with mutation frequency on the left and haplotype frequency at the end time point on the right.

How H12 works#

H12 is based on measure of population diversity known as haplotype homozygosity or H1. In their 2015 paper, Garud and colleagues demonstrated that although H1 was a good statistic to detect hard sweeps, with a single haplotype at high frequency, it was less successful at finding soft sweeps.

In order to detect both hard and soft sweeps, Garud et al. modified the H1 statistic so that the first and second most common haplotypes frequencies are combined (hence H12).

For a deeper dive in the statistic and it’s underlying mathematics, I would highly recommend the Garud lab website or for the brave, the paper.

H12 output#

It is possible to run “peak” or “outlier” detection on the H12 GWSS results, but for this module we will interpret the results by visualising a plot of the data (we will explore how to make these plots in detail later).

Just to get a sense of what the H12 output looks like, let’s run an H12 analysis over chromosome arm 2R, using An. gambiae mosquitoes from a cohort collected in Burkina Faso in 2014.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)

In the example above, we can see from the y-axis that the H12 statistic ranges from 0 to 1. Each point on the plot is the H12 statistic computed in a window, in this case window_size=1000 SNPs, across the 2R chromosome arm (x-axis). Selective sweeps are suggested by peaks in the data. Single high value windows are generally ignored as noise, because the imprint of selection on the genome should leave a peak centered near to the locus under selection, with shoulders of the peak dropping away either side.

If we zoom in to ~28,500,000bp, and roll the mouse over the gene track, we can see that H12 has detected the expected selective sweep over the Cyp6aa/p gene cluster, containing genes known to be involved in insecticide resistance. The sweep is composed of many 1000 SNP windows, providing strong evidence a true sweep has occurred here.

H12 analytical workflow#

Running an H12 analysis can be broken down into the following steps:

Select cohorts for analysis
Calibrate window_size parameter for each cohort
Run and plot H12 scans
Peak identification
Investigate signals to identify candidate resistance genes

Let’s explore each of these in detail.

Step 1. Select cohorts for analysis#

Recap on cohorts#

First we need to define a cohort of individuals to run the H12 analysis on. The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.

Cohort size#

When we are choosing the cohort for a H12 analyses we need to consider how many samples it contains, too few samples and it could become difficult to identify selective sweep signals over background noise, too many and computational time is wasted needlessly. We have found that a reasonable cohort size is 30 samples, hence this is the default cohort_size parameter in all the H12 GWSS functions that we will use. If you try to analyse a cohort with less samples than this parameter, it will error. If your cohort has more samples, they will be randomly downsampled to cohort size.

For this example, let’s focus on Burkina Faso, and first examine what cohorts are available.

burkina_samples_df = ag3.sample_metadata(sample_sets = "3.0", sample_query="country == 'Burkina Faso'")
burkina_samples_df

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	AB0085-Cx	BF2-4	Austin Burt	Burkina Faso	Pala	2012	7	11.151	-4.235	F	...	Hauts-Bassins	BF-09	Houet	gambiae	BF-09_gamb_2012	BF-09_gamb_2012_07	BF-09_gamb_2012_Q3	BF-09_Houet_gamb_2012	BF-09_Houet_gamb_2012_07	BF-09_Houet_gamb_2012_Q3
1	AB0086-Cx	BF2-6	Austin Burt	Burkina Faso	Pala	2012	7	11.151	-4.235	F	...	Hauts-Bassins	BF-09	Houet	gambiae	BF-09_gamb_2012	BF-09_gamb_2012_07	BF-09_gamb_2012_Q3	BF-09_Houet_gamb_2012	BF-09_Houet_gamb_2012_07	BF-09_Houet_gamb_2012_Q3
2	AB0087-C	BF3-3	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3
3	AB0088-C	BF3-5	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3
4	AB0089-Cx	BF3-8	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
291	AB0314-C	6775	Nora Besansky	Burkina Faso	Monomtenga	2004	8	12.060	-1.170	F	...	Centre-Sud	BF-07	Bazega	gambiae	BF-07_gamb_2004	BF-07_gamb_2004_08	BF-07_gamb_2004_Q3	BF-07_Bazega_gamb_2004	BF-07_Bazega_gamb_2004_08	BF-07_Bazega_gamb_2004_Q3
292	AB0315-C	6777	Nora Besansky	Burkina Faso	Monomtenga	2004	8	12.060	-1.170	F	...	Centre-Sud	BF-07	Bazega	gambiae	BF-07_gamb_2004	BF-07_gamb_2004_08	BF-07_gamb_2004_Q3	BF-07_Bazega_gamb_2004	BF-07_Bazega_gamb_2004_08	BF-07_Bazega_gamb_2004_Q3
293	AB0316-C	6779	Nora Besansky	Burkina Faso	Monomtenga	2004	8	12.060	-1.170	F	...	Centre-Sud	BF-07	Bazega	gambiae	BF-07_gamb_2004	BF-07_gamb_2004_08	BF-07_gamb_2004_Q3	BF-07_Bazega_gamb_2004	BF-07_Bazega_gamb_2004_08	BF-07_Bazega_gamb_2004_Q3
294	AB0318-C	5072	Nora Besansky	Burkina Faso	Monomtenga	2004	7	12.060	-1.170	F	...	Centre-Sud	BF-07	Bazega	gambiae	BF-07_gamb_2004	BF-07_gamb_2004_07	BF-07_gamb_2004_Q3	BF-07_Bazega_gamb_2004	BF-07_Bazega_gamb_2004_07	BF-07_Bazega_gamb_2004_Q3
295	AB0325-C	1403	Nora Besansky	Burkina Faso	Monomtenga	2004	6	12.060	-1.170	F	...	Centre-Sud	BF-07	Bazega	gambiae	BF-07_gamb_2004	BF-07_gamb_2004_06	BF-07_gamb_2004_Q2	BF-07_Bazega_gamb_2004	BF-07_Bazega_gamb_2004_06	BF-07_Bazega_gamb_2004_Q2

296 rows × 57 columns

burkina_samples_df.groupby("cohort_admin2_year").size()

cohort_admin2_year
BF-07_Bazega_gamb_2004    13
BF-09_Houet_arab_2014      3
BF-09_Houet_colu_2012     82
BF-09_Houet_colu_2014     53
BF-09_Houet_gamb_2012     99
BF-09_Houet_gamb_2014     46
dtype: int64

In the Burkina Faso data from Ag3.0, there are six cohorts, four of which have more than 30 samples, so lets pick one of these to run a H12 GWSS on.

Step 2. Calibrate window size#

Recap: Windowed analyses#

The Anopheles gambiae genome is relatively large, e.g. there are millions of SNPs on the 3R chromosome arm alone. We are going to use a windowed approach for H12, where we define a window size in number of SNPs, then move this window along the chromosome arm haplotypes, calculating our statistic across each window. This gives us a summary of data that is easier to interpret. This is similar to the approach we used to analyse heterozygosity in workshop 5, however, in the case of heterozygosity, our window was defined by fixed genomic length rather than by number of SNPs.

Choosing the window size for H12 analyses#

Once we have decided on our cohorts, we need to calibrate the window size, an important parameter for the H12 analysis. Every cohort needs its own window calibration because their demographic history will be unique. In workshop 5 we learnt how this demographic history can affect genome-wide background genetic diversity.

Let’s have a look at what happens if the wrong window size is used, then run through our methodology for picking the best size.

Here’s the example we saw earlier, using a 1,000 SNP window on the 2R chromosome arm.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=1000, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=1000"
)

As we saw earlier, by zooming into an established pyrethroid metabolic insecticide resistance locus, known to be under selection, we are able to see a clear peak in the H12 statistic if we have a good window size. Let’s zoom in to 28,500,000bp (Cyp6aa/p gene cluster) and remind ourselves.

However, if our window size is too small, then our signal may not be visible over background noise. To illustrate what happens, let’s run the same scan but with a much smaller window size of 100 bp.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=100, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=100 (too small!)"
)

This scan is far too noisy and it is impossible to detect any signals.

Conversely, if our window size is too big, then a true signal of selective sweep may not cover enough of the genome to generate a peak in the H12 data. To illustrate what happens, let’s run the same scan with a larger window size of 20,000 bp.

ag3.plot_h12_gwss(
    contig="2R", 
    analysis="gamb_colu", 
    window_size=20_000, 
    sample_sets="3.0", 
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    title="BF-09_Houet_gamb_2014; window_size=20,000 (too big!)"
)

With a window size this large, all but the Cyp6aa/p signal have been lost.

Calibrate window size#

We can use the ag3.plot_h12_calibration() function to come to a decision about which window size will likely be most effective without having to visualise multiple H12 GWSS plots. Let’s have a look at the function.

ag3.plot_h12_calibration?

Signature:
ag3.plot_h12_calibration(
    contig: str,
    analysis: str = 'default',
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    cohort_size: Optional[int] = None,
    min_cohort_size: Optional[int] = 15,
    max_cohort_size: Optional[int] = 50,
    window_sizes: Sequence[int] = (100, 200, 500, 1000, 2000, 5000, 10000, 20000),
    random_seed: int = 42,
    title: Optional[str] = None,
    show: bool = True,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot h12 GWSS calibration data for different window sizes.

Parameters
----------
contig : str
    Reference genome contig name. See the `contigs` property for valid
    contig names.
analysis : str, optional, default: 'default'
    Which haplotype phasing analysis to use. See the
    `phasing_analysis_ids` property for available values.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
min_cohort_size : int or None, optional, default: 15
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional, default: 50
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
window_sizes : sequence of int, optional, default: (100, 200, 500, 1000, 2000, 5000, 10000, 20000)
    The sizes of windows (number of SNPs) used to calculate statistics
    within.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
title : str or None, optional
    Plot title.
show : bool, optional, default: True
    If True, show the plot.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/h12.py
Type:      method

The required parameters are:

contig - the chromosome to run the calibration scans on.
analysis - which phasing analysis to use, depends on what taxa are in your cohort.
sample_query
sample_sets

Let’s run the calibration with the Burkina cohort we looked at previously. It doesn’t matter too much which contig to use, but we often use 3L because it isn’t affected by chromosomal inversions.

ag3.plot_h12_calibration(
    contig="3L", 
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)

The plot shows the distribution of H12 values across a range of window sizes from 100 SNPs to 20,000 SNPs.

Our rule of thumb is to choose a window size where the where the 95% percentile of H12 values is at or below 0.1.

For this cohort, a reasonable window size to choose is 1,000 SNPs, and as we saw earlier, that produced clean signals of selective sweeps at known insecticide resistance loci in this cohort.

Remember, the window size parameter needs to be calibrated for each cohort you want to analyse.

Step 3. Run GWSS with H12#

Now we know what a reasonable window size is for our GWSS, we can run and plot the H12 analysis.

Let’s have a look at the function we will use.

ag3.plot_h12_gwss?

Signature:
ag3.plot_h12_gwss(
    contig: str,
    window_size: int,
    analysis: str = 'default',
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    cohort_size: Optional[int] = None,
    min_cohort_size: Optional[int] = 15,
    max_cohort_size: Optional[int] = 50,
    random_seed: int = 42,
    title: Union[str, bool, NoneType] = None,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    track_height: int = 170,
    contig_colors: list[str] = ['#393b79', '#5254a3', '#6b6ecf', '#9c9ede', '#637939'],
    genes_height: int = 90,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
    gene_labels: Optional[Mapping[str, str]] = None,
    gene_labelset: Optional[bokeh.models.annotations.labels.LabelSet] = None,
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot h12 GWSS data.

Parameters
----------
contig : str
    Reference genome contig name. See the `contigs` property for valid
    contig names.
window_size : int
    The size of windows (number of SNPs) used to calculate statistics
    within.
analysis : str, optional, default: 'default'
    Which haplotype phasing analysis to use. See the
    `phasing_analysis_ids` property for available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
min_cohort_size : int or None, optional, default: 15
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional, default: 50
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
title : str or bool or None, optional
    Plot title. If True, a title may be automatically generated.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
track_height : int, optional, default: 170
    Main track height in pixels (px).
contig_colors : list of str, optional, default: ['#393b79', '#5254a3', '#6b6ecf', '#9c9ede', '#637939']
    A sequence of colors.
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.
gene_labels : Mapping[str, str] or None, optional
    A mapping of gene identifiers to custom labels, which will appear in
    the plot.
gene_labelset : LabelSet or None, optional
    A LabelSet to use in the plot.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/h12.py
Type:      method

The required parameters are almost identical to the calibration function, except now we need to specify the window_size we’d like to use for the analysis.

Let’s run a H12 GWSS analysis, end to end, all five chromosome arms, for our chosen cohort of interest.

for contig in ag3.contigs:
  ag3.plot_h12_gwss(
      contig=contig, 
      analysis="gamb_colu",
      sample_sets="3.0",
      sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
      window_size=1000
  )

Step 4. Peak identification#

Reading an H12 plot - recap:

The H12 statistic ranges from 0-1
Each point on the plot is H12 caculated over a window, measured in SNPs
True signals will generally have a max H12 value that is an outlier in the genome-wide distribution
True signals will have a peak shape with roughly exponential decay on both flanks

In the Burkina Faso cohort GWSS plots above, we can spot several signals that have both outlier maximums and fit our peak shape criteria.

Let’s pick the signal on chromosome arm 2L at ~28,550,000 bp and look for candidate insecticide resistance genes.

Step 5. Identify candidate genes#

Case study - Burkina Faso, 2L, peak centered ~28,550,000#

So we’ve done the “easy” bit, run our H12 GWSS and found some promising looking signals of selective sweeps. Now we need to identify candidate genes for further investigation.

What follows is a suggested approach to identify candidate genes potentially involved IR.

Find the approximate centre of your peak of interest by zooming in on the plot above - this is ~28,550,000bp along 2L.
Look at the genes around the centre of the peak. The gene under under selection may not be exactly under the peak, but weight priority by closeness to centre.
Build a list of genes for further investigation.

As the signal is quite wide, I am starting by looking 25kb either side of the peak.
We can also use the ag3.genome_features() function and a Pandas query to pull out a list of genes given a region.

peak_genes_df = (
    ag3.genome_features(region="2L:28,525,000-28,575,000")
    .query('type == "gene"')
    [["contig", "start", "end", "ID", "Name", "description"]]
    .set_index("ID")
)
peak_genes_df

	contig	start	end	Name	description
ID
AGAP006222	2L	28524225	28526317	NaN	glucosyl/glucuronosyl transferases [Source:VB ...
AGAP006223	2L	28526558	28528641	NaN	glucosyl/glucuronosyl transferases [Source:VB ...
AGAP006224	2L	28528758	28533199	NaN	aldehyde oxidase [Source:VB Community Annotation]
AGAP006225	2L	28534732	28539416	NaN	aldehyde oxidase [Source:VB Community Annotation]
AGAP006226	2L	28540651	28545294	Aldehyde_oxidase	NaN
AGAP006227	2L	28545396	28547938	NaN	alpha esterase [Source:VB Community Annotation]
AGAP006228	2L	28548433	28550748	COEAE2F	carboxylesterase [Source:VB Community Annotation]
AGAP006229	2L	28550814	28552032	Vps20	vacuolar protein sorting 20 [Source:VB Communi...
AGAP006231	2L	28552352	28560186	NaN	serine/threonine-protein phosphatase dullard h...
AGAP006232	2L	28563646	28565368	Pex14	peroxin-14 [Source:VB Community Annotation]
AGAP006233	2L	28565893	28567186	NaN	NaN
AGAP006234	2L	28567535	28569087	NaN	protein SHQ1 [Source:VB Community Annotation]
AGAP006235	2L	28569164	28572971	NaN	NaN
AGAP006236	2L	28573531	28574496	NaN	NaN
AGAP006237	2L	28574680	28575725	NaN	Negative elongation factor E [Source:VB Commun...

Look for any genes within gene families that have association with insecticide metabolism or resistance.

E.g. Cytochrome P450s (Cyps), glutathione S -transferases (GSTs), ion channels (e.g. VGSC), ABC transporters, esterases.
Our gene list is quite long, but if we start looking around the peak at 28,550,000bp we can see a carboxylesterase (COEAE2F) which might be IR linked, so looks like a good candidate to start investigating.

Search the literature for evidence of IR links to these genes.

We could just start searching on Google Scholar (other academic search services are available). If we do that for COEAE2F, we get a few hits straight away from the Anopheles gambiae literature including one which found that more of this carboxylesterase was made by mosquitoes exposed to insecticides.
However, a lot of insecticide resistance research happens in other insect systems where the orthologs of this gene might not be called COEAE2F.
Another approach is to first go to Vectorbase and look for orthologs of the gene, then we can also search for those on Google Scholar. If we do that for COEAE2F we find lots of the orthologs in other insects are called esterase B1. If we do a literature search with that term, we find a wealth of papers, with evidence that the same gene causes organophosphate resistance in Culex mosquitoes and has been studied in that system extensively for over 30 years!

Success!#

In our case study, we found evidence for a selective sweep at a novel locus on 2L that we don’t currently monitor in any Anopheles gambiae IR surveillance. We interrogated the genes beneath the peak of the sweep signal and found a gene belonging to a family implicated in insecticide resistance. We then searched Vectorbase to find out the different names this gene might have in other insect systems (orthologs). Using one of these orthologs we found research validating this genes involvement in organophosphate resistance in Culex, a strong suggestion that this selective sweep could be being driven by a mutation affecting this phenotype in Anopheles. Understanding and tracking organophosphate resistance is particularly important as pirimiphos-methyl (Actellic) is increasing in use both in vector control and agriculture.

What next?#

Once a new candidate resistance locus has been found, there are a number of avenues for future work. Though these are beyond the scope of this workshop but we should briefly consider them.

Identify candidate genetic variation driving the selective sweep at this locus. In workshop 1 we learnt how to analyse SNP frequencies in genes and in workshop 2 we learnt how to analyse gene copy number frequencies. By comparing cohorts where we see this selective sweep, with cohorts where we don’t, the causative variation may become clear.
Add COEAE2F to future IR surveillance. Though the involvement of this gene in Anopheles has not yet been validated, given what we know about Culex it would make sense to track SNP variation, gene copy number and evidence of selective sweeps in future mosquito collections.
Validate phenotypic effects of this gene. Populations of mosquitoes (natural or lab colony) that are susceptible and resistant to carbamates could be identified, then investigated for differences using molecular techniques. Often this approach will involved comparing the presence or absence of certain molecular motifs using polymerase chain reaction (PCR). In module 4 of this workshop, we will explore a tool which helps create the DNA primers and probes required for this type of analysis.

Well done!#

In this module we have learnt how to:

Calibrate H12 analysis window size.
Run, plot and interpret a genome scan for recent selection using the H12 statistic.
Identify candidate genes that are under selection.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Have go at the practical exercise, but please don’t worry if you don’t have time to complete it during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

For the practical exercise, I’d like you to run your own case study from start to end, using all the methodology we have explored today.

Calibrate a window size for a Burkina Faso coluzzii cohort, contig 2R hint: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.
Run a H12 GWSS on contig 2R, for this cohort, using your calibrated window size.
Is there evidence of a selective sweep around 2R:40,950,000? If so, would this signal be considered a good hit or not, why?
List the genes under this peak. Hint: 2R:40,900,000-41,000,000. Do any of these genes immediately jump out as IR candidates?
Run a Google Scholar literature search on these genes. Remember to start at the centre of the peak and work out hint: try searching using the Drosophila ortholog name of the gene (from Vectorbase) with the word “Anopheles”.
If you find IR related hits to your gene literature search, list the top hit (highest up the page) alongside the gene name.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.

Pour l’exercice pratique, je vais vous demander de réaliser l’étude d’un cas du début à la fin en utilisant la méthodologie que nous avons explorée aujourd’hui.

Calibrer la taille des fenêtres pour une cohorte de coluzzii venant du Burkina Faso en utilisant le bras de chromosome 2R. Indice: sample_query=”cohort_admin2_year == ‘BF-09_Houet_colu_2014’”.
Exécuter un scan du génome en utilisant H12 sur le bras de chromosome 2R pour cette cohorte. Utiliser la taille des fenêtres que vous venez d’obtenir par calibration.
Y a-t-il des indications d’un balayage sélectif autour de 2R:40,950,000? Si oui, est-ce que ce signal peut être considéré comme de bonne qualité? Pourquoi?
Lister les gènes sous ce pic. Indice: 2R:40,900,000-41,000,000. Est-ce que certains de ces gènes vous sautent aux yeux comme candidats pour la résistance aux insecticides?
Réaliser une recherche de la littérature sur Google Scholar pour ces gènes. Se souvenir de commencer au centre du pic et de s’en éloigner progressivement. Indice: essayer la recherche en utilisant le nom de l’orthologue du gène pour Drosophile (en utilisant Vectorbase)) avec le mot “Anopheles”.
Si vous trouvez des articles connectés à la résistance aux insecticides lors de votre recherche de la littérature, lister les premiers résultats (en haut de la page) ainsi que le nom du gène.

Module 3 - Detecting new forms of insecticide resistance using selection scans

Contents

Module 3 - Detecting new forms of insecticide resistance using selection scans#

Learning objectives#

Lecture#

English#

Français#

Setup#

Saving H12 results#

Why do we need to scan the genome for signals of recent selection?#

Known unknowns#

Unknown unknowns#

Running a GWSS using the H12 statistic#

What is H12 analysis?#

Different kinds of selective sweeps#

How H12 works#

H12 output#

H12 analytical workflow#

Step 1. Select cohorts for analysis#

Recap on cohorts#

Cohort size#

Step 2. Calibrate window size#

Recap: Windowed analyses#

Choosing the window size for H12 analyses#

Calibrate window size#

Step 3. Run GWSS with H12#

Step 4. Peak identification#

Step 5. Identify candidate genes#

Case study - Burkina Faso, 2L, peak centered ~28,550,000#

Success!#

What next?#

Well done!#

Exercises#

English#

Français#