banner

Training course in data analysis for genomic surveillance of African malaria vectors

Workshop 7 - Gene flow and the spread of insecticide resistance


Module 2 - Adaptive gene flow#

Theme: Analysis

When a new beneficial mutation occurs in a mosquito, such as a mutation which confers resistance to an insecticide used in vector control, it is highly likely that mutation will be passed on to the next generation. With each generation, the frequency of the mutation within the local mosquito population will increase - a selective sweep. But mosquitoes are not stationary - they move and can disperse away from the original location where their parents bred. After travelling, a mosquito may then find a mate and breed in a new location. As mosquitoes disperse and breed, beneficial mutations may also be carried and spread to new locations. Mosquitoes are also sometimes pliable in their breeding habits, and can mate and breed successfully with mosquitoes from closely related species. In this way, beneficial mutations can also spread between species.

These features of mosquito biology are a significant challenge for vector control, because mosquitoes don’t respect our borders and can spread insecticide resistance across and between countries. We are interested to study the spread of beneficial mutations between locations and mosquito species, also known as adaptive gene flow, to learn more about how to contain resistance and how to coordinate resistance management strategies.

Learning objectives#

In this module we will:

  • Explain what gene flow is

  • Discuss why gene flow is relevant to malaria vector surveillance and control

  • Explain how gene flow between species (introgression) occurs

  • Explain how spatial gene flow occurs

  • Explain what adaptive gene flow is, and how it relates to selective sweeps

  • Use ancestry-informative markers to look for evidence of adaptive introgression

  • Scan the genome for evidence of adaptive gene flow between species and countries

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

%pip install -q --no-warn-conflicts malariagen_data
Note: you may need to restart the kernel to use updated packages.
import malariagen_data
import warnings
warnings.simplefilter(action='ignore', category=Warning)

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3(
    results_cache="drive/MyDrive/malariagen_data_cache"
)
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9
Results cache /home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-7/drive/MyDrive/malariagen_data_cache
Cohorts analysis 20240418
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 9.0.0
Client location unknown
try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

Introduction#

What is gene flow?#

Gene flow is usually defined as the transfer of genetic material from one population to another. But this is a broad definition, and for this training course we are interested in something a bit more specific, which is the transfer of genetic variation (i.e., mutations) between populations. So for our purposes, the term “gene flow” can be a bit misleading, because it is not actually genes which are being transferred, but genetic variants, e.g., SNP or CNV alleles. Strictly speaking, a term like “allele flow” would be more accurate. However, we will use “gene flow” throughout this workshop, because it has become a well-established term in the population genetics literature.

Gene flow between species (introgression)#

Gene flow can occur between two different species, if members of those species occasionally mate (hybridise) and have fertile offspring. Gene flow between species is commonly known as “introgression” or sometimes as “interspecific gene flow” (“inter” meaning between).

The cartoon below provides an illustration of how gene flow between species can occur. In this cartoon there are two hypothetical mosquito species, named “species A” and “species B”. Usually, mating occurs between members of the same species. But occasionally these two species can mate and produce fertile offspring.

Mating events are shown in the cartoon as an “X”. In the first generation, a mating event occurs between an individual from species A and an individual from species B. Such a mating event between species is called “hybridisation”. The offspring of such a mating event are called “F1 hybrids”.

In the second generation, a mating event occurs between one of the F1 hybrids, and an individual from species B. Such a mating event is called “backcrossing” and the offspring are called “backcrossed hybrids”. If you trace back the ancestors of these backcrossed hybrids, you can see that three of their grandparents were individuals from species B and one of their grandparents was from species A. Thus, some genetic variation from species A will be present.

By this process of hybridisation and backcrossing, mutations originating in one species can be transferred to another species. In the cartoon, the mosquitoes coloured in red are meant to show mosquitoes where a new mutation has occurred. This mutation occurs first in an individual from species B, but the mutation finds its way into individuals from species B via introgression. The cartoon below shows this occurring in two generations, but this will be a process occurring over many generations. The same process could also transfer mutations in the opposite direction, from species B to species A.

introgression cartoon

Spatial gene flow#

Gene flow can also occur between different populations of the same species which are separated spatially, i.e., which live in different locations. This can occur if there is some movement of individuals between locations in each generation, also known as “dispersal” or “migration”. I.e., individuals can move and breed with individuals from a different location. Spatial gene flow is sometimes called “intraspecific gene flow” (“intra” meaning within).

The cartoon below illustrates how spatial gene flow can occur between hypothetical populations of mosquitoes living at three locations labelled “A”, “B” and “C”. In this hypothetical scenario, locations A and B are close enough that mosquitoes can travel between them. Most breeding occurs between mosquitoes within a single location, but occasionally mosquitoes migrate to a different location before breeding.

In the first generation, a mosquito from location A travels to location B before finding a mate and breeding there. In the second generation, one of the offspring then travels to location C before breeding there. Thus the offspring shown in the third generation will have two grandparents originating from location C, one grandparent from location B, and one grandparent from location A.

The red colour illustrates mosquitoes carrying a new mutation. This mutation occurs first in mosquitoes from location A, but via spatial gene flow the mutation travels to locations B and C. Note that the mutation reaches location C even though locations A and C are too far apart for mosquitoes to travel directly between them.

In the real world, it won’t be this simple of course. But the cartoon illustrates in general how mutations originating in one location can spread to other locations over successive generations.

spatial gene flow cartoon

Adaptive gene flow and shared selective sweeps#

Adaptive gene flow occurs when a beneficial mutation is transferred between species or locations via gene flow. For example, a mutation which confers insecticide resistance.

Because a benefical mutation is positively selected, it rises rapidly in frequency within the population where it initially arose. This is also known as a selectice sweep.

If the mutation is then transferred to a second population, and if the mutation is also beneficial in that population, then it will also be positively selected and rise in frequency. In other words, a selective sweep will also occur in the second population.

This type of event is known as “adaptive gene flow” because the transfer of the mutation has increased the fitness of the second population. We can also call this a “shared selective sweep” because selective sweeps are occurring in both populations and involve mutations with a recent common origin.

The cartoon below illustrates a hypothetical scenario involving adaptive gene flow between two populations labelled “A” and “B”. A beneficial mutation first occurs in population A and begins rising in frequency. At the third generation a gene flow event occurs which transfers the mutation to population B. Thereafter the mutation also begins rising in frequency in population B.

adaptive gene flow cartoon

Why is gene flow relevant to malaria vector surveillance and control?#

In short, because gene flow causes the spread of insecticide resistance between mosquito species and countries.

For example, different species of the Anopheles gambiae complex can hybridise, and so insecticide resistance mutations that arise in one species can spread to others. This means that insecticide resistance management strategies cannot focus on any one species in isolation, but have to consider all species with the potential for gene flow together.

We also know that there is considerable potential for spatial gene flow within major malaria vector species, and so insecticide resistance mutations which arise in one country can spread to others. This means that insecticide resistance management strategies cannot consider a single country in isolation but may need to coordinate efforts across borders.

Overall, the potential for gene flow reduces the time for insecticide resistance mutations to emerge and reach high frequencies. In other words, it means that evolution happens faster than it would if mosquitoes from different species and countries were isolated.

To understand the public health impact, it can be useful to draw an analogy with an outbreak of an infectious disease, such as the 2014-2016 Ebola outbreak in West Africa illustrated in the video below. In these outbreaks, an infectious disease originated in one country but spread to others, requiring a coordinated international response. By analogy, insecticide resistance mutations may occur in one country but spread to others, and thus insecticide resistance management may require international coordination.

Detecting adaptive introgression via ancestry-informative markers (AIMs)#

We would like to detect adaptive introgression of insecticide resistance mutations between different species in the Anopheles gambiae complex, so that we can learn more about what introgression events have happened in the past, and where and when introgression is most likely to occur.

In some circumstances it is possible to detect adaptive introgression using AIMs. This is possible because when a beneficial mutation is transferred via gene flow, it is actually a segment of DNA which is transferred, carrying along other nearby variants which are said to be “hitchhiking”. What this means is that, surrounding the beneficial allele, we would expect to find ancestry-informative marker alleles that more commonly found in a different species.

Let’s see an example of an adaptive introgression event which has occurred between An. gambiae and An. coluzzii.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu",
    sample_sets="3.0",
    sample_query="country == 'Burkina Faso' and aim_species != 'arabiensis'",
)
                                     

In this example we can see that the individuals in the upper part of the plot have mostly red (homozygous coluzzii) AIM genotypes, but on chromosome arm 2L have either blue (homozygous gambiae) or yellow (heterozygous) genotypes. This is because an introgression event has occurred from An. gambiae to An. coluzzii transferring the L995F “kdr” mutation which confers resistance to pyrethroids, as first described in Clarkson et al. (2014) and Norris et al. (2015).

Exercise 1 (English)

Can you use AIMs to find any evidence for adaptive introgression between An. gambiae and An. arabiensis in Tanzania?

If so, on what chromosome arm does it occur?

Which direction is the introgression event (i.e., from which species, to which species)?

Hint: Use the plot_aim_heatmap() function, with the aims parameter value "gambcolu_vs_arab" and the sample_query parameter value "country == 'Tanzania'.

Exercice 1 (Français)

Pouvez-vous utiliser les AIMs pour trouver des indices d’une introgression adaptative entre An. gambiae et An. arabiensis en Tanzanie?

Si oui, sur quel bras de chromosome se trouve-t-elle?

Dans quelle direction est l’introgression (i.e., de quelle espèce, vers quelle espèce)?

Indice: Utiliser la fonction plot_aim_heatmap() avec "gambcolu_vs_arab" comme valeur pour le paramètre aims et "country == 'Tanzania' comme valeur pour le paramètre sample_query.

Limitations of using AIMs for detecting adaptive introgression#

There are some important limitations to using AIMs for detecting adaptive introgression. In particular, the AIMs are sparsely distributed throughout the genome, leaving big gaps where introgression events may occur but go undetected. The “gamb_vs_colu” AIMs are also not evenly distributed across the genome, but are clustered in just a few genome locations, particularly centromeric regions, making the problem even worse.

The AIM approach is also not suitable for detecting spatial gene flow within species. This is because there are generally not enough SNPs which are fixed differences between geographically distinct populations to provide enough coverage over the genome.

Fortunately, however, there are other methods we can use to overcome these limitations and discover both adaptive introgression and spatial gene flow events occurring anywhere in the genome.

Genome-wide scans for adaptive gene flow#

There are a number of different methods that can be used to scan the whole genome for signals of adaptive gene flow between species or geographically separated populations.

In this module we will demonstrate one approach, which looks for higher than usual levels of haplotype sharing between two groups (cohorts) of mosquitoes.

Before we do this, however, we will first look for genome regions with evidence for selective sweeps in multiple populations. This allows us to first isolate genome regions we know are under selection. Any elevated gene flow at these genome regions is then highly likely to be adaptive gene flow.

Recalling workshop 6, let’s first run a genome-wide selection scan (GWSS) using the H12 statistic.

We will do this for two cohorts of mosquitoes collected near Bobo Dioulasso (Houet province) in Burkina Faso, one An. gambiae and one An. coluzzii.

First check which cohorts have been defined for Burkina Faso, and how many samples are available for each cohort.

df_samples = ag3.sample_metadata(sample_sets="3.0")
df_samples.query("country == 'Burkina Faso'")["cohort_admin2_year"].value_counts()
cohort_admin2_year
BF-09_Houet_gamb_2012     99
BF-09_Houet_colu_2012     82
BF-09_Houet_colu_2014     53
BF-09_Houet_gamb_2014     46
BF-07_Bazega_gamb_2004    13
BF-09_Houet_arab_2014      3
Name: count, dtype: int64

Calibrate the H12 window size for two cohorts of interest.

ag3.plot_h12_calibration(
    contig="3L",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_colu_2014'",
)
ag3.plot_h12_calibration(
    contig="3L",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
)

A window size of around 1000 SNPs looks reasonable for both cohorts.

Now run genome-wide selection scans. Let’s just run chromosome arm 2L, as this will provide a useful example.

ag3.plot_h12_gwss(
    contig="2L",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_colu_2014'",
    window_size=1000,
)
                                     
ag3.plot_h12_gwss(
    contig="2L",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    window_size=1000,
)
                                     

We can see evidence for several selective sweeps. In particular, there is a strong signal in both cohorts at around position 2.4 Mbp. The signal gets a bit fuzzy on the left hand side, because it occurs on the edge of the pericentromeric region, where there are lower rates of recombination. But this is a genome location where selective sweeps are often found, because it’s the location of the voltage-gated sodium channel gene (Vgsc), where mutations confer resistance to pyrethroids.

However, the question is, are the selective sweeps we observe at this locus shared between the two cohorts? In other words, has there been adaptive introgression spreading a beneficial mutation between the two species, or are these independent and unconnected selective sweeps?

Before we can answer this question, we need to first explain briefly some methodology.

Detecting shared selective sweeps via haplotype frequencies#

To detect shared selective sweeps, we can make use of some of the same ideas used by the H12 statistic to discover selective sweeps. In particular, we can compute haplotype frequencies in windows across the genome, and look for unusual patterns of haplotype frequencies in some windows relative to the genome-wide distribution.

Let’s recap how to compute haplotype frequencies. Recall this example from Workshop 6, showing 6 haplotypes observed from 3 individuals. In the region shown, there are 3 distinct haplotypes, which have been illustrated by colouring red, blue or white. The red haplotype is observed twice in this dataset, and so has a frequency of 2/6 = 0.33. The blue haplotype is observed three times and so has a frequency of 3/6 = 0.5. The white haplotype’s frequency is 1/6 = 0.17.

haplotype frequencies example

We can then use haplotype frequencies to identify genome regions where there is an unusual amount of haplotype sharing, which is a signal the region is under positive selection. E.g., this is basis of the H1, H12 and H123 statistics defined by Garud et al (@@YEAR). The cartoon below illustrates how the H1 statistic is calculated.

detecting sweeps

In a similar way, we can use haplotype frequencies to look for genome regions with an unusual amount of haplotype sharing between cohorts. We have defined a summary statistic which seems to capture this well, called H1X. The cartoon below illustrates how H1X is calculated from haplotype frequencies.

detecting shared sweeps

We’ve added a function called plot_h1x_gwss() which will perform a genome-wide scan using the H1X statistic. Let’s examine the parameters of this function.

ag3.plot_h1x_gwss?

Example - detecting adaptive introgression#

Let’s now use this new statistic and function to look for evidence of adaptive introgression between An. gambiae and An. coluzzii cohorts from Burkina Faso. Recall that on chromosome arm 2L we saw evidence for selective sweeps in both cohorts at around 2.4 Mbp.

ag3.plot_h1x_gwss(
    contig="2L",
    analysis="gamb_colu",
    sample_sets="3.0",
    cohort1_query="cohort_admin2_year == 'BF-09_Houet_colu_2014'",
    cohort2_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    window_size=1000,
)
                                     

Exercise 2 (English)

Look for shared selective sweeps between the same An. gambiae and An. coluzzii cohorts from Burkina Faso, but on chromosome arm 2R instead of 2L. This involves three steps:

  • Run a H12 selection scan in the BF-09_Houet_colu_2014 cohort

  • Run a H12 selection scan in the BF-09_Houet_gamb_2014 cohort

  • Rnu a H1X scan between these two cohorts

What’s happening at the Cyp6aa/p gene cluster, which is around 28.5 Mbp? Is there any evidence for selective sweeps? Is there any evidence for adaptive introgression?

Exercice 2 (Français)

Chercher des balayages sélectifs partagés entre les mêmes cohortes d’An. gambiae et d’An. coluzzii du Burkina Faso mais sur le bras de chromosome 2R au lieu de 2L. Cela se fait en trois étapes:

  • Exécuter un scan de sélection H12 pour la cohorte BF-09_Houet_colu_2014

  • Exécuter un scan de sélection H12 pour la cohorte BF-09_Houet_gamb_2014

  • Exécuter un scan de sélection H1X entre ces deux cohorts

Que se passe-t-il au groupe de gènes Cyp6aa/p qui se trouve à environ 28.5 Mbp? Y a-t-il des indications d’un balayage sélectif? Y a-t-il des indications d’une introgression adaptative?

Example - detecting adaptive gene flow between countries#

Let’s now use the same approach to look for evidence of adaptive gene flow between two countries. Let’s examine An. gambiae from Mali and compare with the An. gambiae from Burkina Faso we investigated above.

Here are the cohorts defined in Mali:

df_samples.query("country == 'Mali'")["cohort_admin2_year"].value_counts()
cohort_admin2_year
ML-3_Yanfolila_gamb_2012    65
ML-2_Kati_gamb_2014         33
ML-3_Yanfolila_colu_2012    27
ML-2_Kati_colu_2014         27
ML-2_Kangaba_gamb_2004      23
ML-4_Bla_colu_2004          19
ML-2_Kati_colu_2004         11
ML-2_Kati_gamb_2004          9
ML-4_Baroueli_colu_2004      6
ML-4_Baroueli_gamb_2004      1
ML-4_Bla_arab_2004           1
ML-4_Baroueli_arab_2004      1
Name: count, dtype: int64

Let’s focus on the ML-2_Kati_gamb_2014 cohort. Calibrate H12 window size:

ag3.plot_h12_calibration(
    contig="3L",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
)

Run H12 GWSS to look for evidence of selective sweeps.

ag3.plot_h12_gwss(
    contig="2R",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
    window_size=1000,
)
                                     

Let’s remind ourselves of the H12 GWSS from Burkina Faso:

ag3.plot_h12_gwss(
    contig="2R",
    analysis="gamb_colu",
    sample_sets="3.0",
    sample_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    window_size=1000,
)
                                     

Now look for evidence of shared selective sweeps between these two cohorts:

ag3.plot_h1x_gwss(
    contig="2R",
    analysis="gamb_colu",
    sample_sets="3.0",
    cohort1_query="cohort_admin2_year == 'ML-2_Kati_gamb_2014'",
    cohort2_query="cohort_admin2_year == 'BF-09_Houet_gamb_2014'",
    window_size=1000,
)
                                     

Here there are strong selection signals in both cohorts at the Cyp6aa/p locus around 28.5 Mbp, and there is a clear signal of a shared sweep at this locus as well. This is strong evidence that one or more haplotypes are sweeping and have spread to both locations from a common origin.

Exercise 3 (English)

Following the same approach as demonstrated above for Burkina Faso and Mali, investigate evidence for adaptive gene flow on chromosome arm 2R between Burkina Faso and any other West African country. Here are some steps to follow:

  • Select a country to compare with Burkina Faso.

  • Find out what cohorts have been defined for your country of interest and how many samples are available for each cohort.

  • Choose a specific cohort to analyse.

  • Run H12 window size calibration for your chosen cohort.

  • Run H12 GWSS on chromosome arm 2R for your chosen cohort.

  • Run H1X GWSS on chromosome arm 2R comparing your chosen cohort with a cohort from Burkina Faso.

Exercice 3 (Français)

En suivant la même approche que celle montré plus haut pour le Burkina Faso et le Mali, explorer les indices d’un flux génétique adaptatif sur le bras de chromosome 2R entre le Burkina Faso et un autre pays de votre choix en Afrique de l’Ouest. Voici quelques étapes importantes:

  • Choisir un pays à comparer au Burkina Faso.

  • Trouver les cohortes définies pour ce pays et combien d’échantillons chaque cohorte contient.

  • Choisir une cohorte spécifique à analyser.

  • Exécuter une calibration des fenêtres H12 pour la cohorte de votre choix.

  • Exécuter un scan de sélection H12 GWSS sur le bras de chromosome 2R pour la cohorte de votre choix.

  • Exécuter un scan de sélection H1X GWSS sur le bras de chromosome 2R comparant la cohorte de votre choix à une cohorte du Burkina Faso.

Conclusions and caveats#

Before we finish, there are a couple of caveats that we have to bear in mind when interpreting results from these analyses.

In particular, in a situation like the above when you have evidence of a shared selective sweep at a known insecticide resistance locus involving two countries like Burkina Faso and Mali…

  • Can you infer direct gene flow between the two countries?

No, because the haplotype under selection could have spread to both countries from a third location.

  • Can you infer that adaptive gene flow is the only driver of resistance?

No, because there could be multiple haplotypes under selection, only some of which are spreading between countries.

In other words, at any given location there may be a combination of haplotypes under selection, which have originated from different locations.

Unfortunately mosquito populations are highly interconnected and patterns of adaptive gene flow can get complicated!

Practical exercises#

As usual, please now run this notebook for yourself from top to bottom, attempting the exercises you encounter along the way.

When you’ve finished, if you feel like a challenge, have a go at the advanced exercise below.

Advanced exercise (English)#

Look for evidence of adaptive introgression between An. gambiae and An. arabiensis in Tanzania on chromosome arm 2R. Here are some steps to follow:

  • Query the sample metadata to find out what cohorts have been defined in Tanzania and how many samples there are in each cohort.

  • Identify the gambiae and arabiensis cohorts with sufficient sample size for analysis.

  • Run H12 window size calibration for each of the selected cohorts.

  • Run H12 GWSS on chromosome arm 2R for each of the selected cohorts.

  • Run H1X HWSS on chromosome arm 2R between pairs of gambiae and arabiensis cohorts.

What do you think is happening at the Cyp6aa/p gene cluster?

Exercice avancé (Français)#

Chercher des preuves d’une introgression adaptative entre An. gambiae et An. arabiensis en Tanzanie sur le bras de chromosome 2R. Voici quelques étapes à suivre:

  • Utiliser une requête portant sur les métadonnées pour trouver quelles cohortes sont définies pour la Tanzanie et combien d’échantillons elles contiennent.

  • Identifier les cohortes de gambiae et d’arabiensis ayant suffisamment d’échantillons pour cette analyse.

  • Exécuter une calibration des fenêtres H12 pour les cohortes choisies.

  • Exécuter un scan de sélection H12 GWSS sur le bras de chromosome 2R pour les cohortes choisies.

  • Exécuter un scan de sélection H1X GWSS sur le bras de chromosome 2R comparant des paires de cohortes gambiae et arabiensis.

Que se passe-t-il dans le groupe de gènes Cyp6aa/p selon vous?