Module 4 - Investigating gene flow at loci of interest using haplotype networks

Training course in data analysis for genomic surveillance of African malaria vectors

Workshop 7 - Gene flow and the spread of insecticide resistance

Module 4 - Investigating gene flow at loci of interest using haplotype networks#

Theme: Analysis

In module 2 of workshop 7, we learnt about selective sweeps being shared between populations through adaptive gene flow, and in and module 3 we used hierarchical haplotype clustering to investigate and visualise haplotypes spreading between countries and species, allowing us to track gene flow at insecticide resistance loci. We also have another methodology in our armamentarium which lets us visualise the relationship between haplotypes in a different way, and that is haplotype networks. In this module we will learn how to create and interpret haplotype networks to analyse gene flow at loci associated with insecticide resistance.

Please note: This module contains interactive plots using Plotly Dash and Cytoscape.js. These plots will not be visible on the course website, but can be generated by running the notebook for yourself. To run this notebook, click the rocket icon at the top of the page then select “Colab”.

Learning objectives#

Understand the theory behind haplotype networks.
Use the malariagen_data python package to build haplotype networks for genomic regions with evidence of adaptive gene flow.
Learn how to interpret haplotype networks to find evidence for gene flow events between countries or species.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

%pip install -q --no-warn-conflicts malariagen_data dash

import malariagen_data

import dash

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	None
Cohorts analysis	20250131
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.2.0
Client location	Iowa, United States (Google Cloud us-central1)

Haplotype networks#

We have an experimental function for building haplotype networks in the malariagen_data python package (we’re still ironing out some bugs!). Let’s have a look at the doc string for the ag3.plot_haplotype_network() function by using “?”.

ag3.plot_haplotype_network?

Signature:
ag3.plot_haplotype_network(
    region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
    analysis: str = 'default',
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    max_dist: int = 2,
    color: Union[str, Mapping, NoneType] = None,
    color_discrete_sequence: Optional[List] = None,
    color_discrete_map: Optional[Mapping] = None,
    category_orders: Union[List, Mapping, NoneType] = None,
    node_size_factor: int = 50,
    layout: str = 'cose',
    layout_params: Optional[Mapping] = None,
    server_port: Optional[int] = None,
    server_mode: Optional[Literal['inline', 'external', 'jupyterlab', 'tab']] = 'inline',
    height: int = 600,
    width: Union[int, str, NoneType] = '100%',
    serve_scripts_locally: bool = True,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
)
Docstring:
Construct a median-joining haplotype network and display it using
Cytoscape.

A haplotype network provides a visualisation of the genetic distance
between haplotypes. Each node in the network represents a unique
haplotype. The size (area) of the node is scaled by the number of times
that unique haplotype was observed within the selected samples. A
connection between two nodes represents a single SNP difference between
the corresponding haplotypes.

Parameters
----------
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript. Can also be a sequence (e.g.,
    list) of regions.
analysis : str, optional, default: 'default'
    Which haplotype phasing analysis to use. See the
    `phasing_analysis_ids` property for available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
max_dist : int, optional, default: 2
    Join network components up to a maximum distance of 2 SNP differences.
color : str or Mapping or None, optional
    Name of variable to use to color the markers.
color_discrete_sequence : List or None, optional
    Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
    Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
    Control the order in which values appear in the legend.
node_size_factor : int, optional, default: 50
    Control the sizing of nodes.
layout : str, optional, default: 'cose'
    Name of the network layout to use to position nodes.
layout_params : Mapping or None, optional
    Additional parameters to the layout algorithm.
server_port : int or None, optional
    Manually override the port on which the Dash app will run.
server_mode : {'inline', 'external', 'jupyterlab', 'tab'} or None, optional, default: 'inline'
    Controls how the Jupyter Dash app will be launched. See
    https://dash.plotly.com/dash-in-jupyter#display-modes for more
    information.
height : int, optional, default: 600
    Height of the Dash app in pixels (px).
width : int or str or None, optional, default: '100%'
    Width of the Dash app.
serve_scripts_locally : bool, optional, default: True
    If True, serve scripts locally, otherwise use CDN.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py
Type:      method

The required parameters, region and analysis are familiar to us by now, but there are a some new optional parameters here that might be useful to understand.

max_dist= This is the maximum distance in n SNPs where the algorithm will join haplotype nodes together as one network, after which it will split the nodes in to separate networks. The default here is 2 SNPs. We will see an example of how this works shortly.
color= Identifies a column in the sample metadata which determines the colour of pie chart segments within nodes. e.g. we could use “country” or “taxon” columns.

It is also worth knowing that if you get 403 errors instead of figures, try reloading your browser tab or restart your browser.

Let’s build a haplotype network around Kdr locus in the Vgsc gene for all the An. coluzzii samples in Ag3.0, and look at the output.

ag3.plot_haplotype_network(
    region='2L:2,400,000-2,440,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'taxon == "coluzzii"',
    color="country",
    max_dist=2,
)

N.B., you will need to run this notebook for yourself in colab in order for the haplotype network to work.

Technical interpretation#

The output figure represents identical haplotypes as nodes (circles), the larger the node, the greater the number of identical haplotypes. In this example, the color parameter has been set to the “country” column of the haplotype metadata, so the node colour proportions represent the countries where that haplotype was present in our data set. We often also set the color parameter to “taxon”, in which case, node colour proportion represents the species which carried that haplotype. If we click or tap a node, information about its composition is provided at the bottom of the figure. This information relates the the color parameter, i.e. if color=taxon, the information will be about species composition. Nodes can also be moved to aid visualisation (shift + click to select multiple), and the whole figure can be zoomed in and out using the mouse wheel.

The edges (lines) that join nodes together represent the genetic distance between the nodes in SNPs. Each edge is a single SNP difference.

In the Vgsc example, with the max_dist (maximum distance) parameter set to two, we se each node in these networks is separated by a maximum genetic distance of 2 SNPs, shown as two edges with a small grey node in the middle. If we click on a small grey node we see “No. haplotypes: 0”. The evolutionary relationship between these haplotype nodes either side of the small node has been inferred as we don’t actually have the intermediate haplotype (with only one SNP difference) in our data set. Let’s increase the max_dist parameter and see what happens.

ag3.plot_haplotype_network(
    region='2L:2,400,000-2,440,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'taxon == "coluzzii"',
    color="country",
    max_dist=4,
)

N.B., you will need to run this notebook for yourself in colab in order for the haplotype network to work.

If we increase the max_dist parameter to four SNPs, we see haplotypes linked together by up to three “intermediate” node, the network has found connections between more distantly related haplotypes. We generally use max_dist=2, mainly because we’re interested here only in groups of very closely related haplotypes, which are therefore very likely to share a recent common origin.

To aid interpretation, only networks containing more than one haplotype are represented in the output figure. This means that if a particular haplotype is found only once in our data set (i.e. a singleton haplotype), and that haplotype is more than the max_dist number of SNPs different from any other haplotype, it not be shown.

Evolutionary interpretation#

Let’s return to the Vgsc network example with max_dist=2 and try to interpret the evolutionary inference with respect to insecticide resistance.

ag3.plot_haplotype_network(
    region='2L:2,400,000-2,440,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'taxon == "coluzzii"',
    color="country",
    max_dist=2,
)

N.B., you will need to run this notebook for yourself in colab in order for the haplotype network to work.

When a fitness increasing mutation occurs, haplotype it resides on comes under selection and it increases in frequency. So, here we’re interested especially in the large nodes, because these are very likely to be high frequency haplotypes experiencing positive selection.
During a selective sweep, some additional mutations will occur on the original haplotype. These could be neutral or they could be secondary functional mutations (i.e. further increasing fitness), but key point is that these additional mutations occur, and give us a network of closely related haplotypes. We can infer that each network of connected haplotypes have a recent common origin.
If a network of connected haplotypes contains haplotypes from two or more countries, this is evidence for adaptive gene flow. Particularly a large node containing haplotypes from multiple countries, this is very strong evidence for adaptive gene flow because exactly the same haplotype is found across countries.
Likewise, if a network of connected haplotypes contains haplotypes from two or more species, this is also evidence for a type of adaptive gene flow called adaptive introgression. Again, a large node containing haplotypes from multiple species, is very strong evidence for adaptive gene flow.

The most striking feature of this Vgsc network is the very large and complex network at the top the figure. The larger node colours show that there are several high frequency haplotypes, closely related to each other, shared across countries. This is a classic signal of adaptive gene flow and in this case, we know from previous research, the haplotypes all carry the L995F kdr mutations that confers resistance to pyrethroids and DDT. Furthermore, the multiple large nodes suggest that secondary mutations have recently occurred on an already selected haplotype, then these new haplotypes have increased in frequency, likely also under selection.

We can also see some smaller complex networks, which suggest potential selective sweeps, two of which show evidence of adaptive gene flow between countries (sharing of identical haplotypes).

The smallest networks are fairly uninformative, and could be considered noise.

Now we have an idea how to run and interpret haplotype networks, let’s look at some insecticide resistance loci we looked at in module 3.

Exercise 1#

English#

Repeat the Vgsc network analysis, but this time investigate adaptive gene flow (introgression) between An. gambiae and coluzzii. HINT: color=?????.
Is there evidence of adaptive gene flow between species?

Français#

Répéter l’analyse des réseaux pour Vgsc mais cette fois étudier le flux génétique adaptatif (introgression) entre An. gambiae et coluzzii. INDICE: color=?????.
Y a-t-il des indications de flux génétique adaptatif entre espèces?

Example: Ace1#

We’ve heard about the Ace1 genes involvement in organophosphate insecticide resistance, and in the last module, we’ve also seen the results of visualising the relationships between haplotypes at this locus using hierarchical clustering. Let’s first remind ourselved what that looked like for Ace1 with An. gambiae haplotypes from from Ghana and Burkina Faso..

Ace1 hierarchical clustering re-cap#

ag3.plot_haplotype_clustering(
    region='2R:3,470,000-3,490,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'country in ["Ghana", "Burkina Faso"] & taxon == "gambiae"',
    color='country')

N.B., you will need to run this notebook for yourself in colab in order for the haplotype network to work.

Towards the right hand side of the dendrogram, we can see a region where there are haplotypes closely related to each other. By zooming in we can find that this distance is <5 SNPs between haplotypes, rather than >50 SNPs distance we see elsewhere in the figure. This pattern is suggestive of a selective sweep, and as it includes multiple countries, is also suggestive of adaptive gene flow.

Adaptive gene flow is where a fitness improving mutation increases in frequency and spreads across space, between populations and/or between different species. Here we are detecting the increase and spread of the haplotype which carries this locus under positive selection. Given the importance of this gene to insecticide action and the high evolutionary pressure that these chemicals put on mosquitoes, this signal may well be driven by a resistance phenotype associated locus.

Let’s investigate the same region but using haplotype networks.

Ace1 haplotype networks#

ag3.plot_haplotype_network(
    region='2R:3,470,000-3,490,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'country in ["Ghana", "Burkina Faso"] & taxon == "gambiae"',
    color="country",
    max_dist=2,
)

N.B., you will need to run this notebook for yourself in colab in order for the haplotype network to work.

Looking back at the hierarchical clustering analysis of Ace1, the vast majority of the dendrogram leaves (haplotypes) have long branches (genetic distances) to their nearest neighbours. All of those singleton haplotypes, more that 2 SNPs distance from any other haplotypes, are filtered out in the network analysis as we are only interested in high frequency haplotypes, and networks of closely related haplotypes, because they are likely to be haplotypes under recent positive selection. What remains are the haplotypes which make us the networks we can see above, a much simplified picture of the genetic relationships.

The largest network corresponds to the selective sweep signal we saw in the clustering analysis. Clicking on the largest node, we see it contains 15 identical haplotypes, 11 from Burkina Faso and four from Ghana. As expected, this corresponds perfectly with what we see if we zoom in to the far right side of the clustering figure and roll over the leaves with our mouse, further evidence this haplotype has spread, under selection between countries. Several larger nodes, closely related to each other gives us evidence that the selection is recent (as we might expect if driven by malaria control insecticidal pressures).

Exercise 2#

English#

For Ace1 add coluzzii samples as well, colour the nodes by taxon and take another look at these networks. HINT sample_query= 'country in ["Ghana", "Burkina Faso"] & taxon in ["gambiae", "coluzzii"]'

is there evidence of adaptive introgression of haplotypes between species at the Ace1 locus?

Français#

Pour Ace1 ajouter les coluzzii, colorer les noeuds par taxon et regarder à nouveau les réseaux. INDICE sample_query= 'country in ["Ghana", "Burkina Faso"] & taxon in ["gambiae", "coluzzii"]'

Y a-t-il des indications d’introgression adaptive entre espèces au locus Ace1 ?

Example: Gste2#

We have seen that Gste2 gene is associated with insecticide resistance, and we saw evidence for adaptive gene flow at this locus using hierarchical clustering. Lets now perform a haplotype networking analysis at this gene and compare the results.

ag3.plot_haplotype_clustering(
    region='3R:28,590,000-28,600,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'country in ["Ghana", "Burkina Faso", "Cote d\'Ivoire"]',
    color='taxon',
    symbol='country'
    )

ag3.plot_haplotype_network(
    region='3R:28,590,000-28,600,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'country in ["Ghana", "Burkina Faso", "Cote d\'Ivoire"]',
    color="country",
    max_dist=2,
    )

ag3.plot_haplotype_network(
    region='3R:28,590,000-28,600,000',
    analysis='gamb_colu',
    sample_sets='3.0',
    sample_query= 'country in ["Ghana", "Burkina Faso", "Cote d\'Ivoire"]',
    color="taxon",
    max_dist=2,
    )

N.B., you will need to run this notebook for yourself in colab in order for the haplotype networks to work.

The haplotype network figures make it easy to spot the selective sweeps and gene flow between Burkina Faso and Ghana (but not Cote d’Ivoire) and the adaptive introgression of haplotypes between gambiae and coluzzii at this locus.

Exercise 3#

English#

The cytochrome P450 gene Cyp9k1 is associated with metabolic pyrethroid resistance. Run a haplotype clustering analysis on samples from Ghana and Cote d’Ivoire on this region. HINT: region=’X:15,240,000-15,250,000’

Is there evidence of selection?
Is there evidence of adaptive gene flow?
If so, between countries and/or species?

Français#

Le cytochrome P450 gene Cyp9k1 est associé à la résistance métabolique aux pyrethrinoïdes. Exécuter une analyse par regroupement d’haplotypes pour les échantillons du Ghana et de la Côte d’Ivoire dans cette région. INDICE: region=’X:15,240,000-15,250,000’

Y a-t-il des indications de sélection?
Y a-t-il des indications de flux génétique adaptatif?
Si oui, entre pays et/ou espèces?

Well done!#

In this module we have:

Learnt about the theory behind haplotype networks.
Used the malariagen_data python package to build haplotype networks for genomic regions with evidence of adaptive gene flow.
Learnt how to interpret haplotype networks to discern movement of insecticide resistance loci between countries and species.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.

Module 4 - Investigating gene flow at loci of interest using haplotype networks

Contents

Module 4 - Investigating gene flow at loci of interest using haplotype networks#

Learning objectives#

Lecture#

English#

Français#

Setup#

Haplotype networks#

Technical interpretation#

Evolutionary interpretation#

Exercise 1#

English#

Français#

Example: Ace1#

Ace1 hierarchical clustering re-cap#

Ace1 haplotype networks#

Exercise 2#

English#

Français#

Example: Gste2#

Exercise 3#

English#

Français#

Well done!#

Exercises#

English#

Français#