banner

Workshop 4 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 3 - Ancestry-informative markers (AIMs)#

Theme: Data

In this module we’re going to access and visualise genotype data from a selection of SNPs which provide information about the species ancestry of individual mosquitoes. These SNPs are known as ancestry-informative markers, abbreviated as AIMs.

Learning objectives#

At the end of this module you will be able to:

  • Explain what ancestry-informative markers are.

  • Access, visualise and interpret AIM genotypes.

  • Explain how AIM genotypes can be used to make provisional species assignments.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Introduction#

Limitations of existing molecular assays for Anopheles species determination#

Several molecular methods exist for determining the species of mosquitoes within the Anopheles gambiae complex. However, each of these methods relies on a single DNA marker. This is limited, because:

  • There is gene flow between species within the gambiae complex, and sexual recombination. This means that the genotype at any single marker may be completely unrepresentative of the rest of the genome.

  • Results of genotyping a single marker may be intermediate, and this can be hard to interpret. For example, an intermediate genotype could indicate a true F1 hybrid, or a backcross (descendent of a hybrid), or a cryptic species, or other possibilities.

  • The existing markers cannot detect novel cryptic species, i.e., species we haven’t seen before.

What are ancestry-informative markers?#

When we have data from whole-genome sequencing of individual mosquitoes, we are able to use genotypes from multiple markers across the genome. This can help to give a more complete picture of the species ancestry of each mosquito, and help to resolve some of the more complicated cases where gene flow is occurring between species, or where cryptic species are present that we were previously unaware of.

To do this, we first identified ancestry-informative markers (AIMs), which are SNPs from across the genome which are fixed (or near-fixed) differences between species. I.e., AIMs are SNPs with two alleles, where all individuals of one species have one of the alleles, and all individuals of the other species have the other allele. We identified these AIMs this using data from previous studies.

We ascertained two sets of AIMs:

  • gambcolu_vs_arab – These are SNPs where An. arabiensis have a different allele from An. gambiae and An. coluzzii

  • gamb_vs_colu – These are SNPs where An. coluzzii have a different allele from An. gambiae.

It can help to think about these AIMs in the context of the species tree for An. arabiensis, An. gambiae and An. coluzzii.

AIMs are SNPs where alleles have become fixed differences since species split.

We can access these data via the malariagen_data package, let’s take a look.

Setup#

Begin by installing and importing the packages we’ll need.

%pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
import numpy as np
import allel

Configure access to MalariaGEN data in Google Cloud.

ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10
Results cache None
Cohorts analysis 20240319
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 7.15.0
Client location unknown

Accessing AIM data#

gambcolu_vs_arab#

Let’s access the genotypes for the gambcolu_vs_arab AIMs for sample set AG1000G-UG (Uganda).

ds_aim = ag3.aim_calls(aims="gambcolu_vs_arab", sample_sets="AG1000G-UG")
ds_aim
<xarray.Dataset> Size: 2MB
Dimensions:           (variants: 2612, samples: 290, ploidy: 2, alleles: 2)
Coordinates:
    sample_id         (samples) <U24 28kB dask.array<chunksize=(290,), meta=np.ndarray>
    variant_contig    (variants) uint8 3kB dask.array<chunksize=(2612,), meta=np.ndarray>
    variant_position  (variants) int32 10kB dask.array<chunksize=(2612,), meta=np.ndarray>
Dimensions without coordinates: variants, samples, ploidy, alleles
Data variables:
    call_genotype     (variants, samples, ploidy) int8 2MB dask.array<chunksize=(1306, 145, 2), meta=np.ndarray>
    variant_allele    (variants, alleles) |S1 5kB dask.array<chunksize=(2612, 2), meta=np.ndarray>
Attributes:
    aims:      gambcolu_vs_arab
    analysis:  20220528
    contigs:   ['2R', '2L', '3R', '3L', 'X']

Note that there are 2,612 variants. This is the number of AIM SNPs. These include SNPs from all five chromosome arms.

aim_contig = ds_aim["variant_contig"].values
contigs = ds_aim.attrs["contigs"]
for i, contig in enumerate(contigs):
    print(f"{contig}: {np.count_nonzero(aim_contig == i)}")
2R: 734
2L: 485
3R: 542
3L: 363
X: 488

Here are the AIM alleles.

aim_alleles = ds_aim["variant_allele"].values
aim_alleles
array([[b'C', b'A'],
       [b'A', b'T'],
       [b'C', b'T'],
       ...,
       [b'A', b'T'],
       [b'T', b'A'],
       [b'T', b'C']], dtype='|S1')

This is a 2-D NumPy array of nucleotides (i.e., SNP alleles). Each row is a SNP. Note that here, the first column gives the allele usually found in An. gambiae and An. coluzzii, and the second column gives the allele usually found in An. arabiensis.

Let’s access the AIM genotypes.

aim_gt = ds_aim["call_genotype"].values
aim_gt.shape
(2612, 290, 2)

This is a 3-D NumPy array of genotype calls, with 2,612 variants and 290 diploid samples.

Let’s take a peek at the genotypes for the first sample:

aim_gt[:, 0]
array([[1, 1],
       [1, 1],
       [1, 1],
       ...,
       [1, 1],
       [1, 1],
       [1, 1]], dtype=int8)

Here a [1, 1] genotype call means homozygous for the An. arabiensis allele.

Now take a peek at the genotypes for the 100th sample:

aim_gt[:, 99]
array([[0, 0],
       [0, 0],
       [0, 0],
       ...,
       [0, 0],
       [0, 0],
       [0, 0]], dtype=int8)

Here a [0, 0] genotype call means homozygous for the allele found in An. gambiae and An. coluzzii.

gamb_vs_colu#

Let’s access the genotypes for the gamb_vs_colu AIMs for sample set AG1000G-CF (Central African Republic).

ds_aim = ag3.aim_calls(aims="gamb_vs_colu", sample_sets="AG1000G-CF")
ds_aim
<xarray.Dataset> Size: 117kB
Dimensions:           (variants: 700, samples: 73, ploidy: 2, alleles: 2)
Coordinates:
    sample_id         (samples) <U24 7kB dask.array<chunksize=(73,), meta=np.ndarray>
    variant_contig    (variants) uint8 700B dask.array<chunksize=(700,), meta=np.ndarray>
    variant_position  (variants) int64 6kB dask.array<chunksize=(700,), meta=np.ndarray>
Dimensions without coordinates: variants, samples, ploidy, alleles
Data variables:
    call_genotype     (variants, samples, ploidy) int8 102kB dask.array<chunksize=(700, 73, 2), meta=np.ndarray>
    variant_allele    (variants, alleles) |S1 1kB dask.array<chunksize=(700, 2), meta=np.ndarray>
Attributes:
    aims:      gamb_vs_colu
    analysis:  20220528
    contigs:   ['2R', '2L', '3R', '3L', 'X']

Note that there are less AIMs here, 700 in total. That’s because An. gambiae and An. coluzzii are less genetically diverged, so there are fewer SNPs which are sufficiently differentiated to use as AIMs.

How many AIMs are there on each of the chromosome arms?

aim_contig = ds_aim["variant_contig"].values
contigs = ds_aim.attrs["contigs"]
for i, contig in enumerate(contigs):
    print(f"{contig}: {np.count_nonzero(aim_contig == i)}")
2R: 54
2L: 92
3R: 48
3L: 24
X: 482

Note here that most AIMs are on the X chromosome.

Let’s peek at the alleles.

ds_aim["variant_allele"].values
array([[b'C', b'A'],
       [b'G', b'A'],
       [b'T', b'A'],
       ...,
       [b'G', b'A'],
       [b'T', b'G'],
       [b'T', b'A']], dtype='|S1')

Here the first column gives the gambiae alleles, and the second column gives the coluzzii alleles.

Finally, let’s peek at the genotype calls.

aim_gt = ds_aim["call_genotype"].values

Here’s the first sample:

aim_gt[:, 0]
array([[0, 0],
       [1, 0],
       [0, 0],
       ...,
       [0, 0],
       [0, 0],
       [0, 0]], dtype=int8)

Here a [0, 0] genotype call means homozygous for the allele found in An. gambiae.

Here’s the 21st sample:

aim_gt[:, 20]
array([[1, 1],
       [1, 1],
       [1, 1],
       ...,
       [1, 1],
       [1, 1],
       [1, 1]], dtype=int8)

Here a [1, 1] genotype call means homozygous for the allele found in An. coluzzii.

Visualising and interpreting AIMs#

Let’s now make a plot to visualise the AIM genotypes for some sample of interest, which will allow us to see the data for all AIMs and all samples at once. We will use the function plot_aim_heatmap().

There are different patterns seen in different sample sets. Let’s view a selection of these to learn more about what patterns are present and what they might mean.

Central African Republic#

ag3.plot_aim_heatmap(aims="gamb_vs_colu", sample_sets="AG1000G-CF")

Here there is a clear distinction between An. gambiae and An. coluzzii.

Malawi#

ag3.plot_aim_heatmap(aims="gambcolu_vs_arab", sample_sets="AG1000G-MW")

Here all samples are An. arabiensis.

For illustration, what would happen if we tried to use the gamb_vs_colu AIMs with some An. arabiensis samples?

ag3.plot_aim_heatmap(aims="gamb_vs_colu", sample_sets="AG1000G-MW")

Note the stripy pattern - the gamb_vs_colu AIMs are not appropriate to use with An. arabiensis.

Uganda#

ag3.plot_aim_heatmap(aims="gambcolu_vs_arab", sample_sets="AG1000G-UG")

Here there is a clear distinction between gambiae (gambcolu) and arabiensis, with one F1 hybrid individual.

Burkina Faso#

ag3.plot_aim_heatmap(aims="gamb_vs_colu", sample_sets="AG1000G-BF-A")

Here there is a clear distinction between An. gambiae and An. coluzzii, but with introgression from gambiae into coluzzii on chromosome arm 2L.

Guinea-Bissau#

Firstly, confirm there are no An. arabiensis.

ag3.plot_aim_heatmap(aims="gambcolu_vs_arab", sample_sets="AG1000G-GW")

Now examine gamb_vs_colu AIMs.

ag3.plot_aim_heatmap(aims="gamb_vs_colu", sample_sets="AG1000G-GW")

Here there are samples with lots of mixed and heterozygous AIMs. There could be several possible explanations for this. We will investigate further in the next module.

Guinea#

Exercise 1 (English)

Uncomment the code cell below and run it to plot the gambiae versus coluzzii AIMs for the AG1000G-GN-B sample set from Guinea. Then try to answer the questions below.

  • Try to find sample AV0241-C (hint: it’s the first row). What do you notice about its AIM genotypes? What species would you assign this, based on the AIM genotypes?

  • Try to find sample AV0235-CW (hint: it’s the last row). What do you notice about its AIM genotypes? What species would you assign this, based on the AIM genotypes?

  • Try to find sample AV0331-C (hint: it has mostly yellow gamb/colu genotypes). Can you explain why this sample might have this pattern of AIM genotypes?

  • Try to find samples AV0161-C and AV0245-C (hint: they occur next to each other, and are mostly red, but have either mostly yellow or blue on chromosome arm 2L). Can you explain why these samples might have these patterns of AIM genotypes?

Exercice 1 (Français)

Dans la cellule ci-dessous, supprimer le “#” et lancer la cellule pour créer un graphique de AIMs gambiae versus coluzzii for le jeu de donnée du Guinée AG1000G-GN-B. Essayez ensuite de répondre aux questions ci-dessous.

  • Essayez de trouver l’échantillon AV0241-C (indice: c’est la première ligne). Que remarquez-vous sur ses génotypes AIM? À quelle espèce attribueriez-vous cela, sur la base des génotypes AIM?

  • Essayez de trouver l’échantillon AV0235-CW (indice: c’est la dernière ligne). Que remarquez-vous sur ses génotypes AIM? À quelle espèce attribueriez-vous cela, sur la base des génotypes AIM?

  • Essayez de trouver l’échantillon AV0331-C (indice: il a principalement des génotypes gamb/colu jaunes). Pouvez-vous expliquer pourquoi cet échantillon pourrait avoir ce modèle de génotypes AIM?

  • Essayez de trouver les échantillons AV0161-C et AV0245-C (indice: ils se trouvent côte à côte et sont principalement rouges, mais ont principalement du jaune ou du bleu sur le bras du chromosome 2L). Pouvez-vous expliquer pourquoi ces échantillons pourraient avoir ces modèles de génotypes AIM?

# ag3.plot_aim_heatmap(aims="gamb_vs_colu", sample_sets="AG1000G-GN-B")

Summarising AIMs and making a provisional species assignment#

To help with our investigation of cryptic species, we summarise the AIM genotypes for each individual by counting the number of alleles called for each species, then computing a fraction. These fractions are then added to the sample metadata.

Let’s take a look at the AIM columns in the sample metadata for the AG1000G-CF sample set.

df_samples = ag3.sample_metadata(sample_sets="AG1000G-CF")
df_samples[["sample_id", "aim_species_fraction_arab", "aim_species_fraction_colu", "aim_species_fraction_colu_no2l", "aim_species"]]
sample_id aim_species_fraction_arab aim_species_fraction_colu aim_species_fraction_colu_no2l aim_species
0 BK0001-C 0.002299 0.014306 0.016474 gambiae
1 BK0002-C 0.001342 0.962089 0.957990 coluzzii
2 BK0003-C 0.001342 0.969914 0.966172 coluzzii
3 BK0005-C 0.002874 0.016440 0.018107 gambiae
4 BK0006-C 0.000958 0.966284 0.963696 coluzzii
... ... ... ... ... ...
68 BK0082-C 0.001727 0.010745 0.009061 gambiae
69 BK0083-C 0.003063 0.011445 0.012356 gambiae
70 BK0085-C 0.002106 0.012894 0.013201 gambiae
71 BK0086-C 0.001532 0.009312 0.010726 gambiae
72 BK0094-CW 0.002492 0.028653 0.032178 gambiae

73 rows × 5 columns

Here aim_species_fraction_arab is the fraction of arabiensis alleles in genotypes at the gambcolu_vs_arab AIMs.

Here aim_species_fraction_colu is the fraction of coluzzii alleles in genotypes at the gamb_vs_colu AIMs.

We also add a aim_species_fraction_colu_no2l column, which is the fraction of coluzzii alleles in genotypes at the gamb_vs_colu AIMs, excluding chromosome arm 2L. Because we know that many countries have been affected by the adaptive introgression event on chromosome arm 2L, it is sometimes useful to exclude it to get a clearer picture of species ancestry.

From these fractions, we make a provisional species assignment, and store the result in the aim_species column.

df_samples.groupby("aim_species").size()
aim_species
coluzzii    18
gambiae     55
dtype: int64

To get more intuition for this, let’s make some scatter plots of the AIM fractions.

def plot_aim_fractions(
    sample_sets=None,
    sample_query=None,
    x="aim_species_fraction_arab",
    y="aim_species_fraction_colu_no2l",
):

    import plotly.express as px

    # load sample metadata
    df = ag3.sample_metadata(sample_sets=sample_sets, sample_query=sample_query)

    # set up colours
    palette = px.colors.qualitative.T10
    color_map = {
        "arabiensis": palette[4],
        "gambiae": palette[0],
        "coluzzii": palette[2],
        "intermediate_gambcolu_arabiensis": palette[5],
        "intermediate_gambiae_coluzzii": palette[6],
    }

    # make a scatter plot
    fig = px.scatter(
        data_frame=df,
        x=x,
        y=y,
        width=500,
        height=500,
        range_x=[-.05, 1.05],
        range_y=[-.05, 1.05],
        hover_data={
            "aim_species_fraction_arab": ":.2f",
            "aim_species_fraction_colu": ":.2f",
            "aim_species_fraction_colu_no2l": ":.2f",
            "sample_id": True,
            "aim_species": True,
        },
        title=sample_sets,
        color="aim_species",
        color_discrete_map=color_map,
        category_orders={"aim_species": list(color_map)},
    )
    
    # add thresholds used for AIM species calling
    arab_cutoff = 0.85
    gambcolu_cutoff = 0.1
    colu_cutoff = 0.9
    gamb_cutoff = 0.1
    fig.add_vline(x=arab_cutoff, line_dash="dash")
    fig.add_vline(x=gambcolu_cutoff, line_dash="dash")
    fig.add_shape(
        type="line",
        x0=-.05, x1=gambcolu_cutoff, y0=gamb_cutoff, y1=gamb_cutoff,
        line_dash="dash",
    )
    fig.add_shape(
        type="line",
        x0=-.05, x1=gambcolu_cutoff, y0=colu_cutoff, y1=colu_cutoff,
        line_dash="dash",
    )
    
    # visual styling
    fig.update_layout(showlegend=False) 
    fig.update_traces(
        marker=dict(
            size=7,
            line=dict(
                width=1,
                color='black'
            )
        ),
        selector=dict(mode='markers')
    )

    return fig
plot_aim_fractions(sample_sets="AG1000G-CF")

We can see two clearly separated groups of markers.

  • In the bottom left are samples with fraction arabiensis AIMs near 0, and fraction coluzzii AIMs near 0. We can provisionally assign these as An. gambiae.

  • In the top left are samples with fraction arabiensis AIMs near 0, and fraction coluzzii AIMs near 1. We can provisionally assign these as An. coluzzii.

If you hover over the markers, you’ll see the provisional species assignment in the “aim_species” field.

Let’s look at another example - Uganda.

plot_aim_fractions(sample_sets="AG1000G-UG")

Here we have three clusters:

  • On the right are samples with fraction arabiensis AIMs near 1. We provisionally assign these An. arabiensis.

  • In the bottom left are samples with fraction arabiensis AIMs near 0, and fraction coluzzii AIMs near 0. We can provisionally assign these as An. gambiae.

  • In the middle is one sample with fraction arabiensis AIMs near 0.5. This is probably an F1 hybrid between gambiae and arabiensis parents. We provisionally assign this “intermediate_gambcolu_arabiensis”.

One last example Guinea-Bissau.

plot_aim_fractions(sample_sets="AG1000G-GW")

Here there are a few samples in the bottom left corner which we provisionally assign as An. gambiae, but most samples occur somewhere in the middle with fraction coluzzii AIMs between 0 and 1. These need further investigation.

Exercise 2 (English)

Uncomment the code cells below and run them to create AIM heatmaps and a scatter plot of the AIM fractions for the Kenyan sample set AG1000G-KE. Then use these plots to try to answer the questions below.

  • Find the samples which get assigned as An. arabiensis based on the AIM fractions. Give some sample identifiers.

  • Find the samples which get assigned as An. gambiae based on the AIM fractions. Give some sample identifiers.

  • Find the samples which get assigned as “intermediate_gambiae_coluzzii” based on the AIM fractions. Give some sample identifiers.

Now count how many samples there are for different possible values of the “aim_species” metadata field. Hint: load the sample metadata for sample set AG1000G-KE, access the “aim_species” column, then perform a value count.

Exercice 2 (Français)

Dans les cellules ci-dessous, supprimer le “#” et lancer les cellules pour créer des graphiques des AIMs dans les échantillons to Kenya (AG1000G-KE). Utilisez ensuite ces diagrammes pour essayer de répondre aux questions ci-dessous.

  • Trouvez les échantillons qui sont attribués comme An. arabiensis sur la base des fractions AIM. Donnez quelques exemples d’identificateurs.

  • Trouvez les échantillons qui sont attribués comme An. gambiae sur la base des fractions AIM. Donnez quelques exemples d’identificateurs.

  • Trouvez les échantillons qui sont attribués comme “intermediate_gambiae_coluzzii” sur la base des fractions AIM. Donnez quelques exemples d’identificateurs.

Comptez maintenant le nombre d’échantillons pour différentes valeurs possibles de la colonne de métadonnées “aim_species”. Indice: chargez les métadonnées du jeu d’échantillons AG1000G-KE, accédez à la colonne “aim_species”, puis comptez les valeurs.

# ag3.plot_aim_heatmap(aims="gambcolu_vs_arab", sample_sets="AG1000G-KE")
# ag3.plot_aim_heatmap(
#     aims="gamb_vs_colu", 
#     sample_sets="AG1000G-KE", 
#     sample_query="aim_species != 'arabiensis'"
# )
# plot_aim_fractions(sample_sets="AG1000G-KE")

Well done!#

Thanks for working through this module on ancestry informative markers.

If you haven’t already done so, please now launch this notebook in Google Colab, and start running the cells from the top. When you encounter a practical exercise, please try to complete it.