banner

Workshop 4 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 4 - Discovering cryptic taxa#

Theme: Analysis

In this module we’re going to detect cryptic taxa in the Ag3.0 data resource using ancestry informative markers (AIMs) and principal component analysis (PCA). We will use functions in the malariagen_data python package to run analyses then learn how interpret the results to discover cryptic taxonomic structure.

Learning objectives#

At the end of this module you will:

  • Understand MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.

  • Be able to run and plot AIM and PCA analyses, and interpret the results.

  • Be able to detect cryptic taxa.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Why do we need to discover cryptic taxa?#

Generally, we need to discover cryptic taxa in our data sets because they will generate population structure. In workshop 3 we explored how genetically distinct populations need to be separated, else population analysis results will be confounded.

Specifically, we need to identify cryptic taxa for genomic surveillance and vector control. Cryptic taxa may differ from known taxa in medically important phenotypes, e.g., biting times, vector competence or insecticide resistance. Vector control methods that work for some vector taxa may fail to control others.

Setup#

Before we begin the analysis, let’s set up the Python packages we’ll need to use.

First install and import the malariagen_data package.

%pip install -q --no-warn-conflicts malariagen_data

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"

Some analyses may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab. To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

Mount your Google Drive - you will need to follow the authorization instructions.

try:
    # if running on colab, mount Google drive
    from google.colab import drive
    drive.mount('drive')
except ImportError:
    pass

With our Google Drive now mounted, we can define and make a directory where we want to save our results.

results_dir = 'drive/MyDrive/Colab Data/module_4_results'
os.makedirs(results_dir, exist_ok=True)

In Google Colab, we can actually see our mounted drive and results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our reults in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the results would instead get stored on our hard drive.

ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release_master_us_central1
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache /home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-4/drive/MyDrive/Colab Data/module_4_results
Cohorts analysis 20250131
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 15.2.0
Client location Iowa, United States (Google Cloud us-central1)

Remember to check Client location in the output above - our cloud data is stored in the US, so want our Google Colab virtual machine (VM) to be based in the US too. If your client location is somewhere else in the world, select Runtime then Disconnect and delete runtime from the toolbar at the top of notebook, then rerun the notebook from the top. This will ensure our analyses run as fast as possible.

Step 1 - Ancestry informative marker (AIM) analysis#

Before we investigate what taxa are present in our dataset in detail, we first make provisional species calls. We could do this using the single marker molecular typing results that contributors often supply when sending samples, generated by assays such as Scott et al. (1993) and Santolamazza et al. (2008). However, as we have discussed in the previous module, these single markers are a blunt tool - the marker may indicate one species, while the rest of the genome indicates a different species entirely.

Rather than a single marker, we use multiple ancestry-informative SNP markers (AIMs) taken from across the genome. These are derived by taking a population of each species and looking for the biallelic SNPs that have different alleles fixed (or almost fixed) in the different species. Depending on how diverged the mosquito species are, this method gives us hundreds or thousands of ancestry informative marker SNPs across the genome.

With these AIMs, we can use the percentage of AIM alleles in an individual to assign provisional species. E.g., currently we assign any sample with 85% or more An. arabiensis AIM alleles as An. arabiensis. These AIM fractions can be found in the sample metadata in the “aim_species_fraction_arab” and “aim_species_fraction_colu” columns. The provisional species calls made from these AIM data can be found in the sample metadata “aim_species” column.

Let’s remind ourselves of what the sample metadata looks like.

sample_meta_df = ag3.sample_metadata()
sample_meta_df.head()
                                     
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 VBS00256-4651STDY7017184 GP97 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
1 VBS00257-4651STDY7017185 GP98 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
2 VBS00259-4651STDY7017186 GP100 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
3 VBS00262-4651STDY7017187 GP103 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2
4 VBS00277-4651STDY7017189 GP118 Tovi Lehmann Mali Dallowere 2012 6 13.616 -7.037 F ... Koulikouro ML-2 Banamba coluzzii ML-2_colu_2012 ML-2_colu_2012_06 ML-2_colu_2012_Q2 ML-2_Banamba_colu_2012 ML-2_Banamba_colu_2012_06 ML-2_Banamba_colu_2012_Q2

5 rows × 57 columns

There are several columns in the metadata which provide data on AIMs. Let’s take a look.

aim_columns = [c for c in sample_meta_df if c.startswith("aim_")]
aim_columns
['aim_species_fraction_arab',
 'aim_species_fraction_colu',
 'aim_species_fraction_colu_no2l',
 'aim_species_gambcolu_arabiensis',
 'aim_species_gambiae_coluzzii',
 'aim_species']
sample_meta_df[["sample_id"] + aim_columns].head()
sample_id aim_species_fraction_arab aim_species_fraction_colu aim_species_fraction_colu_no2l aim_species_gambcolu_arabiensis aim_species_gambiae_coluzzii aim_species
0 VBS00256-4651STDY7017184 0.002 0.858 0.973 gambcolu coluzzii coluzzii
1 VBS00257-4651STDY7017185 0.002 0.977 0.982 gambcolu coluzzii coluzzii
2 VBS00259-4651STDY7017186 0.001 0.917 0.974 gambcolu coluzzii coluzzii
3 VBS00262-4651STDY7017187 0.002 0.860 0.974 gambcolu coluzzii coluzzii
4 VBS00277-4651STDY7017189 0.002 0.924 0.984 gambcolu coluzzii coluzzii

Note the “aim_species” column - this has the value of our provisional species call made from looking at AIM genotype calls. Let’s see all the values this can take:

sample_meta_df.groupby("aim_species").size()
aim_species
arabiensis                          5330
coluzzii                            7921
gambiae                             8238
intermediate_gambcolu_arabiensis     465
intermediate_gambiae_coluzzii        651
dtype: int64

In the previous module we looked at how we can visualise the AIMs underlying this provisional species call with an AIM plot, using the handy ag3.plot_aim_heatmap() function in the malariagen_data python package. Let’s remind ourselves of how this function works, and what the AIMs looks like.

ag3.plot_aim_heatmap?
Signature:
ag3.plot_aim_heatmap(
    aims: str,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    sort: bool = True,
    row_height: int = 4,
    xgap: float = 0,
    ygap: float = 0.5,
    palette: Optional[Tuple[str, str, str, str]] = None,
    show: bool = True,
    renderer: Optional[str] = None,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot a heatmap of ancestry-informative marker (AIM) genotypes.

Parameters
----------
aims : str
    Identifier for a set of ancestry informative markers to use. For
    possible values see the `aim_ids` property.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
sort : bool, optional, default: True
    If true (default), sort the samples by the total fraction of AIM
    alleles for the second species in the comparison.
row_height : int, optional, default: 4
    Height per sample in px.
xgap : float, optional, default: 0
    Creates lines between columns (variants).
ygap : float, optional, default: 0.5
    Creates lines between rows (samples).
palette : tuple[str, str, str, str] or None, optional
    4-tuple of colors for AIM genotypes, in the order: missing, hom taxon
    1, het, hom taxon 2.
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
renderer : str or None, optional
    The name of the renderer to use.

Returns
-------
Figure or None
    A plotly figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/aim_data.py
Type:      method

To visualise the AIM calls as a heatmap we just need to specify which set of markers we want to use (aims), either “gambcolu_vs_arab” or “gamb_vs_colu”. There’s also a sample_set parameter and a sample_query parameter to select which samples to look at.

Let’s look again at the “gamb_vs_colu” AIMs from the “AG1000G-BF-A” sample set of mosquitoes from Burkina Faso.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-BF-A"
)

In the plot above, the subplots represent different contigs (chromosome arms) and the rows are individual samples. These plots are interactive and hovering the mouse pointer brings up data about the AIM variant index (column), mosquito sample ID (row), and AIM genotype (here 0 means homozygous gamb/gamb, 1 means heterozygous gamb/colu and 2 means homozygous colu/colu).

In the “gamb_vs_colu” plot, blue represents homozygous An. gambiae genotypes, red represents homozygous An. coluzzii genotypes, and in yellow we see genotypes that are heterozygous for the gambiae and coluzzii alleles.

Despite the fact that many samples are affected by the known introgression event on chromosome arm 2L, in most cases we still make fairly clear provisional species assignments using the AIMs. In fact, because this introgression is so common, we actually ignore chromosome arm 2L when making provisional species calls between gambiae and coluzzii. Here are the provisional assignments for this sample set:

sample_meta_df.query("sample_set == 'AG1000G-BF-A'").groupby("aim_species").size()
aim_species
coluzzii                         82
gambiae                          98
intermediate_gambiae_coluzzii     1
dtype: int64

Let’s look at another AIM plot, this time using the “gambcolu_vs_arab” AIMs to look at mosquitoes from Uganda.

ag3.plot_aim_heatmap(
    aims="gambcolu_vs_arab", 
    sample_sets="AG1000G-UG"
)

In the “gambcolu_vs_arab” plot, green represents genotypes which are homozygous for the An. arabiensis allele; purple represents genotypes homozygous for the allele found in An. gambiae and An. coluzzii, and in yellow we again see genotypes that are heterozygous.

Though the AIMs are not 100% informative due to how they have been obtained (see module 3), we can see lots of samples where most of the AIMs are homozygous for one species. This what we expected, as these samples should belong to one of these known taxa.

Sometimes, however, we see samples where the species is not clear. We can see a sample like this (ACO198-C) in the Uganda plot above. This sample has heterozygous genotypes at almost all AIMs, and so gets assigned as “intermediate_gambcolu_arabiensis” in the samples metadata “aim_species” column.

sample_meta_df.query("sample_id == 'AC0198-C'")[["sample_id"] + aim_columns]
sample_id aim_species_fraction_arab aim_species_fraction_colu aim_species_fraction_colu_no2l aim_species_gambcolu_arabiensis aim_species_gambiae_coluzzii aim_species
21621 AC0198-C 0.494832 0.211288 0.2075 intermediate NaN intermediate_gambcolu_arabiensis

Here are counts of the different AIM species assignments in Uganda:

sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby("aim_species").size()
aim_species
arabiensis                           82
gambiae                             207
intermediate_gambcolu_arabiensis      1
dtype: int64

There are other sample sets where there are multiple samples that do not have a clear species assignment, e.g., Guinea-Bissau.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-GW"
)

In sample sets like this, there may be many “intermediate” AIM species assignments:

sample_meta_df.query("sample_set == 'AG1000G-GW'").groupby("aim_species").size()
aim_species
gambiae                          28
intermediate_gambiae_coluzzii    73
dtype: int64

Sample sets with many samples assigned a “intermediate” provisional species are flagged for further investigation with principal component analysis (PCA).

Step 2 - Principal component analysis#

Recap: what is principal component analysis?#

In the previous workshop we learnt how PCA can be used to identify genetic structure in a group of samples and we learnt why being able to detect structure is useful for genomic surveillance and vector control.

Fundamentally, PCA is a method for reducing the dimensions of a dataset to help make interpreting the data easier. The PCA finds axes through the data that describe its variance, in the case of genomic data, this effectively collapses thousands or millions of dimensions (SNPs) down to a handful of principal components which allow tractable investigation of structure in the data.

When we are trying to identify the causes of the apparent “intermediate” taxon samples from AIMs analysis, the way PCA reveals the structure of these samples can be be very helpful.

Signals of hybridisation - Uganda#

In the Ugandan “gambcolu_vs_arab” aim plot we saw an individual that appeared to be heterozygous for AIMs across it’s genome. Let’s run a PCA on this sample set and see how this individual appears.

Let’s briefly remind ourselves of the pca function documentation.

ag3.pca?
Signature:
ag3.pca(
    region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
    n_snps: int,
    n_components: int = 20,
    thin_offset: int = 0,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    sample_indices: Optional[List[int]] = None,
    site_mask: Optional[str] = 'default',
    site_class: Optional[str] = None,
    min_minor_ac: Union[int, float, NoneType] = 2,
    max_missing_an: Union[int, float, NoneType] = 0,
    cohort_size: Optional[int] = None,
    min_cohort_size: Optional[int] = None,
    max_cohort_size: Optional[int] = None,
    exclude_samples: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...], NoneType] = None,
    fit_exclude_samples: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...], NoneType] = None,
    random_seed: int = 42,
    inline_array: bool = True,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
) -> Tuple[pandas.core.frame.DataFrame, numpy.ndarray]
Docstring:
Run a principal components analysis (PCA) using biallelic SNPs from the
selected genome region and samples.

.. versionchanged:: 8.0.0    SNP ascertainment has changed slightly.
This function uses biallelic SNPs as input to the PCA. The ascertainment
of SNPs to include has changed slightly in version 8.0.0 and therefore
the results of this function may also be slightly different. Previously,
SNPs were required to be biallelic and one of the observed alleles was
required to be the reference allele. Now SNPs just have to be biallelic.
The following additional parameters were also added in version 8.0.0:
`site_class`, `cohort_size`, `min_cohort_size`, `max_cohort_size`,
`random_seed`.

Parameters
----------
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript. Can also be a sequence (e.g.,
    list) of regions.
n_snps : int
    The desired number of SNPs to use when running the analysis. SNPs will
    be evenly thinned to approximately this number.
n_components : int, optional, default: 20
    Number of components to return.
thin_offset : int, optional, default: 0
    Starting index for SNP thinning. Change this to repeat the analysis
    using a different set of SNPs.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
sample_indices : list of int or None, optional
    Advanced usage parameter. A list of indices of samples to select,
    corresponding to the order in which the samples are found within the
    sample metadata. Either provide this parameter or sample_query, not
    both.
site_mask : str or None, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
site_class : str or None, optional
    Select sites belonging to one of the following classes: CDS_DEG_4,
    (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
    degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
    INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
    longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
    splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
    site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
    untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
    gene).
min_minor_ac : int or float or None, optional, default: 2
    The minimum minor allele count. SNPs with a minor allele count below
    this value will be excluded. Can also be a float, which will be
    interpreted as a fraction.
max_missing_an : int or float or None, optional, default: 0
    The maximum number of missing allele calls to accept. SNPs with more
    than this value will be excluded. Set to 0 to require no missing
    calls. Can also be a float, which will be interpreted as a fraction.
cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
min_cohort_size : int or None, optional
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
exclude_samples : str or int or list of str or int or tuple of str or int or None, optional
    Sample identifier or index within sample set. Multiple values can also
    be provided as a list or tuple.
fit_exclude_samples : str or int or list of str or int or tuple of str or int or None, optional
    Sample identifier or index within sample set. Multiple values can also
    be provided as a list or tuple.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.

Returns
-------
df_pca : DataFrame
    A dataframe of projections along principal components, one row per
    sample. The columns are:     `sample_id` is the identifier of the
    sample,     `partner_sample_id` is the identifier of the sample used
    by the partners who contributed it,     `contributor` is the partner
    who contributed the sample,     `country` is the country the sample
    was collected in,     `location` is the location the sample was
    collected in,     `year` is the year the sample was collected,
    `month` is the month the sample was collected,     `latitude` is the
    latitude of the location the sample was collected in,
    `longitude` is the longitude of the location the sample was
    collected in,     `sex_call` is the sex of the sample,
    `sample_set` is the sample set containing the sample,     `release`
    is the release containing the sample,     `quarter` is the quarter
    of the year the sample was collected,     `study_id* is the
    identifier of the study the sample set containing the sample came
    from,     `study_url` is the URL of the study the sample set
    containing the sample came from,     `terms_of_use_expiry_date` is
    the date the terms of use for the sample expire,
    `terms_of_use_url` is the URL of the terms of use for the sample,
    `unrestricted_use` indicates whether the sample can be used without
    restrictions (e.g., if the terms of use of expired),     `mean_cov`
    is mean value of the coverage,     `median_cov` is the median value
    of the coverage,     `modal_cov` is the mode of the coverage,
    `mean_cov_2L` is mean value of the coverage on 2L,
    `median_cov_2L` is the median value of the coverage on 2L,
    `mode_cov_2L` is the mode of the coverage on 2L,     `mean_cov_2R`
    is mean value of the coverage on 2R,     `median_cov_2R` is the
    median value of the coverage on 2R,     `mode_cov_2R` is the mode of
    the coverage on 2R,     `mean_cov_3L` is mean value of the coverage
    on 3L,     `median_cov_3L` is the median value of the coverage on
    3L,     `mode_cov_3L` is the mode of the coverage on 3L,
    `mean_cov_3R` is mean value of the coverage on 3R,
    `median_cov_3R` is the median value of the coverage on 3R,
    `mode_cov_3R` is the mode of the coverage on 3R,     `mean_cov_X` is
    mean value of the coverage on X,     `median_cov_X` is the median
    value of the coverage on X,     `mode_cov_X` is the mode of the
    coverage on X,     `frac_gen_cov` is the faction of the genome
    covered,     `divergence` is the divergence,     `contam_pct` is the
    percentage of contamination,     `contam_LLR` is the log-likelihood
    ratio of contamination,     `aim_species_fraction_arab` is the
    fraction of the gambcolu vs. arabiensis AIMs that indicated
    arabiensis (this column is only present for *Ag3*),
    `aim_species_fraction_colu` is the fraction of the gambiae vs.
    coluzzii AIMs that indicated coluzzii (this column is only present
    for *Ag3*),     `aim_species_fraction_colu_no2l` is the fraction of
    the gambiae vs. coluzzii AIMs that indicated coluzzii, not including
    the chromosome arm 2L which contains an introgression (this column
    is only present for *Ag3*),     `aim_species_gambcolu_arabiensis` is
    the taxonomic group assigned by the gambcolu vs. arabiensis AIMs
    (this column is only present for *Ag3*),
    `aim_species_gambiae_coluzzi` is the taxonomic group assigned by the
    gambiae vs. coluzzii AIMs (this column is only present for *Ag3*),
    `aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
    the combination of both AIMs analyses (this column is only present
    for *Ag3*),     `country_iso` is the ISO code of the country the
    sample was collected in,     `admin1_name` is the name of the first
    administrative level the sample was collected in,     `admin1_iso`
    is the ISO code of the first administrative level the sample was
    collected in,     `admin2_name` is the name of the second
    administrative level the sample was collected in,     `taxon` is the
    taxon assigned to the sample by the combination of the AIMs analysis
    and the cohort analysis,     `cohort_admin1_year` is the cohort the
    sample belongs to when samples are grouped by first administrative
    level and year,     `cohort_admin1_month` is the cohort the sample
    belongs to when samples are grouped by first administrative level
    and month,     `cohort_admin1_quarter` is the cohort the sample
    belongs to when samples are grouped by first administrative level
    and quarter,     `cohort_admin2_year` is the cohort the sample
    belongs to when samples are grouped by second administrative level
    and year,     `cohort_admin2_month` is the cohort the sample belongs
    to when samples are grouped by second administrative level and
    month,     `cohort_admin2_quarter` is the cohort the sample belong
    to when samples are grouped by second administrative level and
    quarter.     `PC?` is the projection along principal component ? (?
    being an integer between 1 and the number of components). There are
    as many such columns as components,     `pca_fit` is whether this
    sample was used for fitting.
evr : ndarray
    An array of explained variance ratios, one per component.

Notes
-----
This computation may take some time to run, depending on your
computing environment. Results of this computation will be cached and
re-used if the `results_cache` parameter was set when instantiating
the API client.
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type:      method

Let’s define some parameters to use in all the PCA computations.

region = "3L:15,000,000-41,000,000"
n_snps = 100_000

Now run a PCA with Ugandan samples.

pca_df, evr = ag3.pca(
    region=region, 
    n_snps=n_snps, 
    sample_sets="AG1000G-UG"
)
                                     

We can look at the explained variance ratio array (evr) to get an feeling for how many principal components we are interested in for this particular PCA. This array contains the proportion of the total variance in the dataset explained by each principal component. The easiest way to do this is to plot the array using the malariagen_data function plot_pca_variance().

ag3.plot_pca_variance?
Signature:
ag3.plot_pca_variance(
    evr: numpy.ndarray,
    width: Optional[int] = 900,
    height: Optional[int] = 400,
    show: bool = True,
    renderer: Optional[str] = None,
    **kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot explained variance ratios from a principal components analysis
(PCA) using a plotly bar plot.

Parameters
----------
evr : ndarray
    An array of explained variance ratios, one per component.
width : int or None, optional, default: 900
    Figure width in pixels (px).
height : int or None, optional, default: 400
    Figure height in pixels (px).
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
renderer : str or None, optional
    The name of the renderer to use.
**kwargs
    Passed through to px.bar().

Returns
-------
Figure or None
    A plotly figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type:      method
ag3.plot_pca_variance(evr)

Where the variance plot flattens out is a good rule of thumb as to where principal components may be more noise than signal of structure. In this case it looks like PC1 explains much of the variance in the data relative to all other PCs.

Now we can plot the PCA data using the malariagen_data package.

ag3.plot_pca_coords?
Signature:
ag3.plot_pca_coords(
    data: pandas.core.frame.DataFrame,
    x: str = 'PC1',
    y: str = 'PC2',
    color: Union[str, Mapping, NoneType] = None,
    symbol: Union[str, Mapping, NoneType] = None,
    opacity: float = 0.9,
    jitter_frac: Optional[float] = 0.02,
    random_seed: int = 42,
    width: Optional[int] = 900,
    height: Optional[int] = 600,
    marker_size: Union[int, float] = 10,
    color_discrete_sequence: Optional[List] = None,
    color_discrete_map: Optional[Mapping] = None,
    category_orders: Union[List, Mapping, NoneType] = None,
    legend_sizing: Literal['constant', 'trace'] = 'constant',
    show: bool = True,
    renderer: Optional[str] = None,
    render_mode: Literal['auto', 'svg', 'webgl'] = 'svg',
    **kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot sample coordinates from a principal components analysis (PCA) as a
plotly scatter plot.

Parameters
----------
data : DataFrame
    A dataframe of projections along principal components, one row per
    sample. The columns are:     `sample_id` is the identifier of the
    sample,     `partner_sample_id` is the identifier of the sample used
    by the partners who contributed it,     `contributor` is the partner
    who contributed the sample,     `country` is the country the sample
    was collected in,     `location` is the location the sample was
    collected in,     `year` is the year the sample was collected,
    `month` is the month the sample was collected,     `latitude` is the
    latitude of the location the sample was collected in,     `longitude`
    is the longitude of the location the sample was collected in,
    `sex_call` is the sex of the sample,     `sample_set` is the sample
    set containing the sample,     `release` is the release containing the
    sample,     `quarter` is the quarter of the year the sample was
    collected,     `study_id* is the identifier of the study the sample
    set containing the sample came from,     `study_url` is the URL of the
    study the sample set containing the sample came from,
    `terms_of_use_expiry_date` is the date the terms of use for the sample
    expire,     `terms_of_use_url` is the URL of the terms of use for the
    sample,     `unrestricted_use` indicates whether the sample can be
    used without restrictions (e.g., if the terms of use of expired),
    `mean_cov` is mean value of the coverage,     `median_cov` is the
    median value of the coverage,     `modal_cov` is the mode of the
    coverage,     `mean_cov_2L` is mean value of the coverage on 2L,
    `median_cov_2L` is the median value of the coverage on 2L,
    `mode_cov_2L` is the mode of the coverage on 2L,     `mean_cov_2R` is
    mean value of the coverage on 2R,     `median_cov_2R` is the median
    value of the coverage on 2R,     `mode_cov_2R` is the mode of the
    coverage on 2R,     `mean_cov_3L` is mean value of the coverage on 3L,
    `median_cov_3L` is the median value of the coverage on 3L,
    `mode_cov_3L` is the mode of the coverage on 3L,     `mean_cov_3R` is
    mean value of the coverage on 3R,     `median_cov_3R` is the median
    value of the coverage on 3R,     `mode_cov_3R` is the mode of the
    coverage on 3R,     `mean_cov_X` is mean value of the coverage on X,
    `median_cov_X` is the median value of the coverage on X,
    `mode_cov_X` is the mode of the coverage on X,     `frac_gen_cov` is
    the faction of the genome covered,     `divergence` is the divergence,
    `contam_pct` is the percentage of contamination,     `contam_LLR` is
    the log-likelihood ratio of contamination,
    `aim_species_fraction_arab` is the fraction of the gambcolu vs.
    arabiensis AIMs that indicated arabiensis (this column is only present
    for *Ag3*),     `aim_species_fraction_colu` is the fraction of the
    gambiae vs. coluzzii AIMs that indicated coluzzii (this column is only
    present for *Ag3*),     `aim_species_fraction_colu_no2l` is the
    fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii, not
    including the chromosome arm 2L which contains an introgression (this
    column is only present for *Ag3*),
    `aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
    the gambcolu vs. arabiensis AIMs (this column is only present for
    *Ag3*),     `aim_species_gambiae_coluzzi` is the taxonomic group
    assigned by the gambiae vs. coluzzii AIMs (this column is only present
    for *Ag3*),     `aim_species_gambcolu_arabiensis` is the taxonomic
    group assigned by the combination of both AIMs analyses (this column
    is only present for *Ag3*),     `country_iso` is the ISO code of the
    country the sample was collected in,     `admin1_name` is the name of
    the first administrative level the sample was collected in,
    `admin1_iso` is the ISO code of the first administrative level the
    sample was collected in,     `admin2_name` is the name of the second
    administrative level the sample was collected in,     `taxon` is the
    taxon assigned to the sample by the combination of the AIMs analysis
    and the cohort analysis,     `cohort_admin1_year` is the cohort the
    sample belongs to when samples are grouped by first administrative
    level and year,     `cohort_admin1_month` is the cohort the sample
    belongs to when samples are grouped by first administrative level and
    month,     `cohort_admin1_quarter` is the cohort the sample belongs to
    when samples are grouped by first administrative level and quarter,
    `cohort_admin2_year` is the cohort the sample belongs to when samples
    are grouped by second administrative level and year,
    `cohort_admin2_month` is the cohort the sample belongs to when samples
    are grouped by second administrative level and month,
    `cohort_admin2_quarter` is the cohort the sample belong to when
    samples are grouped by second administrative level and quarter.
    `PC?` is the projection along principal component ? (? being an
    integer between 1 and the number of components). There are as many
    such columns as components,     `pca_fit` is whether this sample was
    used for fitting.
x : str, optional, default: 'PC1'
    Name of variable to plot on the X axis.
y : str, optional, default: 'PC2'
    Name of variable to plot on the Y axis.
color : str or Mapping or None, optional
    Name of variable to use to color the markers.
symbol : str or Mapping or None, optional
    Name of the variable to use to choose marker symbols.
opacity : float, optional, default: 0.9
    Marker opacity.
jitter_frac : float or None, optional, default: 0.02
    Randomly jitter points by this fraction of their range.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
width : int or None, optional, default: 900
    Figure width in pixels (px).
height : int or None, optional, default: 600
    Figure height in pixels (px).
marker_size : int or float, optional, default: 10
    Marker size.
color_discrete_sequence : List or None, optional
    Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
    Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
    Control the order in which values appear in the legend.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
    Determines if the legend items symbols scale with their corresponding
    "trace" attributes or remain "constant" independent of the symbol size
    on the graph.
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
renderer : str or None, optional
    The name of the renderer to use.
render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'svg'
    The type of rendering backend to use. See also
    https://plotly.com/python/webgl-vs-svg/.
**kwargs
    Passed through to `px.scatter()`.

Returns
-------
Figure or None
    A plotly figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type:      method

So we just need to plug in our pca_df Pandas DataFrame, however, we can also colour our points using the “gambcolu_vs_arab” AIM derived provisional taxon data to help us interpret the plot.

ag3.plot_pca_coords(pca_df, color="aim_species")

With the points coloured by our AIM analysis, we see that principal component 1 (PC1), which describes the most variation in our data, has detected structure driven by species. The separate An. gambiae and An. arabiensis clusters demonstrate a strong degree of reproductive isolation between the species in Uganda.

Equidistant from the two species clusters, we can see a single sample - AC0198-C. The AIM analysis defined this individual as being “intermediate_gambcolu_arabiensis” because it did not carry enough of either An. arabiensis or An. gambiae/coluzzi to be classified as such. The position of this sample, on the same axis of variation (PC1) as our two clusters of species suggests that this individual is a An. gambiae X An. arabiensis hybrid. Furthermore, that the sample falls at the approximate mid-point between the two main clusters, suggests that this individual is an F1 (first filial generation) hybrid. A multi-generational backcrossed individual (>F1), would fall on the same axis but appear closer to one AIM species cluster or the other, depending on which species had been backcrossed into.

In the sample metadata, the “taxon” column contains the results of these PCA-based cryptic species analyses.

In this case (AG1000G-UG), we don’t need to alter our provisional AIM species assignments, and so the “taxon” column will be identical to “aim_species”.

sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby(["aim_species", "taxon"]).size()
aim_species                       taxon     
arabiensis                        arabiensis     82
gambiae                           gambiae       207
intermediate_gambcolu_arabiensis  unassigned      1
dtype: int64

Signals of cryptic species - Tanzania#

Let’s look at a difference situation, where think there is evidence for a cryptic species that we weren’t previously aware of.

Let’s plot the “gamb_vs_colu” (An. gambiae vs An. coluzzii) AIMs for the Tanzanian sample set, excluding samples assigned as An. arabiensis.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-TZ", 
    sample_query="aim_species != 'arabiensis'"
)
                                     

We can see that a number of samples show mixed ancestry (blue, red and yellow genotypes) over the 2 and 3 chromosomes, but appear homozygous for An. gambiae ancestry on the X chromosome.

In the AIM analysis, these samples were provisionally labelled as being “intermediate_gambiae_coluzzii” in the “aim_species” column of the sample metadata.

This result is particularly interesting, as the An. coluzzii species range does not reach this far East. Let’s investigate this sample set using PCA.

pca_df, evr = ag3.pca(
    region=region, 
    n_snps=n_snps, 
    sample_sets="AG1000G-TZ",
)
                                            
                                      

We should plot the variance array to get an idea which principal components we should look at.

ag3.plot_pca_variance(evr)

For these samples, the explained variance doesn’t flatten out until PC4, which suggests that we should look at the first three PCs. We could make multiple 2D scatter plots to investigate these three PCs e.g. PC1 vs. PC2 and PC2 vs. PC3. But it would be easier to intepret the results if we could make one 3D scatter plot and visualise all three PCs together. There is a function in malariagen_data that makes this very simple.

ag3.plot_pca_coords_3d?
Signature:
ag3.plot_pca_coords_3d(
    data: pandas.core.frame.DataFrame,
    x: str = 'PC1',
    y: str = 'PC2',
    z: str = 'PC3',
    color: Union[str, Mapping, NoneType] = None,
    symbol: Union[str, Mapping, NoneType] = None,
    jitter_frac: Optional[float] = 0.02,
    random_seed: int = 42,
    width: Optional[int] = 900,
    height: Optional[int] = 600,
    marker_size: Union[int, float] = 5,
    color_discrete_sequence: Optional[List] = None,
    color_discrete_map: Optional[Mapping] = None,
    category_orders: Union[List, Mapping, NoneType] = None,
    legend_sizing: Literal['constant', 'trace'] = 'constant',
    show: bool = True,
    renderer: Optional[str] = None,
    **kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot sample coordinates from a principal components analysis (PCA) as a
plotly 3D scatter plot.

Parameters
----------
data : DataFrame
    A dataframe of projections along principal components, one row per
    sample. The columns are:     `sample_id` is the identifier of the
    sample,     `partner_sample_id` is the identifier of the sample used
    by the partners who contributed it,     `contributor` is the partner
    who contributed the sample,     `country` is the country the sample
    was collected in,     `location` is the location the sample was
    collected in,     `year` is the year the sample was collected,
    `month` is the month the sample was collected,     `latitude` is the
    latitude of the location the sample was collected in,     `longitude`
    is the longitude of the location the sample was collected in,
    `sex_call` is the sex of the sample,     `sample_set` is the sample
    set containing the sample,     `release` is the release containing the
    sample,     `quarter` is the quarter of the year the sample was
    collected,     `study_id* is the identifier of the study the sample
    set containing the sample came from,     `study_url` is the URL of the
    study the sample set containing the sample came from,
    `terms_of_use_expiry_date` is the date the terms of use for the sample
    expire,     `terms_of_use_url` is the URL of the terms of use for the
    sample,     `unrestricted_use` indicates whether the sample can be
    used without restrictions (e.g., if the terms of use of expired),
    `mean_cov` is mean value of the coverage,     `median_cov` is the
    median value of the coverage,     `modal_cov` is the mode of the
    coverage,     `mean_cov_2L` is mean value of the coverage on 2L,
    `median_cov_2L` is the median value of the coverage on 2L,
    `mode_cov_2L` is the mode of the coverage on 2L,     `mean_cov_2R` is
    mean value of the coverage on 2R,     `median_cov_2R` is the median
    value of the coverage on 2R,     `mode_cov_2R` is the mode of the
    coverage on 2R,     `mean_cov_3L` is mean value of the coverage on 3L,
    `median_cov_3L` is the median value of the coverage on 3L,
    `mode_cov_3L` is the mode of the coverage on 3L,     `mean_cov_3R` is
    mean value of the coverage on 3R,     `median_cov_3R` is the median
    value of the coverage on 3R,     `mode_cov_3R` is the mode of the
    coverage on 3R,     `mean_cov_X` is mean value of the coverage on X,
    `median_cov_X` is the median value of the coverage on X,
    `mode_cov_X` is the mode of the coverage on X,     `frac_gen_cov` is
    the faction of the genome covered,     `divergence` is the divergence,
    `contam_pct` is the percentage of contamination,     `contam_LLR` is
    the log-likelihood ratio of contamination,
    `aim_species_fraction_arab` is the fraction of the gambcolu vs.
    arabiensis AIMs that indicated arabiensis (this column is only present
    for *Ag3*),     `aim_species_fraction_colu` is the fraction of the
    gambiae vs. coluzzii AIMs that indicated coluzzii (this column is only
    present for *Ag3*),     `aim_species_fraction_colu_no2l` is the
    fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii, not
    including the chromosome arm 2L which contains an introgression (this
    column is only present for *Ag3*),
    `aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
    the gambcolu vs. arabiensis AIMs (this column is only present for
    *Ag3*),     `aim_species_gambiae_coluzzi` is the taxonomic group
    assigned by the gambiae vs. coluzzii AIMs (this column is only present
    for *Ag3*),     `aim_species_gambcolu_arabiensis` is the taxonomic
    group assigned by the combination of both AIMs analyses (this column
    is only present for *Ag3*),     `country_iso` is the ISO code of the
    country the sample was collected in,     `admin1_name` is the name of
    the first administrative level the sample was collected in,
    `admin1_iso` is the ISO code of the first administrative level the
    sample was collected in,     `admin2_name` is the name of the second
    administrative level the sample was collected in,     `taxon` is the
    taxon assigned to the sample by the combination of the AIMs analysis
    and the cohort analysis,     `cohort_admin1_year` is the cohort the
    sample belongs to when samples are grouped by first administrative
    level and year,     `cohort_admin1_month` is the cohort the sample
    belongs to when samples are grouped by first administrative level and
    month,     `cohort_admin1_quarter` is the cohort the sample belongs to
    when samples are grouped by first administrative level and quarter,
    `cohort_admin2_year` is the cohort the sample belongs to when samples
    are grouped by second administrative level and year,
    `cohort_admin2_month` is the cohort the sample belongs to when samples
    are grouped by second administrative level and month,
    `cohort_admin2_quarter` is the cohort the sample belong to when
    samples are grouped by second administrative level and quarter.
    `PC?` is the projection along principal component ? (? being an
    integer between 1 and the number of components). There are as many
    such columns as components,     `pca_fit` is whether this sample was
    used for fitting.
x : str, optional, default: 'PC1'
    Name of variable to plot on the X axis.
y : str, optional, default: 'PC2'
    Name of variable to plot on the Y axis.
z : str, optional, default: 'PC3'
    Name of variable to plot on the Z axis.
color : str or Mapping or None, optional
    Name of variable to use to color the markers.
symbol : str or Mapping or None, optional
    Name of the variable to use to choose marker symbols.
jitter_frac : float or None, optional, default: 0.02
    Randomly jitter points by this fraction of their range.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
width : int or None, optional, default: 900
    Figure width in pixels (px).
height : int or None, optional, default: 600
    Figure height in pixels (px).
marker_size : int or float, optional, default: 5
    Marker size.
color_discrete_sequence : List or None, optional
    Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
    Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
    Control the order in which values appear in the legend.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
    Determines if the legend items symbols scale with their corresponding
    "trace" attributes or remain "constant" independent of the symbol size
    on the graph.
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
renderer : str or None, optional
    The name of the renderer to use.
**kwargs
    Passed through to `px.scatter_3d()`.

Returns
-------
Figure or None
    A plotly figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type:      method

The parameters are very similar to the 2D PCA plotting function, except here we have another axis parameter “z” that we can assign another principal component to. Let’s colour the points by aim_species as we did before.

ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="aim_species")

Under the hood, malariagen_data is building these plots with Plotly, so they are interactive. The plot can be rotated by clicking on the plot and moving the mouse and scrolling allows zooming in and out. Holding the mouse pointer over a point will reveal metadata about that sample. Let’s try to interpret this plot one PC at a time.

PC1#

The first principal component has separated all the “arabiensis from all other “aim_species”. This is what we might expect when we sample two species, due to reproductive isolation.

PC2#

The second principal component has pulled all of the “intermediate_gambiae_coluzzii” samples as well as some “gambiae” samples away from all other samples. This is striking, the cluster that includes the intermediate samples is separated on its own axis of variation.

PC3#

The third principal component splits two clusters of “gambiae” samples. If we look at the metadata attached to samples in these two clusters, we can see that in one cluster, samples were collected in Muleba and in the other, the samples were collected in Muheza. This looks like a classic case of geographic isolation between the An. gambiae populations from these two sites.

Interpretation#

There is clearly a lot to unpack when it comes to the population structure we find in this sample set. The strongest signal of variance (PC1) is being driven by reproductive isolation between An. arabiensis and other species.

What is interesting here is that all the “arabiensis” samples are gathered together in a single cluster, but there are two clusters of “gambiae” (separated by PC3). Let’s have a look at some ecological information about the collection sites in Tanzania (the code for this figure can be found here).

Our “gambiae” samples come from Muheza and Muleba, in the East and West of the country respectively. We might conclude that our two clusters of “gambiae” in our PCA are being driven simply by the geographic distance separating the sample sites. However, “arabiensis” have also been collected from sites on either side of the country, and yet all these “arabiensis” samples cluster together.

There is a body of research showing An. arabiensis has a higher aridity tolerance than An. gambiae (e.g. Gray & Bradley (2005)).

One explanation of our results, consistent with these other findings, is that as Muheza and Muleba are separated by a brown (more arid) region that runs approximately North to South, splitting the country in two. The two “gambiae” populations are thus separated by a barrier of unsuitable environment, enabling the evolution of structure (two clusters in our PCA). As An. arabiensis can tolerate the dryer region, geneflow can occur across the country resulting in less within-species structure (one cluster in our PCA).

But what about the other cluster on our PCA, containing intermediate samples separated from other clusters by PC2. From studies of species distributions we know that there are no An. coluzzii in Tanzania and these samples do not look like hybrids between An. gambiae and An. arabiensis as they are not in between the species clusters pulled apart by PC1 (like our hybrid sample in Uganda).

Perhaps these samples are cryptic species, if they are species we don’t have AIMs for they could be labelled as intermediate in the AIM analysis. However, if this cluster represents a cryptic species, why are there four samples labelled as “gambiae” also present in the cluster?

  • BL0357-C

  • BL0366-C

  • BL0370-C

  • BL0384-C

To dig down into this further, we could make the same PCA plot as before, but this time colour the points by “aim_fraction_colu”, which as it suggests, is the fraction of AIMs from the “gamb_vs_colu” analysis, that suggest An. coluzzii ancestry.

Colour by AIM fraction#

ag3.plot_pca_coords_3d(
    pca_df, 
    x="PC1", 
    y="PC2", 
    z="PC3", 
    color="aim_species_fraction_colu"
)

When we colour the points by fraction coluzzii ancestry we can see that the “gambiae” samples which cluster with the “intermediate_gambiae_coluzzii” actually have relatively high “coluzzii” ancestry (lighter purple) compared with the other “gambiae” (darker purple) in Tanzania. This means that although they had been provisionally labelled as “gambiae” they were at the lower end of the AIM % cut-off.

As mentioned earlier, the AIMs are not 100% informative, and the species cut-offs we use are somewhat arbitrary. This is why we just use the AIM analyses to give provisonal species calls, and then follow up with PCA to give us a more nuanced picture of both structure and taxa.

Labelling cryptic taxa#

With the evidence we have collected on this “intermediate_gambiae_coluzzii” cluster of samples in Tanzania, we would re-label these samples as members of a cryptic taxa. In the sample metadata, the taxon column for these samples has consequently been changed to “gcx3”. This stands for “gambiae complex cryptic taxa 3”, as it is actually the third cryptic taxon we have identified in the Ag3.0 samples.

We can see this re-labelling if we run the same PCA but colour our samples by “taxon”.

ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="taxon")

We can also see how some of the AIM species assignments are converted into a cryptic taxon assignment. I.e., there are some samples labelled as either “gambiae” or “intermediate_gambiae_coluzzii” in the aim_species column, which get assigned as “gcx3” in the taxon column.

sample_meta_df.query("sample_set == 'AG1000G-TZ'").groupby(["aim_species", "taxon"]).size()
aim_species                    taxon     
arabiensis                     arabiensis    225
gambiae                        gambiae        64
                               gcx3            4
intermediate_gambiae_coluzzii  gcx3            7
dtype: int64

The importance of identifying cryptic taxa#

Our principal component analysis has highlighted population structure that appears to be due to a previously-unknown cryptic species in Tanzania. Can we use our genomic data to identify operationally important differences with this taxa?

If we remember module 4 of the first workshop, we used the malariagen_data package to plot gene SNP allele frequencies by cohorts, as a heatmap. Let’s do that for the Vgsc gene (target of pyrethroid insecticides) in our Tanzanian “gambiae” and “gcx3” cohorts.

ag3.aa_allele_frequencies?
Signature:
ag3.aa_allele_frequencies(
    transcript: str,
    cohorts: Union[str, Mapping[str, str]],
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    min_cohort_size: Optional[int] = 10,
    site_mask: Optional[str] = None,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    drop_invariant: bool = True,
    include_counts: bool = False,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
) -> pandas.core.frame.DataFrame
Docstring:
Compute amino acid substitution frequencies for a gene transcript.

Parameters
----------
transcript : str
    Gene transcript identifier.
cohorts : str or Mapping[str, str]
    Either a string giving the name of a predefined cohort set (e.g.,
    "admin1_month") or a dict mapping custom cohort labels to sample
    queries.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
min_cohort_size : int or None, optional, default: 10
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
site_mask : str or None, optional
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
drop_invariant : bool, optional, default: True
    If True, drop variants not observed in the selected samples.
include_counts : bool, optional, default: False
    Include columns with allele counts and number of non-missing allele
    calls (nobs).
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.

Returns
-------
DataFrame
    A dataframe of amino acid allele frequencies, one row per variant.
    The variants are indexed by their amino acid change, their contig,
    their position. The columns are split into two categories: there is
    1 column for each cohort containing the frequency of the amino acid
    change within the cohort, additionally there is a column `max_af`
    containing the maximum frequency of the amino acide change accross
    all cohorts; finally, there are 9 columns describing the variant
    allele: `transcript` contains the gene transcript used for this
    analysis, `effect` is the effect of the allele change, `impact`is
    the impact of the allele change, `ref_allele` is the reference
    allel, `alt_allele` is the alternate allele, `aa_pos` is the
    position of the amino acid, `ref_aa` is the reference amino acid,
    `alt_aa` is the altered amino acid with the varaint allele, and
    `label` is the label of the variant allele.

Notes
-----
Cohorts with fewer samples than `min_cohort_size` will be excluded
from output data frame.
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/snp_frq.py
Type:      method
aa_allele_freqs_df = ag3.aa_allele_frequencies(
    transcript="AGAP004707-RD", 
    cohorts="admin1_year", 
    sample_sets="AG1000G-TZ",
    sample_query="taxon != 'arabiensis'",
)
                                     
                                      
                                     
Load genome features: ⠙ (0:00:00.09) 
aa_filt_df = aa_allele_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_filt_df)

The heatmap shows that the gcx3 cohort of samples (“TZ-25-gcx3-2013”) does not carry any pyrethoid target-site resistance alleles, whereas the “gambiae” cohort from the same collection year and region (“TZ-25-gamb-2013”) has a 41% allele frequency of the kdr (knock-down resistance) allele L995S.

This may mean the gcx3 mosquitoes have very different insecticide resistance profiles to the gambiae mosquitoes from the same location. Consequently, it is important to consider them separately for further investigations into the best vector control strategies.

Taxon analysis workflow - recap#

1. AIMs analysis#

  • First compute AIMs and use these to make provisional species calls.

  • Look for sample sets where there’s evidence of “intermediate” samples.

2. PCA analysis#

  • Run PCAs for each suspect sample set. Adding sample sets from nearby countries with clean species calls can help as a comparison.

  • Look for distinct clusters on PCA plots containing the “intermediate” samples.

3. Interpret#

  • Do these intermediate samples fall between clusters of known species?

    • If yes, they could be hybrids of the two known species.

  • Is there a PCA dimension which pulls the cluster out uniquely?

    • If yes, they could belong to a cryptic taxon

Well done!#

  • In this module we have followed MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.

  • Run and plotted both AIM and PCA analyses, and learnt how to interpret the results.

  • Learnt how to detect cryptic taxa.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Next, run a taxon analysis workflow for Guinea Bissau - “AG1000G-GW”.

  • Plot the “gamb_vs_colu” AIMs for Guinea Bissau.

    • How might we interpret this plot?

  • Run and plot a PCA for Guinea Bissau.

    • How many PCs should we investigate?

    • How might we interpret the PCA plot? Hint: try colouring markers by “aim_species_fraction_colu”

  • Add Burkina Faso A sample set to the PCA analysis - [“AG1000G-GW, “AG1000G-BF-A”].

    • Does this help clarify the situation, if so why?

    • Could we have a hybrid cluster or a cryptic taxon cluster?

    • Explain why you think that.

Français#

Ouvrez ce notebook dans Google Colab et exécutez-le vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, considérez ce que chaque cellule fait.

Indice: Ouvrir ce notebook dans Google Colab, cliquez sur l’icône “fusée” au sommet de la page et sélectionner “Colab” dans le menu.

Ensuite, exécuter l’analyse des taxons pour la Guinée-Bissau – “AG1000G-GW”.

  • Afficher le diagramme des AIMs “gamb_vs_colu” pour Guinea-Bissau

    • Comment interpreter ce diagramme?

  • Exécuter et afficher la PCA pour la Guinée

    • Combien de PCs doivent être étudiés?

    • Comment interpréter le diagramme de la PCA? Indice: essayer de colorer les marqueurs en fonction de “aim_species_fraction_colu”.

  • Ajouter l’ensemble de données Burkina Faso A à l’analyse PCA – [“AG1000G-GW”, “AG1000G-BF-A”].

    • Est-ce que cela aide à rendre la situation plus claire? Si oui, pourquoi?

    • Est-ce vous observez un groupe d’hybrides ou un groupe d’un taxon cryptique?

    • Pourquoi pensez-vous cela?