Workshop 4 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - Discovering cryptic taxa#
Theme: Analysis
In this module we’re going to detect cryptic taxa in the Ag3.0 data resource using ancestry informative markers (AIMs) and principal component analysis (PCA). We will use functions in the malariagen_data
python package to run analyses then learn how interpret the results to discover cryptic taxonomic structure.
Learning objectives#
At the end of this module you will:
Understand MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.
Be able to run and plot AIM and PCA analyses, and interpret the results.
Be able to detect cryptic taxa.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Why do we need to discover cryptic taxa?#
Generally, we need to discover cryptic taxa in our data sets because they will generate population structure. In workshop 3 we explored how genetically distinct populations need to be separated, else population analysis results will be confounded.
Specifically, we need to identify cryptic taxa for genomic surveillance and vector control. Cryptic taxa may differ from known taxa in medically important phenotypes, e.g., biting times, vector competence or insecticide resistance. Vector control methods that work for some vector taxa may fail to control others.
Setup#
Before we begin the analysis, let’s set up the Python packages we’ll need to use.
First install and import the malariagen_data
package.
%pip install -q --no-warn-conflicts malariagen_data
Note that authentication is required to access data through the package, please follow the instructions here.
import malariagen_data
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"
Some analyses may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab. To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.
Mount your Google Drive - you will need to follow the authorization instructions.
try:
# if running on colab, mount Google drive
from google.colab import drive
drive.mount('drive')
except ImportError:
pass
With our Google Drive now mounted, we can define and make a directory where we want to save our results.
results_dir = 'drive/MyDrive/Colab Data/module_4_results'
os.makedirs(results_dir, exist_ok=True)
In Google Colab, we can actually see our mounted drive and results directory by clicking on the file tab on the left hand side of the screen.
Next we should setup the malariagen_data
package. As we want to save our reults in the Google Drive folder we just set up, we’ll use the results_cache
parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the results would instead get stored on our hard drive.
ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release_master_us_central1 |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
Results cache | /home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-4/drive/MyDrive/Colab Data/module_4_results |
Cohorts analysis | 20250131 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 15.2.0 |
Client location | Iowa, United States (Google Cloud us-central1) |
Remember to check Client location
in the output above - our cloud data is stored in the US, so want our Google Colab virtual machine (VM) to be based in the US too. If your client location is somewhere else in the world, select Runtime
then Disconnect and delete runtime
from the toolbar at the top of notebook, then rerun the notebook from the top. This will ensure our analyses run as fast as possible.
Step 1 - Ancestry informative marker (AIM) analysis#
Before we investigate what taxa are present in our dataset in detail, we first make provisional species calls. We could do this using the single marker molecular typing results that contributors often supply when sending samples, generated by assays such as Scott et al. (1993) and Santolamazza et al. (2008). However, as we have discussed in the previous module, these single markers are a blunt tool - the marker may indicate one species, while the rest of the genome indicates a different species entirely.
Rather than a single marker, we use multiple ancestry-informative SNP markers (AIMs) taken from across the genome. These are derived by taking a population of each species and looking for the biallelic SNPs that have different alleles fixed (or almost fixed) in the different species. Depending on how diverged the mosquito species are, this method gives us hundreds or thousands of ancestry informative marker SNPs across the genome.
With these AIMs, we can use the percentage of AIM alleles in an individual to assign provisional species. E.g., currently we assign any sample with 85% or more An. arabiensis AIM alleles as An. arabiensis. These AIM fractions can be found in the sample metadata in the “aim_species_fraction_arab” and “aim_species_fraction_colu” columns. The provisional species calls made from these AIM data can be found in the sample metadata “aim_species” column.
Let’s remind ourselves of what the sample metadata looks like.
sample_meta_df = ag3.sample_metadata()
sample_meta_df.head()
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VBS00256-4651STDY7017184 | GP97 | Tovi Lehmann | Mali | Dallowere | 2012 | 6 | 13.616 | -7.037 | F | ... | Koulikouro | ML-2 | Banamba | coluzzii | ML-2_colu_2012 | ML-2_colu_2012_06 | ML-2_colu_2012_Q2 | ML-2_Banamba_colu_2012 | ML-2_Banamba_colu_2012_06 | ML-2_Banamba_colu_2012_Q2 |
1 | VBS00257-4651STDY7017185 | GP98 | Tovi Lehmann | Mali | Dallowere | 2012 | 6 | 13.616 | -7.037 | F | ... | Koulikouro | ML-2 | Banamba | coluzzii | ML-2_colu_2012 | ML-2_colu_2012_06 | ML-2_colu_2012_Q2 | ML-2_Banamba_colu_2012 | ML-2_Banamba_colu_2012_06 | ML-2_Banamba_colu_2012_Q2 |
2 | VBS00259-4651STDY7017186 | GP100 | Tovi Lehmann | Mali | Dallowere | 2012 | 6 | 13.616 | -7.037 | F | ... | Koulikouro | ML-2 | Banamba | coluzzii | ML-2_colu_2012 | ML-2_colu_2012_06 | ML-2_colu_2012_Q2 | ML-2_Banamba_colu_2012 | ML-2_Banamba_colu_2012_06 | ML-2_Banamba_colu_2012_Q2 |
3 | VBS00262-4651STDY7017187 | GP103 | Tovi Lehmann | Mali | Dallowere | 2012 | 6 | 13.616 | -7.037 | F | ... | Koulikouro | ML-2 | Banamba | coluzzii | ML-2_colu_2012 | ML-2_colu_2012_06 | ML-2_colu_2012_Q2 | ML-2_Banamba_colu_2012 | ML-2_Banamba_colu_2012_06 | ML-2_Banamba_colu_2012_Q2 |
4 | VBS00277-4651STDY7017189 | GP118 | Tovi Lehmann | Mali | Dallowere | 2012 | 6 | 13.616 | -7.037 | F | ... | Koulikouro | ML-2 | Banamba | coluzzii | ML-2_colu_2012 | ML-2_colu_2012_06 | ML-2_colu_2012_Q2 | ML-2_Banamba_colu_2012 | ML-2_Banamba_colu_2012_06 | ML-2_Banamba_colu_2012_Q2 |
5 rows × 57 columns
There are several columns in the metadata which provide data on AIMs. Let’s take a look.
aim_columns = [c for c in sample_meta_df if c.startswith("aim_")]
aim_columns
['aim_species_fraction_arab',
'aim_species_fraction_colu',
'aim_species_fraction_colu_no2l',
'aim_species_gambcolu_arabiensis',
'aim_species_gambiae_coluzzii',
'aim_species']
sample_meta_df[["sample_id"] + aim_columns].head()
sample_id | aim_species_fraction_arab | aim_species_fraction_colu | aim_species_fraction_colu_no2l | aim_species_gambcolu_arabiensis | aim_species_gambiae_coluzzii | aim_species | |
---|---|---|---|---|---|---|---|
0 | VBS00256-4651STDY7017184 | 0.002 | 0.858 | 0.973 | gambcolu | coluzzii | coluzzii |
1 | VBS00257-4651STDY7017185 | 0.002 | 0.977 | 0.982 | gambcolu | coluzzii | coluzzii |
2 | VBS00259-4651STDY7017186 | 0.001 | 0.917 | 0.974 | gambcolu | coluzzii | coluzzii |
3 | VBS00262-4651STDY7017187 | 0.002 | 0.860 | 0.974 | gambcolu | coluzzii | coluzzii |
4 | VBS00277-4651STDY7017189 | 0.002 | 0.924 | 0.984 | gambcolu | coluzzii | coluzzii |
Note the “aim_species” column - this has the value of our provisional species call made from looking at AIM genotype calls. Let’s see all the values this can take:
sample_meta_df.groupby("aim_species").size()
aim_species
arabiensis 5330
coluzzii 7921
gambiae 8238
intermediate_gambcolu_arabiensis 465
intermediate_gambiae_coluzzii 651
dtype: int64
In the previous module we looked at how we can visualise the AIMs underlying this provisional species call with an AIM plot, using the handy ag3.plot_aim_heatmap()
function in the malariagen_data
python package. Let’s remind ourselves of how this function works, and what the AIMs looks like.
ag3.plot_aim_heatmap?
Signature:
ag3.plot_aim_heatmap(
aims: str,
sample_sets: Union[str, Sequence[str], NoneType] = None,
sample_query: Optional[str] = None,
sample_query_options: Optional[dict] = None,
sort: bool = True,
row_height: int = 4,
xgap: float = 0,
ygap: float = 0.5,
palette: Optional[Tuple[str, str, str, str]] = None,
show: bool = True,
renderer: Optional[str] = None,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot a heatmap of ancestry-informative marker (AIM) genotypes.
Parameters
----------
aims : str
Identifier for a set of ancestry informative markers to use. For
possible values see the `aim_ids` property.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
sample_query_options : dict or None, optional
A dictionary of arguments that will be passed through to pandas
query() or eval(), e.g. parser, engine, local_dict, global_dict,
resolvers.
sort : bool, optional, default: True
If true (default), sort the samples by the total fraction of AIM
alleles for the second species in the comparison.
row_height : int, optional, default: 4
Height per sample in px.
xgap : float, optional, default: 0
Creates lines between columns (variants).
ygap : float, optional, default: 0.5
Creates lines between rows (samples).
palette : tuple[str, str, str, str] or None, optional
4-tuple of colors for AIM genotypes, in the order: missing, hom taxon
1, het, hom taxon 2.
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
renderer : str or None, optional
The name of the renderer to use.
Returns
-------
Figure or None
A plotly figure (only returned if show=False).
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/aim_data.py
Type: method
To visualise the AIM calls as a heatmap we just need to specify which set of markers we want to use (aims
), either “gambcolu_vs_arab” or “gamb_vs_colu”. There’s also a sample_set
parameter and a sample_query
parameter to select which samples to look at.
Let’s look again at the “gamb_vs_colu” AIMs from the “AG1000G-BF-A” sample set of mosquitoes from Burkina Faso.
ag3.plot_aim_heatmap(
aims="gamb_vs_colu",
sample_sets="AG1000G-BF-A"
)
In the plot above, the subplots represent different contigs (chromosome arms) and the rows are individual samples. These plots are interactive and hovering the mouse pointer brings up data about the AIM variant index (column), mosquito sample ID (row), and AIM genotype (here 0 means homozygous gamb/gamb, 1 means heterozygous gamb/colu and 2 means homozygous colu/colu).
In the “gamb_vs_colu” plot, blue represents homozygous An. gambiae genotypes, red represents homozygous An. coluzzii genotypes, and in yellow we see genotypes that are heterozygous for the gambiae and coluzzii alleles.
Despite the fact that many samples are affected by the known introgression event on chromosome arm 2L, in most cases we still make fairly clear provisional species assignments using the AIMs. In fact, because this introgression is so common, we actually ignore chromosome arm 2L when making provisional species calls between gambiae and coluzzii. Here are the provisional assignments for this sample set:
sample_meta_df.query("sample_set == 'AG1000G-BF-A'").groupby("aim_species").size()
aim_species
coluzzii 82
gambiae 98
intermediate_gambiae_coluzzii 1
dtype: int64
Let’s look at another AIM plot, this time using the “gambcolu_vs_arab” AIMs to look at mosquitoes from Uganda.
ag3.plot_aim_heatmap(
aims="gambcolu_vs_arab",
sample_sets="AG1000G-UG"
)
In the “gambcolu_vs_arab” plot, green represents genotypes which are homozygous for the An. arabiensis allele; purple represents genotypes homozygous for the allele found in An. gambiae and An. coluzzii, and in yellow we again see genotypes that are heterozygous.
Though the AIMs are not 100% informative due to how they have been obtained (see module 3), we can see lots of samples where most of the AIMs are homozygous for one species. This what we expected, as these samples should belong to one of these known taxa.
Sometimes, however, we see samples where the species is not clear. We can see a sample like this (ACO198-C) in the Uganda plot above. This sample has heterozygous genotypes at almost all AIMs, and so gets assigned as “intermediate_gambcolu_arabiensis” in the samples metadata “aim_species” column.
sample_meta_df.query("sample_id == 'AC0198-C'")[["sample_id"] + aim_columns]
sample_id | aim_species_fraction_arab | aim_species_fraction_colu | aim_species_fraction_colu_no2l | aim_species_gambcolu_arabiensis | aim_species_gambiae_coluzzii | aim_species | |
---|---|---|---|---|---|---|---|
21621 | AC0198-C | 0.494832 | 0.211288 | 0.2075 | intermediate | NaN | intermediate_gambcolu_arabiensis |
Here are counts of the different AIM species assignments in Uganda:
sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby("aim_species").size()
aim_species
arabiensis 82
gambiae 207
intermediate_gambcolu_arabiensis 1
dtype: int64
There are other sample sets where there are multiple samples that do not have a clear species assignment, e.g., Guinea-Bissau.
ag3.plot_aim_heatmap(
aims="gamb_vs_colu",
sample_sets="AG1000G-GW"
)
In sample sets like this, there may be many “intermediate” AIM species assignments:
sample_meta_df.query("sample_set == 'AG1000G-GW'").groupby("aim_species").size()
aim_species
gambiae 28
intermediate_gambiae_coluzzii 73
dtype: int64
Sample sets with many samples assigned a “intermediate” provisional species are flagged for further investigation with principal component analysis (PCA).
Step 2 - Principal component analysis#
Recap: what is principal component analysis?#
In the previous workshop we learnt how PCA can be used to identify genetic structure in a group of samples and we learnt why being able to detect structure is useful for genomic surveillance and vector control.
Fundamentally, PCA is a method for reducing the dimensions of a dataset to help make interpreting the data easier. The PCA finds axes through the data that describe its variance, in the case of genomic data, this effectively collapses thousands or millions of dimensions (SNPs) down to a handful of principal components which allow tractable investigation of structure in the data.
When we are trying to identify the causes of the apparent “intermediate” taxon samples from AIMs analysis, the way PCA reveals the structure of these samples can be be very helpful.
Signals of hybridisation - Uganda#
In the Ugandan “gambcolu_vs_arab” aim plot we saw an individual that appeared to be heterozygous for AIMs across it’s genome. Let’s run a PCA on this sample set and see how this individual appears.
Let’s briefly remind ourselves of the pca function documentation.
ag3.pca?
Signature:
ag3.pca(
region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
n_snps: int,
n_components: int = 20,
thin_offset: int = 0,
sample_sets: Union[str, Sequence[str], NoneType] = None,
sample_query: Optional[str] = None,
sample_query_options: Optional[dict] = None,
sample_indices: Optional[List[int]] = None,
site_mask: Optional[str] = 'default',
site_class: Optional[str] = None,
min_minor_ac: Union[int, float, NoneType] = 2,
max_missing_an: Union[int, float, NoneType] = 0,
cohort_size: Optional[int] = None,
min_cohort_size: Optional[int] = None,
max_cohort_size: Optional[int] = None,
exclude_samples: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...], NoneType] = None,
fit_exclude_samples: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...], NoneType] = None,
random_seed: int = 42,
inline_array: bool = True,
chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
) -> Tuple[pandas.core.frame.DataFrame, numpy.ndarray]
Docstring:
Run a principal components analysis (PCA) using biallelic SNPs from the
selected genome region and samples.
.. versionchanged:: 8.0.0 SNP ascertainment has changed slightly.
This function uses biallelic SNPs as input to the PCA. The ascertainment
of SNPs to include has changed slightly in version 8.0.0 and therefore
the results of this function may also be slightly different. Previously,
SNPs were required to be biallelic and one of the observed alleles was
required to be the reference allele. Now SNPs just have to be biallelic.
The following additional parameters were also added in version 8.0.0:
`site_class`, `cohort_size`, `min_cohort_size`, `max_cohort_size`,
`random_seed`.
Parameters
----------
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript. Can also be a sequence (e.g.,
list) of regions.
n_snps : int
The desired number of SNPs to use when running the analysis. SNPs will
be evenly thinned to approximately this number.
n_components : int, optional, default: 20
Number of components to return.
thin_offset : int, optional, default: 0
Starting index for SNP thinning. Change this to repeat the analysis
using a different set of SNPs.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
sample_query_options : dict or None, optional
A dictionary of arguments that will be passed through to pandas
query() or eval(), e.g. parser, engine, local_dict, global_dict,
resolvers.
sample_indices : list of int or None, optional
Advanced usage parameter. A list of indices of samples to select,
corresponding to the order in which the samples are found within the
sample metadata. Either provide this parameter or sample_query, not
both.
site_mask : str or None, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
site_class : str or None, optional
Select sites belonging to one of the following classes: CDS_DEG_4,
(4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
gene).
min_minor_ac : int or float or None, optional, default: 2
The minimum minor allele count. SNPs with a minor allele count below
this value will be excluded. Can also be a float, which will be
interpreted as a fraction.
max_missing_an : int or float or None, optional, default: 0
The maximum number of missing allele calls to accept. SNPs with more
than this value will be excluded. Set to 0 to require no missing
calls. Can also be a float, which will be interpreted as a fraction.
cohort_size : int or None, optional
Randomly down-sample to this value if the number of samples in the
cohort is greater. Raise an error if the number of samples is less
than this value.
min_cohort_size : int or None, optional
Minimum cohort size. Raise an error if the number of samples is less
than this value.
max_cohort_size : int or None, optional
Randomly down-sample to this value if the number of samples in the
cohort is greater.
exclude_samples : str or int or list of str or int or tuple of str or int or None, optional
Sample identifier or index within sample set. Multiple values can also
be provided as a list or tuple.
fit_exclude_samples : str or int or list of str or int or tuple of str or int or None, optional
Sample identifier or index within sample set. Multiple values can also
be provided as a list or tuple.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
inline_array : bool, optional, default: True
Passed through to dask `from_array()`.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
Define how input data being read from zarr should be divided into
chunks for a dask computation. If 'native', use underlying zarr
chunks. If a string specifying a target memory size, e.g., '300 MiB',
resize chunks in arrays with more than one dimension to match this
size. If 'auto', let dask decide chunk size. If 'ndauto', let dask
decide chunk size but only for arrays with more than one dimension. If
'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
'ndauto01', as 'ndauto' but only vary the first and second chunk
dimensions. Also, can be a tuple of integers, or a callable which
accepts the native chunks as a single argument and returns a valid
dask chunks value.
Returns
-------
df_pca : DataFrame
A dataframe of projections along principal components, one row per
sample. The columns are: `sample_id` is the identifier of the
sample, `partner_sample_id` is the identifier of the sample used
by the partners who contributed it, `contributor` is the partner
who contributed the sample, `country` is the country the sample
was collected in, `location` is the location the sample was
collected in, `year` is the year the sample was collected,
`month` is the month the sample was collected, `latitude` is the
latitude of the location the sample was collected in,
`longitude` is the longitude of the location the sample was
collected in, `sex_call` is the sex of the sample,
`sample_set` is the sample set containing the sample, `release`
is the release containing the sample, `quarter` is the quarter
of the year the sample was collected, `study_id* is the
identifier of the study the sample set containing the sample came
from, `study_url` is the URL of the study the sample set
containing the sample came from, `terms_of_use_expiry_date` is
the date the terms of use for the sample expire,
`terms_of_use_url` is the URL of the terms of use for the sample,
`unrestricted_use` indicates whether the sample can be used without
restrictions (e.g., if the terms of use of expired), `mean_cov`
is mean value of the coverage, `median_cov` is the median value
of the coverage, `modal_cov` is the mode of the coverage,
`mean_cov_2L` is mean value of the coverage on 2L,
`median_cov_2L` is the median value of the coverage on 2L,
`mode_cov_2L` is the mode of the coverage on 2L, `mean_cov_2R`
is mean value of the coverage on 2R, `median_cov_2R` is the
median value of the coverage on 2R, `mode_cov_2R` is the mode of
the coverage on 2R, `mean_cov_3L` is mean value of the coverage
on 3L, `median_cov_3L` is the median value of the coverage on
3L, `mode_cov_3L` is the mode of the coverage on 3L,
`mean_cov_3R` is mean value of the coverage on 3R,
`median_cov_3R` is the median value of the coverage on 3R,
`mode_cov_3R` is the mode of the coverage on 3R, `mean_cov_X` is
mean value of the coverage on X, `median_cov_X` is the median
value of the coverage on X, `mode_cov_X` is the mode of the
coverage on X, `frac_gen_cov` is the faction of the genome
covered, `divergence` is the divergence, `contam_pct` is the
percentage of contamination, `contam_LLR` is the log-likelihood
ratio of contamination, `aim_species_fraction_arab` is the
fraction of the gambcolu vs. arabiensis AIMs that indicated
arabiensis (this column is only present for *Ag3*),
`aim_species_fraction_colu` is the fraction of the gambiae vs.
coluzzii AIMs that indicated coluzzii (this column is only present
for *Ag3*), `aim_species_fraction_colu_no2l` is the fraction of
the gambiae vs. coluzzii AIMs that indicated coluzzii, not including
the chromosome arm 2L which contains an introgression (this column
is only present for *Ag3*), `aim_species_gambcolu_arabiensis` is
the taxonomic group assigned by the gambcolu vs. arabiensis AIMs
(this column is only present for *Ag3*),
`aim_species_gambiae_coluzzi` is the taxonomic group assigned by the
gambiae vs. coluzzii AIMs (this column is only present for *Ag3*),
`aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
the combination of both AIMs analyses (this column is only present
for *Ag3*), `country_iso` is the ISO code of the country the
sample was collected in, `admin1_name` is the name of the first
administrative level the sample was collected in, `admin1_iso`
is the ISO code of the first administrative level the sample was
collected in, `admin2_name` is the name of the second
administrative level the sample was collected in, `taxon` is the
taxon assigned to the sample by the combination of the AIMs analysis
and the cohort analysis, `cohort_admin1_year` is the cohort the
sample belongs to when samples are grouped by first administrative
level and year, `cohort_admin1_month` is the cohort the sample
belongs to when samples are grouped by first administrative level
and month, `cohort_admin1_quarter` is the cohort the sample
belongs to when samples are grouped by first administrative level
and quarter, `cohort_admin2_year` is the cohort the sample
belongs to when samples are grouped by second administrative level
and year, `cohort_admin2_month` is the cohort the sample belongs
to when samples are grouped by second administrative level and
month, `cohort_admin2_quarter` is the cohort the sample belong
to when samples are grouped by second administrative level and
quarter. `PC?` is the projection along principal component ? (?
being an integer between 1 and the number of components). There are
as many such columns as components, `pca_fit` is whether this
sample was used for fitting.
evr : ndarray
An array of explained variance ratios, one per component.
Notes
-----
This computation may take some time to run, depending on your
computing environment. Results of this computation will be cached and
re-used if the `results_cache` parameter was set when instantiating
the API client.
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type: method
Let’s define some parameters to use in all the PCA computations.
region = "3L:15,000,000-41,000,000"
n_snps = 100_000
Now run a PCA with Ugandan samples.
pca_df, evr = ag3.pca(
region=region,
n_snps=n_snps,
sample_sets="AG1000G-UG"
)
We can look at the explained variance ratio array (evr
) to get an feeling for how many principal components we are interested in for this particular PCA. This array contains the proportion of the total variance in the dataset explained by each principal component. The easiest way to do this is to plot the array using the malariagen_data function plot_pca_variance()
.
ag3.plot_pca_variance?
Signature:
ag3.plot_pca_variance(
evr: numpy.ndarray,
width: Optional[int] = 900,
height: Optional[int] = 400,
show: bool = True,
renderer: Optional[str] = None,
**kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot explained variance ratios from a principal components analysis
(PCA) using a plotly bar plot.
Parameters
----------
evr : ndarray
An array of explained variance ratios, one per component.
width : int or None, optional, default: 900
Figure width in pixels (px).
height : int or None, optional, default: 400
Figure height in pixels (px).
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
renderer : str or None, optional
The name of the renderer to use.
**kwargs
Passed through to px.bar().
Returns
-------
Figure or None
A plotly figure (only returned if show=False).
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type: method
ag3.plot_pca_variance(evr)
Where the variance plot flattens out is a good rule of thumb as to where principal components may be more noise than signal of structure. In this case it looks like PC1 explains much of the variance in the data relative to all other PCs.
Now we can plot the PCA data using the malariagen_data package.
ag3.plot_pca_coords?
Signature:
ag3.plot_pca_coords(
data: pandas.core.frame.DataFrame,
x: str = 'PC1',
y: str = 'PC2',
color: Union[str, Mapping, NoneType] = None,
symbol: Union[str, Mapping, NoneType] = None,
opacity: float = 0.9,
jitter_frac: Optional[float] = 0.02,
random_seed: int = 42,
width: Optional[int] = 900,
height: Optional[int] = 600,
marker_size: Union[int, float] = 10,
color_discrete_sequence: Optional[List] = None,
color_discrete_map: Optional[Mapping] = None,
category_orders: Union[List, Mapping, NoneType] = None,
legend_sizing: Literal['constant', 'trace'] = 'constant',
show: bool = True,
renderer: Optional[str] = None,
render_mode: Literal['auto', 'svg', 'webgl'] = 'svg',
**kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot sample coordinates from a principal components analysis (PCA) as a
plotly scatter plot.
Parameters
----------
data : DataFrame
A dataframe of projections along principal components, one row per
sample. The columns are: `sample_id` is the identifier of the
sample, `partner_sample_id` is the identifier of the sample used
by the partners who contributed it, `contributor` is the partner
who contributed the sample, `country` is the country the sample
was collected in, `location` is the location the sample was
collected in, `year` is the year the sample was collected,
`month` is the month the sample was collected, `latitude` is the
latitude of the location the sample was collected in, `longitude`
is the longitude of the location the sample was collected in,
`sex_call` is the sex of the sample, `sample_set` is the sample
set containing the sample, `release` is the release containing the
sample, `quarter` is the quarter of the year the sample was
collected, `study_id* is the identifier of the study the sample
set containing the sample came from, `study_url` is the URL of the
study the sample set containing the sample came from,
`terms_of_use_expiry_date` is the date the terms of use for the sample
expire, `terms_of_use_url` is the URL of the terms of use for the
sample, `unrestricted_use` indicates whether the sample can be
used without restrictions (e.g., if the terms of use of expired),
`mean_cov` is mean value of the coverage, `median_cov` is the
median value of the coverage, `modal_cov` is the mode of the
coverage, `mean_cov_2L` is mean value of the coverage on 2L,
`median_cov_2L` is the median value of the coverage on 2L,
`mode_cov_2L` is the mode of the coverage on 2L, `mean_cov_2R` is
mean value of the coverage on 2R, `median_cov_2R` is the median
value of the coverage on 2R, `mode_cov_2R` is the mode of the
coverage on 2R, `mean_cov_3L` is mean value of the coverage on 3L,
`median_cov_3L` is the median value of the coverage on 3L,
`mode_cov_3L` is the mode of the coverage on 3L, `mean_cov_3R` is
mean value of the coverage on 3R, `median_cov_3R` is the median
value of the coverage on 3R, `mode_cov_3R` is the mode of the
coverage on 3R, `mean_cov_X` is mean value of the coverage on X,
`median_cov_X` is the median value of the coverage on X,
`mode_cov_X` is the mode of the coverage on X, `frac_gen_cov` is
the faction of the genome covered, `divergence` is the divergence,
`contam_pct` is the percentage of contamination, `contam_LLR` is
the log-likelihood ratio of contamination,
`aim_species_fraction_arab` is the fraction of the gambcolu vs.
arabiensis AIMs that indicated arabiensis (this column is only present
for *Ag3*), `aim_species_fraction_colu` is the fraction of the
gambiae vs. coluzzii AIMs that indicated coluzzii (this column is only
present for *Ag3*), `aim_species_fraction_colu_no2l` is the
fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii, not
including the chromosome arm 2L which contains an introgression (this
column is only present for *Ag3*),
`aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
the gambcolu vs. arabiensis AIMs (this column is only present for
*Ag3*), `aim_species_gambiae_coluzzi` is the taxonomic group
assigned by the gambiae vs. coluzzii AIMs (this column is only present
for *Ag3*), `aim_species_gambcolu_arabiensis` is the taxonomic
group assigned by the combination of both AIMs analyses (this column
is only present for *Ag3*), `country_iso` is the ISO code of the
country the sample was collected in, `admin1_name` is the name of
the first administrative level the sample was collected in,
`admin1_iso` is the ISO code of the first administrative level the
sample was collected in, `admin2_name` is the name of the second
administrative level the sample was collected in, `taxon` is the
taxon assigned to the sample by the combination of the AIMs analysis
and the cohort analysis, `cohort_admin1_year` is the cohort the
sample belongs to when samples are grouped by first administrative
level and year, `cohort_admin1_month` is the cohort the sample
belongs to when samples are grouped by first administrative level and
month, `cohort_admin1_quarter` is the cohort the sample belongs to
when samples are grouped by first administrative level and quarter,
`cohort_admin2_year` is the cohort the sample belongs to when samples
are grouped by second administrative level and year,
`cohort_admin2_month` is the cohort the sample belongs to when samples
are grouped by second administrative level and month,
`cohort_admin2_quarter` is the cohort the sample belong to when
samples are grouped by second administrative level and quarter.
`PC?` is the projection along principal component ? (? being an
integer between 1 and the number of components). There are as many
such columns as components, `pca_fit` is whether this sample was
used for fitting.
x : str, optional, default: 'PC1'
Name of variable to plot on the X axis.
y : str, optional, default: 'PC2'
Name of variable to plot on the Y axis.
color : str or Mapping or None, optional
Name of variable to use to color the markers.
symbol : str or Mapping or None, optional
Name of the variable to use to choose marker symbols.
opacity : float, optional, default: 0.9
Marker opacity.
jitter_frac : float or None, optional, default: 0.02
Randomly jitter points by this fraction of their range.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
width : int or None, optional, default: 900
Figure width in pixels (px).
height : int or None, optional, default: 600
Figure height in pixels (px).
marker_size : int or float, optional, default: 10
Marker size.
color_discrete_sequence : List or None, optional
Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
Control the order in which values appear in the legend.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
Determines if the legend items symbols scale with their corresponding
"trace" attributes or remain "constant" independent of the symbol size
on the graph.
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
renderer : str or None, optional
The name of the renderer to use.
render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'svg'
The type of rendering backend to use. See also
https://plotly.com/python/webgl-vs-svg/.
**kwargs
Passed through to `px.scatter()`.
Returns
-------
Figure or None
A plotly figure (only returned if show=False).
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type: method
So we just need to plug in our pca_df
Pandas DataFrame, however, we can also colour our points using the “gambcolu_vs_arab” AIM derived provisional taxon data to help us interpret the plot.
ag3.plot_pca_coords(pca_df, color="aim_species")
With the points coloured by our AIM analysis, we see that principal component 1 (PC1), which describes the most variation in our data, has detected structure driven by species. The separate An. gambiae and An. arabiensis clusters demonstrate a strong degree of reproductive isolation between the species in Uganda.
Equidistant from the two species clusters, we can see a single sample - AC0198-C. The AIM analysis defined this individual as being “intermediate_gambcolu_arabiensis” because it did not carry enough of either An. arabiensis or An. gambiae/coluzzi to be classified as such. The position of this sample, on the same axis of variation (PC1) as our two clusters of species suggests that this individual is a An. gambiae X An. arabiensis hybrid. Furthermore, that the sample falls at the approximate mid-point between the two main clusters, suggests that this individual is an F1 (first filial generation) hybrid. A multi-generational backcrossed individual (>F1), would fall on the same axis but appear closer to one AIM species cluster or the other, depending on which species had been backcrossed into.
In the sample metadata, the “taxon” column contains the results of these PCA-based cryptic species analyses.
In this case (AG1000G-UG), we don’t need to alter our provisional AIM species assignments, and so the “taxon” column will be identical to “aim_species”.
sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby(["aim_species", "taxon"]).size()
aim_species taxon
arabiensis arabiensis 82
gambiae gambiae 207
intermediate_gambcolu_arabiensis unassigned 1
dtype: int64
Signals of cryptic species - Tanzania#
Let’s look at a difference situation, where think there is evidence for a cryptic species that we weren’t previously aware of.
Let’s plot the “gamb_vs_colu” (An. gambiae vs An. coluzzii) AIMs for the Tanzanian sample set, excluding samples assigned as An. arabiensis.
ag3.plot_aim_heatmap(
aims="gamb_vs_colu",
sample_sets="AG1000G-TZ",
sample_query="aim_species != 'arabiensis'"
)
We can see that a number of samples show mixed ancestry (blue, red and yellow genotypes) over the 2 and 3 chromosomes, but appear homozygous for An. gambiae ancestry on the X chromosome.
In the AIM analysis, these samples were provisionally labelled as being “intermediate_gambiae_coluzzii” in the “aim_species” column of the sample metadata.
This result is particularly interesting, as the An. coluzzii species range does not reach this far East. Let’s investigate this sample set using PCA.
pca_df, evr = ag3.pca(
region=region,
n_snps=n_snps,
sample_sets="AG1000G-TZ",
)
We should plot the variance array to get an idea which principal components we should look at.
ag3.plot_pca_variance(evr)
For these samples, the explained variance doesn’t flatten out until PC4, which suggests that we should look at the first three PCs. We could make multiple 2D scatter plots to investigate these three PCs e.g. PC1 vs. PC2 and PC2 vs. PC3. But it would be easier to intepret the results if we could make one 3D scatter plot and visualise all three PCs together. There is a function in malariagen_data
that makes this very simple.
ag3.plot_pca_coords_3d?
Signature:
ag3.plot_pca_coords_3d(
data: pandas.core.frame.DataFrame,
x: str = 'PC1',
y: str = 'PC2',
z: str = 'PC3',
color: Union[str, Mapping, NoneType] = None,
symbol: Union[str, Mapping, NoneType] = None,
jitter_frac: Optional[float] = 0.02,
random_seed: int = 42,
width: Optional[int] = 900,
height: Optional[int] = 600,
marker_size: Union[int, float] = 5,
color_discrete_sequence: Optional[List] = None,
color_discrete_map: Optional[Mapping] = None,
category_orders: Union[List, Mapping, NoneType] = None,
legend_sizing: Literal['constant', 'trace'] = 'constant',
show: bool = True,
renderer: Optional[str] = None,
**kwargs,
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot sample coordinates from a principal components analysis (PCA) as a
plotly 3D scatter plot.
Parameters
----------
data : DataFrame
A dataframe of projections along principal components, one row per
sample. The columns are: `sample_id` is the identifier of the
sample, `partner_sample_id` is the identifier of the sample used
by the partners who contributed it, `contributor` is the partner
who contributed the sample, `country` is the country the sample
was collected in, `location` is the location the sample was
collected in, `year` is the year the sample was collected,
`month` is the month the sample was collected, `latitude` is the
latitude of the location the sample was collected in, `longitude`
is the longitude of the location the sample was collected in,
`sex_call` is the sex of the sample, `sample_set` is the sample
set containing the sample, `release` is the release containing the
sample, `quarter` is the quarter of the year the sample was
collected, `study_id* is the identifier of the study the sample
set containing the sample came from, `study_url` is the URL of the
study the sample set containing the sample came from,
`terms_of_use_expiry_date` is the date the terms of use for the sample
expire, `terms_of_use_url` is the URL of the terms of use for the
sample, `unrestricted_use` indicates whether the sample can be
used without restrictions (e.g., if the terms of use of expired),
`mean_cov` is mean value of the coverage, `median_cov` is the
median value of the coverage, `modal_cov` is the mode of the
coverage, `mean_cov_2L` is mean value of the coverage on 2L,
`median_cov_2L` is the median value of the coverage on 2L,
`mode_cov_2L` is the mode of the coverage on 2L, `mean_cov_2R` is
mean value of the coverage on 2R, `median_cov_2R` is the median
value of the coverage on 2R, `mode_cov_2R` is the mode of the
coverage on 2R, `mean_cov_3L` is mean value of the coverage on 3L,
`median_cov_3L` is the median value of the coverage on 3L,
`mode_cov_3L` is the mode of the coverage on 3L, `mean_cov_3R` is
mean value of the coverage on 3R, `median_cov_3R` is the median
value of the coverage on 3R, `mode_cov_3R` is the mode of the
coverage on 3R, `mean_cov_X` is mean value of the coverage on X,
`median_cov_X` is the median value of the coverage on X,
`mode_cov_X` is the mode of the coverage on X, `frac_gen_cov` is
the faction of the genome covered, `divergence` is the divergence,
`contam_pct` is the percentage of contamination, `contam_LLR` is
the log-likelihood ratio of contamination,
`aim_species_fraction_arab` is the fraction of the gambcolu vs.
arabiensis AIMs that indicated arabiensis (this column is only present
for *Ag3*), `aim_species_fraction_colu` is the fraction of the
gambiae vs. coluzzii AIMs that indicated coluzzii (this column is only
present for *Ag3*), `aim_species_fraction_colu_no2l` is the
fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii, not
including the chromosome arm 2L which contains an introgression (this
column is only present for *Ag3*),
`aim_species_gambcolu_arabiensis` is the taxonomic group assigned by
the gambcolu vs. arabiensis AIMs (this column is only present for
*Ag3*), `aim_species_gambiae_coluzzi` is the taxonomic group
assigned by the gambiae vs. coluzzii AIMs (this column is only present
for *Ag3*), `aim_species_gambcolu_arabiensis` is the taxonomic
group assigned by the combination of both AIMs analyses (this column
is only present for *Ag3*), `country_iso` is the ISO code of the
country the sample was collected in, `admin1_name` is the name of
the first administrative level the sample was collected in,
`admin1_iso` is the ISO code of the first administrative level the
sample was collected in, `admin2_name` is the name of the second
administrative level the sample was collected in, `taxon` is the
taxon assigned to the sample by the combination of the AIMs analysis
and the cohort analysis, `cohort_admin1_year` is the cohort the
sample belongs to when samples are grouped by first administrative
level and year, `cohort_admin1_month` is the cohort the sample
belongs to when samples are grouped by first administrative level and
month, `cohort_admin1_quarter` is the cohort the sample belongs to
when samples are grouped by first administrative level and quarter,
`cohort_admin2_year` is the cohort the sample belongs to when samples
are grouped by second administrative level and year,
`cohort_admin2_month` is the cohort the sample belongs to when samples
are grouped by second administrative level and month,
`cohort_admin2_quarter` is the cohort the sample belong to when
samples are grouped by second administrative level and quarter.
`PC?` is the projection along principal component ? (? being an
integer between 1 and the number of components). There are as many
such columns as components, `pca_fit` is whether this sample was
used for fitting.
x : str, optional, default: 'PC1'
Name of variable to plot on the X axis.
y : str, optional, default: 'PC2'
Name of variable to plot on the Y axis.
z : str, optional, default: 'PC3'
Name of variable to plot on the Z axis.
color : str or Mapping or None, optional
Name of variable to use to color the markers.
symbol : str or Mapping or None, optional
Name of the variable to use to choose marker symbols.
jitter_frac : float or None, optional, default: 0.02
Randomly jitter points by this fraction of their range.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
width : int or None, optional, default: 900
Figure width in pixels (px).
height : int or None, optional, default: 600
Figure height in pixels (px).
marker_size : int or float, optional, default: 5
Marker size.
color_discrete_sequence : List or None, optional
Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
Control the order in which values appear in the legend.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
Determines if the legend items symbols scale with their corresponding
"trace" attributes or remain "constant" independent of the symbol size
on the graph.
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
renderer : str or None, optional
The name of the renderer to use.
**kwargs
Passed through to `px.scatter_3d()`.
Returns
-------
Figure or None
A plotly figure (only returned if show=False).
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/pca.py
Type: method
The parameters are very similar to the 2D PCA plotting function, except here we have another axis parameter “z” that we can assign another principal component to. Let’s colour the points by aim_species
as we did before.
ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="aim_species")
Under the hood, malariagen_data
is building these plots with Plotly, so they are interactive. The plot can be rotated by clicking on the plot and moving the mouse and scrolling allows zooming in and out. Holding the mouse pointer over a point will reveal metadata about that sample. Let’s try to interpret this plot one PC at a time.
PC1#
The first principal component has separated all the “arabiensis from all other “aim_species”. This is what we might expect when we sample two species, due to reproductive isolation.
PC2#
The second principal component has pulled all of the “intermediate_gambiae_coluzzii” samples as well as some “gambiae” samples away from all other samples. This is striking, the cluster that includes the intermediate samples is separated on its own axis of variation.
PC3#
The third principal component splits two clusters of “gambiae” samples. If we look at the metadata attached to samples in these two clusters, we can see that in one cluster, samples were collected in Muleba and in the other, the samples were collected in Muheza. This looks like a classic case of geographic isolation between the An. gambiae populations from these two sites.
Interpretation#
There is clearly a lot to unpack when it comes to the population structure we find in this sample set. The strongest signal of variance (PC1) is being driven by reproductive isolation between An. arabiensis and other species.
What is interesting here is that all the “arabiensis” samples are gathered together in a single cluster, but there are two clusters of “gambiae” (separated by PC3). Let’s have a look at some ecological information about the collection sites in Tanzania (the code for this figure can be found here).

Our “gambiae” samples come from Muheza and Muleba, in the East and West of the country respectively. We might conclude that our two clusters of “gambiae” in our PCA are being driven simply by the geographic distance separating the sample sites. However, “arabiensis” have also been collected from sites on either side of the country, and yet all these “arabiensis” samples cluster together.
There is a body of research showing An. arabiensis has a higher aridity tolerance than An. gambiae (e.g. Gray & Bradley (2005)).
One explanation of our results, consistent with these other findings, is that as Muheza and Muleba are separated by a brown (more arid) region that runs approximately North to South, splitting the country in two. The two “gambiae” populations are thus separated by a barrier of unsuitable environment, enabling the evolution of structure (two clusters in our PCA). As An. arabiensis can tolerate the dryer region, geneflow can occur across the country resulting in less within-species structure (one cluster in our PCA).
But what about the other cluster on our PCA, containing intermediate samples separated from other clusters by PC2. From studies of species distributions we know that there are no An. coluzzii in Tanzania and these samples do not look like hybrids between An. gambiae and An. arabiensis as they are not in between the species clusters pulled apart by PC1 (like our hybrid sample in Uganda).
Perhaps these samples are cryptic species, if they are species we don’t have AIMs for they could be labelled as intermediate in the AIM analysis. However, if this cluster represents a cryptic species, why are there four samples labelled as “gambiae” also present in the cluster?
BL0357-C
BL0366-C
BL0370-C
BL0384-C
To dig down into this further, we could make the same PCA plot as before, but this time colour the points by “aim_fraction_colu”, which as it suggests, is the fraction of AIMs from the “gamb_vs_colu” analysis, that suggest An. coluzzii ancestry.
Colour by AIM fraction#
ag3.plot_pca_coords_3d(
pca_df,
x="PC1",
y="PC2",
z="PC3",
color="aim_species_fraction_colu"
)
When we colour the points by fraction coluzzii ancestry we can see that the “gambiae” samples which cluster with the “intermediate_gambiae_coluzzii” actually have relatively high “coluzzii” ancestry (lighter purple) compared with the other “gambiae” (darker purple) in Tanzania. This means that although they had been provisionally labelled as “gambiae” they were at the lower end of the AIM % cut-off.
As mentioned earlier, the AIMs are not 100% informative, and the species cut-offs we use are somewhat arbitrary. This is why we just use the AIM analyses to give provisonal species calls, and then follow up with PCA to give us a more nuanced picture of both structure and taxa.
Labelling cryptic taxa#
With the evidence we have collected on this “intermediate_gambiae_coluzzii” cluster of samples in Tanzania, we would re-label these samples as members of a cryptic taxa. In the sample metadata, the taxon column for these samples has consequently been changed to “gcx3”. This stands for “gambiae complex cryptic taxa 3”, as it is actually the third cryptic taxon we have identified in the Ag3.0 samples.
We can see this re-labelling if we run the same PCA but colour our samples by “taxon”.
ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="taxon")
We can also see how some of the AIM species assignments are converted into a cryptic taxon assignment. I.e., there are some samples labelled as either “gambiae” or “intermediate_gambiae_coluzzii” in the aim_species column, which get assigned as “gcx3” in the taxon column.
sample_meta_df.query("sample_set == 'AG1000G-TZ'").groupby(["aim_species", "taxon"]).size()
aim_species taxon
arabiensis arabiensis 225
gambiae gambiae 64
gcx3 4
intermediate_gambiae_coluzzii gcx3 7
dtype: int64
The importance of identifying cryptic taxa#
Our principal component analysis has highlighted population structure that appears to be due to a previously-unknown cryptic species in Tanzania. Can we use our genomic data to identify operationally important differences with this taxa?
If we remember module 4 of the first workshop, we used the malariagen_data
package to plot gene SNP allele frequencies by cohorts, as a heatmap. Let’s do that for the Vgsc gene (target of pyrethroid insecticides) in our Tanzanian “gambiae” and “gcx3” cohorts.
ag3.aa_allele_frequencies?
Signature:
ag3.aa_allele_frequencies(
transcript: str,
cohorts: Union[str, Mapping[str, str]],
sample_query: Optional[str] = None,
sample_query_options: Optional[dict] = None,
min_cohort_size: Optional[int] = 10,
site_mask: Optional[str] = None,
sample_sets: Union[str, Sequence[str], NoneType] = None,
drop_invariant: bool = True,
include_counts: bool = False,
chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
inline_array: bool = True,
) -> pandas.core.frame.DataFrame
Docstring:
Compute amino acid substitution frequencies for a gene transcript.
Parameters
----------
transcript : str
Gene transcript identifier.
cohorts : str or Mapping[str, str]
Either a string giving the name of a predefined cohort set (e.g.,
"admin1_month") or a dict mapping custom cohort labels to sample
queries.
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
sample_query_options : dict or None, optional
A dictionary of arguments that will be passed through to pandas
query() or eval(), e.g. parser, engine, local_dict, global_dict,
resolvers.
min_cohort_size : int or None, optional, default: 10
Minimum cohort size. Raise an error if the number of samples is less
than this value.
site_mask : str or None, optional
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
drop_invariant : bool, optional, default: True
If True, drop variants not observed in the selected samples.
include_counts : bool, optional, default: False
Include columns with allele counts and number of non-missing allele
calls (nobs).
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
Define how input data being read from zarr should be divided into
chunks for a dask computation. If 'native', use underlying zarr
chunks. If a string specifying a target memory size, e.g., '300 MiB',
resize chunks in arrays with more than one dimension to match this
size. If 'auto', let dask decide chunk size. If 'ndauto', let dask
decide chunk size but only for arrays with more than one dimension. If
'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
'ndauto01', as 'ndauto' but only vary the first and second chunk
dimensions. Also, can be a tuple of integers, or a callable which
accepts the native chunks as a single argument and returns a valid
dask chunks value.
inline_array : bool, optional, default: True
Passed through to dask `from_array()`.
Returns
-------
DataFrame
A dataframe of amino acid allele frequencies, one row per variant.
The variants are indexed by their amino acid change, their contig,
their position. The columns are split into two categories: there is
1 column for each cohort containing the frequency of the amino acid
change within the cohort, additionally there is a column `max_af`
containing the maximum frequency of the amino acide change accross
all cohorts; finally, there are 9 columns describing the variant
allele: `transcript` contains the gene transcript used for this
analysis, `effect` is the effect of the allele change, `impact`is
the impact of the allele change, `ref_allele` is the reference
allel, `alt_allele` is the alternate allele, `aa_pos` is the
position of the amino acid, `ref_aa` is the reference amino acid,
`alt_aa` is the altered amino acid with the varaint allele, and
`label` is the label of the variant allele.
Notes
-----
Cohorts with fewer samples than `min_cohort_size` will be excluded
from output data frame.
File: /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anoph/snp_frq.py
Type: method
aa_allele_freqs_df = ag3.aa_allele_frequencies(
transcript="AGAP004707-RD",
cohorts="admin1_year",
sample_sets="AG1000G-TZ",
sample_query="taxon != 'arabiensis'",
)
Load genome features: ⠙ (0:00:00.09)
aa_filt_df = aa_allele_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_filt_df)
The heatmap shows that the gcx3 cohort of samples (“TZ-25-gcx3-2013”) does not carry any pyrethoid target-site resistance alleles, whereas the “gambiae” cohort from the same collection year and region (“TZ-25-gamb-2013”) has a 41% allele frequency of the kdr (knock-down resistance) allele L995S.
This may mean the gcx3 mosquitoes have very different insecticide resistance profiles to the gambiae mosquitoes from the same location. Consequently, it is important to consider them separately for further investigations into the best vector control strategies.
Taxon analysis workflow - recap#
1. AIMs analysis#
First compute AIMs and use these to make provisional species calls.
Look for sample sets where there’s evidence of “intermediate” samples.
2. PCA analysis#
Run PCAs for each suspect sample set. Adding sample sets from nearby countries with clean species calls can help as a comparison.
Look for distinct clusters on PCA plots containing the “intermediate” samples.
3. Interpret#
Do these intermediate samples fall between clusters of known species?
If yes, they could be hybrids of the two known species.
Is there a PCA dimension which pulls the cluster out uniquely?
If yes, they could belong to a cryptic taxon
Well done!#
In this module we have followed MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.
Run and plotted both AIM and PCA analyses, and learnt how to interpret the results.
Learnt how to detect cryptic taxa.
Exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Next, run a taxon analysis workflow for Guinea Bissau - “AG1000G-GW”.
Plot the “gamb_vs_colu” AIMs for Guinea Bissau.
How might we interpret this plot?
Run and plot a PCA for Guinea Bissau.
How many PCs should we investigate?
How might we interpret the PCA plot? Hint: try colouring markers by “aim_species_fraction_colu”
Add Burkina Faso A sample set to the PCA analysis - [“AG1000G-GW, “AG1000G-BF-A”].
Does this help clarify the situation, if so why?
Could we have a hybrid cluster or a cryptic taxon cluster?
Explain why you think that.
Français#
Ouvrez ce notebook dans Google Colab et exécutez-le vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, considérez ce que chaque cellule fait.
Indice: Ouvrir ce notebook dans Google Colab, cliquez sur l’icône “fusée” au sommet de la page et sélectionner “Colab” dans le menu.
Ensuite, exécuter l’analyse des taxons pour la Guinée-Bissau – “AG1000G-GW”.
Afficher le diagramme des AIMs “gamb_vs_colu” pour Guinea-Bissau
Comment interpreter ce diagramme?
Exécuter et afficher la PCA pour la Guinée
Combien de PCs doivent être étudiés?
Comment interpréter le diagramme de la PCA? Indice: essayer de colorer les marqueurs en fonction de “aim_species_fraction_colu”.
Ajouter l’ensemble de données Burkina Faso A à l’analyse PCA – [“AG1000G-GW”, “AG1000G-BF-A”].
Est-ce que cela aide à rendre la situation plus claire? Si oui, pourquoi?
Est-ce vous observez un groupe d’hybrides ou un groupe d’un taxon cryptique?
Pourquoi pensez-vous cela?