
Training course in data analysis for genomic surveillance of African malaria vectors
Unrestricted and surveillance metadata flags#
Theme: Data
DISCLAIMER: This is work in progress and subject to change and updates.
This module provides an introduction to two metadata flags that have been added recently to the Ag3 and Af1 resources. These flags are used to filter samples that are “unrestricted”, i.e., available for anyone to use freely, and “surveillance”, i.e., samples that are deemed to be appropriate for a surveillance purpose. What both of these terms mean will be examined in more details shortly.
This module will not contain a lot of technical code but some functions, such as sample_metadata and cnv_hmm will be used as examples. We will not detail what these functions do in this notebook and we will not be interested in the results of their executions, only on the impact of using the two flags on these functions. More details on the use of these functions and their results can be found in the Introductory course. The use of sample_metadata is, for instance, explained in Workshop 1 Module 2 while more can be learned about CNVs in Workshop 2. We will also use some principal components analyses (PCA). Some examples and explanation about PCAs can be found in Workshop 3 Module 4.
Learning objectives#
After completing this module, you will be able to:
Explain what it means for a sample set to be “unrestricted”
Enumerate a few sources of bias that would prevent a sample from being “surveillance”
Understand when to use the
unrestricted_use_onlyandsurveillance_use_onlyflags for your analysesUse these flags when needed
What is an “unrestricted” sample set and why are some sample sets not “unrestricted”?#
Put simply, a sample set is “unrestricted” if anyone can use it however they so choose.
All the sample sets that are accessible using either Ag3 or Af1 are open but terms of use apply. These terms of use are available on the MalariaGEN website and may differ from sample set to sample set. For example, the terms of use for Anopheles gambiae Genomic Surveillance project (which can be found here) state that there is a “publication embargo [which] will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.”.
Checking when a dataset was integrated to a data repository or whether the project partner waived the embargo is onerous on the user so a flag unrestricted_use has been created. It is True if the sample set can be used without any restriction and False otherwise. To be clear, users can use sample sets with a False unrestricted_use flag in their analyses if they want but they have to be aware that terms of use apply and they need to respect them.
What is a “surveillance” sample and why are some samples not “surveillance”?#
There are many ways to use the Ag3 and Af1 data but one of the most important one is surveillance of the vector population. A surveillance use of the data is a study of how the population is behaving and evolving, for instance in the presence of an intervention. Let’s say, a new bednet is widely distributed in a location, one might be interested in knowing whether the vector population is experiencing a decline in size due to the presence of the new bednet (i.e., a change of population demographics), if a sweep occurs to provide resistance (i.e., a change of population genomics), or if the population becomes more exophilic or changes its biting patterns (i.e., a change of population behaviour). Another key question is the time scale of such a change and, potentially, its reversibility.
In order to be useful to answer these many questions, the data needs to be an accurate representation of the underlying population, i.e. a sample should not differ from its population too much. If we deem that a sample is a fair representative of the population it came from, it is called a “surveillance” sample. Otherwise, it is not a “surveillance” sample because it is biased.
There are many different reasons why a sample may be biased. One obvious reason is if the population the sample came from cannot be easily ascertained. For instance, a sample whose taxon is undertermined cannot be used for surveillance as it would be likely be significantly different from every other sample no matter which population it was assigned to. Similarly, an accurate estimate of when and where a sample was collected is required. The space occupied by a population of mosquitoes varies with time and the exact range of movement of a population is difficult to measure with certainty but if we don’t know where a sample was collected, we cannot with any degree of certainty assign it to a population. Something similar is true temporally, obviously two samples from the same location collected during different months or years are likely to belong to populations that are connected but the demographics and genomics of a population can change from month to month (e.g., populations of mosquitoes tend to be smaller during the dry seasons than during the rainy seasons) so a month and year of collection are required for a sample to be considered “surveillance”.
Many samples in the Ag3 and Af1 resources were collected in order to realise bioassays. In theory, if all samples used for a bioassay were then sequenced (and assuming they were all wild-caught), the bias would be fairly minimal but, generally, only the most resistant and the most susceptible samples of a bioassay are sequenced. This means that these samples are heavily biased and their genome is thus not a good representation of what the wild population looks like. All of these samples are thus considered not to be “surveillance”.
The question of bias is not trivial. Every measurement is biased in some ways and the acceptable bias is highly dependent on which questions one wants to ask. For instance, all “surveillance” samples in Ag3 belong to the Anopheles gambiae complex. If one was interested in the biodiversity in Africa, or even just on the biodiversity of insects, it would be an extremely biased resource. We, however, assume that users are aware of this bias and their questions relate to the Anopheles gambiae complex.
Furthermore, some bias is considered to be acceptable. Many “surveillance” samples in Ag3 or Af1 were collected using human-landing catch. This means that the samples at some point landed on a human, probably with the intent of biting them, i.e. most if not all samples are going to be females. This is not considered to be a disqualifying bias for surveillance but it could hinder someone wanting to study male genomes more specifically. Similarly, many “surveillance” samples were collected using pyrethroid spraying. This method only collects samples that are indoor resting and some taxa, which may be more exophilic, may be under-represented so these data cannot be used to produce an accurate picture of species distribution. This is not a typical use of our data and it is generally assumed that exophilic taxa are more minor vectors but it is a source of bias that may need to be taken into account for some analyses.
Normal setup#
Let us first look at the normal setup of Ag3. We could have used Af1 in the same way but we assume that users are, at this point, more familiar with Ag3.
%pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"
try:
# if running on colab, mount Google Drive
from google.colab import drive
drive.mount('drive')
except ImportError:
pass
results_dir = "drive/MyDrive/Colab Data/ag3-structure-results"
os.makedirs(results_dir, exist_ok=True)
ag3_default = malariagen_data.Ag3(results_cache=results_dir)
We will start by looking at how many sample sets are “unrestricted”.
ag3_default.sample_sets()['unrestricted_use'].value_counts()
unrestricted_use
True 48
False 41
Name: count, dtype: Int64
We see that more than half the sample sets are “unrestricted”. We can look at the first few “restricted” sample sets.
ag3_default.sample_sets().query('unrestricted_use == False').head()
/tmp/ipykernel_901/3652335835.py:1: RuntimeWarning:
Engine has switched to 'python' because numexpr does not support extension array dtypes. Please set your engine to python manually.
| sample_set | sample_count | study_id | study_url | terms_of_use_expiry_date | terms_of_use_url | release | unrestricted_use | |
|---|---|---|---|---|---|---|---|---|
| 28 | 1177-VO-ML-LEHMANN-VMF00004 | 647 | 1177-VO-ML-LEHMANN | https://www.malariagen.net/partner_study/1177-... | 2025-11-17 | https://malariagen.github.io/vector-data/ag3/a... | 3.1 | False |
| 29 | 1188-VO-NIANG-NIEL-SN-2304-VMF00259 | 660 | 1188-VO-SN-NIANG | https://www.malariagen.net/network/where-we-wo... | 2026-06-24 | https://malariagen.github.io/vector-data/ag3/a... | 3.10 | False |
| 30 | 1270-VO-MULTI-PAMGEN-VMF00244 | 252 | 1270-VO-MULTI-PAMGEN | https://www.malariagen.net/network/where-we-wo... | 2026-06-24 | https://malariagen.github.io/vector-data/ag3/a... | 3.10 | False |
| 31 | 1330-VO-GN-LAMA-VMF00250 | 180 | 1330-VO-GN-LAMA | https://www.malariagen.net/network/where-we-wo... | 2026-06-24 | https://malariagen.github.io/vector-data/ag3/a... | 3.10 | False |
| 33 | 1296-VO-BF-DIABATE-VMF00272 | 665 | 1296-VO-BF-DIABATE | https://www.malariagen.net/network/where-we-wo... | 2026-09-06 | https://malariagen.github.io/vector-data/ag3/a... | 3.11 | False |
The first one that we see is ‘1177-VO-ML-LEHMANN-VMF00004’.
Each sample set may contain a mix of “surveillance” and “non-surveillance” samples, it is, for example, the case when some of the samples in a study were bioassayed but not all of them. Let us look at an example sample set: 1274-VO-KE-KAMAU-VMF00246.
mixed_sample_set = '1274-VO-KE-KAMAU-VMF00246'
ag3_default.sample_sets().query("sample_set == '1274-VO-KE-KAMAU-VMF00246'")
| sample_set | sample_count | study_id | study_url | terms_of_use_expiry_date | terms_of_use_url | release | unrestricted_use | |
|---|---|---|---|---|---|---|---|---|
| 81 | 1274-VO-KE-KAMAU-VMF00246 | 564 | 1274-VO-KE-KAMAU | https://www.malariagen.net/partner_study/1274-... | 2025-05-12 | https://malariagen.github.io/vector-data/ag3/a... | 3.9 | True |
We can see that this sample set is unrestricted.
ag3_default.sample_metadata(sample_sets=mixed_sample_set)['is_surveillance'].value_counts()
is_surveillance
False 478
True 86
Name: count, dtype: Int64
We see that a majority of the samples are “non-surveillance” in this sample set but many few are “surveillance”. Let us look at a few principal components analyses to see if we can see the difference between the “surveillance” and “non-surveillance” samples. We will use the same region and number of SNPs as in Workshop 3 Module 4.
region = "3L:15,000,000-41,000,000"
n_snps = 100_000
ag3_pca_df, evr = ag3_default.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)
ag3_default.plot_pca_coords(ag3_pca_df, color="is_surveillance")
The principal components seem to be driven by taxons more than “is_surveillance” so we will only select the An. gambiae s.s. samples.
ag3_pca_gam_df, evr_gam = ag3_default.pca(region=region, sample_sets=mixed_sample_set, sample_query = "taxon == 'gambiae'", n_snps=n_snps)
ag3_default.plot_pca_coords(ag3_pca_gam_df, color="is_surveillance")
The principal components now seem to be driven by outliers more than “is_surveillance”. Let us add more data from ‘AG1000G-KE’, some wild samples also from Kenya, and ‘AG1000G-X’, some crosses.
ag3_pca_gam_ext_df, evr_gam_ext = ag3_default.pca(region=region, sample_sets=[mixed_sample_set, 'AG1000G-KE', 'AG1000G-X'], sample_query = "taxon in ['gambiae', 'unassigned']", n_snps=n_snps)
ag3_default.plot_pca_coords(ag3_pca_gam_ext_df, color='is_surveillance', symbol="sample_set")
Not sure whether I am going to use this.
Unrestricted setup#
Let us now set up things so that we only access the unrestricted sample sets.
# Construct an Ag3 object using the `unrestricted_use_only` setting.
ag3_unrestricted = malariagen_data.Ag3(unrestricted_use_only=True)
Let us check that only the unrestricted sample sets are part of this resource.
# See the value counts for `unrestricted_use` for all of the sample sets relevant to this object.
ag3_unrestricted.sample_sets()['unrestricted_use'].value_counts()
unrestricted_use
True 48
Name: count, dtype: Int64
We can also check how many samples are surveillance.
# See the value counts for `is_surveillance` for all of the samples relevant to this object.
ag3_unrestricted.sample_metadata()['is_surveillance'].value_counts()
is_surveillance
True 6666
False 1963
Name: count, dtype: Int64
We can try to access a restricted sample set.
# Note that an error will be raised if you try to access data for a restricted sample set while using the `unrestricted_use_only` setting. For example:
restricted_sample_set = '1177-VO-ML-LEHMANN-VMF00004'
try:
ag3_unrestricted.sample_metadata(sample_sets=restricted_sample_set)
except ValueError as error_message:
print(error_message)
Sample set '1177-VO-ML-LEHMANN-VMF00004' not found. This sample set might be unavailable or irrelevant with respect to settings.
Surveillance setup#
It is also possible to set up the resource to only access surveillance samples.
# Construct an Ag3 object using the `surveillance_use_only` setting.
ag3_surveillance = malariagen_data.Ag3(surveillance_use_only=True)
We can look at the sample sets that have at least one sample with is_surveillance set to True.
# See the number of the sample sets relevant to this object.
# Note that only sample sets that have at least one sample with `is_surveillance` set to `True` are returned for this object.
len(ag3_surveillance.sample_sets())
77
ag3_surveillance.sample_metadata()['is_surveillance'].value_counts()
is_surveillance
True 19039
Name: count, dtype: Int64
Let us look at the samples in the mixed sample set we identified before.
ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)['is_surveillance'].value_counts()
is_surveillance
True 86
Name: count, dtype: Int64
We can also try to look at a sample set that doesn’t contain any surveillance sample.
# See the sample metadata for all the samples in a sample set that has no surveillance samples.
# Note that an error will be raised if you try to access data for a non-surveillance sample set while using the `surveillance_use_only` setting. For example:
non_surveillance_sample_set = 'AG1000G-X'
try:
ag3_surveillance.sample_metadata(sample_sets=non_surveillance_sample_set)
except ValueError as error_message:
print(error_message)
Sample set 'AG1000G-X' not found. This sample set might be unavailable or irrelevant with respect to settings.
Combined setup#
It is also possible to combine the surveillance and unrestricted setups to access only the surveillance samples from unrestricted sample sets.
# Construct an Ag3 object using both the `unrestricted_use_only` setting and the `surveillance_use_only` setting.
ag3_unrestricted_surveillance = malariagen_data.Ag3(unrestricted_use_only=True, surveillance_use_only=True)
We can check that only unrestricted sample sets and surveillance samples are available.
# See the value counts for `unrestricted_use` for all of the sample sets relevant to this object.
# Note that only sample sets with `unrestricted_use` set to `True` are returned for this object.
ag3_unrestricted_surveillance.sample_sets()['unrestricted_use'].value_counts()
unrestricted_use
True 41
Name: count, dtype: Int64
ag3_unrestricted_surveillance.sample_metadata()['is_surveillance'].value_counts()
is_surveillance
True 6666
Name: count, dtype: Int64
Which flags have been set can be seen by looking at the resource.
ag3_unrestricted_surveillance
| MalariaGEN Ag3 API client | |
|---|---|
| Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
| Storage URL | gs://vo_agam_release_master_us_central1 |
| Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
| Results cache | None |
| Cohorts analysis | 20250502 |
| AIM analysis | 20220528 |
| Site filters analysis | dt_20200416 |
| Software version | malariagen_data 15.2.2.post56+4b9904f |
| Client location | Iowa, United States (Google Cloud us-central1) |
| Data filtered for unrestricted use only | True |
| Data filtered for surveillance use only | True |
| Relevant data releases | 3.0, 3.3, 3.4, 3.5, 3.7, 3.8, 3.9, 3.10 |
In addition to the sample_sets and the sample_metadata functions, other functions and properties in the package also return data differently depending on the setting of the unrestricted_use_only and surveillance_use_only parameters. As a general rule, all functions and properties that appear in the API documentation should honour these settings, but you should check first. Be aware that so-called “private” functions, which are used internally in the package and have an underscore prefix, e.g. _surveillance_flags, might return unfiltered data regardless of the settings.
Let us, for instance, look at the set of releases in each configuration of Ag3. release lists the releases with at least one sample set in the resource.
ag3_default.releases
('3.0',
'3.1',
'3.2',
'3.3',
'3.4',
'3.5',
'3.6',
'3.7',
'3.8',
'3.9',
'3.10',
'3.11',
'3.12',
'3.13',
'3.14')
ag3_surveillance.releases
('3.0',
'3.1',
'3.3',
'3.4',
'3.5',
'3.6',
'3.7',
'3.8',
'3.9',
'3.10',
'3.11',
'3.13',
'3.14')
ag3_unrestricted.releases
('3.0', '3.2', '3.3', '3.4', '3.5', '3.7', '3.8', '3.9', '3.10')
ag3_unrestricted_surveillance.releases
('3.0', '3.3', '3.4', '3.5', '3.7', '3.8', '3.9', '3.10')
ag3_default contains all the currently available releases while ag3_surveillance, ag3_unrestricted and ag3_unrestricted_surveillance contain some subsets of the available releases.
Depending on which flags are set, the number of samples returned during an analysis may vary greatly. Let us look at an example showing the CNV HMMs for ‘1274-VO-KE-KAMAU-VMF00246’ with default configuration and the surveillance one.
# The `aim_calls` function will return samples depending on the object's `surveillance_use_only` setting. For example:
cnv_region = '2R:28,480,000-28,490,000'
ag3_cnv_hmm_df = ag3_default.cnv_hmm(region=cnv_region, sample_sets=mixed_sample_set).to_dataframe()
ag3_surveillance_cnv_hmm_df = ag3_surveillance.cnv_hmm(region=cnv_region, sample_sets=mixed_sample_set).to_dataframe()
print('Samples returned by the default object:', len(ag3_cnv_hmm_df['sample_id'].unique()))
print('Samples returned by the surveillance object:', len(ag3_surveillance_cnv_hmm_df['sample_id'].unique()))
Samples returned by the default object: 359
Samples returned by the surveillance object: 49
We get very different numbers of samples and, if we display the results, we will also get different plots.
ag3_default.plot_cnv_hmm_heatmap(
region=cnv_region,
sample_sets=mixed_sample_set,
row_height=5
);
ag3_surveillance.plot_cnv_hmm_heatmap(
region=cnv_region,
sample_sets=mixed_sample_set,
row_height=5
);
The general distribution of results is similar broadly similar between the two plots. One can observe roughly 3 different patterns: no amplification, an amplification that covers the whole region and an amplification that covers only Cyp6aa1.
One may be interested to see if the method of collection of these samples has an impact on their CNVs. This information is not available in Ag3 but one may want to add it. Because we do not have access to this information, we will simulate it by using random values. Obviously, this should never be done in actual analyses!
# Extra metadata can be defined for each of the samples in the mixed sample set which are not surveillance.
import pandas as pd
import numpy as np
mixed_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)
collections_df = pd.DataFrame(
{
"sample_id": mixed_sample_metadata_df["sample_id"],
"collection method": np.random.choice(
["HLC", "pyrethroid spray", "Shannon trap"], len(mixed_sample_metadata_df)
),
}
)
collections_df['collection method'].value_counts()
collection method
Shannon trap 37
pyrethroid spray 26
HLC 23
Name: count, dtype: int64
This information can then be added to the metadata, in this case of the samples that are not surveillance.
ag3_default.add_extra_metadata(collections_df)
We can check that none of the samples that are not surveillance has a collection method and that we have now access to the collection methods for the surveillance samples.
not_surveillance_mixed_sample_metadata_df = ag3_default.sample_metadata(sample_sets=mixed_sample_set, sample_query='not is_surveillance')
not_surveillance_mixed_sample_metadata_df['collection method'].value_counts()
Series([], Name: count, dtype: int64)
mixed_sample_metadata_df['collection method'].isnull().any()
True
mixed_sample_metadata_df = ag3_default.sample_metadata(sample_sets=mixed_sample_set)
mixed_sample_metadata_df['collection method'].value_counts()
collection method
Shannon trap 37
pyrethroid spray 26
HLC 23
Name: count, dtype: int64
It is generally less error-prone to make sure that we are always working with the same number of samples. In this case, it would make more sense to add the extra metadata to ag3_surveillance.
ag3_surveillance.add_extra_metadata(collections_df)
We get the same result for surveillance samples.
surveillance_mixed_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)
surveillance_mixed_sample_metadata_df['collection method'].value_counts()
collection method
Shannon trap 37
pyrethroid spray 26
HLC 23
Name: count, dtype: int64
But no sample is missing a value.
surveillance_mixed_sample_metadata_df['collection method'].isnull().any()
False
This extra-metadata can then be used with other functions (though it is meaningless here).
ag3_pca_cm_df, evr = ag3_default.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)
ag3_default.plot_pca_coords(ag3_pca_cm_df, color='collection method')
ag3_pca_cm_surv_df, evr = ag3_surveillance.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)
ag3_surveillance.plot_pca_coords(ag3_pca_cm_surv_df, color='collection method')
Congratulations on reaching the end of this notebook.