Unrestricted and surveillance metadata flags

Unrestricted and surveillance metadata flags#

Theme: Data

DISCLAIMER: This is work in progress and subject to change and updates.

This module provides an introduction to two metadata flags that have been added recently to the Ag3 and Af1 resources. These flags are used to filter samples that are “unrestricted”, i.e., available for anyone to use freely, and “surveillance”, i.e., samples that are deemed to be appropriate for a surveillance purpose. What both of these terms mean will be examined in more details shortly.

This module will not contain a lot of technical code but some functions, such as sample_metadata and cnv_hmm will be used as examples. We will not detail what these functions do in this notebook and we will not be interested in the results of their executions, only on the impact of using the two flags on these functions. More details on the use of these functions and their results can be found in the Introductory course. The use of sample_metadata is, for instance, explained in Workshop 1 Module 2 while more can be learned about CNVs in Workshop 2. We will also use some principal components analyses (PCA). Some examples and explanation about PCAs can be found in Detecting population structure using PCA.

Learning objectives#

After completing this module, you will be able to:

Explain what it means for a sample set to be “unrestricted”
Enumerate a few sources of bias that would prevent a sample from being “surveillance”
Understand when to use the unrestricted_use_only and surveillance_use_only flags for your analyses
Use these flags when needed

What is an “unrestricted” sample set and why are some sample sets not “unrestricted”?#

Put simply, a sample set is “unrestricted” if anyone can use it however they so choose.

All the sample sets that are accessible using either Ag3 or Af1 are open but terms of use apply. These terms of use are available on the MalariaGEN website and may differ from sample set to sample set. For example, the terms of use for Anopheles gambiae Genomic Surveillance project (which can be found here) state that there is a “publication embargo [which] will expire 24 months after the data is integrated into the Malaria Genome Vector Observatory data repository, or earlier, if the project partner agrees to remove the embargo before the expiry date.”.

We created a flag, unrestricted_use, to facilitate checking whether a sample set is out of embargo. It is True if the sample set can be used without any restriction and False otherwise. Users can access and explore sample sets with a False unrestricted_use flag through the package but they need to be aware that terms of use apply and these differ between sample sets. It is key to respect these terms of use when accessing MalariaGEN resources, this allow us to continue working with the data producers/owners to generate key resources for the community.

What is a “surveillance” sample and why are some samples not “surveillance”?#

There are many ways to use the Ag3 and Af1 data but one of the most important one is surveillance of the vector population. A surveillance use of the data is a study of how a wild population is behaving and evolving, for instance in the presence of an intervention. Let’s say, a new bednet is widely distributed in a location, one might be interested in knowing whether the vector population is experiencing a decline in size due to the presence of the new bednet (i.e., a change of population demographics), if a sweep occurs to provide resistance (i.e., a change of population genomics), or if the population becomes more exophilic or changes its biting patterns (i.e., a change of population behaviour). Another key question is the time scale of such a change and, potentially, its reversibility.

In order to be useful to answer these many questions, the data needs to be an accurate representation of the underlying population, i.e. a sample should not differ from its population too much. If we deem that a sample is a fair representative of the population it came from, it is called a “surveillance” sample. Otherwise, it is not a “surveillance” sample because it is biased.

There are many different reasons why a sample may be biased. One obvious reason is if the population the sample came from cannot be easily ascertained. For instance, a sample whose taxon is undetermined cannot be used for surveillance as it would be likely be significantly different from every other sample no matter which population it was assigned to. Similarly, an accurate estimate of when and where a sample was collected is required. The space occupied by a population of mosquitoes varies with time and the exact range of movement of a population is difficult to measure with certainty but if we don’t know where a sample was collected, we cannot with any degree of certainty assign it to a population. Something similar is true temporally, obviously two samples from the same location collected during different months or years are likely to belong to populations that are connected but the demographics and genomics of a population can change from month to month (e.g., populations of mosquitoes tend to be smaller during the dry seasons than during the rainy seasons) so a month and year of collection are required for a sample to be considered “surveillance”.

Many samples in the Ag3 and Af1 resources were collected in order to realise bioassays. In theory, if all samples used for a bioassay were then sequenced (and assuming they were all wild-caught), the bias would be fairly minimal but, generally, only the most resistant and the most susceptible samples of a bioassay are sequenced. This means that these samples are heavily biased and their genome is thus not a good representation of what the wild population looks like. All of these samples are thus considered not to be “surveillance”.

The question of bias is not trivial. Every measurement is biased in some ways and the acceptable bias is highly dependent on which questions one wants to ask. For instance, all “surveillance” samples in Ag3 belong to the Anopheles gambiae complex. If one was interested in the biodiversity in Africa, or even just on the biodiversity of insects, it would be an extremely biased resource. We, however, assume that users are aware of this bias and their questions relate to the Anopheles gambiae complex.

Furthermore, some bias is considered to be acceptable. Many “surveillance” samples in Ag3 or Af1 were collected using human-landing catch. This means that the samples at some point landed on a human, probably with the intent of biting them, i.e. most if not all samples are going to be females. This is not considered to be a disqualifying bias for surveillance but it could hinder someone wanting to study male genomes more specifically. Similarly, many “surveillance” samples were collected using pyrethroid spraying. This method only collects samples that are indoor resting and some taxa, which may be more exophilic, may be under-represented so these data cannot be used to produce an accurate picture of species distribution. This is not a typical use of our data and it is generally assumed that exophilic taxa are more minor vectors but it is a source of bias that may need to be taken into account for some analyses.

Standard setup used during the entry level course#

Let us first look at the setup you might be more familiar with for Ag3. We could have used Af1 in the same way but we assume that users are, at this point, more familiar with Ag3.

%pip install -q --no-warn-conflicts malariagen_data

import malariagen_data
import os

import plotly.io as pio
pio.renderers.default = "notebook+colab"

try:
    # if running on colab, mount Google Drive
    from google.colab import drive
    drive.mount('drive')
except ImportError:
    pass

results_dir = "drive/MyDrive/Colab Data/ag3-structure-results"
os.makedirs(results_dir, exist_ok=True)

ag3_default = malariagen_data.Ag3(results_cache=results_dir)

We will start by looking at how many sample sets are “unrestricted”.

ag3_default.sample_sets()['unrestricted_use'].value_counts()

unrestricted_use
True     73
False    24
Name: count, dtype: Int64

We see that more than half the sample sets are “unrestricted”. We can look at the first few “restricted” sample sets.

ag3_default.sample_sets().query('unrestricted_use == False').head()

/var/tmp/ipykernel_42655/3652335835.py:1: RuntimeWarning:

Engine has switched to 'python' because numexpr does not support extension array dtypes. Please set your engine to python manually.

	sample_set	sample_count	study_id	study_url	terms_of_use_expiry_date	terms_of_use_url	release	unrestricted_use
29	1188-VO-NIANG-NIEL-SN-2304-VMF00259	660	1188-VO-SN-NIANG	https://www.malariagen.net/network/where-we-wo...	2026-06-24	https://malariagen.github.io/vector-data/ag3/a...	3.10	False
30	1270-VO-MULTI-PAMGEN-VMF00244	252	1270-VO-MULTI-PAMGEN	https://www.malariagen.net/network/where-we-wo...	2026-06-24	https://malariagen.github.io/vector-data/ag3/a...	3.10	False
31	1330-VO-GN-LAMA-VMF00250	180	1330-VO-GN-LAMA	https://www.malariagen.net/network/where-we-wo...	2026-06-24	https://malariagen.github.io/vector-data/ag3/a...	3.10	False
33	1296-VO-BF-DIABATE-VMF00272	665	1296-VO-BF-DIABATE	https://www.malariagen.net/network/where-we-wo...	2026-09-06	https://malariagen.github.io/vector-data/ag3/a...	3.11	False
34	1351-VO-SS-WEETMAN-VMF00282	90	1351-VO-SS-WEETMAN	https://www.malariagen.net/network/where-we-wo...	2026-09-06	https://malariagen.github.io/vector-data/ag3/a...	3.11	False

The first one that we see is ‘1188-VO-NIANG-NIEL-SN-2304-VMF00259’.

Each sample set may contain a mix of “surveillance” and “non-surveillance” samples, it is, for example, the case when some of the samples in a study were bioassayed but not all of them. Let us look at an example sample set: 1274-VO-KE-KAMAU-VMF00246.

mixed_sample_set = '1274-VO-KE-KAMAU-VMF00246'
ag3_default.sample_sets().query("sample_set == '1274-VO-KE-KAMAU-VMF00246'")

	sample_set	sample_count	study_id	study_url	terms_of_use_expiry_date	terms_of_use_url	release	unrestricted_use
89	1274-VO-KE-KAMAU-VMF00246	564	1274-VO-KE-KAMAU	https://www.malariagen.net/partner_study/1274-...	2025-05-12	https://malariagen.github.io/vector-data/ag3/a...	3.9	True

We can see that this sample set is unrestricted.

ag3_default.sample_metadata(sample_sets=mixed_sample_set)['is_surveillance'].value_counts()

is_surveillance
False    478
True      86
Name: count, dtype: Int64

We see that a majority of the samples are “non-surveillance” in this sample set but many few are “surveillance”. Let us look at a few principal components analyses to see if we can see the difference between the “surveillance” and “non-surveillance” samples. We will use the same region and number of SNPs as in Detecting population structure using PCA.

region = "3L:15,000,000-41,000,000"
n_snps = 100_000

ag3_pca_df, evr = ag3_default.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)

ag3_default.plot_pca_coords(ag3_pca_df, color="is_surveillance") 

The principal components seem to be driven by taxons more than “is_surveillance” so we will only select the An. gambiae s.s. samples.

ag3_pca_gam_df, evr_gam = ag3_default.pca(region=region, sample_sets=mixed_sample_set, sample_query = "taxon == 'gambiae'", n_snps=n_snps)

ag3_default.plot_pca_coords(ag3_pca_gam_df, color="is_surveillance") 

The principal components now seem to be driven by outliers more than “is_surveillance”. Let us add more data from ‘AG1000G-KE’, some wild samples also from Kenya, and ‘AG1000G-X’, some crosses.

ag3_pca_gam_ext_df, evr_gam_ext = ag3_default.pca(region=region, sample_sets=[mixed_sample_set, 'AG1000G-KE', 'AG1000G-X'], sample_query = "taxon in ['gambiae', 'unassigned']", n_snps=n_snps)

ag3_default.plot_pca_coords(ag3_pca_gam_ext_df, color='is_surveillance', symbol="sample_set") 

Unrestricted setup#

Let us now set up things so that we only access the unrestricted sample sets.

# Construct an Ag3 object using the `unrestricted_use_only` setting.
ag3_unrestricted = malariagen_data.Ag3(unrestricted_use_only=True)

Let us check that only the unrestricted sample sets are part of this resource.

# See the value counts for `unrestricted_use` for all of the sample sets relevant to this object.
ag3_unrestricted.sample_sets()['unrestricted_use'].value_counts()

unrestricted_use
True    73
Name: count, dtype: Int64

We can also check how many samples are surveillance.

# See the value counts for `is_surveillance` for all of the samples relevant to this object.
ag3_unrestricted.sample_metadata()['is_surveillance'].value_counts()

is_surveillance
True     13720
False     2228
Name: count, dtype: Int64

We can try to access a restricted sample set.

# Note that an error will be raised if you try to access data for a restricted sample set while using the `unrestricted_use_only` setting. For example:
restricted_sample_set = '1188-VO-NIANG-NIEL-SN-2304-VMF00259'
try:
  ag3_unrestricted.sample_metadata(sample_sets=restricted_sample_set)
except ValueError as error_message:
  print(error_message)

Sample set '1188-VO-NIANG-NIEL-SN-2304-VMF00259' not found. This sample set might be unavailable or irrelevant with respect to settings.

Surveillance setup#

It is also possible to set up the resource to only access surveillance samples.

# Construct an Ag3 object using the `surveillance_use_only` setting.
ag3_surveillance = malariagen_data.Ag3(surveillance_use_only=True)

We can look at the sample sets that have at least one sample with is_surveillance set to True.

# See the number of the sample sets relevant to this object.
# Note that only sample sets that have at least one sample with `is_surveillance` set to `True` are returned for this object.
len(ag3_surveillance.sample_sets())

ag3_surveillance.sample_metadata()['is_surveillance'].value_counts()

is_surveillance
True    20848
Name: count, dtype: Int64

Let us look at the samples in the mixed sample set we identified before.

ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)['is_surveillance'].value_counts()

is_surveillance
True    86
Name: count, dtype: Int64

We can also try to look at a sample set that doesn’t contain any surveillance sample.

# See the sample metadata for all the samples in a sample set that has no surveillance samples.
# Note that an error will be raised if you try to access data for a non-surveillance sample set while using the `surveillance_use_only` setting. For example:
non_surveillance_sample_set = 'AG1000G-X'
try:
  ag3_surveillance.sample_metadata(sample_sets=non_surveillance_sample_set)
except ValueError as error_message:
  print(error_message)

Sample set 'AG1000G-X' not found. This sample set might be unavailable or irrelevant with respect to settings.

Combined setup#

It is also possible to combine the surveillance and unrestricted setups to access only the surveillance samples from unrestricted sample sets.

# Construct an Ag3 object using both the `unrestricted_use_only` setting and the `surveillance_use_only` setting.
ag3_unrestricted_surveillance = malariagen_data.Ag3(unrestricted_use_only=True, surveillance_use_only=True)

We can check that only unrestricted sample sets and surveillance samples are available.

# See the value counts for `unrestricted_use` for all of the sample sets relevant to this object.
# Note that only sample sets with `unrestricted_use` set to `True` are returned for this object.
ag3_unrestricted_surveillance.sample_sets()['unrestricted_use'].value_counts()

unrestricted_use
True    64
Name: count, dtype: Int64

ag3_unrestricted_surveillance.sample_metadata()['is_surveillance'].value_counts()

is_surveillance
True    13720
Name: count, dtype: Int64

Which flags have been set can be seen by looking at the resource.

ag3_unrestricted_surveillance

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14, 3.15
Results cache	None
Cohorts analysis	20250815
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.6.0.post407+1430baf
Client location	Iowa, United States (Google Cloud us-central1)
Data filtered for unrestricted use only	True
Data filtered for surveillance use only	True
Relevant data releases	3.0, 3.1, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10

In addition to the sample_sets and the sample_metadata functions, other functions and properties in the package also return data differently depending on the setting of the unrestricted_use_only and surveillance_use_only parameters. As a general rule, all functions and properties that appear in the API documentation should honour these settings, but you should check first. Be aware that so-called “private” functions, which are used internally in the package and have an underscore prefix, e.g. _surveillance_flags, might return unfiltered data regardless of the settings.

Let us, for instance, look at the set of releases in each configuration of Ag3. release lists the releases with at least one sample set in the resource.

ag3_default.releases

('3.0',
 '3.1',
 '3.2',
 '3.3',
 '3.4',
 '3.5',
 '3.6',
 '3.7',
 '3.8',
 '3.9',
 '3.10',
 '3.11',
 '3.12',
 '3.13',
 '3.14',
 '3.15')

ag3_surveillance.releases

('3.0',
 '3.1',
 '3.3',
 '3.4',
 '3.5',
 '3.6',
 '3.7',
 '3.8',
 '3.9',
 '3.10',
 '3.11',
 '3.13',
 '3.14',
 '3.15')

ag3_unrestricted.releases

('3.0', '3.1', '3.2', '3.3', '3.4', '3.5', '3.6', '3.7', '3.8', '3.9', '3.10')

ag3_unrestricted_surveillance.releases

('3.0', '3.1', '3.3', '3.4', '3.5', '3.6', '3.7', '3.8', '3.9', '3.10')

ag3_default contains all the currently available releases while ag3_surveillance, ag3_unrestricted and ag3_unrestricted_surveillance contain some subsets of the available releases.

Depending on which flags are set, the number of samples returned during an analysis may vary greatly. Let us look at an example showing the CNV HMMs for ‘1274-VO-KE-KAMAU-VMF00246’ with default configuration and the surveillance one.

# The `aim_calls` function will return samples depending on the object's `surveillance_use_only` setting. For example:
cnv_region = '2R:28,480,000-28,490,000'
ag3_cnv_hmm_df = ag3_default.cnv_hmm(region=cnv_region, sample_sets=mixed_sample_set).to_dataframe()
ag3_surveillance_cnv_hmm_df = ag3_surveillance.cnv_hmm(region=cnv_region, sample_sets=mixed_sample_set).to_dataframe()
print('Samples returned by the default object:', len(ag3_cnv_hmm_df['sample_id'].unique()))
print('Samples returned by the surveillance object:', len(ag3_surveillance_cnv_hmm_df['sample_id'].unique()))

Samples returned by the default object: 359
Samples returned by the surveillance object: 49

We get very different numbers of samples and, if we display the results, we will also get different plots.

ag3_default.plot_cnv_hmm_heatmap(
    region=cnv_region, 
    sample_sets=mixed_sample_set,
    row_height=5
);

ag3_surveillance.plot_cnv_hmm_heatmap(
    region=cnv_region, 
    sample_sets=mixed_sample_set,
    row_height=5
);

The general distribution of results is similar broadly similar between the two plots. One can observe roughly 3 different patterns: no amplification, an amplification that covers the whole region and an amplification that covers only Cyp6aa1.

One may be interested to see if the method of collection of these samples has an impact on their CNVs. This information is not available in Ag3 but one may want to add it. Because we do not have access to this information, we will simulate it by using random values. Obviously, this should never be done in actual analyses!

# Extra metadata can be defined for each of the samples in the mixed sample set which are not surveillance.
import pandas as pd
import numpy as np
mixed_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)

collections_df = pd.DataFrame(
    {
        "sample_id": mixed_sample_metadata_df["sample_id"],
        "collection method": np.random.choice(
            ["HLC", "pyrethroid spray", "Shannon trap"], len(mixed_sample_metadata_df)
        ),
    }
)
collections_df['collection method'].value_counts()

collection method
HLC                 32
Shannon trap        30
pyrethroid spray    24
Name: count, dtype: int64

This information can then be added to the metadata, in this case of the samples that are not surveillance.

ag3_default.add_extra_metadata(collections_df)

We can check that none of the samples that are not surveillance has a collection method and that we have now access to the collection methods for the surveillance samples.

not_surveillance_mixed_sample_metadata_df = ag3_default.sample_metadata(sample_sets=mixed_sample_set, sample_query='not is_surveillance')
not_surveillance_mixed_sample_metadata_df['collection method'].value_counts()

Series([], Name: count, dtype: int64)

try:
    mixed_sample_metadata_df['collection method'].isnull().any()
except KeyError as error_message:
  print(error_message)

'collection method'

mixed_sample_metadata_df = ag3_default.sample_metadata(sample_sets=mixed_sample_set)
mixed_sample_metadata_df['collection method'].value_counts()

collection method
HLC                 32
Shannon trap        30
pyrethroid spray    24
Name: count, dtype: int64

It is generally less error-prone to make sure that we are always working with the same number of samples. In this case, it would make more sense to add the extra metadata to ag3_surveillance.

ag3_surveillance.add_extra_metadata(collections_df)

We get the same result for surveillance samples.

surveillance_mixed_sample_metadata_df = ag3_surveillance.sample_metadata(sample_sets=mixed_sample_set)
surveillance_mixed_sample_metadata_df['collection method'].value_counts()

collection method
HLC                 32
Shannon trap        30
pyrethroid spray    24
Name: count, dtype: int64

But no sample is missing a value.

surveillance_mixed_sample_metadata_df['collection method'].isnull().any()

np.False_

This extra-metadata can then be used with other functions (though it is meaningless here).

ag3_pca_cm_df, evr = ag3_default.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)
ag3_default.plot_pca_coords(ag3_pca_cm_df, color='collection method') 

ag3_pca_cm_surv_df, evr = ag3_surveillance.pca(region=region, sample_sets=mixed_sample_set, n_snps=n_snps)
ag3_surveillance.plot_pca_coords(ag3_pca_cm_surv_df, color='collection method') 

Well done

In this module, we have learnt:

Why the use of some data may be restricted
Why some data may not be appropriate for surveillance
How to configure the resources to only access the data that is relevant to our scope
That one might get different results with different configurations. One needs to be careful.

Congratulations on reaching the end of this notebook.