Module 3 - Genetic diversity summary statistics

Module 3 - Genetic diversity summary statistics#

Theme: Analysis

This module covers how to compute genetic diversity summary statistics, which are a convenient and informative way to quantify and compare diversity in different mosquito populations.

Learning objectives#

At the end of this module you will be able to:

Select cohorts for comparative analysis of genetic diversity
Compute nucleotide diversity, Watterson’s estimator and Tajima’s D
Plot diversity statistics for multiple cohorts
Interpret the results

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

Install and import the packages we’ll need.

%pip install -q --no-warn-conflicts malariagen_data

import malariagen_data
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook+colab"

Here we will compute summary statistics using data from millions of SNPs. This can take some time, so we will use Google Drive to save results and avoid having to rerun computations. To do this we first need to mount Google Drive to our colab VM:

Configure access to MalariaGEN data in Google Cloud.

try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3(
    results_cache="drive/MyDrive/malariagen_data_cache"
)
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	/home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/workshop-5/drive/MyDrive/malariagen_data_cache
Cohorts analysis	20250131
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.2.0
Client location	Iowa, United States (Google Cloud us-central1)

Selecting cohorts#

In this module we will be comparing genetic diversity between mosquitoes representing different geographical locations, time points and species. Before we begin any comparative analysis of genetic diversity, we first need to decide how to group the mosquitoes we’ve sequenced into cohorts. Each cohort we define will represent a sample of mosquitoes from a given time, place and species.

Let’s first browse the different sample locations for which we have data. In workshop 1, module 2 we looked at how to create an interactive map with sampling locations using ipyleaflet. Because this is something we want to do often, we’ve added a convenience function plot_samples_interactive_map() to do this for you. This will plot a map showing sampling locations for all available samples.

ag3.plot_samples_interactive_map(sample_query="release == '3.0'", basemap='worldimagery')

N.B., you will need to run this notebook for yourself in colab in order for the interactive map to work.

For the purposes of this module we will begin by focusing on Tanzania. To get more information on the numbers of samples available from Tanzania, we can use pandas to create a pivot table as we also illustrated in workshop 1, module 2. Again, because we do this often, we’ve created a convenience function count_samples().

ag3.count_samples(
    sample_query="release == '3.0' and country == 'Tanzania'"
)

				taxon	arabiensis	gambiae	gcx3
country	admin1_iso	admin1_name	admin2_name	year
Tanzania	TZ-05	Kagera	Muleba	2015	137	32	1
	TZ-13	Mara	Tarime	2012	47	0	0
	TZ-25	Tanga	Muheza	2013	1	32	10
	TZ-26	Manyara	Moshi	2012	40	0	0

When grouping samples by geographical location, we sometimes can choose whether to aggregate data from nearby locations, or whether to keep them separate. In general, we want the finest spatial granularity possible, but we also need enough samples to make reasonable estimates of genetic diversity, and so we might have to make a trade-off between granularity and sample size.

Similarly, when grouping samples by date of collection, we can choose whether to aggregate data from the same month, season or year. Again, in general we would prefer the finest temporal granularity possible, but might also need sufficient sample size.

For computing diversity summary statistics, there is no hard-and-fast rule about how many samples are needed, but a sample size of 10 is probably the bare minimum we would consider, and we would prefer more if possible.

To simplify grouping of mosquitoes into cohorts, we have predefined some metadata columns which provide cohort labels at different levels of granularity. Let’s take a peek at the sample metadata:

df_samples_tz = ag3.sample_metadata(
    sample_query="release == '3.0' and country == 'Tanzania'"
)
df_samples_tz.head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	BL0046-C	Plate_C_H6	Bilali Kabula	Tanzania	Muleba	2015	6	-1.962	31.621	F	...	Kagera	TZ-05	Muleba	gcx3	TZ-05_gcx3_2015	TZ-05_gcx3_2015_06	TZ-05_gcx3_2015_Q2	TZ-05_Muleba_gcx3_2015	TZ-05_Muleba_gcx3_2015_06	TZ-05_Muleba_gcx3_2015_Q2
1	BL0047-C	Plate_F_D4	Bilali Kabula	Tanzania	Muleba	2015	3	-1.962	31.621	F	...	Kagera	TZ-05	Muleba	arabiensis	TZ-05_arab_2015	TZ-05_arab_2015_03	TZ-05_arab_2015_Q1	TZ-05_Muleba_arab_2015	TZ-05_Muleba_arab_2015_03	TZ-05_Muleba_arab_2015_Q1
2	BL0048-C	Plate_F_E4	Bilali Kabula	Tanzania	Muleba	2015	3	-1.962	31.621	F	...	Kagera	TZ-05	Muleba	arabiensis	TZ-05_arab_2015	TZ-05_arab_2015_03	TZ-05_arab_2015_Q1	TZ-05_Muleba_arab_2015	TZ-05_Muleba_arab_2015_03	TZ-05_Muleba_arab_2015_Q1
3	BL0049-C	Plate_F_F4	Bilali Kabula	Tanzania	Muleba	2015	3	-1.962	31.621	F	...	Kagera	TZ-05	Muleba	arabiensis	TZ-05_arab_2015	TZ-05_arab_2015_03	TZ-05_arab_2015_Q1	TZ-05_Muleba_arab_2015	TZ-05_Muleba_arab_2015_03	TZ-05_Muleba_arab_2015_Q1
4	BL0050-C	Plate_F_G4	Bilali Kabula	Tanzania	Muleba	2015	3	-1.962	31.621	F	...	Kagera	TZ-05	Muleba	arabiensis	TZ-05_arab_2015	TZ-05_arab_2015_03	TZ-05_arab_2015_Q1	TZ-05_Muleba_arab_2015	TZ-05_Muleba_arab_2015_03	TZ-05_Muleba_arab_2015_Q1

5 rows × 57 columns

If you scroll to the right, you’ll see columns named “cohort_admin1_year”, “cohort_admin1_month”, “cohort_admin1_quarter”, “cohort_admin2_year”, “cohort_admin2_quarter” and “cohort_admin2_month”. Each of these columns provides a different set of cohort labels. All are grouped by species (taxon).

For this analysis, let’s use the “cohort_admin2_year” column to group samples by administrative level 2 divisions (usually called districts) and year. Here are the cohort labels for the Tanzania samples and the sample sizes.

df_samples_tz.groupby("cohort_admin2_year").size()

cohort_admin2_year
TZ-05_Muleba_arab_2015    137
TZ-05_Muleba_gamb_2015     32
TZ-05_Muleba_gcx3_2015      1
TZ-13_Tarime_arab_2012     47
TZ-25_Muheza_arab_2013      1
TZ-25_Muheza_gamb_2013     32
TZ-25_Muheza_gcx3_2013     10
TZ-26_Moshi_arab_2012      40
dtype: int64

We can see there are two cohorts with only 1 sample - these will get dropped from our analysis. Otherwise, all cohorts have at least 10 samples.

Recap: genetic diversity summary statistics#

We are going to compute the following three summary statistics for each cohort:

Nucleotide diversity (theta_pi) - This quantifies the average number of differences (mismatches) between any pair of DNA sequences in a cohort. The higher this statistic, the more differences you will expect to see between any pair of randomly sampled DNA sequences.
Watterson’s estimator (theta_w) - This quantifies the number of segregating sites within a cohort, scaled by a constant. The higher this statistic, the more sites are found to be segregating (i.e., polymorphic).
Tajima’s D (tajima_d) - This quantifies the difference between Watterson’s estimator and nucleotide diversity, and scales the result by a constant. This statistic is useful because it focuses not on the magnitude but on the architecture of diversity in a cohort, compared to a neutrally evolving population of constant size. Negative values imply an excess of rare alleles, and positive values imply a deficit of rare alleles, relative to a population at equilibrium.

Computing diversity summary statistics#

Let’s begin by computing these summary statistics in a single cohort. For this we can use the function cohort_diversity_stats(). Let’s look at the docstring.

ag3.cohort_diversity_stats?

Signature:
ag3.cohort_diversity_stats(
    cohort: Union[str, Tuple[str, str]],
    cohort_size: int,
    region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
    min_cohort_size: Optional[int] = None,
    max_cohort_size: Optional[int] = None,
    site_mask: Optional[str] = 'default',
    site_class: Optional[str] = None,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    random_seed: int = 42,
    n_jack: int = 200,
    confidence_level: float = 0.95,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
) -> pandas.core.series.Series
Docstring:
Compute genetic diversity summary statistics for a cohort of
individuals.

Parameters
----------
cohort : str or tuple[str, str]
    Either a string giving one of the predefined cohort labels, or a pair
    of strings giving a custom cohort label and a sample query.
cohort_size : int
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript. Can also be a sequence (e.g.,
    list) of regions.
min_cohort_size : int or None, optional
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
max_cohort_size : int or None, optional
    Randomly down-sample to this value if the number of samples in the
    cohort is greater.
site_mask : str or None, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
site_class : str or None, optional
    Select sites belonging to one of the following classes: CDS_DEG_4,
    (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
    degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
    INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
    longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
    splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
    site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
    untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
    gene).
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
n_jack : int, optional, default: 200
    Number of blocks to divide the data into for the block jackknife
    estimation of confidence intervals. N.B., larger is not necessarily
    better.
confidence_level : float, optional, default: 0.95
    Confidence level to use for confidence interval calculation. E.g.,
    0.95 means 95% confidence interval.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.

Returns
-------
Series
    A pandas series with summary statistics (theta pi, Watterson's theta
    and Tajima's D) and their estimate, bias, standard error, confidence
    interval error, confidence interval lower value, and confidence
    interval upper value. The series also contains the cohort under
    study, its taxon, its year of collection, its month of collection,
    its country of collection, the ISO code of its first administrative
    level of collection, the name of its first administrative level of
    collection, the name of its second administrative level of
    collection, the longitude of its location of collection, and the
    latitude of its location of collection.
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py
Type:      method

Let’s now compute the statistics for one of our Tanzanian cohorts, TZ-05_Muleba_gamb_2015.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=30,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
stats

cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.021167
theta_pi_estimate                  0.021173
theta_pi_bias                     -0.000005
theta_pi_std_err                     0.0003
theta_pi_ci_err                    0.001176
theta_pi_ci_low                    0.020585
theta_pi_ci_upp                    0.021761
theta_w                            0.038855
theta_w_estimate                   0.038854
theta_w_bias                       0.000001
theta_w_std_err                    0.000436
theta_w_ci_err                     0.001709
theta_w_ci_low                     0.037999
theta_w_ci_upp                     0.039708
tajima_d                          -1.635894
tajima_d_estimate                 -1.635436
tajima_d_bias                     -0.000459
tajima_d_std_err                   0.015294
tajima_d_ci_err                     0.05995
tajima_d_ci_low                   -1.665411
tajima_d_ci_upp                   -1.605461
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

This function returns a pandas Series with values of the summary statistics for our requested cohort. The field theta_pi contains the value of nucleotide diversity. The field theta_w contains the value of Watterson’s estimator. The field tajima_d contains the value of Tajima’s D.

A 95% confidence interval has also been estimated for each of these statistics using a block jackknife procedure. The lower and upper ends of these confidence intervals are given by the fields with the suffixes _ci_low and _ci_upp respectively. The fields with suffix _ci_err give the size of the confidence interval, and is useful for plotting error bars.

It is important to note that, in addition to the cohort parameter, we also provided values for the parameters cohort_size, region, site_mask and site_class. These parameters are important and can make a big difference to the statistic values obtained.

To illustrate the impact of changing some of these parameters, let’s recompute the statistics but providing cohort_size of 10 instead of 30.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
stats

cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.021242
theta_pi_estimate                  0.021247
theta_pi_bias                     -0.000004
theta_pi_std_err                   0.000302
theta_pi_ci_err                    0.001182
theta_pi_ci_low                    0.020655
theta_pi_ci_upp                    0.021838
theta_w                             0.03035
theta_w_estimate                   0.030352
theta_w_bias                      -0.000003
theta_w_std_err                    0.000377
theta_w_ci_err                     0.001478
theta_w_ci_low                     0.029613
theta_w_ci_upp                     0.031092
tajima_d                          -1.259672
tajima_d_estimate                 -1.259398
tajima_d_bias                     -0.000274
tajima_d_std_err                   0.016672
tajima_d_ci_err                    0.065353
tajima_d_ci_low                   -1.292075
tajima_d_ci_upp                   -1.226721
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

Note that the value of theta_pi hasn’t changed much, but the values of theta_w and tajima_d are considerably different. This is because, for many cohorts, there is a strong dependency between these statistics and the cohort size, i.e., the number of samples used to calculate the statistics. Because of this, if we want to make a fair comparison between different cohorts, it is important to use the same cohort size in all cases. If we have a cohort with more samples than our chosen cohort size, the cohort will be randomly downsampled.

Another parameter that makes a big difference is the site_class parameter, which defines the type of genomic site used to calculate the diversity statistics. Above we used “CDS_DEG_4” which stands for 4-fold degenerate coding sites, i.e., sites within gene coding sequences where none of the possible nucleotide changes will affect the resulting amino acid. These sites are typically under relatively weak selective constraint, because they have little to no impact on the phenotype of the organism.

For illustration, let’s contrast this with “CDS_DEG_0” which stands for non-degenerate coding sites, i.e., sites where any nucleotide change will change the amino acid.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_0",
)
stats

cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.002492
theta_pi_estimate                  0.002492
theta_pi_bias                          -0.0
theta_pi_std_err                   0.000091
theta_pi_ci_err                    0.000356
theta_pi_ci_low                    0.002314
theta_pi_ci_upp                     0.00267
theta_w                            0.004047
theta_w_estimate                   0.004047
theta_w_bias                      -0.000001
theta_w_std_err                     0.00013
theta_w_ci_err                     0.000509
theta_w_ci_low                     0.003793
theta_w_ci_upp                     0.004302
tajima_d                          -1.612858
tajima_d_estimate                 -1.612861
tajima_d_bias                      0.000004
tajima_d_std_err                   0.026732
tajima_d_ci_err                    0.104788
tajima_d_ci_low                   -1.665255
tajima_d_ci_upp                   -1.560467
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

Note that the value of theta_pi has now reduced by approximately an order of magnitude. This is because non-degenerate coding sites are subject to much stronger purifying selection than 4-fold degenerate sites.

In this module we are most interested in using comparative analysis of diversity statistics to learn more about demographic differences and changes within our sampled mosquito populations. Hence we will prefer to use sites that are less affected by selection, and will use 4-fold degenerate sites (“CDS_DEG_4”) for the remainder of the module.

A final note, above we chose to provide the parameter region="3L:15,000,000-41,000,000", which defines the genome region we will use for diversity calculations. We typically prefer to use chromosome arm 3L because there are no large polymorphic inversions in any of the species we are studying. We also exclude the pericentromeric and telomeric regions because these have reduced diversity due to reduced recombination and hence selection at linked sites.

Comparing diversity between cohorts#

Now we have seen how to compute statistics for a single cohort, let’s run the same computation but for multiple cohorts, and see how to compare the results.

Example: Tanzania#

Sticking with our focus on Tanzania, let’s remind ourselves what cohorts and sample sizes are available if we group samples by the “cohort_admin2_year” field, which will group by taxon, district and year.

df_samples_tz.groupby("cohort_admin2_year").size()

cohort_admin2_year
TZ-05_Muleba_arab_2015    137
TZ-05_Muleba_gamb_2015     32
TZ-05_Muleba_gcx3_2015      1
TZ-13_Tarime_arab_2012     47
TZ-25_Muheza_arab_2013      1
TZ-25_Muheza_gamb_2013     32
TZ-25_Muheza_gcx3_2013     10
TZ-26_Moshi_arab_2012      40
dtype: int64

If we choose a cohort size of 10 for our comparative analysis, we will be able to analyse 6 cohorts.

To compute diversity for multiple cohorts at the same time, we have provided a diversity_stats() function.

ag3.diversity_stats?

Signature:
ag3.diversity_stats(
    cohorts: Union[str, Mapping[str, str]],
    cohort_size: int,
    region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
    site_mask: Optional[str] = 'default',
    site_class: Optional[str] = None,
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    random_seed: int = 42,
    n_jack: int = 200,
    confidence_level: float = 0.95,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
) -> pandas.core.frame.DataFrame
Docstring:
Compute genetic diversity summary statistics for multiple cohorts.

Parameters
----------
cohorts : str or Mapping[str, str]
    Either a string giving the name of a predefined cohort set (e.g.,
    "admin1_month") or a dict mapping custom cohort labels to sample
    queries.
cohort_size : int
    Randomly down-sample to this value if the number of samples in the
    cohort is greater. Raise an error if the number of samples is less
    than this value.
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript. Can also be a sequence (e.g.,
    list) of regions.
site_mask : str or None, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
site_class : str or None, optional
    Select sites belonging to one of the following classes: CDS_DEG_4,
    (4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
    degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
    INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
    longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
    splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
    site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
    untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
    gene).
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
random_seed : int, optional, default: 42
    Random seed used for reproducible down-sampling.
n_jack : int, optional, default: 200
    Number of blocks to divide the data into for the block jackknife
    estimation of confidence intervals. N.B., larger is not necessarily
    better.
confidence_level : float, optional, default: 0.95
    Confidence level to use for confidence interval calculation. E.g.,
    0.95 means 95% confidence interval.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.

Returns
-------
DataFrame
    A DataFrame where each row provides summary statistics and their
    confidence intervals for a single cohort. The columns are the value,
    the estimate, the bias, the standard error, the confidence interval
    error, the confidence interval lower value, the confidence interval
    upper value for each summary statistics (theta pi, Watterson's theta
    and Tajima's D), the taxon of the cohort, its year of collection,
    its month of collection, its country of collection, the ISO code of
    its first administrative level of collection, the name of its first
    administrative level of collection, the name of its second
    administrative level of collection, the longitude of its location of
    collection, and the latitude of its location of collection.
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py
Type:      method

This function has many of the same parameters we discussed above, except now we provide a sample_query parameter to restrict the overall analysis to samples in Tanzania, and we provide a cohorts parameter to define how these samples will be grouped into cohorts.

df_stats_tz_admin2_year = ag3.diversity_stats(
    sample_query="release == '3.0' and country == 'Tanzania'",
    cohorts="admin2_year",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
df_stats_tz_admin2_year

Cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping.

	cohort	theta_pi	theta_pi_estimate	theta_pi_bias	theta_pi_std_err	theta_pi_ci_err	theta_pi_ci_low	theta_pi_ci_upp	theta_w	theta_w_estimate	...	tajima_d_ci_upp	taxon	year	month	country	admin1_iso	admin1_name	admin2_name	longitude	latitude
0	TZ-05_Muleba_arab_2015	0.012306	0.012306	6.185033e-07	0.000224	0.000877	0.011867	0.012744	0.014789	0.014790	...	-0.658653	arabiensis	2015	[3, 4, 6]	Tanzania	TZ-05	Kagera	Muleba	31.621	-1.962
1	TZ-05_Muleba_gamb_2015	0.021242	0.021247	-4.337053e-06	0.000302	0.001182	0.020655	0.021838	0.030350	0.030352	...	-1.226721	gambiae	2015	[3, 6]	Tanzania	TZ-05	Kagera	Muleba	31.621	-1.962
2	TZ-13_Tarime_arab_2012	0.012435	0.012435	-3.120599e-07	0.000220	0.000861	0.012005	0.012866	0.015058	0.015059	...	-0.681130	arabiensis	2012	8	Tanzania	TZ-13	Mara	Tarime	34.199	-1.431
3	TZ-25_Muheza_gamb_2013	0.017705	0.017705	4.563558e-07	0.000290	0.001136	0.017137	0.018273	0.018939	0.018935	...	-0.218096	gambiae	2013	1	Tanzania	TZ-25	Tanga	Muheza	38.948	-4.940
4	TZ-25_Muheza_gcx3_2013	0.013395	0.013398	-2.958723e-06	0.000342	0.001340	0.012728	0.014068	0.012917	0.012918	...	0.257321	gcx3	2013	1	Tanzania	TZ-25	Tanga	Muheza	38.948	-4.940
5	TZ-26_Moshi_arab_2012	0.012374	0.012373	1.310417e-06	0.000219	0.000857	0.011945	0.012801	0.014669	0.014667	...	-0.604319	arabiensis	2012	8	Tanzania	TZ-26	Manyara	Moshi	37.308	-3.482

6 rows × 31 columns

This function returns a pandas DataFrame, where each row contains data about a cohort, and the columns provide the different statistic values and their confidence intervals. Let’s take a look at all the available columns:

df_stats_tz_admin2_year.columns

Index(['cohort', 'theta_pi', 'theta_pi_estimate', 'theta_pi_bias',
       'theta_pi_std_err', 'theta_pi_ci_err', 'theta_pi_ci_low',
       'theta_pi_ci_upp', 'theta_w', 'theta_w_estimate', 'theta_w_bias',
       'theta_w_std_err', 'theta_w_ci_err', 'theta_w_ci_low', 'theta_w_ci_upp',
       'tajima_d', 'tajima_d_estimate', 'tajima_d_bias', 'tajima_d_std_err',
       'tajima_d_ci_err', 'tajima_d_ci_low', 'tajima_d_ci_upp', 'taxon',
       'year', 'month', 'country', 'admin1_iso', 'admin1_name', 'admin2_name',
       'longitude', 'latitude'],
      dtype='object')

A simple way to visualise these statistics and compare cohorts is to make a bar plot. Let’s do this for nucleotide diversity using plotly express.

px.bar(
    data_frame=df_stats_tz_admin2_year,
    x="cohort",
    y="theta_pi_estimate",
    error_y="theta_pi_ci_err",
    color="taxon",
    height=450,
    width=500,
)

Above we chose to colour the bars by taxon, which helps to see some of the patterns in the data. For example, we have three An. arabiensis cohorts, all of which have nearly identical nucleotide diversity. We also have two An. gambiae cohorts, both of which have higher nucleotide diversity. There is also a significant difference between the two An. gambiae cohorts from the east and west of the country. Finally, we have one cohort representing the putative cryptic taxon which we have provisionally labelled gcx3 (see workshop 4, module 4).

We also would like to make similar plots to visualise Watterson’s estimator and Tajima’s D. Because we’ll want to make these plots often, let’s define a simple function to make all the plots we need.

def plot_diversity_stats(
    df_stats, 
    color=None, 
    height=450, 
    template="plotly_white"
):

    # set up common plotting parameters
    hover_name = "cohort"
    hover_data = [
        "taxon",
        "country",
        "admin1_iso",
        "admin1_name",
        "admin2_name",
        "year",
        "month",
    ]
    labels = {
        'theta_pi_estimate': r'$\widehat{\theta}_{\pi}$',
        'theta_w_estimate': r'$\widehat{\theta}_{w}$',
        'tajima_d_estimate': r'$D$',
        'cohort': "Cohort",
        'taxon': 'Taxon',
        'country': "Country",
    }
    category_orders = {
        "taxon": ["arabiensis", "gambiae", "gcx3", "coluzzii", "gcx1", "gcx2"],
    }
    width = 300 + 30 * len(df_stats)

    # nucleotide diversity bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_pi_estimate",
        error_y="theta_pi_ci_err",
        title="Nucleotide diversity",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # watterson estimator bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_w_estimate",
        error_y="theta_w_ci_err",
        title="Watterson estimator",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # tajima's d bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="tajima_d_estimate",
        error_y="tajima_d_ci_err",
        title="Tajima's D",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # scatter plot comparing diversity estimators
    fig = px.scatter(
        data_frame=df_stats,
        x="theta_pi_estimate",
        y="theta_w_estimate",
        error_x="theta_pi_ci_err",
        error_y="theta_w_ci_err",
        title="Diversity estimators",
        color=color,
        width=500,
        height=500,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

Let’s now make these plots for our Tanzanian cohorts.

plot_diversity_stats(df_stats_tz_admin2_year, color="taxon")

From the above we can see the following patterns:

All An. arabiensis cohorts have similar diversity, regardless of sampling location.
An. gambiae cohorts have different diversity between the west and east of the country.
The gcx3 cohort is quite different from all other cohorts, and is the only cohort with a positive value for Tajima’s D.

Exercise 1 (English)

Uncomment and run the code cells below to repeat the above analysis, but using cohorts grouped by district (administrative level 2) and month.

Do the results change any of our observations about the similarities and differences between species and geographical locations?

Is there any evidence for differences in diversity between different months of sampling?

Exercice 1 (Français)

Décommentez et exécutez les cellules de code ci-dessous pour répéter l’analyse ci-dessus, mais en utilisant des cohortes regroupées par district (niveau administratif 2) et mois.

Les résultats modifient-ils certaines de nos observations sur les similitudes et les différences entre les espèces et les emplacements géographiques?

Existe-t-il des preuves de différences de diversité entre les différents mois d’échantillonnage?

# df_samples_tz.groupby("cohort_admin2_month").size()

# df_stats_tz_admin2_month = ag3.diversity_stats(
#     sample_query="release == '3.0' and country == 'Tanzania'",
#     cohorts="admin2_month",
#     cohort_size=10,
#     region="3L:15,000,000-41,000,000",
#     site_mask="gamb_colu_arab",
#     site_class="CDS_DEG_4",
# )
# plot_diversity_stats(df_stats_tz_admin2_month, color="taxon")

Example: East Africa#

Let’s now expand our analysis to include samples from neighbouring countries, to investigate a broader geographical range.

Firstly, let’s plot a map to remind ourselves of the available sampling locations.

ag3.plot_samples_interactive_map(sample_query="release == '3.0'", basemap='worldimagery')

N.B., you will need to run this notebook for yourself in colab in order for the interactive map to work.

From the map we can see that there are two sampling locations in Uganda and one location in Kenya. Let’s add these countries and investigate sample sizes.

ag3.count_samples(
    sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']"
)

				taxon	arabiensis	gambiae	gcx3	unassigned
country	admin1_iso	admin1_name	admin2_name	year
Kenya	KE-14	Kilifi	Kilifi North	2000	0	19	0	0
				2007	3	0	0	0
				2012	10	0	54	0
Tanzania	TZ-05	Kagera	Muleba	2015	137	32	1	0
	TZ-13	Mara	Tarime	2012	47	0	0	0
	TZ-25	Tanga	Muheza	2013	1	32	10	0
	TZ-26	Manyara	Moshi	2012	40	0	0	0
Uganda	UG-E	Eastern Region	Tororo	2012	81	112	0	1
Uganda	UG-W	Western Region	Kanungu	2012	1	95	0	0

Let’s also confirm how many cohorts would be available if we continued to use the “cohort_admin2_year” field to group by taxon, district and year.

ag3.sample_metadata(
    sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']"
).groupby("cohort_admin2_year").size()

cohort_admin2_year
KE-14_Kilifi-North_arab_2007      3
KE-14_Kilifi-North_arab_2012     10
KE-14_Kilifi-North_gamb_2000     19
KE-14_Kilifi-North_gcx3_2012     54
TZ-05_Muleba_arab_2015          137
TZ-05_Muleba_gamb_2015           32
TZ-05_Muleba_gcx3_2015            1
TZ-13_Tarime_arab_2012           47
TZ-25_Muheza_arab_2013            1
TZ-25_Muheza_gamb_2013           32
TZ-25_Muheza_gcx3_2013           10
TZ-26_Moshi_arab_2012            40
UG-E_Tororo_arab_2012            81
UG-E_Tororo_gamb_2012           112
UG-W_Kanungu_arab_2012            1
UG-W_Kanungu_gamb_2012           95
dtype: int64

Here, sticking with a cohort size of 10 will allow us to include 3 cohorts from Kenya and 3 cohorts from Uganda.

Let’s now compute and plot diversity stats for all cohorts.

df_stats_tz_ke_ug_admin2_year = ag3.diversity_stats(
    sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']",
    cohorts="admin2_year",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
df_stats_tz_ke_ug_admin2_year

Cohort (KE-14_Kilifi-North_arab_2007) has insufficient samples (3) for requested cohort size (10), dropping.
Cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (UG-W_Kanungu_arab_2012) has insufficient samples (1) for requested cohort size (10), dropping.

	cohort	theta_pi	theta_pi_estimate	theta_pi_bias	theta_pi_std_err	theta_pi_ci_err	theta_pi_ci_low	theta_pi_ci_upp	theta_w	theta_w_estimate	...	tajima_d_ci_upp	taxon	year	month	country	admin1_iso	admin1_name	admin2_name	longitude	latitude
0	KE-14_Kilifi-North_arab_2012	0.012515	0.012513	1.739748e-06	0.000221	0.000868	0.012080	0.012947	0.015013	0.015010	...	-0.649706	arabiensis	2012	-1	Kenya	KE-14	Kilifi	Kilifi North	39.909	-3.511
1	KE-14_Kilifi-North_gamb_2000	0.017751	0.017754	-3.812392e-06	0.000281	0.001103	0.017203	0.018306	0.018947	0.018950	...	-0.213639	gambiae	2000	-1	Kenya	KE-14	Kilifi	Kilifi North	39.909	-3.511
2	KE-14_Kilifi-North_gcx3_2012	0.013473	0.013474	-1.341458e-06	0.000276	0.001082	0.012933	0.014015	0.011567	0.011568	...	0.814804	gcx3	2012	-1	Kenya	KE-14	Kilifi	Kilifi North	39.909	-3.511
3	TZ-05_Muleba_arab_2015	0.012306	0.012306	6.185033e-07	0.000224	0.000877	0.011867	0.012744	0.014789	0.014790	...	-0.658653	arabiensis	2015	[3, 4, 6]	Tanzania	TZ-05	Kagera	Muleba	31.621	-1.962
4	TZ-05_Muleba_gamb_2015	0.021242	0.021247	-4.337053e-06	0.000302	0.001182	0.020655	0.021838	0.030350	0.030352	...	-1.226721	gambiae	2015	[3, 6]	Tanzania	TZ-05	Kagera	Muleba	31.621	-1.962
5	TZ-13_Tarime_arab_2012	0.012435	0.012435	-3.120599e-07	0.000220	0.000861	0.012005	0.012866	0.015058	0.015059	...	-0.681130	arabiensis	2012	8	Tanzania	TZ-13	Mara	Tarime	34.199	-1.431
6	TZ-25_Muheza_gamb_2013	0.017705	0.017705	4.563558e-07	0.000290	0.001136	0.017137	0.018273	0.018939	0.018935	...	-0.218096	gambiae	2013	1	Tanzania	TZ-25	Tanga	Muheza	38.948	-4.940
7	TZ-25_Muheza_gcx3_2013	0.013395	0.013398	-2.958723e-06	0.000342	0.001340	0.012728	0.014068	0.012917	0.012918	...	0.257321	gcx3	2013	1	Tanzania	TZ-25	Tanga	Muheza	38.948	-4.940
8	TZ-26_Moshi_arab_2012	0.012374	0.012373	1.310417e-06	0.000219	0.000857	0.011945	0.012801	0.014669	0.014667	...	-0.604319	arabiensis	2012	8	Tanzania	TZ-26	Manyara	Moshi	37.308	-3.482
9	UG-E_Tororo_arab_2012	0.012560	0.012560	8.203444e-07	0.000224	0.000876	0.012121	0.012998	0.015198	0.015195	...	-0.679454	arabiensis	2012	10	Uganda	UG-E	Eastern Region	Tororo	34.026	0.770
10	UG-E_Tororo_gamb_2012	0.021693	0.021698	-4.572455e-06	0.000317	0.001244	0.021076	0.022320	0.031458	0.031460	...	-1.271359	gambiae	2012	10	Uganda	UG-E	Eastern Region	Tororo	34.026	0.770
11	UG-W_Kanungu_gamb_2012	0.021464	0.021469	-4.432877e-06	0.000310	0.001214	0.020862	0.022076	0.030680	0.030685	...	-1.225296	gambiae	2012	[10, 11]	Uganda	UG-W	Western Region	Kanungu	29.701	-0.751

12 rows × 31 columns

plot_diversity_stats(df_stats_tz_ke_ug_admin2_year, color="taxon")

Some observations about these results:

All An. arabiensis cohorts show similar diversity. This indicates that all An. arabiensis populations from across this region share very similar demographic histories.
Diversity in An. gambiae from western Tanzania is very similar to both An. gambiae cohorts from Uganda. Conversely, diversity in An. gambiae from eastern Tanzania is very similar to An. gambiae from coastal Kenya. This supports a strong population division within An. gambiae, with different demographic histories.
Diversity is very similar between the two gcx3 cohorts from coastal Tanzania and Kenya. This indicates that the two gcx3 cohorts are sampled from populations with similar demographic histories. In particular, both cohorts have positive values for Tajima’s D, which is consistent with a reduction in population size. We cannot infer the timing or strength of this reduction from Tajima’s D alone, but it is interesting to note that these are the only two cohorts with evidence for population reduction, whereas all An. gambiae and An. arabiensis cohorts have negative Tajima’s D, which is consistent with historical population expansion. This certainly adds evidence for an important distinction between the gcx3 taxon and other mosquito taxa in the region.

Exercise 2 (English)

Rerun the diversity analysis for East Africa in release 3.0, adding Mozambique, Malawi and Mayotte. Discuss the patterns you observe with a colleague.

Exercice 2 (Français)

Réexécutez l’analyse de la diversité pour l’Afrique de l’Est dans la version 3.0, en ajoutant le Mozambique, le Malawi et Mayotte. Discutez des tendances que vous observez avec un collègue.

Example: Burkina Faso#

Finally, let’s look at an example where we have data from the same location from multiple years, to illustrate how we might use diversity to monitor populations for signs of recent demographic changes.

In Burkina Faso we have data from the Houet district from 2012 and 2014, for both An. gambiae and An. coluzzii.

ag3.count_samples(
    sample_query="release == '3.0' and country == 'Burkina Faso'"
)

				taxon	arabiensis	coluzzii	gambiae
country	admin1_iso	admin1_name	admin2_name	year
Burkina Faso	BF-07	Centre-Sud	Bazega	2004	0	0	13
	BF-09	Hauts-Bassins	Houet	2012	0	82	99
	BF-09	Hauts-Bassins	Houet	2014	3	53	46

Exercise 3 (English)

Analyse diversity in Burkina Faso.

Do you observe any differences between An. gambiae and An. coluzzii?

Do you observe any differences between 2012 and 2014?

Exercice 3 (Français)

Analyser la diversité au Burkina Faso.

Observez-vous des différences entre An. gambiae et An. coluzzii?

Observez-vous des différences entre 2012 et 2014 ?

Well done!#

Well done for completing this module on genetic diversity summary statistics!

Exercises#

Please now launch this notebook in colab, clear the outputs, copy it to your drive and run it from the top, complete the exercises you find along the way.

Challenge (English)#

Choose any country or geographical region and perform comparative analysis of genetic diversity. Share your findings and conclusions with colleagues.

Défi (Français)#

Choisissez n’importe quel pays ou région géographique et effectuez une analyse comparative de la diversité génétique. Partagez vos découvertes et conclusions avec vos collègues.