Workshop 5 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 3 - Genetic diversity summary statistics#
Theme: Analysis
This module covers how to compute genetic diversity summary statistics, which are a convenient and informative way to quantify and compare diversity in different mosquito populations.
Learning objectives#
At the end of this module you will be able to:
Select cohorts for comparative analysis of genetic diversity
Compute nucleotide diversity, Watterson’s estimator and Tajima’s D
Plot diversity statistics for multiple cohorts
Interpret the results
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
Install and import the packages we’ll need.
%pip install -q --no-warn-conflicts malariagen_data
Note: you may need to restart the kernel to use updated packages.
import malariagen_data
import plotly.express as px
Here we will compute summary statistics using data from millions of SNPs. This can take some time, so we will use Google Drive to save results and avoid having to rerun computations. To do this we first need to mount Google Drive to our colab VM:
Configure access to MalariaGEN data in Google Cloud.
try:
from google.colab import drive
drive.mount("drive")
except ImportError:
pass
Note that authentication is required to access data through the package, please follow the instructions here.
ag3 = malariagen_data.Ag3(
results_cache="drive/MyDrive/malariagen_data_cache"
)
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | /home/kellylbennett/github/malariagen/anopheles-genomic-surveillance/docs/workshop-5/drive/MyDrive/malariagen_data_cache |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
Selecting cohorts#
In this module we will be comparing genetic diversity between mosquitoes representing different geographical locations, time points and species. Before we begin any comparative analysis of genetic diversity, we first need to decide how to group the mosquitoes we’ve sequenced into cohorts. Each cohort we define will represent a sample of mosquitoes from a given time, place and species.
Let’s first browse the different sample locations for which we have data. In workshop 1, module 2 we looked at how to create an interactive map with sampling locations using ipyleaflet. Because this is something we want to do often, we’ve added a convenience function plot_samples_interactive_map()
to do this for you. This will plot a map showing sampling locations for all available samples.
ag3.plot_samples_interactive_map(sample_query="release == '3.0'", basemap='worldimagery')
For the purposes of this module we will begin by focusing on Tanzania. To get more information on the numbers of samples available from Tanzania, we can use pandas to create a pivot table as we also illustrated in workshop 1, module 2. Again, because we do this often, we’ve created a convenience function count_samples()
.
ag3.count_samples(
sample_query="release == '3.0' and country == 'Tanzania'"
)
taxon | arabiensis | gambiae | gcx3 | ||||
---|---|---|---|---|---|---|---|
country | admin1_iso | admin1_name | admin2_name | year | |||
Tanzania | TZ-05 | Kagera | Muleba | 2015 | 137 | 32 | 1 |
TZ-13 | Mara | Tarime | 2012 | 47 | 0 | 0 | |
TZ-25 | Tanga | Muheza | 2013 | 1 | 32 | 10 | |
TZ-26 | Manyara | Moshi | 2012 | 40 | 0 | 0 |
When grouping samples by geographical location, we sometimes can choose whether to aggregate data from nearby locations, or whether to keep them separate. In general, we want the finest spatial granularity possible, but we also need enough samples to make reasonable estimates of genetic diversity, and so we might have to make a trade-off between granularity and sample size.
Similarly, when grouping samples by date of collection, we can choose whether to aggregate data from the same month, season or year. Again, in general we would prefer the finest temporal granularity possible, but might also need sufficient sample size.
For computing diversity summary statistics, there is no hard-and-fast rule about how many samples are needed, but a sample size of 10 is probably the bare minimum we would consider, and we would prefer more if possible.
To simplify grouping of mosquitoes into cohorts, we have predefined some metadata columns which provide cohort labels at different levels of granularity. Let’s take a peek at the sample metadata:
df_samples_tz = ag3.sample_metadata(
sample_query="release == '3.0' and country == 'Tanzania'"
)
df_samples_tz.head()
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BL0046-C | Plate_C_H6 | Bilali Kabula | Tanzania | Muleba | 2015 | 6 | -1.962 | 31.621 | F | ... | Kagera | TZ-05 | Muleba | gcx3 | TZ-05_gcx3_2015 | TZ-05_gcx3_2015_06 | TZ-05_gcx3_2015_Q2 | TZ-05_Muleba_gcx3_2015 | TZ-05_Muleba_gcx3_2015_06 | TZ-05_Muleba_gcx3_2015_Q2 |
1 | BL0047-C | Plate_F_D4 | Bilali Kabula | Tanzania | Muleba | 2015 | 3 | -1.962 | 31.621 | F | ... | Kagera | TZ-05 | Muleba | arabiensis | TZ-05_arab_2015 | TZ-05_arab_2015_03 | TZ-05_arab_2015_Q1 | TZ-05_Muleba_arab_2015 | TZ-05_Muleba_arab_2015_03 | TZ-05_Muleba_arab_2015_Q1 |
2 | BL0048-C | Plate_F_E4 | Bilali Kabula | Tanzania | Muleba | 2015 | 3 | -1.962 | 31.621 | F | ... | Kagera | TZ-05 | Muleba | arabiensis | TZ-05_arab_2015 | TZ-05_arab_2015_03 | TZ-05_arab_2015_Q1 | TZ-05_Muleba_arab_2015 | TZ-05_Muleba_arab_2015_03 | TZ-05_Muleba_arab_2015_Q1 |
3 | BL0049-C | Plate_F_F4 | Bilali Kabula | Tanzania | Muleba | 2015 | 3 | -1.962 | 31.621 | F | ... | Kagera | TZ-05 | Muleba | arabiensis | TZ-05_arab_2015 | TZ-05_arab_2015_03 | TZ-05_arab_2015_Q1 | TZ-05_Muleba_arab_2015 | TZ-05_Muleba_arab_2015_03 | TZ-05_Muleba_arab_2015_Q1 |
4 | BL0050-C | Plate_F_G4 | Bilali Kabula | Tanzania | Muleba | 2015 | 3 | -1.962 | 31.621 | F | ... | Kagera | TZ-05 | Muleba | arabiensis | TZ-05_arab_2015 | TZ-05_arab_2015_03 | TZ-05_arab_2015_Q1 | TZ-05_Muleba_arab_2015 | TZ-05_Muleba_arab_2015_03 | TZ-05_Muleba_arab_2015_Q1 |
5 rows × 32 columns
If you scroll to the right, you’ll see columns named “cohort_admin1_year”, “cohort_admin1_month”, “cohort_admin1_quarter”, “cohort_admin2_year”, “cohort_admin2_quarter” and “cohort_admin2_month”. Each of these columns provides a different set of cohort labels. All are grouped by species (taxon).
For this analysis, let’s use the “cohort_admin2_year” column to group samples by administrative level 2 divisions (usually called districts) and year. Here are the cohort labels for the Tanzania samples and the sample sizes.
df_samples_tz.groupby("cohort_admin2_year").size()
cohort_admin2_year
TZ-05_Muleba_arab_2015 137
TZ-05_Muleba_gamb_2015 32
TZ-05_Muleba_gcx3_2015 1
TZ-13_Tarime_arab_2012 47
TZ-25_Muheza_arab_2013 1
TZ-25_Muheza_gamb_2013 32
TZ-25_Muheza_gcx3_2013 10
TZ-26_Moshi_arab_2012 40
dtype: int64
We can see there are two cohorts with only 1 sample - these will get dropped from our analysis. Otherwise, all cohorts have at least 10 samples.
Recap: genetic diversity summary statistics#
We are going to compute the following three summary statistics for each cohort:
Nucleotide diversity (
theta_pi
) - This quantifies the average number of differences (mismatches) between any pair of DNA sequences in a cohort. The higher this statistic, the more differences you will expect to see between any pair of randomly sampled DNA sequences.Watterson’s estimator (
theta_w
) - This quantifies the number of segregating sites within a cohort, scaled by a constant. The higher this statistic, the more sites are found to be segregating (i.e., polymorphic).Tajima’s D (
tajima_d
) - This quantifies the difference between Watterson’s estimator and nucleotide diversity, and scales the result by a constant. This statistic is useful because it focuses not on the magnitude but on the architecture of diversity in a cohort, compared to a neutrally evolving population of constant size. Negative values imply an excess of rare alleles, and positive values imply a deficit of rare alleles, relative to a population at equilibrium.
Computing diversity summary statistics#
Let’s begin by computing these summary statistics in a single cohort. For this we can use the function cohort_diversity_stats()
. Let’s look at the docstring.
ag3.cohort_diversity_stats?
Signature:
ag3.cohort_diversity_stats(
cohort: Union[str, Tuple[str, str]],
cohort_size: int,
region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
min_cohort_size: Optional[int] = None,
max_cohort_size: Optional[int] = None,
site_mask: Optional[str] = 'default',
site_class: Optional[str] = None,
sample_sets: Union[Sequence[str], str, NoneType] = None,
random_seed: int = 42,
n_jack: int = 200,
confidence_level: float = 0.95,
) -> pandas.core.series.Series
Docstring:
Compute genetic diversity summary statistics for a cohort of
individuals.
Parameters
----------
cohort : str or tuple[str, str]
Either a string giving one of the predefined cohort labels, or a pair
of strings giving a custom cohort label and a sample query.
cohort_size : int
Randomly down-sample to this value if the number of samples in the
cohort is greater. Raise an error if the number of samples is less
than this value.
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript. Can also be a sequence (e.g.,
list) of regions.
min_cohort_size : int or None, optional
Minimum cohort size. Raise an error if the number of samples is less
than this value.
max_cohort_size : int or None, optional
Randomly down-sample to this value if the number of samples in the
cohort is greater.
site_mask : str or None, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
site_class : str or None, optional
Select sites belonging to one of the following classes: CDS_DEG_4,
(4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
gene).
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
n_jack : int, optional, default: 200
Number of blocks to divide the data into for the block jackknife
estimation of confidence intervals. N.B., larger is not necessarily
better.
confidence_level : float, optional, default: 0.95
Confidence level to use for confidence interval calculation. E.g.,
0.95 means 95% confidence interval.
Returns
-------
Series
A pandas series with summary statistics and their confidence
intervals.
File: /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type: method
Let’s now compute the statistics for one of our Tanzanian cohorts, TZ-05_Muleba_gamb_2015.
stats = ag3.cohort_diversity_stats(
cohort="TZ-05_Muleba_gamb_2015",
cohort_size=30,
region="3L:15,000,000-41,000,000",
site_mask="gamb_colu_arab",
site_class="CDS_DEG_4",
)
stats
cohort TZ-05_Muleba_gamb_2015
theta_pi 0.021167
theta_pi_estimate 0.021173
theta_pi_bias -0.000005
theta_pi_std_err 0.0003
theta_pi_ci_err 0.001176
theta_pi_ci_low 0.020585
theta_pi_ci_upp 0.021761
theta_w 0.038855
theta_w_estimate 0.038854
theta_w_bias 0.000001
theta_w_std_err 0.000436
theta_w_ci_err 0.001709
theta_w_ci_low 0.037999
theta_w_ci_upp 0.039708
tajima_d -1.635894
tajima_d_estimate -1.635436
tajima_d_bias -0.000459
tajima_d_std_err 0.015294
tajima_d_ci_err 0.05995
tajima_d_ci_low -1.665411
tajima_d_ci_upp -1.605461
taxon gambiae
year 2015
month [3, 6]
country Tanzania
admin1_iso TZ-05
admin1_name Kagera
admin2_name Muleba
longitude 31.621
latitude -1.962
dtype: object
This function returns a pandas Series with values of the summary statistics for our requested cohort. The field theta_pi
contains the value of nucleotide diversity. The field theta_w
contains the value of Watterson’s estimator. The field tajima_d
contains the value of Tajima’s D.
A 95% confidence interval has also been estimated for each of these statistics using a block jackknife procedure. The lower and upper ends of these confidence intervals are given by the fields with the suffixes _ci_low
and _ci_upp
respectively. The fields with suffix _ci_err
give the size of the confidence interval, and is useful for plotting error bars.
It is important to note that, in addition to the cohort
parameter, we also provided values for the parameters cohort_size
, region
, site_mask
and site_class
. These parameters are important and can make a big difference to the statistic values obtained.
To illustrate the impact of changing some of these parameters, let’s recompute the statistics but providing cohort_size
of 10 instead of 30.
stats = ag3.cohort_diversity_stats(
cohort="TZ-05_Muleba_gamb_2015",
cohort_size=10,
region="3L:15,000,000-41,000,000",
site_mask="gamb_colu_arab",
site_class="CDS_DEG_4",
)
stats
cohort TZ-05_Muleba_gamb_2015
theta_pi 0.021242
theta_pi_estimate 0.021247
theta_pi_bias -0.000004
theta_pi_std_err 0.000302
theta_pi_ci_err 0.001182
theta_pi_ci_low 0.020655
theta_pi_ci_upp 0.021838
theta_w 0.03035
theta_w_estimate 0.030352
theta_w_bias -0.000003
theta_w_std_err 0.000377
theta_w_ci_err 0.001478
theta_w_ci_low 0.029613
theta_w_ci_upp 0.031092
tajima_d -1.259672
tajima_d_estimate -1.259398
tajima_d_bias -0.000274
tajima_d_std_err 0.016672
tajima_d_ci_err 0.065353
tajima_d_ci_low -1.292075
tajima_d_ci_upp -1.226721
taxon gambiae
year 2015
month [3, 6]
country Tanzania
admin1_iso TZ-05
admin1_name Kagera
admin2_name Muleba
longitude 31.621
latitude -1.962
dtype: object
Note that the value of theta_pi
hasn’t changed much, but the values of theta_w
and tajima_d
are considerably different. This is because, for many cohorts, there is a strong dependency between these statistics and the cohort size, i.e., the number of samples used to calculate the statistics. Because of this, if we want to make a fair comparison between different cohorts, it is important to use the same cohort size in all cases. If we have a cohort with more samples than our chosen cohort size, the cohort will be randomly downsampled.
Another parameter that makes a big difference is the site_class
parameter, which defines the type of genomic site used to calculate the diversity statistics. Above we used “CDS_DEG_4” which stands for 4-fold degenerate coding sites, i.e., sites within gene coding sequences where none of the possible nucleotide changes will affect the resulting amino acid. These sites are typically under relatively weak selective constraint, because they have little to no impact on the phenotype of the organism.
For illustration, let’s contrast this with “CDS_DEG_0” which stands for non-degenerate coding sites, i.e., sites where any nucleotide change will change the amino acid.
stats = ag3.cohort_diversity_stats(
cohort="TZ-05_Muleba_gamb_2015",
cohort_size=10,
region="3L:15,000,000-41,000,000",
site_mask="gamb_colu_arab",
site_class="CDS_DEG_0",
)
stats
cohort TZ-05_Muleba_gamb_2015
theta_pi 0.002492
theta_pi_estimate 0.002492
theta_pi_bias -0.0
theta_pi_std_err 0.000091
theta_pi_ci_err 0.000356
theta_pi_ci_low 0.002314
theta_pi_ci_upp 0.00267
theta_w 0.004047
theta_w_estimate 0.004047
theta_w_bias -0.000001
theta_w_std_err 0.00013
theta_w_ci_err 0.000509
theta_w_ci_low 0.003793
theta_w_ci_upp 0.004302
tajima_d -1.612858
tajima_d_estimate -1.612861
tajima_d_bias 0.000004
tajima_d_std_err 0.026732
tajima_d_ci_err 0.104788
tajima_d_ci_low -1.665255
tajima_d_ci_upp -1.560467
taxon gambiae
year 2015
month [3, 6]
country Tanzania
admin1_iso TZ-05
admin1_name Kagera
admin2_name Muleba
longitude 31.621
latitude -1.962
dtype: object
Note that the value of theta_pi
has now reduced by approximately an order of magnitude. This is because non-degenerate coding sites are subject to much stronger purifying selection than 4-fold degenerate sites.
In this module we are most interested in using comparative analysis of diversity statistics to learn more about demographic differences and changes within our sampled mosquito populations. Hence we will prefer to use sites that are less affected by selection, and will use 4-fold degenerate sites (“CDS_DEG_4”) for the remainder of the module.
A final note, above we chose to provide the parameter region="3L:15,000,000-41,000,000"
, which defines the genome region we will use for diversity calculations. We typically prefer to use chromosome arm 3L because there are no large polymorphic inversions in any of the species we are studying. We also exclude the pericentromeric and telomeric regions because these have reduced diversity due to reduced recombination and hence selection at linked sites.
Comparing diversity between cohorts#
Now we have seen how to compute statistics for a single cohort, let’s run the same computation but for multiple cohorts, and see how to compare the results.
Example: Tanzania#
Sticking with our focus on Tanzania, let’s remind ourselves what cohorts and sample sizes are available if we group samples by the “cohort_admin2_year” field, which will group by taxon, district and year.
df_samples_tz.groupby("cohort_admin2_year").size()
cohort_admin2_year
TZ-05_Muleba_arab_2015 137
TZ-05_Muleba_gamb_2015 32
TZ-05_Muleba_gcx3_2015 1
TZ-13_Tarime_arab_2012 47
TZ-25_Muheza_arab_2013 1
TZ-25_Muheza_gamb_2013 32
TZ-25_Muheza_gcx3_2013 10
TZ-26_Moshi_arab_2012 40
dtype: int64
If we choose a cohort size of 10 for our comparative analysis, we will be able to analyse 6 cohorts.
To compute diversity for multiple cohorts at the same time, we have provided a diversity_stats()
function.
ag3.diversity_stats?
Signature:
ag3.diversity_stats(
cohorts: Union[str, Mapping[str, str]],
cohort_size: int,
region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
site_mask: str = 'default',
site_class: Optional[str] = None,
sample_query: Optional[str] = None,
sample_sets: Union[Sequence[str], str, NoneType] = None,
random_seed: int = 42,
n_jack: int = 200,
confidence_level: float = 0.95,
) -> pandas.core.frame.DataFrame
Docstring:
Compute genetic diversity summary statistics for multiple cohorts.
Parameters
----------
cohorts : str or Mapping[str, str]
Either a string giving the name of a predefined cohort set (e.g.,
"admin1_month") or a dict mapping custom cohort labels to sample
queries.
cohort_size : int
Randomly down-sample to this value if the number of samples in the
cohort is greater. Raise an error if the number of samples is less
than this value.
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript. Can also be a sequence (e.g.,
list) of regions.
site_mask : str, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
site_class : str or None, optional
Select sites belonging to one of the following classes: CDS_DEG_4,
(4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
gene).
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
n_jack : int, optional, default: 200
Number of blocks to divide the data into for the block jackknife
estimation of confidence intervals. N.B., larger is not necessarily
better.
confidence_level : float, optional, default: 0.95
Confidence level to use for confidence interval calculation. E.g.,
0.95 means 95% confidence interval.
Returns
-------
DataFrame
A DataFrame where each row provides summary statistics and their
confidence intervals for a single cohort.
File: /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type: method
This function has many of the same parameters we discussed above, except now we provide a sample_query
parameter to restrict the overall analysis to samples in Tanzania, and we provide a cohorts
parameter to define how these samples will be grouped into cohorts.
df_stats_tz_admin2_year = ag3.diversity_stats(
sample_query="release == '3.0' and country == 'Tanzania'",
cohorts="admin2_year",
cohort_size=10,
region="3L:15,000,000-41,000,000",
site_mask="gamb_colu_arab",
site_class="CDS_DEG_4",
)
df_stats_tz_admin2_year
Cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping.
cohort | theta_pi | theta_pi_estimate | theta_pi_bias | theta_pi_std_err | theta_pi_ci_err | theta_pi_ci_low | theta_pi_ci_upp | theta_w | theta_w_estimate | ... | tajima_d_ci_upp | taxon | year | month | country | admin1_iso | admin1_name | admin2_name | longitude | latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TZ-05_Muleba_arab_2015 | 0.012306 | 0.012306 | 6.185033e-07 | 0.000224 | 0.000877 | 0.011867 | 0.012744 | 0.014789 | 0.014790 | ... | -0.658653 | arabiensis | 2015 | [3, 4, 6] | Tanzania | TZ-05 | Kagera | Muleba | 31.621 | -1.962 |
1 | TZ-05_Muleba_gamb_2015 | 0.021242 | 0.021247 | -4.337053e-06 | 0.000302 | 0.001182 | 0.020655 | 0.021838 | 0.030350 | 0.030352 | ... | -1.226721 | gambiae | 2015 | [3, 6] | Tanzania | TZ-05 | Kagera | Muleba | 31.621 | -1.962 |
2 | TZ-13_Tarime_arab_2012 | 0.012435 | 0.012435 | -3.120599e-07 | 0.000220 | 0.000861 | 0.012005 | 0.012866 | 0.015058 | 0.015059 | ... | -0.681130 | arabiensis | 2012 | 8 | Tanzania | TZ-13 | Mara | Tarime | 34.199 | -1.431 |
3 | TZ-25_Muheza_gamb_2013 | 0.017705 | 0.017705 | 4.563558e-07 | 0.000290 | 0.001136 | 0.017137 | 0.018273 | 0.018939 | 0.018935 | ... | -0.218096 | gambiae | 2013 | 1 | Tanzania | TZ-25 | Tanga | Muheza | 38.948 | -4.940 |
4 | TZ-25_Muheza_gcx3_2013 | 0.013395 | 0.013398 | -2.958723e-06 | 0.000342 | 0.001340 | 0.012728 | 0.014068 | 0.012917 | 0.012918 | ... | 0.257321 | gcx3 | 2013 | 1 | Tanzania | TZ-25 | Tanga | Muheza | 38.948 | -4.940 |
5 | TZ-26_Moshi_arab_2012 | 0.012374 | 0.012373 | 1.310417e-06 | 0.000219 | 0.000857 | 0.011945 | 0.012801 | 0.014669 | 0.014667 | ... | -0.604319 | arabiensis | 2012 | 8 | Tanzania | TZ-26 | Manyara | Moshi | 37.308 | -3.482 |
6 rows × 31 columns
This function returns a pandas DataFrame, where each row contains data about a cohort, and the columns provide the different statistic values and their confidence intervals. Let’s take a look at all the available columns:
df_stats_tz_admin2_year.columns
Index(['cohort', 'theta_pi', 'theta_pi_estimate', 'theta_pi_bias',
'theta_pi_std_err', 'theta_pi_ci_err', 'theta_pi_ci_low',
'theta_pi_ci_upp', 'theta_w', 'theta_w_estimate', 'theta_w_bias',
'theta_w_std_err', 'theta_w_ci_err', 'theta_w_ci_low', 'theta_w_ci_upp',
'tajima_d', 'tajima_d_estimate', 'tajima_d_bias', 'tajima_d_std_err',
'tajima_d_ci_err', 'tajima_d_ci_low', 'tajima_d_ci_upp', 'taxon',
'year', 'month', 'country', 'admin1_iso', 'admin1_name', 'admin2_name',
'longitude', 'latitude'],
dtype='object')
A simple way to visualise these statistics and compare cohorts is to make a bar plot. Let’s do this for nucleotide diversity using plotly express.
px.bar(
data_frame=df_stats_tz_admin2_year,
x="cohort",
y="theta_pi_estimate",
error_y="theta_pi_ci_err",
color="taxon",
height=450,
width=500,
)
Above we chose to colour the bars by taxon, which helps to see some of the patterns in the data. For example, we have three An. arabiensis cohorts, all of which have nearly identical nucleotide diversity. We also have two An. gambiae cohorts, both of which have higher nucleotide diversity. There is also a significant difference between the two An. gambiae cohorts from the east and west of the country. Finally, we have one cohort representing the putative cryptic taxon which we have provisionally labelled gcx3 (see workshop 4, module 4).
We also would like to make similar plots to visualise Watterson’s estimator and Tajima’s D. Because we’ll want to make these plots often, let’s define a simple function to make all the plots we need.
def plot_diversity_stats(
df_stats,
color=None,
height=450,
template="plotly_white"
):
# set up common plotting parameters
hover_name = "cohort"
hover_data = [
"taxon",
"country",
"admin1_iso",
"admin1_name",
"admin2_name",
"year",
"month",
]
labels = {
'theta_pi_estimate': r'$\widehat{\theta}_{\pi}$',
'theta_w_estimate': r'$\widehat{\theta}_{w}$',
'tajima_d_estimate': r'$D$',
'cohort': "Cohort",
'taxon': 'Taxon',
'country': "Country",
}
category_orders = {
"taxon": ["arabiensis", "gambiae", "gcx3", "coluzzii", "gcx1", "gcx2"],
}
width = 300 + 30 * len(df_stats)
# nucleotide diversity bar plot
fig = px.bar(
data_frame=df_stats,
x="cohort",
y="theta_pi_estimate",
error_y="theta_pi_ci_err",
title="Nucleotide diversity",
color=color,
height=height,
width=width,
hover_name=hover_name,
hover_data=hover_data,
labels=labels,
template=template,
category_orders=category_orders,
)
fig.show()
# watterson estimator bar plot
fig = px.bar(
data_frame=df_stats,
x="cohort",
y="theta_w_estimate",
error_y="theta_w_ci_err",
title="Watterson estimator",
color=color,
height=height,
width=width,
hover_name=hover_name,
hover_data=hover_data,
labels=labels,
template=template,
category_orders=category_orders,
)
fig.show()
# tajima's d bar plot
fig = px.bar(
data_frame=df_stats,
x="cohort",
y="tajima_d_estimate",
error_y="tajima_d_ci_err",
title="Tajima's D",
color=color,
height=height,
width=width,
hover_name=hover_name,
hover_data=hover_data,
labels=labels,
template=template,
category_orders=category_orders,
)
fig.show()
# scatter plot comparing diversity estimators
fig = px.scatter(
data_frame=df_stats,
x="theta_pi_estimate",
y="theta_w_estimate",
error_x="theta_pi_ci_err",
error_y="theta_w_ci_err",
title="Diversity estimators",
color=color,
width=500,
height=500,
hover_name=hover_name,
hover_data=hover_data,
labels=labels,
template=template,
category_orders=category_orders,
)
fig.show()
Let’s now make these plots for our Tanzanian cohorts.
plot_diversity_stats(df_stats_tz_admin2_year, color="taxon")
From the above we can see the following patterns:
All An. arabiensis cohorts have similar diversity, regardless of sampling location.
An. gambiae cohorts have different diversity between the west and east of the country.
The gcx3 cohort is quite different from all other cohorts, and is the only cohort with a positive value for Tajima’s D.
Exercise 1 (English)
Uncomment and run the code cells below to repeat the above analysis, but using cohorts grouped by district (administrative level 2) and month.
Do the results change any of our observations about the similarities and differences between species and geographical locations?
Is there any evidence for differences in diversity between different months of sampling?
Exercice 1 (Français)
Décommentez et exécutez les cellules de code ci-dessous pour répéter l’analyse ci-dessus, mais en utilisant des cohortes regroupées par district (niveau administratif 2) et mois.
Les résultats modifient-ils certaines de nos observations sur les similitudes et les différences entre les espèces et les emplacements géographiques?
Existe-t-il des preuves de différences de diversité entre les différents mois d’échantillonnage?
# df_samples_tz.groupby("cohort_admin2_month").size()
# df_stats_tz_admin2_month = ag3.diversity_stats(
# sample_query="release == '3.0' and country == 'Tanzania'",
# cohorts="admin2_month",
# cohort_size=10,
# region="3L:15,000,000-41,000,000",
# site_mask="gamb_colu_arab",
# site_class="CDS_DEG_4",
# )
# plot_diversity_stats(df_stats_tz_admin2_month, color="taxon")
Example: East Africa#
Let’s now expand our analysis to include samples from neighbouring countries, to investigate a broader geographical range.
Firstly, let’s plot a map to remind ourselves of the available sampling locations.
ag3.plot_samples_interactive_map(sample_query="release == '3.0'", basemap='worldimagery')
From the map we can see that there are two sampling locations in Uganda and one location in Kenya. Let’s add these countries and investigate sample sizes.
ag3.count_samples(
sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']"
)
taxon | arabiensis | gambiae | gcx3 | unassigned | ||||
---|---|---|---|---|---|---|---|---|
country | admin1_iso | admin1_name | admin2_name | year | ||||
Kenya | KE-14 | Kilifi | Kilifi North | 2000 | 0 | 19 | 0 | 0 |
2007 | 3 | 0 | 0 | 0 | ||||
2012 | 10 | 0 | 54 | 0 | ||||
Tanzania | TZ-05 | Kagera | Muleba | 2015 | 137 | 32 | 1 | 0 |
TZ-13 | Mara | Tarime | 2012 | 47 | 0 | 0 | 0 | |
TZ-25 | Tanga | Muheza | 2013 | 1 | 32 | 10 | 0 | |
TZ-26 | Manyara | Moshi | 2012 | 40 | 0 | 0 | 0 | |
Uganda | UG-E | Eastern Region | Tororo | 2012 | 81 | 112 | 0 | 1 |
UG-W | Western Region | Kanungu | 2012 | 1 | 95 | 0 | 0 |
Let’s also confirm how many cohorts would be available if we continued to use the “cohort_admin2_year” field to group by taxon, district and year.
ag3.sample_metadata(
sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']"
).groupby("cohort_admin2_year").size()
cohort_admin2_year
KE-14_Kilifi-North_arab_2007 3
KE-14_Kilifi-North_arab_2012 10
KE-14_Kilifi-North_gamb_2000 19
KE-14_Kilifi-North_gcx3_2012 54
TZ-05_Muleba_arab_2015 137
TZ-05_Muleba_gamb_2015 32
TZ-05_Muleba_gcx3_2015 1
TZ-13_Tarime_arab_2012 47
TZ-25_Muheza_arab_2013 1
TZ-25_Muheza_gamb_2013 32
TZ-25_Muheza_gcx3_2013 10
TZ-26_Moshi_arab_2012 40
UG-E_Tororo_arab_2012 81
UG-E_Tororo_gamb_2012 112
UG-W_Kanungu_arab_2012 1
UG-W_Kanungu_gamb_2012 95
dtype: int64
Here, sticking with a cohort size of 10 will allow us to include 3 cohorts from Kenya and 3 cohorts from Uganda.
Let’s now compute and plot diversity stats for all cohorts.
df_stats_tz_ke_ug_admin2_year = ag3.diversity_stats(
sample_query="release == '3.0' and country in ['Tanzania', 'Kenya', 'Uganda']",
cohorts="admin2_year",
cohort_size=10,
region="3L:15,000,000-41,000,000",
site_mask="gamb_colu_arab",
site_class="CDS_DEG_4",
)
df_stats_tz_ke_ug_admin2_year
Cohort (KE-14_Kilifi-North_arab_2007) has insufficient samples (3) for requested cohort size (10), dropping.
Cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping.
Cohort (UG-W_Kanungu_arab_2012) has insufficient samples (1) for requested cohort size (10), dropping.
cohort | theta_pi | theta_pi_estimate | theta_pi_bias | theta_pi_std_err | theta_pi_ci_err | theta_pi_ci_low | theta_pi_ci_upp | theta_w | theta_w_estimate | ... | tajima_d_ci_upp | taxon | year | month | country | admin1_iso | admin1_name | admin2_name | longitude | latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KE-14_Kilifi-North_arab_2012 | 0.012515 | 0.012513 | 1.739748e-06 | 0.000221 | 0.000868 | 0.012080 | 0.012947 | 0.015013 | 0.015010 | ... | -0.649706 | arabiensis | 2012 | -1 | Kenya | KE-14 | Kilifi | Kilifi North | 39.909 | -3.511 |
1 | KE-14_Kilifi-North_gamb_2000 | 0.017751 | 0.017754 | -3.812392e-06 | 0.000281 | 0.001103 | 0.017203 | 0.018306 | 0.018947 | 0.018950 | ... | -0.213639 | gambiae | 2000 | -1 | Kenya | KE-14 | Kilifi | Kilifi North | 39.909 | -3.511 |
2 | KE-14_Kilifi-North_gcx3_2012 | 0.013473 | 0.013474 | -1.341458e-06 | 0.000276 | 0.001082 | 0.012933 | 0.014015 | 0.011567 | 0.011568 | ... | 0.814804 | gcx3 | 2012 | -1 | Kenya | KE-14 | Kilifi | Kilifi North | 39.909 | -3.511 |
3 | TZ-05_Muleba_arab_2015 | 0.012306 | 0.012306 | 6.185033e-07 | 0.000224 | 0.000877 | 0.011867 | 0.012744 | 0.014789 | 0.014790 | ... | -0.658653 | arabiensis | 2015 | [3, 4, 6] | Tanzania | TZ-05 | Kagera | Muleba | 31.621 | -1.962 |
4 | TZ-05_Muleba_gamb_2015 | 0.021242 | 0.021247 | -4.337053e-06 | 0.000302 | 0.001182 | 0.020655 | 0.021838 | 0.030350 | 0.030352 | ... | -1.226721 | gambiae | 2015 | [3, 6] | Tanzania | TZ-05 | Kagera | Muleba | 31.621 | -1.962 |
5 | TZ-13_Tarime_arab_2012 | 0.012435 | 0.012435 | -3.120599e-07 | 0.000220 | 0.000861 | 0.012005 | 0.012866 | 0.015058 | 0.015059 | ... | -0.681130 | arabiensis | 2012 | 8 | Tanzania | TZ-13 | Mara | Tarime | 34.199 | -1.431 |
6 | TZ-25_Muheza_gamb_2013 | 0.017705 | 0.017705 | 4.563558e-07 | 0.000290 | 0.001136 | 0.017137 | 0.018273 | 0.018939 | 0.018935 | ... | -0.218096 | gambiae | 2013 | 1 | Tanzania | TZ-25 | Tanga | Muheza | 38.948 | -4.940 |
7 | TZ-25_Muheza_gcx3_2013 | 0.013395 | 0.013398 | -2.958723e-06 | 0.000342 | 0.001340 | 0.012728 | 0.014068 | 0.012917 | 0.012918 | ... | 0.257321 | gcx3 | 2013 | 1 | Tanzania | TZ-25 | Tanga | Muheza | 38.948 | -4.940 |
8 | TZ-26_Moshi_arab_2012 | 0.012374 | 0.012373 | 1.310417e-06 | 0.000219 | 0.000857 | 0.011945 | 0.012801 | 0.014669 | 0.014667 | ... | -0.604319 | arabiensis | 2012 | 8 | Tanzania | TZ-26 | Manyara | Moshi | 37.308 | -3.482 |
9 | UG-E_Tororo_arab_2012 | 0.012560 | 0.012560 | 8.203444e-07 | 0.000224 | 0.000876 | 0.012121 | 0.012998 | 0.015198 | 0.015195 | ... | -0.679454 | arabiensis | 2012 | 10 | Uganda | UG-E | Eastern Region | Tororo | 34.026 | 0.770 |
10 | UG-E_Tororo_gamb_2012 | 0.021693 | 0.021698 | -4.572455e-06 | 0.000317 | 0.001244 | 0.021076 | 0.022320 | 0.031458 | 0.031460 | ... | -1.271359 | gambiae | 2012 | 10 | Uganda | UG-E | Eastern Region | Tororo | 34.026 | 0.770 |
11 | UG-W_Kanungu_gamb_2012 | 0.021464 | 0.021469 | -4.432877e-06 | 0.000310 | 0.001214 | 0.020862 | 0.022076 | 0.030680 | 0.030685 | ... | -1.225296 | gambiae | 2012 | [10, 11] | Uganda | UG-W | Western Region | Kanungu | 29.701 | -0.751 |
12 rows × 31 columns
plot_diversity_stats(df_stats_tz_ke_ug_admin2_year, color="taxon")
Some observations about these results:
All An. arabiensis cohorts show similar diversity. This indicates that all An. arabiensis populations from across this region share very similar demographic histories.
Diversity in An. gambiae from western Tanzania is very similar to both An. gambiae cohorts from Uganda. Conversely, diversity in An. gambiae from eastern Tanzania is very similar to An. gambiae from coastal Kenya. This supports a strong population division within An. gambiae, with different demographic histories.
Diversity is very similar between the two gcx3 cohorts from coastal Tanzania and Kenya. This indicates that the two gcx3 cohorts are sampled from populations with similar demographic histories. In particular, both cohorts have positive values for Tajima’s D, which is consistent with a reduction in population size. We cannot infer the timing or strength of this reduction from Tajima’s D alone, but it is interesting to note that these are the only two cohorts with evidence for population reduction, whereas all An. gambiae and An. arabiensis cohorts have negative Tajima’s D, which is consistent with historical population expansion. This certainly adds evidence for an important distinction between the gcx3 taxon and other mosquito taxa in the region.
Exercise 2 (English)
Rerun the diversity analysis for East Africa in release 3.0, adding Mozambique, Malawi and Mayotte. Discuss the patterns you observe with a colleague.
Exercice 2 (Français)
Réexécutez l’analyse de la diversité pour l’Afrique de l’Est dans la version 3.0, en ajoutant le Mozambique, le Malawi et Mayotte. Discutez des tendances que vous observez avec un collègue.
Example: Burkina Faso#
Finally, let’s look at an example where we have data from the same location from multiple years, to illustrate how we might use diversity to monitor populations for signs of recent demographic changes.
In Burkina Faso we have data from the Houet district from 2012 and 2014, for both An. gambiae and An. coluzzii.
ag3.count_samples(
sample_query="release == '3.0' and country == 'Burkina Faso'"
)
taxon | arabiensis | coluzzii | gambiae | ||||
---|---|---|---|---|---|---|---|
country | admin1_iso | admin1_name | admin2_name | year | |||
Burkina Faso | BF-07 | Centre-Sud | Bazega | 2004 | 0 | 0 | 13 |
BF-09 | Hauts-Bassins | Houet | 2012 | 0 | 82 | 99 | |
2014 | 3 | 53 | 46 |
Exercise 3 (English)
Analyse diversity in Burkina Faso.
Do you observe any differences between An. gambiae and An. coluzzii?
Do you observe any differences between 2012 and 2014?
Exercice 3 (Français)
Analyser la diversité au Burkina Faso.
Observez-vous des différences entre An. gambiae et An. coluzzii?
Observez-vous des différences entre 2012 et 2014 ?
Well done!#
Well done for completing this module on genetic diversity summary statistics!
Exercises#
Please now launch this notebook in colab, clear the outputs, copy it to your drive and run it from the top, complete the exercises you find along the way.
Challenge (English)#
Choose any country or geographical region and perform comparative analysis of genetic diversity. Share your findings and conclusions with colleagues.
Défi (Français)#
Choisissez n’importe quel pays ou région géographique et effectuez une analyse comparative de la diversité génétique. Partagez vos découvertes et conclusions avec vos collègues.