banner

Workshop 5 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 3 - Genetic diversity summary statistics#

Theme: Analysis

This module covers how to compute genetic diversity summary statistics, which are a convenient and informative way to quantify and compare diversity in different mosquito populations.

Learning objectives#

At the end of this module you will be able to:

  • Select cohorts for comparative analysis of genetic diversity

  • Compute nucleotide diversity, Watterson’s estimator and Tajima’s D

  • Plot diversity statistics for multiple cohorts

  • Interpret the results

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

Install and import the packages we’ll need.

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
import plotly.express as px

Here we will compute summary statistics using data from millions of SNPs. This can take some time, so we will use Google Drive to save results and avoid having to rerun computations. To do this we first need to mount Google Drive to our colab VM:

try:
    from google.colab import drive
    drive.mount("drive")
except ImportError:
    pass

Configure access to MalariaGEN data in Google Cloud.

ag3 = malariagen_data.Ag3(
    results_cache="drive/MyDrive/malariagen_data_cache"
)
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0
Results cache /home/ahernank/github/anopheles-genomic-surveillance.github.io/docs/workshop-5/drive/MyDrive/malariagen_data_cache
Cohorts analysis 20230516
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 7.11.0
Client location unknown

Selecting cohorts#

In this module we will be comparing genetic diversity between mosquitoes representing different geographical locations, time points and species. Before we begin any comparative analysis of genetic diversity, we first need to decide how to group the mosquitoes we’ve sequenced into cohorts. Each cohort we define will represent a sample of mosquitoes from a given time, place and species.

Let’s first browse the different sample locations for which we have data. In workshop 1, module 2 we looked at how to create an interactive map with sampling locations using ipyleaflet. Because this is something we want to do often, we’ve added a convenience function plot_samples_interactive_map() to do this for you. This will plot a map showing sampling locations for all available samples.

ag3.plot_samples_interactive_map()

For the purposes of this module we will begin by focusing on Tanzania. To get more information on the numbers of samples available from Tanzania, we can use pandas to create a pivot table as we also illustrated in workshop 1, module 2. Again, because we do this often, we’ve created a convenience function count_samples().

ag3.count_samples(
    sample_query="country == 'Tanzania'"
)
taxon arabiensis gambiae gcx3
country admin1_iso admin1_name admin2_name year
Tanzania TZ-05 Kagera Muleba 2015 137 32 1
TZ-13 Mara Tarime 2012 47 0 0
TZ-25 Tanga Muheza 2013 1 32 10
TZ-26 Manyara Moshi 2012 40 0 0

When grouping samples by geographical location, we sometimes can choose whether to aggregate data from nearby locations, or whether to keep them separate. In general, we want the finest spatial granularity possible, but we also need enough samples to make reasonable estimates of genetic diversity, and so we might have to make a trade-off between granularity and sample size.

Similarly, when grouping samples by date of collection, we can choose whether to aggregate data from the same month, season or year. Again, in general we would prefer the finest temporal granularity possible, but might also need sufficient sample size.

For computing diversity summary statistics, there is no hard-and-fast rule about how many samples are needed, but a sample size of 10 is probably the bare minimum we would consider, and we would prefer more if possible.

To simplify grouping of mosquitoes into cohorts, we have predefined some metadata columns which provide cohort labels at different levels of granularity. Let’s take a peek at the sample metadata:

df_samples_tz = ag3.sample_metadata(
    sample_query="country == 'Tanzania'"
)
df_samples_tz.head()
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 BL0046-C Plate_C_H6 Bilali Kabula Tanzania Muleba 2015 6 -1.962 31.621 F ... Kagera TZ-05 Muleba gcx3 TZ-05_gcx3_2015 TZ-05_gcx3_2015_06 TZ-05_gcx3_2015_Q2 TZ-05_Muleba_gcx3_2015 TZ-05_Muleba_gcx3_2015_06 TZ-05_Muleba_gcx3_2015_Q2
1 BL0047-C Plate_F_D4 Bilali Kabula Tanzania Muleba 2015 3 -1.962 31.621 F ... Kagera TZ-05 Muleba arabiensis TZ-05_arab_2015 TZ-05_arab_2015_03 TZ-05_arab_2015_Q1 TZ-05_Muleba_arab_2015 TZ-05_Muleba_arab_2015_03 TZ-05_Muleba_arab_2015_Q1
2 BL0048-C Plate_F_E4 Bilali Kabula Tanzania Muleba 2015 3 -1.962 31.621 F ... Kagera TZ-05 Muleba arabiensis TZ-05_arab_2015 TZ-05_arab_2015_03 TZ-05_arab_2015_Q1 TZ-05_Muleba_arab_2015 TZ-05_Muleba_arab_2015_03 TZ-05_Muleba_arab_2015_Q1
3 BL0049-C Plate_F_F4 Bilali Kabula Tanzania Muleba 2015 3 -1.962 31.621 F ... Kagera TZ-05 Muleba arabiensis TZ-05_arab_2015 TZ-05_arab_2015_03 TZ-05_arab_2015_Q1 TZ-05_Muleba_arab_2015 TZ-05_Muleba_arab_2015_03 TZ-05_Muleba_arab_2015_Q1
4 BL0050-C Plate_F_G4 Bilali Kabula Tanzania Muleba 2015 3 -1.962 31.621 F ... Kagera TZ-05 Muleba arabiensis TZ-05_arab_2015 TZ-05_arab_2015_03 TZ-05_arab_2015_Q1 TZ-05_Muleba_arab_2015 TZ-05_Muleba_arab_2015_03 TZ-05_Muleba_arab_2015_Q1

5 rows × 30 columns

If you scroll to the right, you’ll see columns named “cohort_admin1_year”, “cohort_admin1_month”, “cohort_admin2_year” and “cohort_admin2_month”. Each of these columns provides a different set of cohort labels. All are grouped by species (taxon).

For this analysis, let’s use the “cohort_admin2_year” column to group samples by administrative level 2 divisions (usually called districts) and year. Here are the cohort labels for the Tanzania samples and the sample sizes.

df_samples_tz.groupby("cohort_admin2_year").size()
cohort_admin2_year
TZ-05_Muleba_arab_2015    137
TZ-05_Muleba_gamb_2015     32
TZ-05_Muleba_gcx3_2015      1
TZ-13_Tarime_arab_2012     47
TZ-25_Muheza_arab_2013      1
TZ-25_Muheza_gamb_2013     32
TZ-25_Muheza_gcx3_2013     10
TZ-26_Moshi_arab_2012      40
dtype: int64

We can see there are two cohorts with only 1 sample - these will get dropped from our analysis. Otherwise, all cohorts have at least 10 samples.

Recap: genetic diversity summary statistics#

We are going to compute the following three summary statistics for each cohort:

  • Nucleotide diversity (theta_pi) - This quantifies the average number of differences (mismatches) between any pair of DNA sequences in a cohort. The higher this statistic, the more differences you will expect to see between any pair of randomly sampled DNA sequences.

  • Watterson’s estimator (theta_w) - This quantifies the number of segregating sites within a cohort, scaled by a constant. The higher this statistic, the more sites are found to be segregating (i.e., polymorphic).

  • Tajima’s D (tajima_d) - This quantifies the difference between Watterson’s estimator and nucleotide diversity, and scales the result by a constant. This statistic is useful because it focuses not on the magnitude but on the architecture of diversity in a cohort, compared to a neutrally evolving population of constant size. Negative values imply an excess of rare alleles, and positive values imply a deficit of rare alleles, relative to a population at equilibrium.

Computing diversity summary statistics#

Let’s begin by computing these summary statistics in a single cohort. For this we can use the function cohort_diversity_stats(). Let’s look at the docstring.

ag3.cohort_diversity_stats?

Let’s now compute the statistics for one of our Tanzanian cohorts, TZ-05_Muleba_gamb_2015.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=30,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
stats
cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.021167
theta_pi_estimate                  0.021173
theta_pi_bias                     -0.000005
theta_pi_std_err                     0.0003
theta_pi_ci_err                    0.001176
theta_pi_ci_low                    0.020585
theta_pi_ci_upp                    0.021761
theta_w                            0.038855
theta_w_estimate                   0.038854
theta_w_bias                       0.000001
theta_w_std_err                    0.000436
theta_w_ci_err                     0.001709
theta_w_ci_low                     0.037999
theta_w_ci_upp                     0.039708
tajima_d                          -1.635894
tajima_d_estimate                 -1.635436
tajima_d_bias                     -0.000459
tajima_d_std_err                   0.015294
tajima_d_ci_err                     0.05995
tajima_d_ci_low                   -1.665411
tajima_d_ci_upp                   -1.605461
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

This function returns a pandas Series with values of the summary statistics for our requested cohort. The field theta_pi contains the value of nucleotide diversity. The field theta_w contains the value of Watterson’s estimator. The field tajima_d contains the value of Tajima’s D.

A 95% confidence interval has also been estimated for each of these statistics using a block jackknife procedure. The lower and upper ends of these confidence intervals are given by the fields with the suffixes _ci_low and _ci_upp respectively. The fields with suffix _ci_err give the size of the confidence interval, and is useful for plotting error bars.

It is important to note that, in addition to the cohort parameter, we also provided values for the parameters cohort_size, region, site_mask and site_class. These parameters are important and can make a big difference to the statistic values obtained.

To illustrate the impact of changing some of these parameters, let’s recompute the statistics but providing cohort_size of 10 instead of 30.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
stats
cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.021242
theta_pi_estimate                  0.021247
theta_pi_bias                     -0.000004
theta_pi_std_err                   0.000302
theta_pi_ci_err                    0.001182
theta_pi_ci_low                    0.020655
theta_pi_ci_upp                    0.021838
theta_w                             0.03035
theta_w_estimate                   0.030352
theta_w_bias                      -0.000003
theta_w_std_err                    0.000377
theta_w_ci_err                     0.001478
theta_w_ci_low                     0.029613
theta_w_ci_upp                     0.031092
tajima_d                          -1.259672
tajima_d_estimate                 -1.259398
tajima_d_bias                     -0.000274
tajima_d_std_err                   0.016672
tajima_d_ci_err                    0.065353
tajima_d_ci_low                   -1.292075
tajima_d_ci_upp                   -1.226721
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

Note that the value of theta_pi hasn’t changed much, but the values of theta_w and tajima_d are considerably different. This is because, for many cohorts, there is a strong dependency between these statistics and the cohort size, i.e., the number of samples used to calculate the statistics. Because of this, if we want to make a fair comparison between different cohorts, it is important to use the same cohort size in all cases. If we have a cohort with more samples than our chosen cohort size, the cohort will be randomly downsampled.

Another parameter that makes a big difference is the site_class parameter, which defines the type of genomic site used to calculate the diversity statistics. Above we used “CDS_DEG_4” which stands for 4-fold degenerate coding sites, i.e., sites within gene coding sequences where none of the possible nucleotide changes will affect the resulting amino acid. These sites are typically under relatively weak selective constraint, because they have little to no impact on the phenotype of the organism.

For illustration, let’s contrast this with “CDS_DEG_0” which stands for non-degenerate coding sites, i.e., sites where any nucleotide change will change the amino acid.

stats = ag3.cohort_diversity_stats(
    cohort="TZ-05_Muleba_gamb_2015",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_0",
)
stats
cohort               TZ-05_Muleba_gamb_2015
theta_pi                           0.002492
theta_pi_estimate                  0.002492
theta_pi_bias                          -0.0
theta_pi_std_err                   0.000091
theta_pi_ci_err                    0.000356
theta_pi_ci_low                    0.002314
theta_pi_ci_upp                     0.00267
theta_w                            0.004047
theta_w_estimate                   0.004047
theta_w_bias                      -0.000001
theta_w_std_err                     0.00013
theta_w_ci_err                     0.000509
theta_w_ci_low                     0.003793
theta_w_ci_upp                     0.004302
tajima_d                          -1.612858
tajima_d_estimate                 -1.612861
tajima_d_bias                      0.000004
tajima_d_std_err                   0.026732
tajima_d_ci_err                    0.104788
tajima_d_ci_low                   -1.665255
tajima_d_ci_upp                   -1.560467
taxon                               gambiae
year                                   2015
month                                [3, 6]
country                            Tanzania
admin1_iso                            TZ-05
admin1_name                          Kagera
admin2_name                          Muleba
longitude                            31.621
latitude                             -1.962
dtype: object

Note that the value of theta_pi has now reduced by approximately an order of magnitude. This is because non-degenerate coding sites are subject to much stronger purifying selection than 4-fold degenerate sites.

In this module we are most interested in using comparative analysis of diversity statistics to learn more about demographic differences and changes within our sampled mosquito populations. Hence we will prefer to use sites that are less affected by selection, and will use 4-fold degenerate sites (“CDS_DEG_4”) for the remainder of the module.

A final note, above we chose to provide the parameter region="3L:15,000,000-41,000,000", which defines the genome region we will use for diversity calculations. We typically prefer to use chromosome arm 3L because there are no large polymorphic inversions in any of the species we are studying. We also exclude the pericentromeric and telomeric regions because these have reduced diversity due to reduced recombination and hence selection at linked sites.

Comparing diversity between cohorts#

Now we have seen how to compute statistics for a single cohort, let’s run the same computation but for multiple cohorts, and see how to compare the results.

Example: Tanzania#

Sticking with our focus on Tanzania, let’s remind ourselves what cohorts and sample sizes are available if we group samples by the “cohort_admin2_year” field, which will group by taxon, district and year.

df_samples_tz.groupby("cohort_admin2_year").size()
cohort_admin2_year
TZ-05_Muleba_arab_2015    137
TZ-05_Muleba_gamb_2015     32
TZ-05_Muleba_gcx3_2015      1
TZ-13_Tarime_arab_2012     47
TZ-25_Muheza_arab_2013      1
TZ-25_Muheza_gamb_2013     32
TZ-25_Muheza_gcx3_2013     10
TZ-26_Moshi_arab_2012      40
dtype: int64

If we choose a cohort size of 10 for our comparative analysis, we will be able to analyse 6 cohorts.

To compute diversity for multiple cohorts at the same time, we have provided a diversity_stats() function.

ag3.diversity_stats?

This function has many of the same parameters we discussed above, except now we provide a sample_query parameter to restrict the overall analysis to samples in Tanzania, and we provide a cohorts parameter to define how these samples will be grouped into cohorts.

df_stats_tz_admin2_year = ag3.diversity_stats(
    sample_query="country == 'Tanzania'",
    cohorts="admin2_year",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
df_stats_tz_admin2_year
[INFO] cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping
[INFO] cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping
cohort theta_pi theta_pi_estimate theta_pi_bias theta_pi_std_err theta_pi_ci_err theta_pi_ci_low theta_pi_ci_upp theta_w theta_w_estimate ... tajima_d_ci_upp taxon year month country admin1_iso admin1_name admin2_name longitude latitude
0 TZ-05_Muleba_arab_2015 0.012306 0.012306 6.185033e-07 0.000224 0.000877 0.011867 0.012744 0.014789 0.014790 ... -0.658653 arabiensis 2015 [3, 4, 6] Tanzania TZ-05 Kagera Muleba 31.621 -1.962
1 TZ-05_Muleba_gamb_2015 0.021242 0.021247 -4.337053e-06 0.000302 0.001182 0.020655 0.021838 0.030350 0.030352 ... -1.226721 gambiae 2015 [3, 6] Tanzania TZ-05 Kagera Muleba 31.621 -1.962
2 TZ-13_Tarime_arab_2012 0.012435 0.012435 -3.120599e-07 0.000220 0.000861 0.012005 0.012866 0.015058 0.015059 ... -0.681130 arabiensis 2012 8 Tanzania TZ-13 Mara Tarime 34.199 -1.431
3 TZ-25_Muheza_gamb_2013 0.017705 0.017705 4.563558e-07 0.000290 0.001136 0.017137 0.018273 0.018939 0.018935 ... -0.218096 gambiae 2013 1 Tanzania TZ-25 Tanga Muheza 38.948 -4.940
4 TZ-25_Muheza_gcx3_2013 0.013395 0.013398 -2.958723e-06 0.000342 0.001340 0.012728 0.014068 0.012917 0.012918 ... 0.257321 gcx3 2013 1 Tanzania TZ-25 Tanga Muheza 38.948 -4.940
5 TZ-26_Moshi_arab_2012 0.012374 0.012373 1.310417e-06 0.000219 0.000857 0.011945 0.012801 0.014669 0.014667 ... -0.604319 arabiensis 2012 8 Tanzania TZ-26 Manyara Moshi 37.308 -3.482

6 rows × 31 columns

This function returns a pandas DataFrame, where each row contains data about a cohort, and the columns provide the different statistic values and their confidence intervals. Let’s take a look at all the available columns:

df_stats_tz_admin2_year.columns
Index(['cohort', 'theta_pi', 'theta_pi_estimate', 'theta_pi_bias',
       'theta_pi_std_err', 'theta_pi_ci_err', 'theta_pi_ci_low',
       'theta_pi_ci_upp', 'theta_w', 'theta_w_estimate', 'theta_w_bias',
       'theta_w_std_err', 'theta_w_ci_err', 'theta_w_ci_low', 'theta_w_ci_upp',
       'tajima_d', 'tajima_d_estimate', 'tajima_d_bias', 'tajima_d_std_err',
       'tajima_d_ci_err', 'tajima_d_ci_low', 'tajima_d_ci_upp', 'taxon',
       'year', 'month', 'country', 'admin1_iso', 'admin1_name', 'admin2_name',
       'longitude', 'latitude'],
      dtype='object')

A simple way to visualise these statistics and compare cohorts is to make a bar plot. Let’s do this for nucleotide diversity using plotly express.

px.bar(
    data_frame=df_stats_tz_admin2_year,
    x="cohort",
    y="theta_pi_estimate",
    error_y="theta_pi_ci_err",
    color="taxon",
    height=450,
    width=500,
)

Above we chose to colour the bars by taxon, which helps to see some of the patterns in the data. For example, we have three An. arabiensis cohorts, all of which have nearly identical nucleotide diversity. We also have two An. gambiae cohorts, both of which have higher nucleotide diversity. There is also a significant difference between the two An. gambiae cohorts from the east and west of the country. Finally, we have one cohort representing the putative cryptic taxon which we have provisionally labelled gcx3 (see workshop 4, module 4).

We also would like to make similar plots to visualise Watterson’s estimator and Tajima’s D. Because we’ll want to make these plots often, let’s define a simple function to make all the plots we need.

def plot_diversity_stats(
    df_stats, 
    color=None, 
    height=450, 
    template="plotly_white"
):

    # set up common plotting parameters
    hover_name = "cohort"
    hover_data = [
        "taxon",
        "country",
        "admin1_iso",
        "admin1_name",
        "admin2_name",
        "year",
        "month",
    ]
    labels = {
        'theta_pi_estimate': r'$\widehat{\theta}_{\pi}$',
        'theta_w_estimate': r'$\widehat{\theta}_{w}$',
        'tajima_d_estimate': r'$D$',
        'cohort': "Cohort",
        'taxon': 'Taxon',
        'country': "Country",
    }
    category_orders = {
        "taxon": ["arabiensis", "gambiae", "gcx3", "coluzzii", "gcx1", "gcx2"],
    }
    width = 300 + 30 * len(df_stats)

    # nucleotide diversity bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_pi_estimate",
        error_y="theta_pi_ci_err",
        title="Nucleotide diversity",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # watterson estimator bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_w_estimate",
        error_y="theta_w_ci_err",
        title="Watterson estimator",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # tajima's d bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="tajima_d_estimate",
        error_y="tajima_d_ci_err",
        title="Tajima's D",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # scatter plot comparing diversity estimators
    fig = px.scatter(
        data_frame=df_stats,
        x="theta_pi_estimate",
        y="theta_w_estimate",
        error_x="theta_pi_ci_err",
        error_y="theta_w_ci_err",
        title="Diversity estimators",
        color=color,
        width=500,
        height=500,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

Let’s now make these plots for our Tanzanian cohorts.

plot_diversity_stats(df_stats_tz_admin2_year, color="taxon")

From the above we can see the following patterns:

  • All An. arabiensis cohorts have similar diversity, regardless of sampling location.

  • An. gambiae cohorts have different diversity between the west and east of the country.

  • The gcx3 cohort is quite different from all other cohorts, and is the only cohort with a positive value for Tajima’s D.

Exercise 1 (English)

Uncomment and run the code cells below to repeat the above analysis, but using cohorts grouped by district (administrative level 2) and month.

Do the results change any of our observations about the similarities and differences between species and geographical locations?

Is there any evidence for differences in diversity between different months of sampling?

Exercice 1 (Français)

Décommentez et exécutez les cellules de code ci-dessous pour répéter l’analyse ci-dessus, mais en utilisant des cohortes regroupées par district (niveau administratif 2) et mois.

Les résultats modifient-ils certaines de nos observations sur les similitudes et les différences entre les espèces et les emplacements géographiques?

Existe-t-il des preuves de différences de diversité entre les différents mois d’échantillonnage?

# df_samples_tz.groupby("cohort_admin2_month").size()
# df_stats_tz_admin2_month = ag3.diversity_stats(
#     sample_query="country == 'Tanzania'",
#     cohorts="admin2_month",
#     cohort_size=10,
#     region="3L:15,000,000-41,000,000",
#     site_mask="gamb_colu_arab",
#     site_class="CDS_DEG_4",
# )
# plot_diversity_stats(df_stats_tz_admin2_month, color="taxon")

Example: East Africa#

Let’s now expand our analysis to include samples from neighbouring countries, to investigate a broader geographical range.

Firstly, let’s plot a map to remind ourselves of the available sampling locations.

ag3.plot_samples_interactive_map()

From the map we can see that there are two sampling locations in Uganda and one location in Kenya. Let’s add these countries and investigate sample sizes.

ag3.count_samples(
    sample_query="country in ['Tanzania', 'Kenya', 'Uganda']"
)
taxon arabiensis gambiae gcx3 unassigned
country admin1_iso admin1_name admin2_name year
Kenya KE-14 Kilifi Kilifi North 2000 0 19 0 0
2007 3 0 0 0
2012 10 0 54 0
Tanzania TZ-05 Kagera Muleba 2015 137 32 1 0
TZ-13 Mara Tarime 2012 47 0 0 0
TZ-25 Tanga Muheza 2013 1 32 10 0
TZ-26 Manyara Moshi 2012 40 0 0 0
Uganda UG-E Eastern Region Tororo 2012 81 112 0 1
UG-W Western Region Kanungu 2012 1 95 0 0

Let’s also confirm how many cohorts would be available if we continued to use the “cohort_admin2_year” field to group by taxon, district and year.

ag3.sample_metadata(
    sample_query="country in ['Tanzania', 'Kenya', 'Uganda']"
).groupby("cohort_admin2_year").size()
cohort_admin2_year
KE-14_Kilifi-North_arab_2007      3
KE-14_Kilifi-North_arab_2012     10
KE-14_Kilifi-North_gamb_2000     19
KE-14_Kilifi-North_gcx3_2012     54
TZ-05_Muleba_arab_2015          137
TZ-05_Muleba_gamb_2015           32
TZ-05_Muleba_gcx3_2015            1
TZ-13_Tarime_arab_2012           47
TZ-25_Muheza_arab_2013            1
TZ-25_Muheza_gamb_2013           32
TZ-25_Muheza_gcx3_2013           10
TZ-26_Moshi_arab_2012            40
UG-E_Tororo_arab_2012            81
UG-E_Tororo_gamb_2012           112
UG-W_Kanungu_arab_2012            1
UG-W_Kanungu_gamb_2012           95
dtype: int64

Here, sticking with a cohort size of 10 will allow us to include 3 cohorts from Kenya and 3 cohorts from Uganda.

Let’s now compute and plot diversity stats for all cohorts.

df_stats_tz_ke_ug_admin2_year = ag3.diversity_stats(
    sample_query="country in ['Tanzania', 'Kenya', 'Uganda']",
    cohorts="admin2_year",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
df_stats_tz_ke_ug_admin2_year
[INFO] cohort (KE-14_Kilifi-North_arab_2007) has insufficient samples (3) for requested cohort size (10), dropping
[INFO] cohort (TZ-05_Muleba_gcx3_2015) has insufficient samples (1) for requested cohort size (10), dropping
[INFO] cohort (TZ-25_Muheza_arab_2013) has insufficient samples (1) for requested cohort size (10), dropping
[INFO] cohort (UG-W_Kanungu_arab_2012) has insufficient samples (1) for requested cohort size (10), dropping
cohort theta_pi theta_pi_estimate theta_pi_bias theta_pi_std_err theta_pi_ci_err theta_pi_ci_low theta_pi_ci_upp theta_w theta_w_estimate ... tajima_d_ci_upp taxon year month country admin1_iso admin1_name admin2_name longitude latitude
0 KE-14_Kilifi-North_arab_2012 0.012515 0.012513 1.739748e-06 0.000221 0.000868 0.012080 0.012947 0.015013 0.015010 ... -0.649706 arabiensis 2012 -1 Kenya KE-14 Kilifi Kilifi North 39.909 -3.511
1 KE-14_Kilifi-North_gamb_2000 0.017751 0.017754 -3.812392e-06 0.000281 0.001103 0.017203 0.018306 0.018947 0.018950 ... -0.213639 gambiae 2000 -1 Kenya KE-14 Kilifi Kilifi North 39.909 -3.511
2 KE-14_Kilifi-North_gcx3_2012 0.013473 0.013474 -1.341458e-06 0.000276 0.001082 0.012933 0.014015 0.011567 0.011568 ... 0.814804 gcx3 2012 -1 Kenya KE-14 Kilifi Kilifi North 39.909 -3.511
3 TZ-05_Muleba_arab_2015 0.012306 0.012306 6.185033e-07 0.000224 0.000877 0.011867 0.012744 0.014789 0.014790 ... -0.658653 arabiensis 2015 [3, 4, 6] Tanzania TZ-05 Kagera Muleba 31.621 -1.962
4 TZ-05_Muleba_gamb_2015 0.021242 0.021247 -4.337053e-06 0.000302 0.001182 0.020655 0.021838 0.030350 0.030352 ... -1.226721 gambiae 2015 [3, 6] Tanzania TZ-05 Kagera Muleba 31.621 -1.962
5 TZ-13_Tarime_arab_2012 0.012435 0.012435 -3.120599e-07 0.000220 0.000861 0.012005 0.012866 0.015058 0.015059 ... -0.681130 arabiensis 2012 8 Tanzania TZ-13 Mara Tarime 34.199 -1.431
6 TZ-25_Muheza_gamb_2013 0.017705 0.017705 4.563558e-07 0.000290 0.001136 0.017137 0.018273 0.018939 0.018935 ... -0.218096 gambiae 2013 1 Tanzania TZ-25 Tanga Muheza 38.948 -4.940
7 TZ-25_Muheza_gcx3_2013 0.013395 0.013398 -2.958723e-06 0.000342 0.001340 0.012728 0.014068 0.012917 0.012918 ... 0.257321 gcx3 2013 1 Tanzania TZ-25 Tanga Muheza 38.948 -4.940
8 TZ-26_Moshi_arab_2012 0.012374 0.012373 1.310417e-06 0.000219 0.000857 0.011945 0.012801 0.014669 0.014667 ... -0.604319 arabiensis 2012 8 Tanzania TZ-26 Manyara Moshi 37.308 -3.482
9 UG-E_Tororo_arab_2012 0.012560 0.012560 8.203444e-07 0.000224 0.000876 0.012121 0.012998 0.015198 0.015195 ... -0.679454 arabiensis 2012 10 Uganda UG-E Eastern Region Tororo 34.026 0.770
10 UG-E_Tororo_gamb_2012 0.021693 0.021698 -4.572455e-06 0.000317 0.001244 0.021076 0.022320 0.031458 0.031460 ... -1.271359 gambiae 2012 10 Uganda UG-E Eastern Region Tororo 34.026 0.770
11 UG-W_Kanungu_gamb_2012 0.021464 0.021469 -4.432877e-06 0.000310 0.001214 0.020862 0.022076 0.030680 0.030685 ... -1.225296 gambiae 2012 [10, 11] Uganda UG-W Western Region Kanungu 29.701 -0.751

12 rows × 31 columns

plot_diversity_stats(df_stats_tz_ke_ug_admin2_year, color="taxon")

Some observations about these results:

  • All An. arabiensis cohorts show similar diversity. This indicates that all An. arabiensis populations from across this region share very similar demographic histories.

  • Diversity in An. gambiae from western Tanzania is very similar to both An. gambiae cohorts from Uganda. Conversely, diversity in An. gambiae from eastern Tanzania is very similar to An. gambiae from coastal Kenya. This supports a strong population division within An. gambiae, with different demographic histories.

  • Diversity is very similar between the two gcx3 cohorts from coastal Tanzania and Kenya. This indicates that the two gcx3 cohorts are sampled from populations with similar demographic histories. In particular, both cohorts have positive values for Tajima’s D, which is consistent with a reduction in population size. We cannot infer the timing or strength of this reduction from Tajima’s D alone, but it is interesting to note that these are the only two cohorts with evidence for population reduction, whereas all An. gambiae and An. arabiensis cohorts have negative Tajima’s D, which is consistent with historical population expansion. This certainly adds evidence for an important distinction between the gcx3 taxon and other mosquito taxa in the region.

Exercise 2 (English)

Rerun the diversity analysis for East Africa, adding Mozambique, Malawi and Mayotte. Discuss the patterns you observe with a colleague.

Exercice 2 (Français)

Réexécutez l’analyse de la diversité pour l’Afrique de l’Est, en ajoutant le Mozambique, le Malawi et Mayotte. Discutez des tendances que vous observez avec un collègue.

Example: Burkina Faso#

Finally, let’s look at an example where we have data from the same location from multiple years, to illustrate how we might use diversity to monitor populations for signs of recent demographic changes.

In Burkina Faso we have data from the Houet district from 2012 and 2014, for both An. gambiae and An. coluzzii.

ag3.count_samples(
    sample_query="country == 'Burkina Faso'"
)
taxon arabiensis coluzzii gambiae
country admin1_iso admin1_name admin2_name year
Burkina Faso BF-07 Centre-Sud Bazega 2004 0 0 13
BF-09 Hauts-Bassins Houet 2012 0 82 99
2014 3 53 46

Exercise 3 (English)

Analyse diversity in Burkina Faso.

Do you observe any differences between An. gambiae and An. coluzzii?

Do you observe any differences between 2012 and 2014?

Exercice 3 (Français)

Analyser la diversité au Burkina Faso.

Observez-vous des différences entre An. gambiae et An. coluzzii?

Observez-vous des différences entre 2012 et 2014 ?

Well done!#

Well done for completing this module on genetic diversity summary statistics!

Exercises#

Please now launch this notebook in colab, clear the outputs, copy it to your drive and run it from the top, complete the exercises you find along the way.

Challenge (English)#

Choose any country or geographical region and perform comparative analysis of genetic diversity. Share your findings and conclusions with colleagues.

Défi (Français)#

Choisissez n’importe quel pays ou région géographique et effectuez une analyse comparative de la diversité génétique. Partagez vos découvertes et conclusions avec vos collègues.