Population structure in Anopheles funestus

Population structure in Anopheles funestus#

Theme: Analysis

DISCLAIMER: This is work in progress and subject to change and updates.

A certain degree of familiarity with the content of the training course on Anopheles gambiae s.l. is expected. Many of the concepts presented in this module were introduced in the training course and we will refer to the relevant workshop and module instead of giving detailed explanations.

More specifically, this module will use neighbour-joining trees and principal component analyses, which were introduced in Workshop 3 - Module 4 and will be covered in more details in a future advanced module, to observe population structure. In order to observe diversity within cohorts, we will also use the statistics and measures of the homozigosity of a sample introduced in Workshop 5. Users are encouraged to go back to these workshops to refamiliarize themselves with their content.

Learning objectives#

After completing this module, you will be able to use the malariagen_data Python package to:

Create neighbour-joining trees to detect population structure
Use a Principal Component Analysis to study population structure
Compute diversity statistics to infer population demographics

Lecture?#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s install the python packages we will need for our analyses.

%pip install -q --no-warn-conflicts malariagen_data

Now we’ve installed malariagen_data, we can import it into our environment and set it up to access data in the cloud.

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data
import numpy as np
import pandas as pd
import seaborn as sns
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"

To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

When mounting your Google Drive, you will need to follow the authorization instructions.

With our Google Drive now mounted, we can define and make a directory where we want to save our results.

try:
    # if running on colab, mount Google Drive
    from google.colab import drive
    drive.mount('drive')
except ImportError:
    pass

results_dir = "drive/MyDrive/Colab Data/af1-structure-results"
os.makedirs(results_dir, exist_ok=True)

af1 = malariagen_data.Af1(results_cache=results_dir)
af1

MalariaGEN Af1 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Af1 API docs.
Storage URL	gs://vo_afun_release_master_us_central1
Data releases available	1.0, 1.1, 1.2, 1.3, 1.4
Results cache	/home/jonbrenas/anopheles-genomic-surveillance.github.io/docs/advanced-training-materials/drive/MyDrive/Colab Data/af1-structure-results
Cohorts analysis	20240515
Site filters analysis	dt_20200416
Software version	malariagen_data 15.2.0
Client location	Iowa, United States (Google Cloud us-central1)

Continent-wide population structure#

The An. gambiae data resource Ag3 contains several different taxa and it is an obvious source of structure within the resource. We can start by checking if the same is true of Af1.

af1.sample_metadata(sample_sets="1.0").groupby('taxon').size()

taxon
funestus    656
dtype: int64

Despite the fact that An. funestus is a complex of species, all samples available in Af1.0 are considered to belong to same taxon. We can thus assume for the rest of this notebook that the population structure we observe doesn’t rise to differences in terms of taxon classification. This may not, and indeed does not, mean that it will be true of future additions to the resource and we will have to be aware of that if additional data is used in the future.

We will now generate a neighbour-joining tree (as we did during Workshop 3) to study the population structure using all the data available. We will use a region of chromosome 2 which is free from inversions and known signals of selection.

region = "2RL:60,000,000-100,000,000"
n_snps = 100_000

af1.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_sets = "1.0",
    color="country",
)

Because samples from many different countries are available in Af1, this njt is a little tricky to analyse. The most obvious conclusion is that basic geography seems to explain the majority of the structure we observe, i.e., samples from the same country to tend to group together and away from samples from other locations.

There are, however, a few exceptions and we are going to look more closely at some of them. We can see, in the top-left, a cluster containing samples from Malawi, Zambia, Tanzania and Mozambique that are quite well mixed. These countries are not very distant from each other but, looking at the map from Advanced Funestus - Exploration, we can see that the locations are still quite distant from each other.

We see another two well-mixed clusters, top-right and bottom-right, of samples from Kenya, Uganda and the Democratic Republic. On the map, these sampling sites are much closer to each other. The top-right cluster also gradually extends to the Central African Republic and then Cameroon which makes a lot of sense geographically.

In addition to the samples from Kenya and Uganda, we can see that the samples from DRC are split into another homogeneous cluster. The same is true for the samples from Mozambique. Finally the samples from Ghana are split into two clusters, one of which also includes the samples from Nigeria. Interestingly, the samples from Benin, which are between Nigeria and Ghana geographically, form their own cluster.

We will now look at some of these countries in isolation, starting with Mozambique.

af1.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_query = "country == 'Mozambique'",
    sample_sets = "1.0",
    color="admin1_name",
)

The structure in Mozambique is clearly defined by geography. Let’s now look at Ghana.

af1.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_query = "country == 'Ghana'",
    sample_sets = "1.0",
    color="admin1_name",
)

We can see that, apart from the sample VBS24205, the region of origin of the samples determines in which cluster they fall. The Northern cluster is less homogeneous than the Southern one.

Next, let’s look at the DRC.

af1.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_query = "country == 'Democratic Republic of the Congo'",
    sample_sets = "1.0",
    color="admin1_name",
)

There is again a clear geographic explanation for the population structure. Now that we know which samples to exclude, let’s look at Uganda, Kenya, the DRC, the Central African Republic and Cameroon together.

af1.plot_njt(
    region=region,
    n_snps=n_snps,
    sample_query = "country in ['Kenya', 'Uganda', 'Central African Republic', 'Cameroon', 'Democratic Republic of the Congo'] and admin1_name != 'Kinshasa'",
    sample_sets = "1.0",
    color="country",
)

This time, the situation is a little less clear cut. The samples from the Central African Republic and Cameroon are not mixed and slightly more distant from the rest, as we observed previously, and the Kenyan samples are generally to the right but the samples from Uganda and the DRC and sprinkled among the big cluster fairly regularly. There seems to be very little population structure between these samples.

Let’s look at a PCA with these same samples.

pca_df, evr = af1.pca(
    region=region, 
    n_snps=n_snps, 
    sample_query = "country in ['Kenya', 'Uganda', 'Central African Republic', 'Cameroon', 'Democratic Republic of the Congo'] and admin1_name != 'Kinshasa'",
    sample_sets = "1.0",
)
af1.plot_pca_coords(pca_df, color="country")

This plot shows a very similar result to what was observed using the neighbour-joining tree: samples from Kenya, Uganda and the DRC form one fairly homogeneous cluster while samples from the Central African Republic and Cameroon are a little different. PC2 blows up the Cameroonian cluster, we can check using the explained variance that it is not meaningful.

af1.plot_pca_variance(evr)

As expected, the explained variance is generally low and only PC1 is really above the plateau of values.

Genetic diversity#

So far, we have looked at which populations look markedly different from their neighbours. We will now try to look at the genetic diversity within some of these populations and try to see if the differences we observed affect their diversities and their demographics. A more detailed explanation of the code used and how to interpret genetic diversity analyses can be found in Workshop 5.

We will look at the populations from Mozambique, Ghana and the DRC as our exemplar cohorts.

df_stats_admin1_year= af1.diversity_stats(
    sample_query="release == '1.0' and country in ['Democratic Republic of the Congo', 'Ghana', 'Mozambique']",
    cohorts="admin1_year",
    cohort_size=10,
    region=region,
    site_mask="funestus",
    site_class="CDS_DEG_4",
)
df_stats_admin1_year

	cohort	theta_pi	theta_pi_estimate	theta_pi_bias	theta_pi_std_err	theta_pi_ci_err	theta_pi_ci_low	theta_pi_ci_upp	theta_w	theta_w_estimate	...	tajima_d_ci_upp	taxon	year	month	country	admin1_iso	admin1_name	admin2_name	longitude	latitude
0	CD-HU_fune_2017	0.030781	0.030781	4.514322e-07	0.000605	0.002370	0.029596	0.031966	0.045245	0.045242	...	-1.314886	funestus	2017	10	Democratic Republic of the Congo	CD-HU	Upper Uele	Watsa	29.548000	3.094000
1	CD-KN_fune_2015	0.031232	0.031233	-9.453471e-07	0.000536	0.002101	0.030183	0.032283	0.040638	0.040639	...	-0.927535	funestus	2015	5	Democratic Republic of the Congo	CD-KN	Kinshasa	Kinshasa	15.313000	-4.327000
2	GH-AH_fune_2014	0.030609	0.030608	6.031017e-07	0.000611	0.002395	0.029411	0.031806	0.044442	0.044439	...	-1.279664	funestus	2014	3	Ghana	GH-AH	Ashanti Region	Adansi Akrofuom	-1.617000	5.933000
3	GH-NP_fune_2017	0.027089	0.027088	1.218321e-06	0.000613	0.002402	0.025887	0.028289	0.030716	0.030710	...	-0.447752	funestus	2017	[6, 7, 8, 10]	Ghana	GH-NP	Northern Region	[Kumbungu, Tolon, Zabzugu]	-0.968556	9.461972
4	MZ-L_fune_2016	0.025648	0.025648	1.730250e-07	0.000533	0.002088	0.024604	0.026692	0.025341	0.025342	...	0.097808	funestus	2016	4	Mozambique	MZ-L	Maputo	Manhiça	32.873000	-25.255000
5	MZ-L_fune_2018	0.026050	0.026049	1.041284e-06	0.000540	0.002117	0.024991	0.027108	0.026161	0.026162	...	0.025088	funestus	2018	[1, 2]	Mozambique	MZ-L	Maputo	Manhiça	32.877000	-25.265000
6	MZ-P_fune_2015	0.025382	0.025381	1.147190e-06	0.000510	0.001999	0.024382	0.026381	0.025676	0.025676	...	0.000476	funestus	2015	8	Mozambique	MZ-P	Cabo Delgado	Palma	40.594000	-10.851000

7 rows × 31 columns

import plotly.express as px
def plot_diversity_stats(
    df_stats,
    color=None,
    height=450,
    template="plotly_white"
):

    # set up common plotting parameters
    hover_name = "cohort"
    hover_data = [
        "taxon",
        "country",
        "admin1_iso",
        "admin1_name",
        "admin2_name",
        "year",
        "month",
    ]
    labels = {
        'theta_pi_estimate': r'$\widehat{\theta}_{\pi}$',
        'theta_w_estimate': r'$\widehat{\theta}_{w}$',
        'tajima_d_estimate': r'$D$',
        'cohort': "Cohort",
        'taxon': 'Taxon',
        'country': "Country",
    }
    category_orders = {
        "taxon": ["gambiae", "coluzzii", "funestus"],
    }
    width = 300 + 30 * len(df_stats)

    # nucleotide diversity bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_pi_estimate",
        error_y="theta_pi_ci_err",
        title="Nucleotide diversity",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
    )
    fig.show()

    # watterson estimator bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="theta_w_estimate",
        error_y="theta_w_ci_err",
        title="Watterson estimator",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # tajima's d bar plot
    fig = px.bar(
        data_frame=df_stats,
        x="cohort",
        y="tajima_d_estimate",
        error_y="tajima_d_ci_err",
        title="Tajima's D",
        color=color,
        height=height,
        width=width,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

    # scatter plot comparing diversity estimators
    fig = px.scatter(
        data_frame=df_stats,
        x="theta_pi_estimate",
        y="theta_w_estimate",
        error_x="theta_pi_ci_err",
        error_y="theta_w_ci_err",
        title="Diversity estimators",
        color=color,
        width=500,
        height=500,
        hover_name=hover_name,
        hover_data=hover_data,
        labels=labels,
        template=template,
        category_orders=category_orders,
    )
    fig.show()

plot_diversity_stats(df_stats_admin1_year, color="country")

We can see clear differences in terms of diversity statistics between the various populations. All populations from Mozambique seem to have similar demographics, and maybe losing diversity. On the other hand, the diversity statistics look very different in the DRC and even more in Ghana between the two different populations.

Let’s use the An. gambiae data as a comparison. It doesn’t contain any data from Mozambique so we will only use the data from the DRC and Ghana.

ag3 = malariagen_data.Ag3()
df_stats_admin1_year_gam = ag3.diversity_stats(
    sample_query="release == '3.0' and country in ['Democratic Republic of the Congo', 'Ghana']",
    cohorts="admin1_year",
    cohort_size=10,
    region="3L:15,000,000-41,000,000",
    site_mask="gamb_colu_arab",
    site_class="CDS_DEG_4",
)
df_stats_all = pd.concat([df_stats_admin1_year, df_stats_admin1_year_gam])
plot_diversity_stats(df_stats_all, color="taxon")

Cohort (GH-EP_colu_2012) has insufficient samples (1) for requested cohort size (10), dropping.
Access SNP calls: ⠦ (0:00:42.61) 

We can see that genetic diversity is much more variable in An. funestus than it is in An. gambiae. While some population seem to have similar values of $\theta_w$ (i.e., similar number of segregating sites), $\theta_\pi$ seems generally higher (i.e., more different alleles).

We can also look at the heterozygosity and the runs of homozygosity for some samples from these populations. We will look at VBS24539 from Northern Mozambique and VBS24195 from Northern Ghana.

af1.plot_heterozygosity(
    sample="VBS24539",
    region="3RL",
    site_mask="funestus",
    window_size=10_000,
);

Load genome features: ⠹ (0:00:00.16)

/home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py:327: BokehDeprecationWarning:

'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.

af1.plot_heterozygosity(
    sample="VBS24195",
    region="3RL",
    site_mask="funestus",
    window_size=10_000,
);

Load genome features: ⠙ (0:00:00.08)

/home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py:327: BokehDeprecationWarning:

'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.

It looks like there are several regions with very low heterozygosity in the genome of VBS24195. We can look at the runs of homozygosity to confirm that.

af1.plot_roh(
    sample="VBS24539",
    region="3RL",
    site_mask="funestus",
    window_size=10_000,
);

/home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py:327: BokehDeprecationWarning:

'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.

af1.plot_roh(
    sample="VBS24195",
    region="3RL",
    site_mask="funestus",
    window_size=10_000,
);

Load genome features: ⠙ (0:00:00.09)

/home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py:327: BokehDeprecationWarning:

'circle() method with size value' was deprecated in Bokeh 3.4.0 and will be removed, use 'scatter(size=...) instead' instead.

VBS24195 has, indeed, several much longer runs of homozygosity compared to VBS24539. One possible explanation is the presence of genes under selection in these regions, for example genes connected to insecticide resistance. Insecticide resistance will be analysed more thoroughly in Advanced Funestus - Insecticide resistance.

Using only a few samples is unlikely to give us a very representative picture of the situation so let’s use a more systematic approach. Beware, the following code may be very slow. We will only use 10 samples per cohort to make it slightly faster.

cohorts = {'Mozambique_north': "release == '1.0' and location == 'Motinho'",
           'Mozambique_south': "release == '1.0' and location in ['Palmeira', 'Palmeiras']",
           'Ghana_north': "release == '1.0' and admin1_name == 'Northern Region'",
           'Ghana_south': "release == '1.0' and location == 'Obuasi'"}

roh_count_list = []
froh_list = []
coh_list = []
sample_list = []

for coh_name, coh_query in cohorts.items():
    for sample in list(af1.sample_metadata().query(coh_query).sample_id)[:10]:
        roh_df = af1.roh_hmm(
            sample=sample,
            region="3RL",
            site_mask="funestus",
            window_size=10_000,
        )
        roh_df_100kb = roh_df.query("roh_length > 100_000")
        roh_count = len(roh_df_100kb)
        roh_total_length = roh_df_100kb["roh_length"].sum()
        froh = roh_total_length / len(af1.genome_sequence(region="3RL"))
        coh_list.append(coh_name)
        sample_list.append(sample)
        roh_count_list.append(roh_count)
        froh_list.append(froh)
df_roh = pd.DataFrame.from_dict({'sample_id': sample_list, 'cohort': coh_list, 'roh_count': roh_count_list, 'froh': froh_list})
df_roh

	sample_id	cohort	roh_count	froh
0	VBS24488	Mozambique_north	2	0.007288
1	VBS24489	Mozambique_north	2	0.006575
2	VBS24492	Mozambique_north	3	0.006747
3	VBS24493	Mozambique_north	1	0.003411
4	VBS24494	Mozambique_north	6	0.016241
5	VBS24497	Mozambique_north	3	0.009385
6	VBS24498	Mozambique_north	1	0.006982
7	VBS24499	Mozambique_north	2	0.007248
8	VBS24502	Mozambique_north	4	0.009050
9	VBS24505	Mozambique_north	3	0.008507
10	VBS17342	Mozambique_south	3	0.008484
11	VBS17345	Mozambique_south	7	0.014227
12	VBS17346	Mozambique_south	8	0.020137
13	VBS17347	Mozambique_south	2	0.005868
14	VBS17348	Mozambique_south	3	0.008000
15	VBS17349	Mozambique_south	3	0.008604
16	VBS17351	Mozambique_south	3	0.004690
17	VBS17353	Mozambique_south	1	0.004507
18	VBS17355	Mozambique_south	9	0.019035
19	VBS17356	Mozambique_south	3	0.006053
20	VBS24195	Ghana_north	47	0.148934
21	VBS24196	Ghana_north	52	0.166943
22	VBS24197	Ghana_north	47	0.126020
23	VBS24198	Ghana_north	43	0.152637
24	VBS24199	Ghana_north	39	0.131004
25	VBS24200	Ghana_north	48	0.205518
26	VBS24201	Ghana_north	44	0.221557
27	VBS24202	Ghana_north	40	0.417817
28	VBS24203	Ghana_north	50	0.200660
29	VBS24204	Ghana_north	36	0.247439
30	VBS17240	Ghana_south	3	0.006122
31	VBS17241	Ghana_south	3	0.007113
32	VBS17242	Ghana_south	1	0.005406
33	VBS17243	Ghana_south	4	0.008991
34	VBS17244	Ghana_south	2	0.005089
35	VBS17245	Ghana_south	3	0.006106
36	VBS17246	Ghana_south	3	0.007483
37	VBS17247	Ghana_south	1	0.003411
38	VBS17248	Ghana_south	3	0.007649
39	VBS17249	Ghana_south	2	0.006407

sns.scatterplot(data=df_roh, x="froh", y="roh_count", hue="cohort")

<Axes: xlabel='froh', ylabel='roh_count'>

../_images/a3a4017a9d1a9e7eee533ab46f347a0727d12179cece945ca97147a0aba29ee8.png

We can see that the population from Northern Ghana is quite atypical with very high numbers and frequencies of runs of homozygosity.

Congratulations on reaching the end of this notebook. You should now be able to start analysing Anopheles funestus data using Af1.0.