Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - Detecting taxonomic and geographical population structure using neighbour-joining trees (NJT)#
Theme: Analysis#
In this module we’re going to investigate genetic structure by constructing neighbour-joining trees (NJTs) using genome variation data from the MalariaGEN Vector Observatory Ag3.0 release. We will use functions in the malariagen_data python package to run and plot NJTs, then learn how interpret the results to discover both taxonomic and geographic structure in mosquito populations.
Learning Objectives#
At the end of this module you will be able to:
Construct neighbour-joining trees across different mosquito cohorts.
Plot trees and interpret results.
Use NJTs to discover taxonomic and geographic population structure.
Lecture#
English#
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/nNHql0s5mmQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
French#
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/a1On0sHZ4nc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
What is taxonomic structure?#
In Ag3.0 there are 2,784 wild-caught mosquitoes collected from 19 countries in sub-Saharan Africa. This includes mosquitoes from three known species within the Anopheles gambiae species complex. We expect that any two mosquitoes from the same species will be more closely related than two mosquitoes from different species. This will hopefully be intuitive, because mosquitoes will prefer to mate with individuals of the same species, even if they live alongside mosquitoes from other species and can hybridise. This is also known as reproductive isolation and it may be caused by different drivers, e.g., behavioural differences leading to a lower likelihood of mating. Genomic differences accumulate between species with restricted gene flow over time with more divergence expected between species that have been reproductively isolated for longer.
What is population structure?#
We might expect that two mosquitoes, of the same species, sampled from the same location, will be more closely related than two mosquitoes sampled from distant locations. These non-random patterns of relatedness are known as population structure. This is based on the assumption that mosquitoes may travel to find a mate, but may only be willing or able to travel a certain distance. Mosquitoes may also be impeded from travelling in certain directions by natural physical barriers, such as a region of elevated terrain, or by variation in the availability of suitable habitat. These limits and barriers to movement can thus generate population structure, also known as geographic isolation.
For both taxonomic and population structure we may have some idea of what to expect based on our knowledge of previous studies, but we may be surprised.
For example, the Anopheles gambiae species complex continues to be unravelled, and there may be cryptic species that we were not previously aware of (e.g. see Crawford et al. 2016 and Tennessen et al. 2020). We will return to this topic with an entire workshop about discovering cryptic species.
Also, recent studies have suggested that some malaria mosquitoes may engage in long-distance migration (Huestis et al. 2019), challenging the previous view that mosquitoes generally don’t travel more than a few kilometres in their lifetimes (Service 1997). But we still don’t know to what extent long-distance migration occurs, or whether the rate or range of migration varies between geographical regions and/or mosquito species. We would also like to learn more about how ecological and landscape variation affects the movement and interbreeding of mosquito populations in different regions of Africa.
In short, we would like to investigate taxonomic and population structure, and we can use genetic variation data to do that. There are various methods for analysing taxonomic and population structure, but in this module we will focus on using a particular type of phylogenetic tree known as a neighbour-joining tree, then compare and contrast this with another method - principal component analysis - towards the end.
What is a phylogenetic tree?#
Phylogenetic trees are used to represent the evolutionary relationships between sampled individuals. These individuals (often called the tips or leaves of the tree) are connected together by tree branches with their length representing the amount of genetic change that has occured between individuals. For example, this might be based on the number of single nucloetide polymorphism (SNP) differences when comparing the genomic sequences of individuals (SNPs are introduced in Workshop1 Module 4). Individual branches are internally connected together by parent nodes. These nodes represent common ancestors inferred from the data. The common ancestor of all individuals in the tree is represented by its root. A tree is rooted when we know and have sampled the common ancestor of the individuals being analysed (often refered to as an outgroup) and means we can infer the order of the branching events. An unrooted tree can be constructed without this information but we cannot infer the direction of the evolutionary path without applying a timescale based on known mutation rates, i.e. a molecular clock.
What are neighbour-joining trees?#
A neighbour-joining tree is a type of phylogenetic tree based on the concept of minimum evolution. This concept assumes that the tree with the shortest sum of branch lengths is the correct tree, i.e. the principle of parsimony - the simplest solution is the most likely. The number of genomic differences, e.g. SNPs, observed between individuals can be represented by a distance matrix.
Neighbour-joining trees are derived by determining which individuals at the end of the tree branches are neighbours by clustering together the closest related individuals in a interative process.
In this simplified example below, a neighbour-joining tree is constructed from four individuals from different taxa. First, the leaves with the minimum distance between them are joined as neighbours. In the example below, these are taxa B and taxa C with a distance of 50 SNPs difference.
In an iterative process, the distance matrix is then recalculated using the distance of the remaining leaves to the B/C parent node. The next most closely related pair are joined as neighbours. In this case taxa A and taxa D have the minimum distance between them.
The distance matrix continues to be recalculated until all neighbours and subtrees are joined together by minimum distance. In our example, there was just one subtree pair left to be joined. These were the parent nodes A/D and B/C, resulting in the completed neighbour joining tree presented below.
Set up#
Now that we have covered some of the theoretical background of neighbour-joining trees and looked at a simple example with four samples, let’s install and import the Python package malariagen_data
that we can use to generate NJTs with hundreds or thousands of individuals.
%pip install -q --no-warn-conflicts malariagen_data
Note that authentication is required to access data through the package, more details can be found here.
import malariagen_data
import os
import plotly.io as pio
pio.renderers.default = "notebook+colab"
Saving NJT results#
Some NJT runs may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab, because genotype calls from tens of millions of SNPs may need to be scanned to identify and extract the data.
To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.
When mounting your Google Drive you will need to follow the authorization instructions.
With our Google Drive now mounted, we can define and make a directory where we want to save our NJT results.
results_dir = "drive/MyDrive/Colab Data/ag3-structure-results"
os.makedirs(results_dir, exist_ok=True)
In Google Colab, we can actually see our mounted drive and NJT results directory by clicking on the file tab on the left hand side of the screen.
Next we should setup the malariagen_data
package. As we want to save our NJT results in the Google Drive folder we just set up, we’ll use the results_cache
parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the NJT results would instead get stored on our hard drive.
ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release_master_us_central1 |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
Results cache | /home/ahernank/github/anopheles-genomic-surveillance.github.io/docs/workshop-3/drive/MyDrive/Colab Data/ag3-structure-results |
Cohorts analysis | 20250131 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 15.0.1 |
Client location | Iowa, United States (Google Cloud us-central1) |
The output of ag3 shows us the “Client location”. This is where our Google Colab virtual machine is running on the cloud. As the data we would like to analyse is physically stored in the US, we should check that our notebook is running there too. If not, click “Runtime > Disconnect and delete runtime” from the menu, then re-run the notebook and check again.
The ag3.plot_njt() function#
We can generate and plot a neighbour-joining tree using the malariagen_data
function ag3.plot_njt(). Before we start, let’s have a quick look at the documention using “?”
ag3.plot_njt?
Signature:
ag3.plot_njt(
region: Union[str, malariagen_data.util.Region, Mapping, List[Union[str, malariagen_data.util.Region, Mapping]], Tuple[Union[str, malariagen_data.util.Region, Mapping], ...]],
n_snps: int,
color: Union[str, Mapping, NoneType] = None,
symbol: Union[str, Mapping, NoneType] = None,
algorithm: Literal['dynamic', 'rapid', 'canonical'] = 'dynamic',
metric: Literal['cityblock', 'euclidean', 'sqeuclidean'] = 'cityblock',
distance_sort: Optional[bool] = None,
count_sort: Optional[bool] = None,
center_x: int | float = 0,
center_y: int | float = 0,
arc_start: int | float = 0,
arc_stop: int | float = 6.283185307179586,
width: Optional[int] = 800,
height: Optional[int] = 600,
show: bool = True,
renderer: Optional[str] = None,
render_mode: Literal['auto', 'svg', 'webgl'] = 'svg',
title: Union[str, bool, NoneType] = True,
title_font_size: int = 14,
line_width: Union[int, float] = 0.5,
marker_size: Union[int, float] = 5,
color_discrete_sequence: Optional[List] = None,
color_discrete_map: Optional[Mapping] = None,
category_orders: Union[List, Mapping, NoneType] = None,
edge_legend: bool = False,
leaf_legend: bool = True,
legend_sizing: Literal['constant', 'trace'] = 'constant',
thin_offset: int = 0,
sample_sets: Union[str, Sequence[str], NoneType] = None,
sample_query: Optional[str] = None,
sample_query_options: Optional[dict] = None,
sample_indices: Optional[List[int]] = None,
site_mask: Optional[str] = 'default',
site_class: Optional[str] = None,
min_minor_ac: Union[int, float, NoneType] = 2,
max_missing_an: Union[int, float, NoneType] = 0,
cohort_size: Optional[int] = None,
min_cohort_size: Optional[int] = None,
max_cohort_size: Optional[int] = None,
random_seed: int = 42,
inline_array: bool = True,
chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
) -> Optional[plotly.graph_objs._figure.Figure]
Docstring:
Plot an unrooted neighbour-joining tree, computed from pairwise
distances between samples using biallelic SNP genotypes.
The tree is displayed as an unrooted tree using the equal angles layout.
Parameters
----------
region : str or Region or Mapping or list of str or Region or Mapping or tuple of str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript. Can also be a sequence (e.g.,
list) of regions.
n_snps : int
The desired number of SNPs to use when running the analysis. SNPs will
be evenly thinned to approximately this number.
color : str or Mapping or None, optional
Name of variable to use to color the markers.
symbol : str or Mapping or None, optional
Name of the variable to use to choose marker symbols.
algorithm : {'dynamic', 'rapid', 'canonical'}, optional, default: 'dynamic'
Neighbour-joining algorithm to use. The 'dynamic' algorithm is
fastest.
metric : {'cityblock', 'euclidean', 'sqeuclidean'}, optional, default: 'cityblock'
The metric to compute distance between genotypes in two samples.
distance_sort : bool or None, optional
If True, for each node n, if True, the child with the minimum distance
between is plotted first. Note distance_sort and count_sort cannot
both be True.
count_sort : bool or None, optional
If True, for each node n, the child with the minimum number of
descendants is plotted first. Note distance_sort and count_sort cannot
both be True.
center_x : UnionType[int, float], optional, default: 0
X coordinate where plotting is centered.
center_y : UnionType[int, float], optional, default: 0
Y coordinate where plotting is centered.
arc_start : UnionType[int, float], optional, default: 0
Angle where tree layout begins.
arc_stop : UnionType[int, float], optional, default: 6.283185307179586
Angle where tree layout ends.
width : int or None, optional, default: 800
Figure width in pixels (px).
height : int or None, optional, default: 600
Figure height in pixels (px).
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
renderer : str or None, optional
The name of the renderer to use.
render_mode : {'auto', 'svg', 'webgl'}, optional, default: 'svg'
The type of rendering backend to use. See also
https://plotly.com/python/webgl-vs-svg/.
title : str or bool or None, optional, default: True
If True, attempt to use metadata from input dataset as a plot title.
Otherwise, use supplied value as a title.
title_font_size : int, optional, default: 14
Font size for the plot title.
line_width : int or float, optional, default: 0.5
Line width.
marker_size : int or float, optional, default: 5
Marker size.
color_discrete_sequence : List or None, optional
Provide a list of colours to use.
color_discrete_map : Mapping or None, optional
Provide an explicit mapping from values to colours.
category_orders : List or Mapping or None, optional
Control the order in which values appear in the legend.
edge_legend : bool, optional, default: False
Show legend entries for the different edge (line) colors.
leaf_legend : bool, optional, default: True
Show legend entries for the different leaf node (scatter) colors and
symbols.
legend_sizing : {'constant', 'trace'}, optional, default: 'constant'
Determines if the legend items symbols scale with their corresponding
"trace" attributes or remain "constant" independent of the symbol size
on the graph.
thin_offset : int, optional, default: 0
Starting index for SNP thinning. Change this to repeat the analysis
using a different set of SNPs.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
sample_query_options : dict or None, optional
A dictionary of arguments that will be passed through to pandas
query() or eval(), e.g. parser, engine, local_dict, global_dict,
resolvers.
sample_indices : list of int or None, optional
Advanced usage parameter. A list of indices of samples to select,
corresponding to the order in which the samples are found within the
sample metadata. Either provide this parameter or sample_query, not
both.
site_mask : str or None, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
site_class : str or None, optional
Select sites belonging to one of the following classes: CDS_DEG_4,
(4-fold degenerate coding sites), CDS_DEG_2_SIMPLE (2-fold simple
degenerate coding sites), CDS_DEG_0 (non-degenerate coding sites),
INTRON_SHORT (introns shorter than 100 bp), INTRON_LONG (introns
longer than 200 bp), INTRON_SPLICE_5PRIME (intron within 2 bp of 5'
splice site), INTRON_SPLICE_3PRIME (intron within 2 bp of 3' splice
site), UTR_5PRIME (5' untranslated region), UTR_3PRIME (3'
untranslated region), INTERGENIC (intergenic, more than 10 kbp from a
gene).
min_minor_ac : int or float or None, optional, default: 2
The minimum minor allele count. SNPs with a minor allele count below
this value will be excluded. Can also be a float, which will be
interpreted as a fraction.
max_missing_an : int or float or None, optional, default: 0
The maximum number of missing allele calls to accept. SNPs with more
than this value will be excluded. Set to 0 to require no missing
calls. Can also be a float, which will be interpreted as a fraction.
cohort_size : int or None, optional
Randomly down-sample to this value if the number of samples in the
cohort is greater. Raise an error if the number of samples is less
than this value.
min_cohort_size : int or None, optional
Minimum cohort size. Raise an error if the number of samples is less
than this value.
max_cohort_size : int or None, optional
Randomly down-sample to this value if the number of samples in the
cohort is greater.
random_seed : int, optional, default: 42
Random seed used for reproducible down-sampling.
inline_array : bool, optional, default: True
Passed through to dask `from_array()`.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
Define how input data being read from zarr should be divided into
chunks for a dask computation. If 'native', use underlying zarr
chunks. If a string specifying a target memory size, e.g., '300 MiB',
resize chunks in arrays with more than one dimension to match this
size. If 'auto', let dask decide chunk size. If 'ndauto', let dask
decide chunk size but only for arrays with more than one dimension. If
'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
'ndauto01', as 'ndauto' but only vary the first and second chunk
dimensions. Also, can be a tuple of integers, or a callable which
accepts the native chunks as a single argument and returns a valid
dask chunks value.
Returns
-------
Figure or None
A plotly figure (only returned if show=False).
File: /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/malariagen_data/anoph/distance.py
Type: method
We can see that there are two required parameters:
region
is one we are probably getting familar with by now, it defines what region of the genome we want to use to run the analysis. We could assign a whole chromosome arm, a chromosome region, with a start and stop point, or a specific gene of interest.n_snps
requires us to define how many SNPs to use from our region to build the NJT. This reduction in SNPs is called thinning. Too few SNPs and the analysis will not have enought power to accurately build trees, too many SNPs and the analysis will take a long time to run. We generally find 100,000 SNPs is a good compromise when running on Google Colab.
Here’s what the ag3.plot_njt() function is doing when we run it:
Access the SNP genotypes on which to run the analysis (these are stored in the cloud) - filter which sample_sets, which region of the genome, and apply any sample_query included in the parameters (we will look at this functionality a bit later).
SNP allele count - this is the count of the alternate alleles for each nucleotide in the selected genomic region, across our samples. We are interested in sites where there is SNP variation in our samples (segregating sites), particularly biallelic sites including the reference allele (i.e. where there is one alternate allele).
Thin the SNP alleles - using all the alleles availiable in our selected region could make our analysis very slow without adding additional power. Instead, we can run our analysis on just a subset of segregating sites. The
n_snps
parameter is used to ‘thin’ our SNPs. we find that around 100,000 SNPs is a good starting place for analysis of structure.Construct the distance matrix - based on pairwise genetic distances observed between the samples using our biallelic SNPs, we iteratively construct a distance matrix. The default parameter is
city block distance
which is the absolute difference between a pair of genotypes.Plot the neighbour-joining tree - an unrooted tree is plotted where the branch length indicates the degree of differentation between the different leaves and nodes of the tree.
Analysis example - taxonomic structure in Burkina Faso#
We are going to run the NJT on a section of the 3L chromosome arm. 3L is a good choice of chromosome when investigating genetic structure in the An. gambiae complex because it is not confounded by large inversions that could impact the results. We remove the centromere and telomere proximal regions of 3L as these are generally regions of repetitive nucleotides that are difficult to sequence accurately.
We’ll thin the SNPs on 3L down to around 100,000 as that should provide enough data to construct an accurate evolutionary tree. We’ll construct a NJT for a sample set from Burkina Faso.
region = "3L:15,000,000-41,000,000"
n_snps = 100_000
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = "AG1000G-BF-B",
color="taxon",
);
By colouring our leaves by taxon (color="taxon"
) we can see three taxa each forming clusters of closely related individuals with short branch lengths including An. gambiae, An. coluzzii and An. arabiensis. The branch length connecting An. arabiensis to the other taxa is the longest. This is because it is more distantly related to An. coluzzii and An. gambiae, i.e., there were more pairwise differences and therefore a higher distance value on comparison with the other species. Note that 101,077 SNPs were used here. The number of SNPs for analysis (100,000) is not exact because the specified region for SNP discovery is evenly thinned to provide a close approximation.
Exercise 1.
Re-run the NJT for Burkina Faso, but this time construct the tree using the X chromosome. How does this change the phylogeny? What are the possible reasons for this?
Analysis example - population structure of An. gambiae in East Africa#
Next we’ll investigate population (within taxa) structure using the same region of the 3L chromosome arm as we used for the taxonomic analysis above. Let’s explore whether we observe population structure in Tanzanian An. gambiae. We will restrict our analysis to this taxa using the sample query parameter. Since we are now looking within-species we might want to colour our samples by the location from which they originate.
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = "AG1000G-TZ",
sample_query="taxon == 'gambiae'",
color="location",
);
In the tree we clearly have two clusters of closely related An. gambiae , one including individuals from Muleba in the northwest and another with individuals from Muheza located on the eastern coast likely resulting from restricted gene flow between the two regions i.e., geographic isolation.
Let’s add An. gambiae data from Kenya into our analysis to see how this relates to the structure observed in Tanzania by first colouring the samples by country.
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = ["AG1000G-TZ","AG1000G-KE"],
sample_query="taxon == 'gambiae'",
color="country",
);
We can see that the samples from Kenya fall into just one cluster of samples originating from Tanzania. Let’s explore the location of these samples by changing the color parameter to see how this might relate to geography.
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = ["AG1000G-TZ","AG1000G-KE"],
sample_query="taxon == 'gambiae'",
color="location",
);
Now we can see that the samples from Kenya are from Kilifi on the east coast. The Tanzanian samples with which they cluster originate Muheza, which is also located on the east coast. If we consider the geography of East Africa, the Great Rift Valley and Lake Victoria are potential barriers which might restrict mosquito movement and therefore gene flow between the coast and inland East Africa.
Exercise 2.
Add in the East African population cohort of Mayotte (“AG1000G-FR”) into your NJT of East African populations. Where does this population cohort fall and how would you explain your findings considering this is an island population?
Analysis example - Population structure in West Africa#
Let’s see if we also observe population structure in An. gambiae from “Far West” Africa, including samples from Guinea Bissau, The Gambia, Guinea and Mali.
ag3.plot_njt(
region=region,
n_snps=100_000,
sample_sets=['AG1000G-GM-C', 'AG1000G-GN-B','AG1000G-GW'],
sample_query="taxon == 'gambiae'",
color="country",
);
All individuals of An gambiae from the Far West Africa form a single cluster with mixed geographical origin. There is therefore no population structure detected across the region and individuals are expected to undergo unrestricted gene flow.
Let’s compare what we see for An. gambiae to An. coluzzii from Far West Africa.
ag3.plot_njt(
region=region,
n_snps=100_000,
sample_sets=['AG1000G-GM-C', 'AG1000G-GN-B','AG1000G-GW'],
sample_query="taxon == 'coluzzii'",
color="country",
);
On first glance samples appear to form a single cluster but are clearly ordered by geographical origin, suggesting that An. coluzzii in Far West Africa are locally structured. If we zoom in on the plot, we see that there is in fact a distinction between these samples, with individuals from each country forming clusters seperated by short branches. Therefore, we see evidence for some restricted gene flow among An. coluzzii in Far West Africa. This is in constrast to what we saw from An. gambiae because we did not see any population structure across the same region. The difference we see could reflects differences in species ecology and dispersal. For example, An. gambiae oviposit in temporary water bodies and need to disperse to find these habitats. An. coluzzii however, tend to oviposit in permenant water bodies such as rice paddies and may not need to disperse as often or as far. However, it is worth noting that the situation is not always so simple since there is evidence to suggest An. coluzzii can also undergo long range airborne dispersal (Lehmann et al. 2017).
Exercise 3.
Investigate population structure in An. coluzzii from the West African countries of Burkina Faso, Cote D’Ivoire, Ghana (‘AG1000G-BF-B’, ‘AG1000G-CI’, ‘AG1000G-GH’). Do you observe geographical population structure? How does this compare to the situation in Far West Africa?
Principal Components Analysis#
Another method which is commonly used to visualise genomic variation in the data is principal components analysis (PCA). PCA is a method for reducing the dimensions of the variation found within a dataset to help make interpreting the data easier. Typically we use a region of genomic data (e.g. a chomosome arm) which might have millions of SNPs (from hundreds or thousands of samples). The PCA finds axes through the data that describe its variance, in the case of genomic data, this effectively collapses thousands or millions of dimensions (SNPs) down to a handful of principal components (PCs) that describe structure in the data. On a PCA plot, individuals which have a similar genomic composition are clustered together while those which seperate out on the plot can indicate a degree of population structure due to reproductive isolation.
Let’s take a look at the sample set we used to investigate taxonomic structure in Burkina Faso using PCA. We will use the ag3.pca
function to calculate our PCA. This function returns two objects, so we need to assign two variables to it. The first is a dataframe containing 20 PCs and the second is an array containing the percentage of variance in the data which is explained by each PC.
pca_bf_df, evr_bf = ag3.pca(
region=region,
n_snps=n_snps,
sample_sets="AG1000G-BF-B",
);
Plotting explained variance#
Let’s have a look at the explained variance output (evr_bf
). We can see that this is an array of 20 floats (numbers with decimal points). These are the percentage of variance in the dataset explained by each of our 20 principal components
evr_bf
array([0.02730522, 0.0222005 , 0.01423874, 0.0129367 , 0.01234486,
0.01193132, 0.01170422, 0.01034488, 0.01026332, 0.0102485 ,
0.010171 , 0.01016118, 0.0101288 , 0.010114 , 0.01010154,
0.01005797, 0.01003724, 0.01002416, 0.00999244, 0.0099812 ],
dtype=float32)
We can use the variance explained array to get an intuition for which PCs to examine. The easiest way to do this is to plot our evr_bf
array in the form of a bar chart. We have a handy function in malariagen_data
to do this for us. Again, this is an interactive plot and hovering the pointer over a bar will reveal the exact explained variance percentage.
ag3.plot_pca_variance(evr_bf)
In this example, the first principal component explains around 2.5% of the variance and the second around 2.25%, whereas all of the subsequent principal components explain less than 1.5% of the variance. The absolute magnitude of these values is less important as this will change depending on what dataset we analyse. What is more important, however, is the fact that PCs 1 and 2 relatively explain much more variance than the subsequent PCs. In other words, there is a big step down from PC2 to PC3, then the variance explained flattens off. This is a good indication that these first two PCs are capturing some real structure in the data, and that the rest of the PCs are potentially just noise.
There are various statistical methods for formally testing whether a principal component conveys a real signal of population structure or is just random noise (for example see Patterson et al. 2006 and Forkman et al. 2019). However, for exploratory analyses, looking at the differences in variance explained between the adjacent PCs, and ignoring the tail of PCs where the variance flattens off, is not a bad rule of thumb.
Plotting principal components#
We can visualise our analysis by making a scatter plot of the principal components for each sample, e.g., PC1 and PC2. The first two principal components are often sufficient to explore the data and are used by default when applying the function. Higher principal components can be visualised using the x=
and y=
parameters e.g. x="PC3", y="PC4"
. Here we will colour the points by taxon to allow investigation of taxonomic structure.
ag3.plot_pca_coords(
pca_bf_df,
color="taxon",
title="Taxonomic structure in Burkina Faso"
)
On the PCA plot each point represents an individual sample. Let’s plot the NJT for the Burkina Faso sample set again, so that we can compare our findings.
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = "AG1000G-BF-B",
color="taxon",
);
Within both the NJT and PCA plot we see three groups seperating out the different taxon. An. arabiensis appears further away on the NJT and PCA plot than An. gambiae and An. coluzzii reflecting their evolutionary history. Since An. gambiae and An. coluzzii are more closely related, we expect them to share more SNPs. However, there is an important difference in how NJT and PCA are interpreted. The length of branches on the NJT are based on a measureable distance. That is, they are scaled to the number of genetic differences, i.e, SNPs, between individuals on the tree. In comparison, the location of individuals on the PCA plot do not represent genetic distance as an absolute measurement. For example, while we can say that An. arabiensis are more genetically different than An. gambiae and An. coluzzii, this is based on relative distance and we cannot say by how much.
Next we’ll try PCA to look at population structure, as we did for the sample set from Tanzania. Similar to the NJT, we will use a sample query to restrict our analysis to An. gambiae. This time, we will colour the PCA plot by location to investigate geographical structure across Tanzania. We will also plot the NJT we computed earlier for comparison.
pca_tz_df, evr_tz = ag3.pca(
region=region,
n_snps=n_snps,
sample_sets="AG1000G-TZ",
sample_query="taxon == 'gambiae'",
);
ag3.plot_pca_coords(
pca_tz_df,
color="location",
title="Population structure in Tanzania"
)
ag3.plot_njt(
region=region,
n_snps=n_snps,
sample_sets = "AG1000G-TZ",
sample_query="taxon == 'gambiae'",
color="location",
);
In the NJT we see two clusters seperating samples from Muleba and Muheza in Tanzania indicating restricted gene flow between inland and coastal An. gambiae. On the PCA plot, we also see An. gambiae from these two locations form their own distinct clusters. Because mosquitoes from Muleba cluster closely on the plot, we can infer that samples from Muleba are more genetically similar to one another than they are to Muheza, and vice versa.
Often, running more than one type of analysis can help with interpretation and in this case, both the results from NJT and PCA are concurrent. We can be more confident that we have found a signal of population structure seperating inland Muleba from coastal Muheza. This is an example of how performing mutliple analysis can help to strengthen the evidence for your findings.
Exercise 4.
Run a PCA to investigate geographical population structure for populations in East Africa i.e., by adding the parameter sample_sets=["AG1000G-TZ","AG1000G-KE", "AG1000G-UG"]
. How would you interpret this plot? Do results concur with findings from the NJT?
Well done!#
In this module we have run neighbour-joining trees across different mosquito cohorts and interpreted the results to discover taxonomic and geographic population structure.
Exercises (English)
Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.
Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Exercises (Français)
Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.
Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.
Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.