Workshop 5 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - Heterozygosity and runs of homozygosity#
Theme: analysis
In this module we are going to investigate heterozygosity and homozygosity using plots across the genome and runs of homozygosity (ROH). These analyses can help us understand more about the demographic history of a mosquito population.
Learning Objectives#
At the end of this module, you will be able to:
Understand and be able to calculate heterozygosity
Plot and interpret the heterozygosity across the genome for individual samples
Explain ROH analysis
Run, plot and interpret ROH
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.
%pip install -q --no-warn-conflicts malariagen_data
Note: you may need to restart the kernel to use updated packages.
import malariagen_data
import warnings
warnings.simplefilter(action='ignore', category=Warning)
Note that authentication is required to access data through the package, please follow the instructions here.
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | None |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
Check your Client location (virtual machine or VM) is located near to the data in the US. If this isn’t the case, use Runtime - Disconnect and delete runtime
from the menu, then re-run the notebook to generate a new VM.
Heterozygosity in wild populations#
Mosquitoes are diploid, so an individual carries two copies of each autosome (non-sex chromosome), one from each parent. This means that each autosomal nucleotide position in the genome can be homozygous, where the nucleotide from each chromosome is the same (e.g., A/A, C/C, T/T or G/G), or heterozygous, where the nucleotides differ (e.g., A/T, C/G, etc.).
In the figure below, we see that at a single genomic position where both parents are heterozygous, the probability of heterozygous offspring is 0.5 (1/2). If only one of the parents in a mating was heterozygous then the probability of heterozygous offspring would decrease to 0.25 (1/4).
Analysing heterozygosity across the genome of an individual can give an indication how inbred an individual is. If heterozygosity is generally low, or there are regions with depressed heterozygosity, then the parents of the individual were likely closely related. This can happen by chance of course, but if we find that many individuals from a given population have low heterozygosity, then the population may be small or have recently been through a bottleneck (we’ll come back to that).
Let’s plot heterozygosity across the genome of some of our mosquito samples and learn how to interpret the figures.
First, we should load the sample metadata to help us pick some samples.
sample_df = ag3.sample_metadata()
sample_df.query("release == '3.0' and country == 'Burkina Faso'").head()
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15094 | AB0085-Cx | BF2-4 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
15095 | AB0086-Cx | BF2-6 | Austin Burt | Burkina Faso | Pala | 2012 | 7 | 11.151 | -4.235 | F | ... | Hauts-Bassins | BF-09 | Houet | gambiae | BF-09_gamb_2012 | BF-09_gamb_2012_07 | BF-09_gamb_2012_Q3 | BF-09_Houet_gamb_2012 | BF-09_Houet_gamb_2012_07 | BF-09_Houet_gamb_2012_Q3 |
15096 | AB0087-C | BF3-3 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
15097 | AB0088-C | BF3-5 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
15098 | AB0089-Cx | BF3-8 | Austin Burt | Burkina Faso | Bana Village | 2012 | 7 | 11.233 | -4.472 | F | ... | Hauts-Bassins | BF-09 | Houet | coluzzii | BF-09_colu_2012 | BF-09_colu_2012_07 | BF-09_colu_2012_Q3 | BF-09_Houet_colu_2012 | BF-09_Houet_colu_2012_07 | BF-09_Houet_colu_2012_Q3 |
5 rows × 32 columns
For our first heterozygosity plot, we’ll pick a wild-caught Anopheles gambiae sample from Burkina Faso, “AB0085-Cx”.
We can use the plot_heterozygosity()
function to visualise heterozygosity in a single sample over a chromosome arm. Let’s have a look at what parameters are required by theplot_heterozygosity()
function.
ag3.plot_heterozygosity?
Signature:
ag3.plot_heterozygosity(
sample: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...]],
region: Union[str, malariagen_data.util.Region, Mapping],
window_size: int = 20000,
y_max: float = 0.03,
circle_kwargs: Optional[Mapping] = None,
site_mask: str = 'default',
sample_set: Optional[str] = None,
sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
width: Optional[int] = None,
track_height: int = 170,
genes_height: int = 90,
show: bool = True,
output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity for a single sample over a genome region.
Parameters
----------
sample : str or int or list of str or int or tuple of str or int
Sample identifier or index within sample set. Multiple values can also
be provided as a list or tuple.
region : str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript.
window_size : int, optional, default: 20000
Number of sites per window.
y_max : float, optional, default: 0.03
Y axis limit.
circle_kwargs : Mapping or None, optional
Passed through to bokeh circle() function.
site_mask : str, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
sample_set : str or None, optional
Sample set identifier.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
/user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
Plot width in pixels (px).
track_height : int, optional, default: 170
Main track height in pixels (px).
genes_height : int, optional, default: 90
Genes track height in pixels (px).
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
Specify an output backend to render a plot area onto. See also
https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.
Returns
-------
Model or None
A bokeh figure (only returned if show=False).
File: /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type: method
The Anopheles gambiae genome is relatively large, e.g., there are 53 million nucleotide positions on the 3R chromosome arm. Plotting millions of points is just not practical either computationally or when it comes to interpretation. Consequently we use a windowed approach, where we define a window size in number of sites (genomic positions), then move this window along the chromosome arm, taking the mean of our statistic (here, heterozygosity) in every window. This gives us a summary of data that is easier to interpret.
Here we’ll use a window size of 10,000 sites. As we know that this sample is Anophele gambiae (see “taxon” column in metadata), we should use the “gamb_colu” site mask as this will give us just the sites that pass site filters, i.e. sites we are confident in calling genotypes, for An. gambiae or An. coluzzii.
ag3.plot_heterozygosity(
sample="AB0085-Cx",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
Each marker shows the heterozygosity in a single window, where heterozygosity is the fraction of sites with a heterozygous genotype call.
The heterozygosity we see for individual AB0085-Cx is typical of many Anopheles samples, with most windows having a heterozygosity of 0.015 (1.5%). We see just a few windows with very low heterozygosity along most of the contig, with a drop at the telomeric end (left) and a decline at the centromeric end (right).
Note that heterozygosity is typically lower in all samples in both telomeric and pericentromeric regions, because of reduced rates of recombination.
How does this sample’s heterozygosity, across the 3R contig, compare with one of the lab colony samples?
Heterozygosity in lab colonies#
A lab colony is usually founded by taking a small number of gravid females from the wild and rearing the offspring in cage. Genetically this differs from wild mosquito populations in two major ways.
1. Population bottleneck.#
As the lab colony is founded by only a few wild individuals, it is highly unlikely that all the genetic diversity found in the wild population will be represented in the colony. This is known as a founder effect which is a kind of population bottleneck. The term bottleneck refers to the idea that the original population is squeezed through the slim neck of a bottle, so only some individuals/genetic-variants make it through to the new smaller population.
2. Population size.#
In the wild, a mosquito population may consist of millions of individuals and even after a severe population bottleneck, like an island founding, it would be expected that the population would over time become large again. However, in a colony, the population size is kept artifically small to stop cage overcrowding and associated drops in colony fitness. A genetic side-effect of this very small population size is an increase in the power of genetic drift. Genetic drift is the random fluctuations in allele frequencies due to chance, and in small populations it has a much stronger effect than in large populations. This means that it is much more likely for alleles to become fixed (100%) in a colony than in a natural population, and so over a few generations the already low heterozygosity (due to bottleneck) is likely to drop even further.
Let’s choose a lab colony sample from the lab colony cross metadata to look at.
crosses_df = ag3.cross_metadata()
crosses_df.query("role == 'parent'")
cross | sample_id | father_id | mother_id | sex | role | |
---|---|---|---|---|---|---|
0 | 18-5 | AD0142-C | NaN | NaN | F | parent |
1 | 18-5 | AD0143-C | NaN | NaN | M | parent |
22 | 29-2 | AD0231-C | NaN | NaN | F | parent |
23 | 29-2 | AD0232-C | NaN | NaN | M | parent |
44 | 36-9 | AD0254-C | NaN | NaN | F | parent |
45 | 36-9 | AD0255-C | NaN | NaN | M | parent |
66 | 37-3 | AD0277-C | NaN | NaN | F | parent |
67 | 37-3 | AD0278-C | NaN | NaN | M | parent |
88 | 42-4 | AD0305-C | NaN | NaN | F | parent |
89 | 42-4 | AD0306-C | NaN | NaN | M | parent |
104 | 45-1 | AD0324-C | NaN | NaN | F | parent |
105 | 45-1 | AD0325-C | NaN | NaN | M | parent |
126 | 46-9 | AD0347-C | NaN | NaN | F | parent |
127 | 46-9 | AD0348-C | NaN | NaN | M | parent |
148 | 47-6 | AD0371-C | NaN | NaN | F | parent |
149 | 47-6 | AD0372-C | NaN | NaN | M | parent |
170 | 73-2 | AD0422-C | NaN | NaN | F | parent |
171 | 73-2 | AD0423-C | NaN | NaN | M | parent |
191 | 78-2 | AD0447-C | NaN | NaN | F | parent |
192 | 78-2 | AD0448-C | NaN | NaN | M | parent |
211 | 80-2 | AD0473-C | NaN | NaN | F | parent |
212 | 80-2 | AD0474-C | NaN | NaN | M | parent |
233 | B5 | AC0382-C | NaN | NaN | F | parent |
234 | B5 | AC0416-C | NaN | NaN | M | parent |
246 | K2 | AC0300-C | NaN | NaN | F | parent |
247 | K2 | AC0406-C | NaN | NaN | M | parent |
263 | K4 | AC0317-C | NaN | NaN | F | parent |
264 | K4 | AC0398-C | NaN | NaN | M | parent |
281 | K6 | AC0334-C | NaN | NaN | F | parent |
282 | K6 | AC0398-C | NaN | NaN | M | parent |
Let’s look at sample AD0305-C which was the mother in cross 42-4.
ag3.plot_heterozygosity(
sample="AD0305-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
If we compare this to our wild-caught sample we can see that although some of the windows still have heterozygosity ~1.5%, there are also some large regions of the chromosome where heterozygosity is almost zero. These are known as “runs of homozygosity”.
Let’s also look at sample AD0306-C which was the father of cross 42-4.
ag3.plot_heterozygosity(
sample="AD0306-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
This sample has even more runs of homozygosity. Hardly any windows remain with the normal level of heterozygosity. This is a symptom of significant inbreeding.
Let’s now compare the heterozygosity of these two parents with the heterozygosity in their progeny.
crosses_df.query("cross == '42-4' and role == 'progeny'")
cross | sample_id | father_id | mother_id | sex | role | |
---|---|---|---|---|---|---|
90 | 42-4 | AD0309-C | AD0306-C | AD0305-C | M | progeny |
91 | 42-4 | AD0310-C | AD0306-C | AD0305-C | M | progeny |
92 | 42-4 | AD0311-C | AD0306-C | AD0305-C | M | progeny |
93 | 42-4 | AD0312-C | AD0306-C | AD0305-C | M | progeny |
94 | 42-4 | AD0313-C | AD0306-C | AD0305-C | M | progeny |
95 | 42-4 | AD0314-C | AD0306-C | AD0305-C | M | progeny |
96 | 42-4 | AD0315-C | AD0306-C | AD0305-C | M | progeny |
97 | 42-4 | AD0316-C | AD0306-C | AD0305-C | F | progeny |
98 | 42-4 | AD0317-C | AD0306-C | AD0305-C | M | progeny |
99 | 42-4 | AD0318-C | AD0306-C | AD0305-C | M | progeny |
100 | 42-4 | AD0319-C | AD0306-C | AD0305-C | F | progeny |
101 | 42-4 | AD0320-C | AD0306-C | AD0305-C | F | progeny |
102 | 42-4 | AD0322-C | AD0306-C | AD0305-C | F | progeny |
103 | 42-4 | AD0323-C | AD0306-C | AD0305-C | F | progeny |
Let’s pick one of the progeny, AD0309-C.
ag3.plot_heterozygosity(
sample="AD0309-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
This sample has normal levels of heterozygosity, similar to wild samples. So how did this happen if both the parents showed signs of inbreeding?
The progeny of cross 42-4 are the result of mating a female from one lab colony to a male from a different lab colony. This F1 pairing of two different bottlenecked populations should restore heterozygosity as these cross progeny are effectively outbred, i.e., their parents are not closely related.
This is roughly analogous to breeding between two different pedigree breeds, such as dog breeds for example. Each of the parents may be relatively inbred due to many generations of mating within a small population, but the two different breeds are not closely related, and so their progeny show normal levels of heterozygosity.
Heterozygosity in island populations#
As we have learnt, taking a single individual (or small number of individuals) from the wild to seed a lab colony results in a reduction in heterozygosity as it results in a bottleneck event. A similar effect can also occur naturally when an animal populates an island from the mainland.
An island population will likely be founded with only a fraction of the mainland population diversity, plus island populations would tend to be smaller than those on the mainland, increasing inbreeding and the effect of drift (though not to the levels seen in the small lab colonies).
In Ag3.0 we have samples from the oceanic island Mayotte, ~1000km from mainland Africa and a similar distance from Madagascar. Due to these distances, it is likely that the Anopheles gambiae mosquito population was founded by just a few individuals. Let’s look at heterozygosity in a Mayotte sample.
sample_df.query("release == '3.0' and country == 'Mayotte'").head()
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16063 | AP0007-C | 60 | Igor Sharakhov | Mayotte | Mtsamboro Forest Reserve | 2011 | -1 | -12.703 | 45.081 | F | ... | Mayotte | MYT | Mtsamboro | gambiae | MYT_gamb_2011 | MYT_gamb_2011 | MYT_gamb_2011 | MYT_Mtsamboro_gamb_2011 | MYT_Mtsamboro_gamb_2011 | MYT_Mtsamboro_gamb_2011 |
16064 | AP0021-C | 92 | Igor Sharakhov | Mayotte | Karihani Lake | 2011 | -1 | -12.797 | 45.122 | F | ... | Mayotte | MYT | Tsingoni | gambiae | MYT_gamb_2011 | MYT_gamb_2011 | MYT_gamb_2011 | MYT_Tsingoni_gamb_2011 | MYT_Tsingoni_gamb_2011 | MYT_Tsingoni_gamb_2011 |
16065 | AP0019-C | 88 | Igor Sharakhov | Mayotte | Mtsanga Charifou | 2011 | -1 | -12.991 | 45.156 | M | ... | Mayotte | MYT | Kani-Kéli | gambiae | MYT_gamb_2011 | MYT_gamb_2011 | MYT_gamb_2011 | MYT_Kani-Kéli_gamb_2011 | MYT_Kani-Kéli_gamb_2011 | MYT_Kani-Kéli_gamb_2011 |
16066 | AP0020-C | 78 | Igor Sharakhov | Mayotte | Mtsanga Charifou | 2011 | -1 | -12.991 | 45.156 | M | ... | Mayotte | MYT | Kani-Kéli | gambiae | MYT_gamb_2011 | MYT_gamb_2011 | MYT_gamb_2011 | MYT_Kani-Kéli_gamb_2011 | MYT_Kani-Kéli_gamb_2011 | MYT_Kani-Kéli_gamb_2011 |
16067 | AP0009-C | 62 | Igor Sharakhov | Mayotte | Combani | 2011 | -1 | -12.779 | 45.143 | M | ... | Mayotte | MYT | Tsingoni | gambiae | MYT_gamb_2011 | MYT_gamb_2011 | MYT_gamb_2011 | MYT_Tsingoni_gamb_2011 | MYT_Tsingoni_gamb_2011 | MYT_Tsingoni_gamb_2011 |
5 rows × 32 columns
ag3.plot_heterozygosity(
sample="AP0007-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
We can see from this plot, that the demography of the Mayotte population is likely different from any we have looked at so far. Though the peak heterozygosity across the 3R contig is ~1.5%, there is much more spread, with most windows falling between 1.5% and 0.5% heterozygosity. Interpretations of this result could be that the island population was founded some time ago, and heterozygosity is returning to levels like we saw on the mainland e.g. Burkina Faso, and/or this is the stable level of heterozygosity for a small population size of Anopheles mosquitoes.
Exercise 1 (English)
We have sampled another island population in Ag3.0, Bioko (part of Equatorial Guinea). Though, unlike Mayotte, Bioko is only ~30km away from mainland Africa.
Uncomment the code below and run a heterozygosity analysis for a sample from Bioko.
How does this individual compare to the sample from Mayotte we looked at?
Can you suggest why they might look different?
Exercice 1 (Français)
Nous avons des échantillons provenant d’une autre population insulaire dans Ag3.0: Bioko (qui fait partie de la Guinée-Équatoriale). Toutefois, contrairement à Mayotte, Bioko ne se trouve qu’à ~30 km du continent africain.
Décommenter le code ci-dessous et effectuer une analyse de l’hétérozygosité pour un échantillon venant de Bioko.
En quoi est-ce que cet individu est différent de celui que nous avons regardé pour Mayotte?
Pouvez-vous proposer une explication pour ces différences?
# sample_df.query("release == '3.0' and location == 'Bioko'")
# ag3.plot_heterozygosity(
# sample="???",
# region="3R",
# site_mask="gamb_colu",
# window_size=10_000,
# );
Heterozygosity in an atypical wild population#
One of the reasons for running population genetic analyses like heterozygosity, is to detect populations that don’t act as we expect them to, as this means we might be missing something about their demography which could be relevant to vector control. Samples of the putative cryptic species gcx3, collected in Kenya, have strange patterns of heterozygosity for wild samples, let’s take a look and think about what the causes could be.
sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Kenya'").head()
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16881 | AK0049-C | KIL38 | Janet Midega | Kenya | Kilifi | 2012 | -1 | -3.511 | 39.909 | F | ... | Kilifi | KE-14 | Kilifi North | gcx3 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 |
16882 | AK0050-C | KIL39 | Janet Midega | Kenya | Kilifi | 2012 | -1 | -3.511 | 39.909 | F | ... | Kilifi | KE-14 | Kilifi North | gcx3 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 |
16883 | AK0051-C | KIL40 | Janet Midega | Kenya | Kilifi | 2012 | -1 | -3.511 | 39.909 | F | ... | Kilifi | KE-14 | Kilifi North | gcx3 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 |
16884 | AK0052-C | KIL41 | Janet Midega | Kenya | Kilifi | 2012 | -1 | -3.511 | 39.909 | F | ... | Kilifi | KE-14 | Kilifi North | gcx3 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 |
16888 | AK0062-C | KIL47 | Janet Midega | Kenya | Kilifi | 2012 | -1 | -3.511 | 39.909 | F | ... | Kilifi | KE-14 | Kilifi North | gcx3 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 | KE-14_Kilifi-North_gcx3_2012 |
5 rows × 32 columns
ag3.plot_heterozygosity(
sample="AK0050-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
So what is going on here? This wild caught sample has long runs of homozygosity - regions where heterozygosity is close to zero - more so than even some of our lab colony samples. It looks like this individual has come from a wild population that has been through a severe bottleneck recently. But what might cause that in a natural population?
One possibility is that a vector control campaign that ran in Kenya prior to the date of collection was so successful that it rapidly and drastically reduced the population size of these mosquitoes. Unfortunately, this is just conjecture as we don’t have longitudinal samples (before, during, after the control campaign) to test this hypothesis.
Fortunately many of our sample sets going forward do contain longitudinally sampled mosquitoes, some in conjunction with vector control roll-out projects. Monitoring heterozygosity with these samples could potentially indicate if a control method e.g. next-generation bednets, affects mosquito population sizes and thus provide an indication of its effectiveness.
Exercise 2 (English)
Another location where we have collected samples with atypical diversity statistics is Tanzania. Uncomment the code below and analyse the hetereozygosity of a gcx3 sample from Tanzania.
Does your Tanzanian sample’s heterozygosity differ from the sample from Kenya we looked at above? If so, how is it different?
Exercice 2 (Français)
Un autre lieu où nous avons des échantillons avec des statistiques de la diversité génétiques atypiques est la Tanzanie. Décommenter le code ci-dessous et analyser l’hétérozygotie d’un échantillon gcx3 tanzanien.
Est-ce que l’hétérozygotie de votre échantillon tanzanien est différente de celle de l’échantillon kenyan que nous avons regardé en premier? Si oui, comment?
# sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Tanzania'")
# ag3.plot_heterozygosity(
# sample="???",
# region="3R",
# site_mask="gamb_colu",
# window_size=10_000,
# );
Runs of homozygosity (ROH)#
Plotting heterozygosity across individual genomes provides a great visual way to investigate population demography. For example, plotting heterozygosity across 3R suggests that the island population of Mayotte differs from mainland populations, however, it doesn’t tell us exactly how it differs or by how much.
One analysis that provides a way of quantifying heterozygosity, or rather the opposite side of the same coin - homozygosity, is to measure runs of homozygosity or ROH. As we saw in our Kenyan heterozygosity plot above, runs of homozygosity are where we see multiple windows in row with near-zero heterozygosity (a.k.a. homozygosity).
To infer and measure these homozygous regions we can use a hidden markov model (HMM). HMMs were explained in workshop 2, module 3, where it was used to detect copy number variation from coverage data. But briefly, HMM is an algorithm that allows large, noisy datasets (such as heterozygosity), to be quickly scanned for patterns (such as runs of homozygosity).
To infer and visualise ROH, we can use the function plot_roh()
. Let’s have a look at what parameters we need to specify for the plot_roh()
function.
ag3.plot_roh?
Signature:
ag3.plot_roh(
sample: Union[str, int],
region: Union[str, malariagen_data.util.Region, Mapping],
window_size: int = 20000,
site_mask: str = 'default',
sample_set: Optional[str] = None,
phet_roh: float = 0.001,
phet_nonroh: Tuple[float, ...] = (0.003, 0.01),
transition: float = 0.001,
y_max: float = 0.03,
sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
width: Optional[int] = None,
heterozygosity_height: int = 170,
roh_height: int = 40,
genes_height: int = 90,
circle_kwargs: Optional[Mapping] = None,
show: bool = True,
output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity and inferred runs of homozygosity for a
single sample over a genome region.
Parameters
----------
sample : str or int
Sample identifier or index within sample set.
region : str or Region or Mapping
Region of the reference genome. Can be a contig name, region string
(formatted like "{contig}:{start}-{end}"), or identifier of a genome
feature such as a gene or transcript.
window_size : int, optional, default: 20000
Number of sites per window.
site_mask : str, optional, default: 'default'
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
sample_set : str or None, optional
Sample set identifier.
phet_roh : float, optional, default: 0.001
Probability of observing a heterozygote in a ROH.
phet_nonroh : tuple of float, optional, default: (0.003, 0.01)
One or more probabilities of observing a heterozygote outside a ROH.
transition : float, optional, default: 0.001
Probability of moving between states. A larger window size may call
for a larger transitional probability.
y_max : float, optional, default: 0.03
Y axis limit.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
/user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
Plot width in pixels (px).
heterozygosity_height : int, optional, default: 170
Plot height in pixels (px).
roh_height : int, optional, default: 40
Plot height in pixels (px).
genes_height : int, optional, default: 90
Genes track height in pixels (px).
circle_kwargs : Mapping or None, optional
Passed through to bokeh circle() function.
show : bool, optional, default: True
If true, show the plot. If False, do not show the plot, but return the
figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
Specify an output backend to render a plot area onto. See also
https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.
Returns
-------
Model or None
A bokeh figure (only returned if show=False).
File: /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type: method
Let’s now plot ROH for the Kenyan sample we looked at earlier.
ag3.plot_roh(
sample="AK0050-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
);
The ROH analysis starts with the heterozygosity data that we plotted earlier and the top panel of the plot here is identical to the output of the plot_heterozygosity()
function. The results from the HMM ROH analysis have been plotted below this, where we can see 23 blocks of ROH have been detected.
To access a Pandas DataFrame of the underlying ROH data, we can use another function, roh_hmm
, with exactly the same parameters as before. The output DataFrame shows us the start, end and length of each ROH inferred by the HMM.
roh_df = ag3.roh_hmm(
sample="AK0050-C",
region="3R",
site_mask="gamb_colu",
window_size=10_000,
)
roh_df
sample_id | contig | roh_start | roh_stop | roh_length | roh_is_marginal | |
---|---|---|---|---|---|---|
0 | AK0050-C | 3R | 180 | 420337 | 420157 | True |
1 | AK0050-C | 3R | 432202 | 1690670 | 1258468 | False |
2 | AK0050-C | 3R | 1723873 | 2003686 | 279813 | False |
3 | AK0050-C | 3R | 6108642 | 6123679 | 15037 | False |
4 | AK0050-C | 3R | 11288732 | 12567018 | 1278286 | False |
5 | AK0050-C | 3R | 12580510 | 14932888 | 2352378 | False |
6 | AK0050-C | 3R | 17090690 | 17115221 | 24531 | False |
7 | AK0050-C | 3R | 18307285 | 18605951 | 298666 | False |
8 | AK0050-C | 3R | 19013506 | 19039488 | 25982 | False |
9 | AK0050-C | 3R | 19451384 | 20434062 | 982678 | False |
10 | AK0050-C | 3R | 20449042 | 21095747 | 646705 | False |
11 | AK0050-C | 3R | 25613447 | 28241877 | 2628430 | False |
12 | AK0050-C | 3R | 28522124 | 29437036 | 914912 | False |
13 | AK0050-C | 3R | 29969536 | 33355367 | 3385831 | False |
14 | AK0050-C | 3R | 33389150 | 38051606 | 4662456 | False |
15 | AK0050-C | 3R | 38094971 | 39265672 | 1170701 | False |
16 | AK0050-C | 3R | 39473670 | 39486954 | 13284 | False |
17 | AK0050-C | 3R | 39946985 | 40124985 | 178000 | False |
18 | AK0050-C | 3R | 40178007 | 41108589 | 930582 | False |
19 | AK0050-C | 3R | 41149111 | 45242143 | 4093032 | False |
20 | AK0050-C | 3R | 45275124 | 52790662 | 7515538 | False |
21 | AK0050-C | 3R | 52850062 | 52996177 | 146115 | False |
22 | AK0050-C | 3R | 53044868 | 53080557 | 35689 | False |
In the Ag1000G phase 1 paper we computed ROH for all samples, allowing us to plot and compare them.
In the plot, each marker represents an individual mosquito. The Y-axis is the count of runs of homozygosity and the X-axis is the fraction of the genome composed of runs of homozygosity. Because these kind of data are inherently noisy, we filtered the ROH to only include those longer than 100,000 bp (100 kbp).
Lab colony parents are shown in black. We can see how most of them have lots of runs of homozygosity which make up much of their genomes, while most wild caught cohorts have much smaller fractions of their genomes made up of ROH.
Exercise 3 (English)
What country might the grey samples come from (Hint: we’ve looked at strange patterns there today).
Why might two of the lab colony parents (labelled “G”) have both low ROH count and fraction ROH?
Exercice 3 (Français)
De quel pays peuvent venir les échantillons gris du diagramme ci-dessus? (Indice: nous avons observé des motifs étranges dans ce pays aujourd’hui)
Pourquoi est-ce que deux des parents venant de colonies (ceux étiquetés “G”) ont un compte des ROHs et une fraction de ROH si faibles?
Let’s run our Kenyan sample through the same process we used to create the plot above.
First, filter the ROH DataFrame to just include runs of homozygosity longer than 100 kbp.
roh_df_100kb = roh_df.query("roh_length > 100_000")
roh_df_100kb
sample_id | contig | roh_start | roh_stop | roh_length | roh_is_marginal | |
---|---|---|---|---|---|---|
0 | AK0050-C | 3R | 180 | 420337 | 420157 | True |
1 | AK0050-C | 3R | 432202 | 1690670 | 1258468 | False |
2 | AK0050-C | 3R | 1723873 | 2003686 | 279813 | False |
4 | AK0050-C | 3R | 11288732 | 12567018 | 1278286 | False |
5 | AK0050-C | 3R | 12580510 | 14932888 | 2352378 | False |
7 | AK0050-C | 3R | 18307285 | 18605951 | 298666 | False |
9 | AK0050-C | 3R | 19451384 | 20434062 | 982678 | False |
10 | AK0050-C | 3R | 20449042 | 21095747 | 646705 | False |
11 | AK0050-C | 3R | 25613447 | 28241877 | 2628430 | False |
12 | AK0050-C | 3R | 28522124 | 29437036 | 914912 | False |
13 | AK0050-C | 3R | 29969536 | 33355367 | 3385831 | False |
14 | AK0050-C | 3R | 33389150 | 38051606 | 4662456 | False |
15 | AK0050-C | 3R | 38094971 | 39265672 | 1170701 | False |
17 | AK0050-C | 3R | 39946985 | 40124985 | 178000 | False |
18 | AK0050-C | 3R | 40178007 | 41108589 | 930582 | False |
19 | AK0050-C | 3R | 41149111 | 45242143 | 4093032 | False |
20 | AK0050-C | 3R | 45275124 | 52790662 | 7515538 | False |
21 | AK0050-C | 3R | 52850062 | 52996177 | 146115 | False |
Now we can count the number of runs of homozygosity longer that 100 kbp…
roh_count = len(roh_df_100kb)
roh_count
18
…and, calculate the fraction ROH (fraction of the genome composed of ROH longer than 100_000 bp), by summing the “roh_length” column, then dividing this by the total length of the region.
roh_total_length = roh_df_100kb["roh_length"].sum()
roh_total_length
33142748
froh = roh_total_length / len(ag3.genome_sequence(region="3R"))
froh
0.6229759752712953
So, we can see that this individual is incredibly inbred for a wild caught mosquito, over 60% of the 3R chromosome arm is composed of long runs of homozygosity!
N.B., our ROH count will be different from the result shown for this individual in the Ag1000G phase 1 paper, because the paper combines ROH across multiple chromosomes, whereas here we have just calculated it across a single chromosome arm.
Exercise 4 (English)
Choose another sample from any mosquito population, and compute the number of ROH (roh_count
) and fraction of genome in a ROH (froh
) using only ROH larger than 100 kbp.
Exercice 4 (Français)
Choisissez un autre échantillon de n’importe quelle population de moustiques et calculez le nombre de ROH (roh_count
) et la fraction de génome dans un ROH (froh
) en utilisant uniquement des ROH supérieur à 100 kbp.
Well done!#
In this module we have learnt about using heterozygosity and runs of homozygosity as tools for understanding mosquito population demography.
Run, plotted and interpreted heterozygosity across regions of the genome.
Run, plotted and interpreted ROH analysis.
Exercises#
Open this notebook in Google Colab, copy to your drive, clear outputs and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.
Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.
Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.