Module 4 - Heterozygosity and runs of homozygosity

Module 4 - Heterozygosity and runs of homozygosity#

Theme: analysis

In this module we are going to investigate heterozygosity and homozygosity using plots across the genome and runs of homozygosity (ROH). These analyses can help us understand more about the demographic history of a mosquito population.

Learning Objectives#

At the end of this module, you will be able to:

Understand and be able to calculate heterozygosity
Plot and interpret the heterozygosity across the genome for individual samples
Explain ROH analysis
Run, plot and interpret ROH

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

%pip install -q --no-warn-conflicts malariagen_data

import malariagen_data
import warnings
warnings.simplefilter(action='ignore', category=Warning)

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	None
Cohorts analysis	20250131
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.2.0
Client location	Iowa, United States (Google Cloud us-central1)

Heterozygosity in wild populations#

Mosquitoes are diploid, so an individual carries two copies of each autosome (non-sex chromosome), one from each parent. This means that each autosomal nucleotide position in the genome can be homozygous, where the nucleotide from each chromosome is the same (e.g., A/A, C/C, T/T or G/G), or heterozygous, where the nucleotides differ (e.g., A/T, C/G, etc.).

In the figure below, we see that at a single genomic position where both parents are heterozygous, the probability of heterozygous offspring is 0.5 (1/2). If only one of the parents in a mating was heterozygous then the probability of heterozygous offspring would decrease to 0.25 (1/4).

Analysing heterozygosity across the genome of an individual can give an indication how inbred an individual is. If heterozygosity is generally low, or there are regions with depressed heterozygosity, then the parents of the individual were likely closely related. This can happen by chance of course, but if we find that many individuals from a given population have low heterozygosity, then the population may be small or have recently been through a bottleneck (we’ll come back to that).

Let’s plot heterozygosity across the genome of some of our mosquito samples and learn how to interpret the figures.

First, we should load the sample metadata to help us pick some samples.

sample_df = ag3.sample_metadata()
sample_df.query("release == '3.0' and country == 'Burkina Faso'").head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
19020	AB0085-Cx	BF2-4	Austin Burt	Burkina Faso	Pala	2012	7	11.151	-4.235	F	...	Hauts-Bassins	BF-09	Houet	gambiae	BF-09_gamb_2012	BF-09_gamb_2012_07	BF-09_gamb_2012_Q3	BF-09_Houet_gamb_2012	BF-09_Houet_gamb_2012_07	BF-09_Houet_gamb_2012_Q3
19021	AB0086-Cx	BF2-6	Austin Burt	Burkina Faso	Pala	2012	7	11.151	-4.235	F	...	Hauts-Bassins	BF-09	Houet	gambiae	BF-09_gamb_2012	BF-09_gamb_2012_07	BF-09_gamb_2012_Q3	BF-09_Houet_gamb_2012	BF-09_Houet_gamb_2012_07	BF-09_Houet_gamb_2012_Q3
19022	AB0087-C	BF3-3	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3
19023	AB0088-C	BF3-5	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3
19024	AB0089-Cx	BF3-8	Austin Burt	Burkina Faso	Bana Village	2012	7	11.233	-4.472	F	...	Hauts-Bassins	BF-09	Houet	coluzzii	BF-09_colu_2012	BF-09_colu_2012_07	BF-09_colu_2012_Q3	BF-09_Houet_colu_2012	BF-09_Houet_colu_2012_07	BF-09_Houet_colu_2012_Q3

5 rows × 57 columns

For our first heterozygosity plot, we’ll pick a wild-caught Anopheles gambiae sample from Burkina Faso, “AB0085-Cx”.

We can use the plot_heterozygosity() function to visualise heterozygosity in a single sample over a chromosome arm. Let’s have a look at what parameters are required by theplot_heterozygosity() function.

ag3.plot_heterozygosity?

Signature:
ag3.plot_heterozygosity(
    sample: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...]],
    region: Union[str, malariagen_data.util.Region, Mapping],
    window_size: int = 20000,
    y_max: float = 0.03,
    circle_kwargs: Optional[Mapping] = None,
    site_mask: Optional[str] = 'default',
    sample_set: Optional[str] = None,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    track_height: int = 170,
    genes_height: int = 90,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
    gene_labels: Optional[Mapping[str, str]] = None,
    gene_labelset: Optional[bokeh.models.annotations.labels.LabelSet] = None,
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity for a single sample over a genome region.

Parameters
----------
sample : str or int or list of str or int or tuple of str or int
    Sample identifier or index within sample set. Multiple values can also
    be provided as a list or tuple.
region : str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript.
window_size : int, optional, default: 20000
    Number of sites per window.
y_max : float, optional, default: 0.03
    Y axis limit.
circle_kwargs : Mapping or None, optional
    Passed through to bokeh scatter() function with marker = 'circle'.
site_mask : str or None, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_set : str or None, optional
    Sample set identifier.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
track_height : int, optional, default: 170
    Main track height in pixels (px).
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.
gene_labels : Mapping[str, str] or None, optional
    A mapping of gene identifiers to custom labels, which will appear in
    the plot.
gene_labelset : LabelSet or None, optional
    A LabelSet to use in the plot.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py
Type:      method

The Anopheles gambiae genome is relatively large, e.g., there are 53 million nucleotide positions on the 3R chromosome arm. Plotting millions of points is just not practical either computationally or when it comes to interpretation. Consequently we use a windowed approach, where we define a window size in number of sites (genomic positions), then move this window along the chromosome arm, taking the mean of our statistic (here, heterozygosity) in every window. This gives us a summary of data that is easier to interpret.

Here we’ll use a window size of 10,000 sites. As we know that this sample is Anophele gambiae (see “taxon” column in metadata), we should use the “gamb_colu” site mask as this will give us just the sites that pass site filters, i.e. sites we are confident in calling genotypes, for An. gambiae or An. coluzzii.

ag3.plot_heterozygosity(
    sample="AB0085-Cx", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

Each marker shows the heterozygosity in a single window, where heterozygosity is the fraction of sites with a heterozygous genotype call.

The heterozygosity we see for individual AB0085-Cx is typical of many Anopheles samples, with most windows having a heterozygosity of 0.015 (1.5%). We see just a few windows with very low heterozygosity along most of the contig, with a drop at the telomeric end (left) and a decline at the centromeric end (right).

Note that heterozygosity is typically lower in all samples in both telomeric and pericentromeric regions, because of reduced rates of recombination.

How does this sample’s heterozygosity, across the 3R contig, compare with one of the lab colony samples?

Heterozygosity in lab colonies#

A lab colony is usually founded by taking a small number of gravid females from the wild and rearing the offspring in cage. Genetically this differs from wild mosquito populations in two major ways.

1. Population bottleneck.#

As the lab colony is founded by only a few wild individuals, it is highly unlikely that all the genetic diversity found in the wild population will be represented in the colony. This is known as a founder effect which is a kind of population bottleneck. The term bottleneck refers to the idea that the original population is squeezed through the slim neck of a bottle, so only some individuals/genetic-variants make it through to the new smaller population.

2. Population size.#

In the wild, a mosquito population may consist of millions of individuals and even after a severe population bottleneck, like an island founding, it would be expected that the population would over time become large again. However, in a colony, the population size is kept artifically small to stop cage overcrowding and associated drops in colony fitness. A genetic side-effect of this very small population size is an increase in the power of genetic drift. Genetic drift is the random fluctuations in allele frequencies due to chance, and in small populations it has a much stronger effect than in large populations. This means that it is much more likely for alleles to become fixed (100%) in a colony than in a natural population, and so over a few generations the already low heterozygosity (due to bottleneck) is likely to drop even further.

Let’s choose a lab colony sample from the lab colony cross metadata to look at.

crosses_df = ag3.cross_metadata()
crosses_df.query("role == 'parent'")

	cross	sample_id	father_id	mother_id	sex	role
0	18-5	AD0142-C	NaN	NaN	F	parent
1	18-5	AD0143-C	NaN	NaN	M	parent
22	29-2	AD0231-C	NaN	NaN	F	parent
23	29-2	AD0232-C	NaN	NaN	M	parent
44	36-9	AD0254-C	NaN	NaN	F	parent
45	36-9	AD0255-C	NaN	NaN	M	parent
66	37-3	AD0277-C	NaN	NaN	F	parent
67	37-3	AD0278-C	NaN	NaN	M	parent
88	42-4	AD0305-C	NaN	NaN	F	parent
89	42-4	AD0306-C	NaN	NaN	M	parent
104	45-1	AD0324-C	NaN	NaN	F	parent
105	45-1	AD0325-C	NaN	NaN	M	parent
126	46-9	AD0347-C	NaN	NaN	F	parent
127	46-9	AD0348-C	NaN	NaN	M	parent
148	47-6	AD0371-C	NaN	NaN	F	parent
149	47-6	AD0372-C	NaN	NaN	M	parent
170	73-2	AD0422-C	NaN	NaN	F	parent
171	73-2	AD0423-C	NaN	NaN	M	parent
191	78-2	AD0447-C	NaN	NaN	F	parent
192	78-2	AD0448-C	NaN	NaN	M	parent
211	80-2	AD0473-C	NaN	NaN	F	parent
212	80-2	AD0474-C	NaN	NaN	M	parent
233	B5	AC0382-C	NaN	NaN	F	parent
234	B5	AC0416-C	NaN	NaN	M	parent
246	K2	AC0300-C	NaN	NaN	F	parent
247	K2	AC0406-C	NaN	NaN	M	parent
263	K4	AC0317-C	NaN	NaN	F	parent
264	K4	AC0398-C	NaN	NaN	M	parent
281	K6	AC0334-C	NaN	NaN	F	parent
282	K6	AC0398-C	NaN	NaN	M	parent

Let’s look at sample AD0305-C which was the mother in cross 42-4.

ag3.plot_heterozygosity(
    sample="AD0305-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

If we compare this to our wild-caught sample we can see that although some of the windows still have heterozygosity ~1.5%, there are also some large regions of the chromosome where heterozygosity is almost zero. These are known as “runs of homozygosity”.

Let’s also look at sample AD0306-C which was the father of cross 42-4.

ag3.plot_heterozygosity(
    sample="AD0306-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

This sample has even more runs of homozygosity. Hardly any windows remain with the normal level of heterozygosity. This is a symptom of significant inbreeding.

Let’s now compare the heterozygosity of these two parents with the heterozygosity in their progeny.

crosses_df.query("cross == '42-4' and role == 'progeny'")

	cross	sample_id	father_id	mother_id	sex	role
90	42-4	AD0309-C	AD0306-C	AD0305-C	M	progeny
91	42-4	AD0310-C	AD0306-C	AD0305-C	M	progeny
92	42-4	AD0311-C	AD0306-C	AD0305-C	M	progeny
93	42-4	AD0312-C	AD0306-C	AD0305-C	M	progeny
94	42-4	AD0313-C	AD0306-C	AD0305-C	M	progeny
95	42-4	AD0314-C	AD0306-C	AD0305-C	M	progeny
96	42-4	AD0315-C	AD0306-C	AD0305-C	M	progeny
97	42-4	AD0316-C	AD0306-C	AD0305-C	F	progeny
98	42-4	AD0317-C	AD0306-C	AD0305-C	M	progeny
99	42-4	AD0318-C	AD0306-C	AD0305-C	M	progeny
100	42-4	AD0319-C	AD0306-C	AD0305-C	F	progeny
101	42-4	AD0320-C	AD0306-C	AD0305-C	F	progeny
102	42-4	AD0322-C	AD0306-C	AD0305-C	F	progeny
103	42-4	AD0323-C	AD0306-C	AD0305-C	F	progeny

Let’s pick one of the progeny, AD0309-C.

ag3.plot_heterozygosity(
    sample="AD0309-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

This sample has normal levels of heterozygosity, similar to wild samples. So how did this happen if both the parents showed signs of inbreeding?

The progeny of cross 42-4 are the result of mating a female from one lab colony to a male from a different lab colony. This F1 pairing of two different bottlenecked populations should restore heterozygosity as these cross progeny are effectively outbred, i.e., their parents are not closely related.

This is roughly analogous to breeding between two different pedigree breeds, such as dog breeds for example. Each of the parents may be relatively inbred due to many generations of mating within a small population, but the two different breeds are not closely related, and so their progeny show normal levels of heterozygosity.

Heterozygosity in island populations#

As we have learnt, taking a single individual (or small number of individuals) from the wild to seed a lab colony results in a reduction in heterozygosity as it results in a bottleneck event. A similar effect can also occur naturally when an animal populates an island from the mainland.

An island population will likely be founded with only a fraction of the mainland population diversity, plus island populations would tend to be smaller than those on the mainland, increasing inbreeding and the effect of drift (though not to the levels seen in the small lab colonies).

In Ag3.0 we have samples from the oceanic island Mayotte, ~1000km from mainland Africa and a similar distance from Madagascar. Due to these distances, it is likely that the Anopheles gambiae mosquito population was founded by just a few individuals. Let’s look at heterozygosity in a Mayotte sample.

sample_df.query("release == '3.0' and country == 'Mayotte'").head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
19989	AP0007-C	60	Igor Sharakhov	Mayotte	Mtsamboro Forest Reserve	2011	-1	-12.703	45.081	F	...	Mayotte	MYT	Mtsamboro	gambiae	MYT_gamb_2011	MYT_gamb_2011	MYT_gamb_2011	MYT_Mtsamboro_gamb_2011	MYT_Mtsamboro_gamb_2011	MYT_Mtsamboro_gamb_2011
19990	AP0021-C	92	Igor Sharakhov	Mayotte	Karihani Lake	2011	-1	-12.797	45.122	F	...	Mayotte	MYT	Tsingoni	gambiae	MYT_gamb_2011	MYT_gamb_2011	MYT_gamb_2011	MYT_Tsingoni_gamb_2011	MYT_Tsingoni_gamb_2011	MYT_Tsingoni_gamb_2011
19991	AP0019-C	88	Igor Sharakhov	Mayotte	Mtsanga Charifou	2011	-1	-12.991	45.156	M	...	Mayotte	MYT	Kani-Kéli	gambiae	MYT_gamb_2011	MYT_gamb_2011	MYT_gamb_2011	MYT_Kani-Kéli_gamb_2011	MYT_Kani-Kéli_gamb_2011	MYT_Kani-Kéli_gamb_2011
19992	AP0020-C	78	Igor Sharakhov	Mayotte	Mtsanga Charifou	2011	-1	-12.991	45.156	M	...	Mayotte	MYT	Kani-Kéli	gambiae	MYT_gamb_2011	MYT_gamb_2011	MYT_gamb_2011	MYT_Kani-Kéli_gamb_2011	MYT_Kani-Kéli_gamb_2011	MYT_Kani-Kéli_gamb_2011
19993	AP0009-C	62	Igor Sharakhov	Mayotte	Combani	2011	-1	-12.779	45.143	M	...	Mayotte	MYT	Tsingoni	gambiae	MYT_gamb_2011	MYT_gamb_2011	MYT_gamb_2011	MYT_Tsingoni_gamb_2011	MYT_Tsingoni_gamb_2011	MYT_Tsingoni_gamb_2011

5 rows × 57 columns

ag3.plot_heterozygosity(
    sample="AP0007-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

We can see from this plot, that the demography of the Mayotte population is likely different from any we have looked at so far. Though the peak heterozygosity across the 3R contig is ~1.5%, there is much more spread, with most windows falling between 1.5% and 0.5% heterozygosity. Interpretations of this result could be that the island population was founded some time ago, and heterozygosity is returning to levels like we saw on the mainland e.g. Burkina Faso, and/or this is the stable level of heterozygosity for a small population size of Anopheles mosquitoes.

Exercise 1 (English)

We have sampled another island population in Ag3.0, Bioko (part of Equatorial Guinea). Though, unlike Mayotte, Bioko is only ~30km away from mainland Africa.

Uncomment the code below and run a heterozygosity analysis for a sample from Bioko.
How does this individual compare to the sample from Mayotte we looked at?
Can you suggest why they might look different?

Exercice 1 (Français)

Nous avons des échantillons provenant d’une autre population insulaire dans Ag3.0: Bioko (qui fait partie de la Guinée-Équatoriale). Toutefois, contrairement à Mayotte, Bioko ne se trouve qu’à ~30 km du continent africain.

Décommenter le code ci-dessous et effectuer une analyse de l’hétérozygosité pour un échantillon venant de Bioko.
En quoi est-ce que cet individu est différent de celui que nous avons regardé pour Mayotte?
Pouvez-vous proposer une explication pour ces différences?

# sample_df.query("release == '3.0' and location == 'Bioko'")

# ag3.plot_heterozygosity(
#     sample="???", 
#     region="3R", 
#     site_mask="gamb_colu", 
#     window_size=10_000,
# );

Heterozygosity in an atypical wild population#

One of the reasons for running population genetic analyses like heterozygosity, is to detect populations that don’t act as we expect them to, as this means we might be missing something about their demography which could be relevant to vector control. Samples of the putative cryptic species gcx3, collected in Kenya, have strange patterns of heterozygosity for wild samples, let’s take a look and think about what the causes could be.

sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Kenya'").head()

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
20807	AK0049-C	KIL38	Janet Midega	Kenya	Kilifi	2012	-1	-3.511	39.909	F	...	Kilifi	KE-14	Kilifi North	gcx3	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012
20808	AK0050-C	KIL39	Janet Midega	Kenya	Kilifi	2012	-1	-3.511	39.909	F	...	Kilifi	KE-14	Kilifi North	gcx3	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012
20809	AK0051-C	KIL40	Janet Midega	Kenya	Kilifi	2012	-1	-3.511	39.909	F	...	Kilifi	KE-14	Kilifi North	gcx3	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012
20810	AK0052-C	KIL41	Janet Midega	Kenya	Kilifi	2012	-1	-3.511	39.909	F	...	Kilifi	KE-14	Kilifi North	gcx3	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012
20814	AK0062-C	KIL47	Janet Midega	Kenya	Kilifi	2012	-1	-3.511	39.909	F	...	Kilifi	KE-14	Kilifi North	gcx3	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012	KE-14_Kilifi-North_gcx3_2012

5 rows × 57 columns

ag3.plot_heterozygosity(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

So what is going on here? This wild caught sample has long runs of homozygosity - regions where heterozygosity is close to zero - more so than even some of our lab colony samples. It looks like this individual has come from a wild population that has been through a severe bottleneck recently. But what might cause that in a natural population?

One possibility is that a vector control campaign that ran in Kenya prior to the date of collection was so successful that it rapidly and drastically reduced the population size of these mosquitoes. Unfortunately, this is just conjecture as we don’t have longitudinal samples (before, during, after the control campaign) to test this hypothesis.

Fortunately many of our sample sets going forward do contain longitudinally sampled mosquitoes, some in conjunction with vector control roll-out projects. Monitoring heterozygosity with these samples could potentially indicate if a control method e.g. next-generation bednets, affects mosquito population sizes and thus provide an indication of its effectiveness.

Exercise 2 (English)

Another location where we have collected samples with atypical diversity statistics is Tanzania. Uncomment the code below and analyse the hetereozygosity of a gcx3 sample from Tanzania.

Does your Tanzanian sample’s heterozygosity differ from the sample from Kenya we looked at above? If so, how is it different?

Exercice 2 (Français)

Un autre lieu où nous avons des échantillons avec des statistiques de la diversité génétiques atypiques est la Tanzanie. Décommenter le code ci-dessous et analyser l’hétérozygotie d’un échantillon gcx3 tanzanien.

Est-ce que l’hétérozygotie de votre échantillon tanzanien est différente de celle de l’échantillon kenyan que nous avons regardé en premier? Si oui, comment?

# sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Tanzania'")

# ag3.plot_heterozygosity(
#     sample="???", 
#     region="3R", 
#     site_mask="gamb_colu", 
#     window_size=10_000,
# );

Runs of homozygosity (ROH)#

Plotting heterozygosity across individual genomes provides a great visual way to investigate population demography. For example, plotting heterozygosity across 3R suggests that the island population of Mayotte differs from mainland populations, however, it doesn’t tell us exactly how it differs or by how much.

One analysis that provides a way of quantifying heterozygosity, or rather the opposite side of the same coin - homozygosity, is to measure runs of homozygosity or ROH. As we saw in our Kenyan heterozygosity plot above, runs of homozygosity are where we see multiple windows in row with near-zero heterozygosity (a.k.a. homozygosity).

To infer and measure these homozygous regions we can use a hidden markov model (HMM). HMMs were explained in workshop 2, module 3, where it was used to detect copy number variation from coverage data. But briefly, HMM is an algorithm that allows large, noisy datasets (such as heterozygosity), to be quickly scanned for patterns (such as runs of homozygosity).

To infer and visualise ROH, we can use the function plot_roh(). Let’s have a look at what parameters we need to specify for the plot_roh() function.

ag3.plot_roh?

Signature:
ag3.plot_roh(
    sample: Union[str, int],
    region: Union[str, malariagen_data.util.Region, Mapping],
    window_size: int = 20000,
    site_mask: Optional[str] = 'default',
    sample_set: Optional[str] = None,
    phet_roh: float = 0.001,
    phet_nonroh: Tuple[float, ...] = (0.003, 0.01),
    transition: float = 0.001,
    y_max: float = 0.03,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    heterozygosity_height: int = 170,
    roh_height: int = 40,
    genes_height: int = 90,
    circle_kwargs: Optional[Mapping] = None,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
    gene_labels: Optional[Mapping[str, str]] = None,
    gene_labelset: Optional[bokeh.models.annotations.labels.LabelSet] = None,
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity and inferred runs of homozygosity for a
single sample over a genome region.

Parameters
----------
sample : str or int
    Sample identifier or index within sample set.
region : str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript.
window_size : int, optional, default: 20000
    Number of sites per window.
site_mask : str or None, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_set : str or None, optional
    Sample set identifier.
phet_roh : float, optional, default: 0.001
    Probability of observing a heterozygote in a ROH.
phet_nonroh : tuple of float, optional, default: (0.003, 0.01)
    One or more probabilities of observing a heterozygote outside a ROH.
transition : float, optional, default: 0.001
    Probability of moving between states. A larger window size may call
    for a larger transitional probability.
y_max : float, optional, default: 0.03
    Y axis limit.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
heterozygosity_height : int, optional, default: 170
    Plot height in pixels (px).
roh_height : int, optional, default: 40
    Plot height in pixels (px).
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
circle_kwargs : Mapping or None, optional
    Passed through to bokeh scatter() function with marker = 'circle'.
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.
gene_labels : Mapping[str, str] or None, optional
    A mapping of gene identifiers to custom labels, which will appear in
    the plot.
gene_labelset : LabelSet or None, optional
    A LabelSet to use in the plot.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/35674e27b19f7c625ba32a1b88449ff45c90b40edb90a065b66c5a9a5388f41c-20250421-195247-360829-93-training-nb-maintenance-mgen-15.2.0/lib/python3.11/site-packages/malariagen_data/anopheles.py
Type:      method

Let’s now plot ROH for the Kenyan sample we looked at earlier.

ag3.plot_roh(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);

The ROH analysis starts with the heterozygosity data that we plotted earlier and the top panel of the plot here is identical to the output of the plot_heterozygosity() function. The results from the HMM ROH analysis have been plotted below this, where we can see 23 blocks of ROH have been detected.

To access a Pandas DataFrame of the underlying ROH data, we can use another function, roh_hmm, with exactly the same parameters as before. The output DataFrame shows us the start, end and length of each ROH inferred by the HMM.

roh_df = ag3.roh_hmm(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
)
roh_df

	sample_id	contig	roh_start	roh_stop	roh_length	roh_is_marginal
0	AK0050-C	3R	180	420337	420157	True
1	AK0050-C	3R	432202	1690670	1258468	False
2	AK0050-C	3R	1723873	2003686	279813	False
3	AK0050-C	3R	6108642	6123679	15037	False
4	AK0050-C	3R	11288732	12567018	1278286	False
5	AK0050-C	3R	12580510	14932888	2352378	False
6	AK0050-C	3R	17090690	17115221	24531	False
7	AK0050-C	3R	18307285	18605951	298666	False
8	AK0050-C	3R	19013506	19039488	25982	False
9	AK0050-C	3R	19451384	20434062	982678	False
10	AK0050-C	3R	20449042	21095747	646705	False
11	AK0050-C	3R	25613447	28241877	2628430	False
12	AK0050-C	3R	28522124	29437036	914912	False
13	AK0050-C	3R	29969536	33355367	3385831	False
14	AK0050-C	3R	33389150	38051606	4662456	False
15	AK0050-C	3R	38094971	39265672	1170701	False
16	AK0050-C	3R	39473670	39486954	13284	False
17	AK0050-C	3R	39946985	40124985	178000	False
18	AK0050-C	3R	40178007	41108589	930582	False
19	AK0050-C	3R	41149111	45242143	4093032	False
20	AK0050-C	3R	45275124	52790662	7515538	False
21	AK0050-C	3R	52850062	52996177	146115	False
22	AK0050-C	3R	53044868	53080557	35689	False

In the Ag1000G phase 1 paper we computed ROH for all samples, allowing us to plot and compare them.

In the plot, each marker represents an individual mosquito. The Y-axis is the count of runs of homozygosity and the X-axis is the fraction of the genome composed of runs of homozygosity. Because these kind of data are inherently noisy, we filtered the ROH to only include those longer than 100,000 bp (100 kbp).

Lab colony parents are shown in black. We can see how most of them have lots of runs of homozygosity which make up much of their genomes, while most wild caught cohorts have much smaller fractions of their genomes made up of ROH.

Exercise 3 (English)

What country might the grey samples come from (Hint: we’ve looked at strange patterns there today).
Why might two of the lab colony parents (labelled “G”) have both low ROH count and fraction ROH?

Exercice 3 (Français)

De quel pays peuvent venir les échantillons gris du diagramme ci-dessus? (Indice: nous avons observé des motifs étranges dans ce pays aujourd’hui)
Pourquoi est-ce que deux des parents venant de colonies (ceux étiquetés “G”) ont un compte des ROHs et une fraction de ROH si faibles?

Let’s run our Kenyan sample through the same process we used to create the plot above.

First, filter the ROH DataFrame to just include runs of homozygosity longer than 100 kbp.

roh_df_100kb = roh_df.query("roh_length > 100_000")
roh_df_100kb

	sample_id	contig	roh_start	roh_stop	roh_length	roh_is_marginal
0	AK0050-C	3R	180	420337	420157	True
1	AK0050-C	3R	432202	1690670	1258468	False
2	AK0050-C	3R	1723873	2003686	279813	False
4	AK0050-C	3R	11288732	12567018	1278286	False
5	AK0050-C	3R	12580510	14932888	2352378	False
7	AK0050-C	3R	18307285	18605951	298666	False
9	AK0050-C	3R	19451384	20434062	982678	False
10	AK0050-C	3R	20449042	21095747	646705	False
11	AK0050-C	3R	25613447	28241877	2628430	False
12	AK0050-C	3R	28522124	29437036	914912	False
13	AK0050-C	3R	29969536	33355367	3385831	False
14	AK0050-C	3R	33389150	38051606	4662456	False
15	AK0050-C	3R	38094971	39265672	1170701	False
17	AK0050-C	3R	39946985	40124985	178000	False
18	AK0050-C	3R	40178007	41108589	930582	False
19	AK0050-C	3R	41149111	45242143	4093032	False
20	AK0050-C	3R	45275124	52790662	7515538	False
21	AK0050-C	3R	52850062	52996177	146115	False

Now we can count the number of runs of homozygosity longer that 100 kbp…

roh_count = len(roh_df_100kb)
roh_count

…and, calculate the fraction ROH (fraction of the genome composed of ROH longer than 100_000 bp), by summing the “roh_length” column, then dividing this by the total length of the region.

roh_total_length = roh_df_100kb["roh_length"].sum()
roh_total_length

33142748

froh = roh_total_length / len(ag3.genome_sequence(region="3R"))
froh

0.6229759752712953

So, we can see that this individual is incredibly inbred for a wild caught mosquito, over 60% of the 3R chromosome arm is composed of long runs of homozygosity!

N.B., our ROH count will be different from the result shown for this individual in the Ag1000G phase 1 paper, because the paper combines ROH across multiple chromosomes, whereas here we have just calculated it across a single chromosome arm.

Exercise 4 (English)

Choose another sample from any mosquito population, and compute the number of ROH (roh_count) and fraction of genome in a ROH (froh) using only ROH larger than 100 kbp.

Exercice 4 (Français)

Choisissez un autre échantillon de n’importe quelle population de moustiques et calculez le nombre de ROH (roh_count) et la fraction de génome dans un ROH (froh) en utilisant uniquement des ROH supérieur à 100 kbp.

Well done!#

In this module we have learnt about using heterozygosity and runs of homozygosity as tools for understanding mosquito population demography.
Run, plotted and interpreted heterozygosity across regions of the genome.
Run, plotted and interpreted ROH analysis.

Exercises#

Open this notebook in Google Colab, copy to your drive, clear outputs and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Module 4 - Heterozygosity and runs of homozygosity

Contents

Module 4 - Heterozygosity and runs of homozygosity#

Learning Objectives#

Lecture#

English#

Français#

Setup#

Heterozygosity in wild populations#

Heterozygosity in lab colonies#

1. Population bottleneck.#

2. Population size.#

Heterozygosity in island populations#

Heterozygosity in an atypical wild population#

Runs of homozygosity (ROH)#

Well done!#

Exercises#