banner

Workshop 5 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 4 - Heterozygosity and runs of homozygosity#

Theme: analysis

In this module we are going to investigate heterozygosity and homozygosity using plots across the genome and runs of homozygosity (ROH). These analyses can help us understand more about the demographic history of a mosquito population.

Learning Objectives#

At the end of this module, you will be able to:

  • Understand and be able to calculate heterozygosity

  • Plot and interpret the heterozygosity across the genome for individual samples

  • Explain ROH analysis

  • Run, plot and interpret ROH

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

First, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

%pip install -q --no-warn-conflicts malariagen_data
Note: you may need to restart the kernel to use updated packages.
import malariagen_data
import warnings
warnings.simplefilter(action='ignore', category=Warning)

Note that authentication is required to access data through the package, please follow the instructions here.

ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9
Results cache None
Cohorts analysis 20240418
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 9.0.0
Client location unknown

Check your Client location (virtual machine or VM) is located near to the data in the US. If this isn’t the case, use Runtime - Disconnect and delete runtime from the menu, then re-run the notebook to generate a new VM.

Heterozygosity in wild populations#

Mosquitoes are diploid, so an individual carries two copies of each autosome (non-sex chromosome), one from each parent. This means that each autosomal nucleotide position in the genome can be homozygous, where the nucleotide from each chromosome is the same (e.g., A/A, C/C, T/T or G/G), or heterozygous, where the nucleotides differ (e.g., A/T, C/G, etc.).

In the figure below, we see that at a single genomic position where both parents are heterozygous, the probability of heterozygous offspring is 0.5 (1/2). If only one of the parents in a mating was heterozygous then the probability of heterozygous offspring would decrease to 0.25 (1/4).

Analysing heterozygosity across the genome of an individual can give an indication how inbred an individual is. If heterozygosity is generally low, or there are regions with depressed heterozygosity, then the parents of the individual were likely closely related. This can happen by chance of course, but if we find that many individuals from a given population have low heterozygosity, then the population may be small or have recently been through a bottleneck (we’ll come back to that).

Let’s plot heterozygosity across the genome of some of our mosquito samples and learn how to interpret the figures.

First, we should load the sample metadata to help us pick some samples.

sample_df = ag3.sample_metadata()
sample_df.query("release == '3.0' and country == 'Burkina Faso'").head()
                                     
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
15094 AB0085-Cx BF2-4 Austin Burt Burkina Faso Pala 2012 7 11.151 -4.235 F ... Hauts-Bassins BF-09 Houet gambiae BF-09_gamb_2012 BF-09_gamb_2012_07 BF-09_gamb_2012_Q3 BF-09_Houet_gamb_2012 BF-09_Houet_gamb_2012_07 BF-09_Houet_gamb_2012_Q3
15095 AB0086-Cx BF2-6 Austin Burt Burkina Faso Pala 2012 7 11.151 -4.235 F ... Hauts-Bassins BF-09 Houet gambiae BF-09_gamb_2012 BF-09_gamb_2012_07 BF-09_gamb_2012_Q3 BF-09_Houet_gamb_2012 BF-09_Houet_gamb_2012_07 BF-09_Houet_gamb_2012_Q3
15096 AB0087-C BF3-3 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3
15097 AB0088-C BF3-5 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3
15098 AB0089-Cx BF3-8 Austin Burt Burkina Faso Bana Village 2012 7 11.233 -4.472 F ... Hauts-Bassins BF-09 Houet coluzzii BF-09_colu_2012 BF-09_colu_2012_07 BF-09_colu_2012_Q3 BF-09_Houet_colu_2012 BF-09_Houet_colu_2012_07 BF-09_Houet_colu_2012_Q3

5 rows × 32 columns

For our first heterozygosity plot, we’ll pick a wild-caught Anopheles gambiae sample from Burkina Faso, “AB0085-Cx”.

We can use the plot_heterozygosity() function to visualise heterozygosity in a single sample over a chromosome arm. Let’s have a look at what parameters are required by theplot_heterozygosity() function.

ag3.plot_heterozygosity?
Signature:
ag3.plot_heterozygosity(
    sample: Union[str, int, List[Union[str, int]], Tuple[Union[str, int], ...]],
    region: Union[str, malariagen_data.util.Region, Mapping],
    window_size: int = 20000,
    y_max: float = 0.03,
    circle_kwargs: Optional[Mapping] = None,
    site_mask: str = 'default',
    sample_set: Optional[str] = None,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    track_height: int = 170,
    genes_height: int = 90,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity for a single sample over a genome region.

Parameters
----------
sample : str or int or list of str or int or tuple of str or int
    Sample identifier or index within sample set. Multiple values can also
    be provided as a list or tuple.
region : str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript.
window_size : int, optional, default: 20000
    Number of sites per window.
y_max : float, optional, default: 0.03
    Y axis limit.
circle_kwargs : Mapping or None, optional
    Passed through to bokeh circle() function.
site_mask : str, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_set : str or None, optional
    Sample set identifier.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
track_height : int, optional, default: 170
    Main track height in pixels (px).
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type:      method

The Anopheles gambiae genome is relatively large, e.g., there are 53 million nucleotide positions on the 3R chromosome arm. Plotting millions of points is just not practical either computationally or when it comes to interpretation. Consequently we use a windowed approach, where we define a window size in number of sites (genomic positions), then move this window along the chromosome arm, taking the mean of our statistic (here, heterozygosity) in every window. This gives us a summary of data that is easier to interpret.

Here we’ll use a window size of 10,000 sites. As we know that this sample is Anophele gambiae (see “taxon” column in metadata), we should use the “gamb_colu” site mask as this will give us just the sites that pass site filters, i.e. sites we are confident in calling genotypes, for An. gambiae or An. coluzzii.

ag3.plot_heterozygosity(
    sample="AB0085-Cx", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                   
                                     

Each marker shows the heterozygosity in a single window, where heterozygosity is the fraction of sites with a heterozygous genotype call.

The heterozygosity we see for individual AB0085-Cx is typical of many Anopheles samples, with most windows having a heterozygosity of 0.015 (1.5%). We see just a few windows with very low heterozygosity along most of the contig, with a drop at the telomeric end (left) and a decline at the centromeric end (right).

Note that heterozygosity is typically lower in all samples in both telomeric and pericentromeric regions, because of reduced rates of recombination.

How does this sample’s heterozygosity, across the 3R contig, compare with one of the lab colony samples?

Heterozygosity in lab colonies#

A lab colony is usually founded by taking a small number of gravid females from the wild and rearing the offspring in cage. Genetically this differs from wild mosquito populations in two major ways.

1. Population bottleneck.#

As the lab colony is founded by only a few wild individuals, it is highly unlikely that all the genetic diversity found in the wild population will be represented in the colony. This is known as a founder effect which is a kind of population bottleneck. The term bottleneck refers to the idea that the original population is squeezed through the slim neck of a bottle, so only some individuals/genetic-variants make it through to the new smaller population.

2. Population size.#

In the wild, a mosquito population may consist of millions of individuals and even after a severe population bottleneck, like an island founding, it would be expected that the population would over time become large again. However, in a colony, the population size is kept artifically small to stop cage overcrowding and associated drops in colony fitness. A genetic side-effect of this very small population size is an increase in the power of genetic drift. Genetic drift is the random fluctuations in allele frequencies due to chance, and in small populations it has a much stronger effect than in large populations. This means that it is much more likely for alleles to become fixed (100%) in a colony than in a natural population, and so over a few generations the already low heterozygosity (due to bottleneck) is likely to drop even further.

Let’s choose a lab colony sample from the lab colony cross metadata to look at.

crosses_df = ag3.cross_metadata()
crosses_df.query("role == 'parent'")
cross sample_id father_id mother_id sex role
0 18-5 AD0142-C NaN NaN F parent
1 18-5 AD0143-C NaN NaN M parent
22 29-2 AD0231-C NaN NaN F parent
23 29-2 AD0232-C NaN NaN M parent
44 36-9 AD0254-C NaN NaN F parent
45 36-9 AD0255-C NaN NaN M parent
66 37-3 AD0277-C NaN NaN F parent
67 37-3 AD0278-C NaN NaN M parent
88 42-4 AD0305-C NaN NaN F parent
89 42-4 AD0306-C NaN NaN M parent
104 45-1 AD0324-C NaN NaN F parent
105 45-1 AD0325-C NaN NaN M parent
126 46-9 AD0347-C NaN NaN F parent
127 46-9 AD0348-C NaN NaN M parent
148 47-6 AD0371-C NaN NaN F parent
149 47-6 AD0372-C NaN NaN M parent
170 73-2 AD0422-C NaN NaN F parent
171 73-2 AD0423-C NaN NaN M parent
191 78-2 AD0447-C NaN NaN F parent
192 78-2 AD0448-C NaN NaN M parent
211 80-2 AD0473-C NaN NaN F parent
212 80-2 AD0474-C NaN NaN M parent
233 B5 AC0382-C NaN NaN F parent
234 B5 AC0416-C NaN NaN M parent
246 K2 AC0300-C NaN NaN F parent
247 K2 AC0406-C NaN NaN M parent
263 K4 AC0317-C NaN NaN F parent
264 K4 AC0398-C NaN NaN M parent
281 K6 AC0334-C NaN NaN F parent
282 K6 AC0398-C NaN NaN M parent

Let’s look at sample AD0305-C which was the mother in cross 42-4.

ag3.plot_heterozygosity(
    sample="AD0305-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                   

                                     

If we compare this to our wild-caught sample we can see that although some of the windows still have heterozygosity ~1.5%, there are also some large regions of the chromosome where heterozygosity is almost zero. These are known as “runs of homozygosity”.

Let’s also look at sample AD0306-C which was the father of cross 42-4.

ag3.plot_heterozygosity(
    sample="AD0306-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                     

This sample has even more runs of homozygosity. Hardly any windows remain with the normal level of heterozygosity. This is a symptom of significant inbreeding.

Let’s now compare the heterozygosity of these two parents with the heterozygosity in their progeny.

crosses_df.query("cross == '42-4' and role == 'progeny'")
cross sample_id father_id mother_id sex role
90 42-4 AD0309-C AD0306-C AD0305-C M progeny
91 42-4 AD0310-C AD0306-C AD0305-C M progeny
92 42-4 AD0311-C AD0306-C AD0305-C M progeny
93 42-4 AD0312-C AD0306-C AD0305-C M progeny
94 42-4 AD0313-C AD0306-C AD0305-C M progeny
95 42-4 AD0314-C AD0306-C AD0305-C M progeny
96 42-4 AD0315-C AD0306-C AD0305-C M progeny
97 42-4 AD0316-C AD0306-C AD0305-C F progeny
98 42-4 AD0317-C AD0306-C AD0305-C M progeny
99 42-4 AD0318-C AD0306-C AD0305-C M progeny
100 42-4 AD0319-C AD0306-C AD0305-C F progeny
101 42-4 AD0320-C AD0306-C AD0305-C F progeny
102 42-4 AD0322-C AD0306-C AD0305-C F progeny
103 42-4 AD0323-C AD0306-C AD0305-C F progeny

Let’s pick one of the progeny, AD0309-C.

ag3.plot_heterozygosity(
    sample="AD0309-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                     

This sample has normal levels of heterozygosity, similar to wild samples. So how did this happen if both the parents showed signs of inbreeding?

The progeny of cross 42-4 are the result of mating a female from one lab colony to a male from a different lab colony. This F1 pairing of two different bottlenecked populations should restore heterozygosity as these cross progeny are effectively outbred, i.e., their parents are not closely related.

This is roughly analogous to breeding between two different pedigree breeds, such as dog breeds for example. Each of the parents may be relatively inbred due to many generations of mating within a small population, but the two different breeds are not closely related, and so their progeny show normal levels of heterozygosity.

Heterozygosity in island populations#

As we have learnt, taking a single individual (or small number of individuals) from the wild to seed a lab colony results in a reduction in heterozygosity as it results in a bottleneck event. A similar effect can also occur naturally when an animal populates an island from the mainland.

An island population will likely be founded with only a fraction of the mainland population diversity, plus island populations would tend to be smaller than those on the mainland, increasing inbreeding and the effect of drift (though not to the levels seen in the small lab colonies).

In Ag3.0 we have samples from the oceanic island Mayotte, ~1000km from mainland Africa and a similar distance from Madagascar. Due to these distances, it is likely that the Anopheles gambiae mosquito population was founded by just a few individuals. Let’s look at heterozygosity in a Mayotte sample.

sample_df.query("release == '3.0' and country == 'Mayotte'").head()
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
16063 AP0007-C 60 Igor Sharakhov Mayotte Mtsamboro Forest Reserve 2011 -1 -12.703 45.081 F ... Mayotte MYT Mtsamboro gambiae MYT_gamb_2011 MYT_gamb_2011 MYT_gamb_2011 MYT_Mtsamboro_gamb_2011 MYT_Mtsamboro_gamb_2011 MYT_Mtsamboro_gamb_2011
16064 AP0021-C 92 Igor Sharakhov Mayotte Karihani Lake 2011 -1 -12.797 45.122 F ... Mayotte MYT Tsingoni gambiae MYT_gamb_2011 MYT_gamb_2011 MYT_gamb_2011 MYT_Tsingoni_gamb_2011 MYT_Tsingoni_gamb_2011 MYT_Tsingoni_gamb_2011
16065 AP0019-C 88 Igor Sharakhov Mayotte Mtsanga Charifou 2011 -1 -12.991 45.156 M ... Mayotte MYT Kani-Kéli gambiae MYT_gamb_2011 MYT_gamb_2011 MYT_gamb_2011 MYT_Kani-Kéli_gamb_2011 MYT_Kani-Kéli_gamb_2011 MYT_Kani-Kéli_gamb_2011
16066 AP0020-C 78 Igor Sharakhov Mayotte Mtsanga Charifou 2011 -1 -12.991 45.156 M ... Mayotte MYT Kani-Kéli gambiae MYT_gamb_2011 MYT_gamb_2011 MYT_gamb_2011 MYT_Kani-Kéli_gamb_2011 MYT_Kani-Kéli_gamb_2011 MYT_Kani-Kéli_gamb_2011
16067 AP0009-C 62 Igor Sharakhov Mayotte Combani 2011 -1 -12.779 45.143 M ... Mayotte MYT Tsingoni gambiae MYT_gamb_2011 MYT_gamb_2011 MYT_gamb_2011 MYT_Tsingoni_gamb_2011 MYT_Tsingoni_gamb_2011 MYT_Tsingoni_gamb_2011

5 rows × 32 columns

ag3.plot_heterozygosity(
    sample="AP0007-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                   
                                     

We can see from this plot, that the demography of the Mayotte population is likely different from any we have looked at so far. Though the peak heterozygosity across the 3R contig is ~1.5%, there is much more spread, with most windows falling between 1.5% and 0.5% heterozygosity. Interpretations of this result could be that the island population was founded some time ago, and heterozygosity is returning to levels like we saw on the mainland e.g. Burkina Faso, and/or this is the stable level of heterozygosity for a small population size of Anopheles mosquitoes.

Exercise 1 (English)

We have sampled another island population in Ag3.0, Bioko (part of Equatorial Guinea). Though, unlike Mayotte, Bioko is only ~30km away from mainland Africa.

  • Uncomment the code below and run a heterozygosity analysis for a sample from Bioko.

  • How does this individual compare to the sample from Mayotte we looked at?

  • Can you suggest why they might look different?

Exercice 1 (Français)

Nous avons des échantillons provenant d’une autre population insulaire dans Ag3.0: Bioko (qui fait partie de la Guinée-Équatoriale). Toutefois, contrairement à Mayotte, Bioko ne se trouve qu’à ~30 km du continent africain.

  • Décommenter le code ci-dessous et effectuer une analyse de l’hétérozygosité pour un échantillon venant de Bioko.

  • En quoi est-ce que cet individu est différent de celui que nous avons regardé pour Mayotte?

  • Pouvez-vous proposer une explication pour ces différences?

# sample_df.query("release == '3.0' and location == 'Bioko'")
# ag3.plot_heterozygosity(
#     sample="???", 
#     region="3R", 
#     site_mask="gamb_colu", 
#     window_size=10_000,
# );

Heterozygosity in an atypical wild population#

One of the reasons for running population genetic analyses like heterozygosity, is to detect populations that don’t act as we expect them to, as this means we might be missing something about their demography which could be relevant to vector control. Samples of the putative cryptic species gcx3, collected in Kenya, have strange patterns of heterozygosity for wild samples, let’s take a look and think about what the causes could be.

sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Kenya'").head()
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
16881 AK0049-C KIL38 Janet Midega Kenya Kilifi 2012 -1 -3.511 39.909 F ... Kilifi KE-14 Kilifi North gcx3 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012
16882 AK0050-C KIL39 Janet Midega Kenya Kilifi 2012 -1 -3.511 39.909 F ... Kilifi KE-14 Kilifi North gcx3 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012
16883 AK0051-C KIL40 Janet Midega Kenya Kilifi 2012 -1 -3.511 39.909 F ... Kilifi KE-14 Kilifi North gcx3 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012
16884 AK0052-C KIL41 Janet Midega Kenya Kilifi 2012 -1 -3.511 39.909 F ... Kilifi KE-14 Kilifi North gcx3 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012
16888 AK0062-C KIL47 Janet Midega Kenya Kilifi 2012 -1 -3.511 39.909 F ... Kilifi KE-14 Kilifi North gcx3 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012 KE-14_Kilifi-North_gcx3_2012

5 rows × 32 columns

ag3.plot_heterozygosity(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                   
                                     

So what is going on here? This wild caught sample has long runs of homozygosity - regions where heterozygosity is close to zero - more so than even some of our lab colony samples. It looks like this individual has come from a wild population that has been through a severe bottleneck recently. But what might cause that in a natural population?

One possibility is that a vector control campaign that ran in Kenya prior to the date of collection was so successful that it rapidly and drastically reduced the population size of these mosquitoes. Unfortunately, this is just conjecture as we don’t have longitudinal samples (before, during, after the control campaign) to test this hypothesis.

Fortunately many of our sample sets going forward do contain longitudinally sampled mosquitoes, some in conjunction with vector control roll-out projects. Monitoring heterozygosity with these samples could potentially indicate if a control method e.g. next-generation bednets, affects mosquito population sizes and thus provide an indication of its effectiveness.

Exercise 2 (English)

Another location where we have collected samples with atypical diversity statistics is Tanzania. Uncomment the code below and analyse the hetereozygosity of a gcx3 sample from Tanzania.

  • Does your Tanzanian sample’s heterozygosity differ from the sample from Kenya we looked at above? If so, how is it different?

Exercice 2 (Français)

Un autre lieu où nous avons des échantillons avec des statistiques de la diversité génétiques atypiques est la Tanzanie. Décommenter le code ci-dessous et analyser l’hétérozygotie d’un échantillon gcx3 tanzanien.

  • Est-ce que l’hétérozygotie de votre échantillon tanzanien est différente de celle de l’échantillon kenyan que nous avons regardé en premier? Si oui, comment?

# sample_df.query("release == '3.0' and taxon == 'gcx3' and country == 'Tanzania'")
# ag3.plot_heterozygosity(
#     sample="???", 
#     region="3R", 
#     site_mask="gamb_colu", 
#     window_size=10_000,
# );

Runs of homozygosity (ROH)#

Plotting heterozygosity across individual genomes provides a great visual way to investigate population demography. For example, plotting heterozygosity across 3R suggests that the island population of Mayotte differs from mainland populations, however, it doesn’t tell us exactly how it differs or by how much.

One analysis that provides a way of quantifying heterozygosity, or rather the opposite side of the same coin - homozygosity, is to measure runs of homozygosity or ROH. As we saw in our Kenyan heterozygosity plot above, runs of homozygosity are where we see multiple windows in row with near-zero heterozygosity (a.k.a. homozygosity).

To infer and measure these homozygous regions we can use a hidden markov model (HMM). HMMs were explained in workshop 2, module 3, where it was used to detect copy number variation from coverage data. But briefly, HMM is an algorithm that allows large, noisy datasets (such as heterozygosity), to be quickly scanned for patterns (such as runs of homozygosity).

To infer and visualise ROH, we can use the function plot_roh(). Let’s have a look at what parameters we need to specify for the plot_roh() function.

ag3.plot_roh?
Signature:
ag3.plot_roh(
    sample: Union[str, int],
    region: Union[str, malariagen_data.util.Region, Mapping],
    window_size: int = 20000,
    site_mask: str = 'default',
    sample_set: Optional[str] = None,
    phet_roh: float = 0.001,
    phet_nonroh: Tuple[float, ...] = (0.003, 0.01),
    transition: float = 0.001,
    y_max: float = 0.03,
    sizing_mode: Literal['fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'] = 'stretch_width',
    width: Optional[int] = None,
    heterozygosity_height: int = 170,
    roh_height: int = 40,
    genes_height: int = 90,
    circle_kwargs: Optional[Mapping] = None,
    show: bool = True,
    output_backend: Literal['canvas', 'webgl', 'svg'] = 'webgl',
) -> Optional[bokeh.model.model.Model]
Docstring:
Plot windowed heterozygosity and inferred runs of homozygosity for a
single sample over a genome region.

Parameters
----------
sample : str or int
    Sample identifier or index within sample set.
region : str or Region or Mapping
    Region of the reference genome. Can be a contig name, region string
    (formatted like "{contig}:{start}-{end}"), or identifier of a genome
    feature such as a gene or transcript.
window_size : int, optional, default: 20000
    Number of sites per window.
site_mask : str, optional, default: 'default'
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_set : str or None, optional
    Sample set identifier.
phet_roh : float, optional, default: 0.001
    Probability of observing a heterozygote in a ROH.
phet_nonroh : tuple of float, optional, default: (0.003, 0.01)
    One or more probabilities of observing a heterozygote outside a ROH.
transition : float, optional, default: 0.001
    Probability of moving between states. A larger window size may call
    for a larger transitional probability.
y_max : float, optional, default: 0.03
    Y axis limit.
sizing_mode : {'fixed', 'stretch_width', 'stretch_height', 'stretch_both', 'scale_width', 'scale_height', 'scale_both'}, optional, default: 'stretch_width'
    Bokeh plot sizing mode, see also https://docs.bokeh.org/en/latest/docs
    /user_guide/basic/layouts.html#sizing-modes.
width : int or None, optional
    Plot width in pixels (px).
heterozygosity_height : int, optional, default: 170
    Plot height in pixels (px).
roh_height : int, optional, default: 40
    Plot height in pixels (px).
genes_height : int, optional, default: 90
    Genes track height in pixels (px).
circle_kwargs : Mapping or None, optional
    Passed through to bokeh circle() function.
show : bool, optional, default: True
    If true, show the plot. If False, do not show the plot, but return the
    figure.
output_backend : {'canvas', 'webgl', 'svg'}, optional, default: 'webgl'
    Specify an output backend to render a plot area onto. See also
    https://docs.bokeh.org/en/latest/docs/user_guide/output/webgl.html.

Returns
-------
Model or None
    A bokeh figure (only returned if show=False).
File:      /home/conda/developer/2835ffd46e5535e81e9672d0828c143d4f1444491ac44f0aeddd90156919e446-20240426-081151-481742-64-training-nb-maintenance-mgen-9.0.0/lib/python3.10/site-packages/malariagen_data/anopheles.py
Type:      method

Let’s now plot ROH for the Kenyan sample we looked at earlier.

ag3.plot_roh(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
);
                                     

The ROH analysis starts with the heterozygosity data that we plotted earlier and the top panel of the plot here is identical to the output of the plot_heterozygosity() function. The results from the HMM ROH analysis have been plotted below this, where we can see 23 blocks of ROH have been detected.

To access a Pandas DataFrame of the underlying ROH data, we can use another function, roh_hmm, with exactly the same parameters as before. The output DataFrame shows us the start, end and length of each ROH inferred by the HMM.

roh_df = ag3.roh_hmm(
    sample="AK0050-C", 
    region="3R", 
    site_mask="gamb_colu", 
    window_size=10_000,
)
roh_df
sample_id contig roh_start roh_stop roh_length roh_is_marginal
0 AK0050-C 3R 180 420337 420157 True
1 AK0050-C 3R 432202 1690670 1258468 False
2 AK0050-C 3R 1723873 2003686 279813 False
3 AK0050-C 3R 6108642 6123679 15037 False
4 AK0050-C 3R 11288732 12567018 1278286 False
5 AK0050-C 3R 12580510 14932888 2352378 False
6 AK0050-C 3R 17090690 17115221 24531 False
7 AK0050-C 3R 18307285 18605951 298666 False
8 AK0050-C 3R 19013506 19039488 25982 False
9 AK0050-C 3R 19451384 20434062 982678 False
10 AK0050-C 3R 20449042 21095747 646705 False
11 AK0050-C 3R 25613447 28241877 2628430 False
12 AK0050-C 3R 28522124 29437036 914912 False
13 AK0050-C 3R 29969536 33355367 3385831 False
14 AK0050-C 3R 33389150 38051606 4662456 False
15 AK0050-C 3R 38094971 39265672 1170701 False
16 AK0050-C 3R 39473670 39486954 13284 False
17 AK0050-C 3R 39946985 40124985 178000 False
18 AK0050-C 3R 40178007 41108589 930582 False
19 AK0050-C 3R 41149111 45242143 4093032 False
20 AK0050-C 3R 45275124 52790662 7515538 False
21 AK0050-C 3R 52850062 52996177 146115 False
22 AK0050-C 3R 53044868 53080557 35689 False

In the Ag1000G phase 1 paper we computed ROH for all samples, allowing us to plot and compare them.

In the plot, each marker represents an individual mosquito. The Y-axis is the count of runs of homozygosity and the X-axis is the fraction of the genome composed of runs of homozygosity. Because these kind of data are inherently noisy, we filtered the ROH to only include those longer than 100,000 bp (100 kbp).

Lab colony parents are shown in black. We can see how most of them have lots of runs of homozygosity which make up much of their genomes, while most wild caught cohorts have much smaller fractions of their genomes made up of ROH.

Exercise 3 (English)

  • What country might the grey samples come from (Hint: we’ve looked at strange patterns there today).

  • Why might two of the lab colony parents (labelled “G”) have both low ROH count and fraction ROH?

Exercice 3 (Français)

  • De quel pays peuvent venir les échantillons gris du diagramme ci-dessus? (Indice: nous avons observé des motifs étranges dans ce pays aujourd’hui)

  • Pourquoi est-ce que deux des parents venant de colonies (ceux étiquetés “G”) ont un compte des ROHs et une fraction de ROH si faibles?

Let’s run our Kenyan sample through the same process we used to create the plot above.

First, filter the ROH DataFrame to just include runs of homozygosity longer than 100 kbp.

roh_df_100kb = roh_df.query("roh_length > 100_000")
roh_df_100kb
sample_id contig roh_start roh_stop roh_length roh_is_marginal
0 AK0050-C 3R 180 420337 420157 True
1 AK0050-C 3R 432202 1690670 1258468 False
2 AK0050-C 3R 1723873 2003686 279813 False
4 AK0050-C 3R 11288732 12567018 1278286 False
5 AK0050-C 3R 12580510 14932888 2352378 False
7 AK0050-C 3R 18307285 18605951 298666 False
9 AK0050-C 3R 19451384 20434062 982678 False
10 AK0050-C 3R 20449042 21095747 646705 False
11 AK0050-C 3R 25613447 28241877 2628430 False
12 AK0050-C 3R 28522124 29437036 914912 False
13 AK0050-C 3R 29969536 33355367 3385831 False
14 AK0050-C 3R 33389150 38051606 4662456 False
15 AK0050-C 3R 38094971 39265672 1170701 False
17 AK0050-C 3R 39946985 40124985 178000 False
18 AK0050-C 3R 40178007 41108589 930582 False
19 AK0050-C 3R 41149111 45242143 4093032 False
20 AK0050-C 3R 45275124 52790662 7515538 False
21 AK0050-C 3R 52850062 52996177 146115 False

Now we can count the number of runs of homozygosity longer that 100 kbp…

roh_count = len(roh_df_100kb)
roh_count
18

…and, calculate the fraction ROH (fraction of the genome composed of ROH longer than 100_000 bp), by summing the “roh_length” column, then dividing this by the total length of the region.

roh_total_length = roh_df_100kb["roh_length"].sum()
roh_total_length
33142748
froh = roh_total_length / len(ag3.genome_sequence(region="3R"))
froh
0.6229759752712953

So, we can see that this individual is incredibly inbred for a wild caught mosquito, over 60% of the 3R chromosome arm is composed of long runs of homozygosity!

N.B., our ROH count will be different from the result shown for this individual in the Ag1000G phase 1 paper, because the paper combines ROH across multiple chromosomes, whereas here we have just calculated it across a single chromosome arm.

Exercise 4 (English)

Choose another sample from any mosquito population, and compute the number of ROH (roh_count) and fraction of genome in a ROH (froh) using only ROH larger than 100 kbp.

Exercice 4 (Français)

Choisissez un autre échantillon de n’importe quelle population de moustiques et calculez le nombre de ROH (roh_count) et la fraction de génome dans un ROH (froh) en utilisant uniquement des ROH supérieur à 100 kbp.

Well done!#

  • In this module we have learnt about using heterozygosity and runs of homozygosity as tools for understanding mosquito population demography.

  • Run, plotted and interpreted heterozygosity across regions of the genome.

  • Run, plotted and interpreted ROH analysis.

Exercises#

Open this notebook in Google Colab, copy to your drive, clear outputs and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.