Methods#

Data sources#

This report analyses genome variation data from the Malaria Vector Genome Observatory. See Table 1 below for a complete list of the sample sets used in the current analysis version, with information about the corresponding contributors, data releases and citations. These sample sets provide data for a total of 656 mosquitoes sampled from 13 countries.

Table 1. Data sources included in the current analysis version.
Sample Set Study Contributor Data Release Citation
1229-VO-GH-DADZIE-VMF00095 1229-VO-GH-DADZIE Samuel Dadzie Af1.0 Boddé, Nwezeobi et al. 2024
1230-VO-GA-CF-AYALA-VMF00045 1230-VO-MULTI-AYALA Diego Ayala Af1.0 Boddé, Nwezeobi et al. 2024
1231-VO-MULTI-WONDJI-VMF00043 1231-VO-MULTI-WONDJI Charles Wondji Af1.0 Boddé, Nwezeobi et al. 2024
1232-VO-KE-OCHOMO-VMF00044 1232-VO-KE-OCHOMO Eric Ochomo Af1.0 Boddé, Nwezeobi et al. 2024
1235-VO-MZ-PAAIJMANS-VMF00094 1235-VO-MZ-PAAIJMANS Krijn Paaijmans Af1.0 Boddé, Nwezeobi et al. 2024
1236-VO-TZ-OKUMU-VMF00090 1236-VO-TZ-OKUMU Fredros Okumu Af1.0 Boddé, Nwezeobi et al. 2024
1240-VO-CD-KOEKEMOER-VMF00099 1240-VO-MULTI-KOEKEMOER Lizette Koekemoer Af1.0 Boddé, Nwezeobi et al. 2024
1240-VO-MZ-KOEKEMOER-VMF00101 1240-VO-MULTI-KOEKEMOER Lizette Koekemoer Af1.0 Boddé, Nwezeobi et al. 2024

Sample metadata, unphased SNP calls, and phased SNP haplotypes were retrieved from the Malaria Vector Genome Observatory cloud data repository hosted in Google Cloud Storage (GCS) via the MalariaGEN Python API version 15.0.1.

Sample inclusion and grouping into cohorts#

Samples were considered for inclusion if they met the following criteria:

  • Gender assigned as female via comparison of sequence coverage on autosomes and sex chromosomes.

  • Taxon assigned as funestus via principle components analysis of genomic data from Chromosome 3 and comparison with reference samples with known taxon assignments.

After filtering according to these inclusion criteria, samples were grouped into cohorts by taxon, location of sampling and date of sampling. Samples were grouped spatially if their collection locations were within the same level 2 administrative unit, according to geoBoundaries version 5.0.0. Samples were grouped temporally if their collection dates were within the same quarter (3 month period) where possible, except in a small number of cases where metadata were only available on year of collection.

Cohorts were excluded from the analysis if the sample size was less than 15. Cohorts with more than 100 samples were randomly downsampled for computational efficiency. Cohorts were also excluded from the analysis if they failed H12 or G123 window size calibration (see below). After applying these filters, a total of 15 cohorts were retained for analysis (Table 2).

Table 2. Cohorts selected for genome-wide selection scan analyses.
Cohort Country Region District Taxon Year Quarter Sample Size
Benin / Ouidah / funestus / 2014 / Q2 Benin Atlanique Ouidah funestus 2014 2 37
Democratic Republic of the Congo / Watsa / funestus / 2017 / Q4 Democratic Republic of the Congo Upper Uele Watsa funestus 2017 4 43
Democratic Republic of the Congo / Kinshasa / funestus / 2015 / Q2 Democratic Republic of the Congo Kinshasa Kinshasa funestus 2015 2 34
Cameroon / Mayo-Banyo / funestus / 2014 / Q3 Cameroon Adamaoua Mayo-Banyo funestus 2014 3 45
Gabon / Mpassa / funestus / 2017 / Q4 Gabon Haut-Ogooué Mpassa funestus 2017 4 26
Ghana / Adansi Akrofuom / funestus / 2014 / Q1 Ghana Ashanti Region Adansi Akrofuom funestus 2014 1 31
Kenya / Mt. Elgon / funestus / 2016 / Q4 Kenya Bungoma Mt. Elgon funestus 2016 4 21
Kenya / Nyando / funestus / 2014 / Q2 Kenya Kisumu Nyando funestus 2014 2 35
Malawi / Chikwawa / funestus / 2014 / Q1 Malawi Southern Region Chikwawa funestus 2014 1 18
Mozambique / Manhiça / funestus / 2016 / Q2 Mozambique Maputo Manhiça funestus 2016 2 22
Mozambique / Manhiça / funestus / 2018 / Q1 Mozambique Maputo Manhiça funestus 2018 1 58
Mozambique / Palma / funestus / 2015 / Q3 Mozambique Cabo Delgado Palma funestus 2015 3 40
Nigeria / Remo North / funestus / 2015 / Q1 Nigeria Ogun Remo North funestus 2015 1 41
Uganda / Tororo / funestus / 2014 / Q2 Uganda Eastern Region Tororo funestus 2014 2 49
Zambia / Sinda / funestus / 2016 / Q3 Zambia Eastern Sinda funestus 2016 3 43

H12 and G123 window size calibration#

Both H12 (Garud et al. 2015) and G123 (Harris et al. 2018) are statistical methods for performing genome-wide selection scans which rely on dividing data into windows along the genome. Typically the size of these windows is set to a fixed number of polymorphic sites (SNPs). I.e., all windows contain data ─ either phased haplotypes or unphased genotypes ─ for the same number of SNPs. In order to detect recent selective sweeps, the size of these windows needs to be chosen so that windows are generally larger than the normal genetic distance over which linkage disequlibrium (LD) decays to background levels in the absence of recent positive selection. Therefore, in windows which are unaffected by recent selective sweeps, genetic diversity will be high and thus the values of the selection statistics will be low. Conversely, in windows affected by recent selective sweeps, linkage disequilibrium will extend over a longer genetic distance spanning multiple windows, so that genetic diversity within those windows is low and thus values of selection statistics will be high. In other words, the choice of window size affects the signal to noise ratio for selection scans using H12 and G123 statistics. If windows are too small, results are dominated by background noise. If windows are too large, noise is minimal but power to detect recent selection signals is reduced.

This decision regarding an appropriate window size needs to be made independently for each cohort of samples over which a selection scan will be performed. This is because different source populations may have different demographic histories, and this in turn may alter the genetic distance over which LD decays in the absence of positive selection. Previous studies have used various demographic inference methods to try to infer key demographic parameters for each cohort being analysed, then use these parameters to inform the decision of window size. In practice, this approach presents a number of challenges. Firstly, inference of demographic parameters is difficult, and even state of the art inference methods may reach inaccurate conclusions. Secondly, running demographic inference methods can be computationally demanding, and this becomes impractical for large numbers of cohorts.

For these reasons we have taken an empirical approach to window size calibration for H12 and G123 scans, designed to reach a good signal to noise ratio.

For each cohort, we compute H12 over contig 3RL for multiple window sizes of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000 or 10000 SNPs. We then compute the 95th percentile of statistic values over all windows. We choose the smallest window size for which the 95th percentile is below 0.08. This means that any window with a statistic value above this threshold will be in the top 5% of windows.

Similarly, we compute G123 over contig 3RL for multiple window sizes of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000 or 10000 SNPs. We then compute the 95th percentile of statistic values over all windows. We choose the smallest window size for which the 95th percentile is below 0.08.

TODO how was window-size calibration done?

TODO after calibration, some cohorts removed if cannot get a window-size.

H12 genome-wide selection scans#

TODO

G123 genome-wide selection scans#

TODO

IHS genome-wide selection scans#

TODO

Automated detection of selection signals#

TODO

Identification of selection alerts#

TODO

Web report generation#

TODO