Methods#

Data sources#

This report analyses genome variation data from the Malaria Vector Genome Observatory. See Table 1 below for a complete list of the sample sets used in the current analysis version, with information about the corresponding contributors, data releases and citations. These sample sets provide data for a total of 4,878 mosquitoes sampled from 25 countries.

Table 1. Data sources included in the current analysis version.
Sample Set Study Contributor Data Release Citation
AG1000G-AO AG1000G-AO Joao Pinto Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-BF-A AG1000G-BF-1 Austin Burt Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-BF-B AG1000G-BF-1 Austin Burt Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-BF-C AG1000G-BF-2 Nora Besansky Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-CD AG1000G-CD David Weetman Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-CF AG1000G-CF Alessandra della Torre Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-CI AG1000G-CI David Weetman Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2020
AG1000G-CM-A AG1000G-CM-1 Nora Besansky Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-CM-B AG1000G-CM-2 Nora Besansky Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-CM-C AG1000G-CM-3 Brad White Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-FR AG1000G-FR Igor Sharakhov Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2020
AG1000G-GA-A AG1000G-GA-1 Joao Pinto Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-GH AG1000G-GH David Weetman Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2020
AG1000G-GM-A AG1000G-GM-1 Martin Donnelly Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2020
AG1000G-GM-B AG1000G-GM-2 Beniamino Caputo Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-GM-C AG1000G-GM-3 Charles Godfray Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-GN-A AG1000G-GN-ML Ken Vernick Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-GN-B AG1000G-GN-ML Ken Vernick Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-GQ AG1000G-GQ Igor Sharakhov Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2020
AG1000G-GW AG1000G-GW Joao Pinto Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-KE AG1000G-KE Janet Midega Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
AG1000G-ML-A AG1000G-ML-1 Austin Burt Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-ML-B AG1000G-ML-2 Nora Besansky Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-MW AG1000G-MW Martin Donnelly Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-MZ AG1000G-MZ Joao Pinto Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-TZ AG1000G-TZ Bilali Kabula, David Weetman Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2021
AG1000G-UG AG1000G-UG Martin Donnelly Ag3.0 Anopheles gambiae 1000 Genomes Consortium 2017
1237-VO-BJ-DJOGBENOU-VMF00050 1237-VO-BJ-DJOGBENOU Luc Djogbenou Ag3.2 Lucas et al. 2023
1237-VO-BJ-DJOGBENOU-VMF00067 1237-VO-BJ-DJOGBENOU Luc Djogbenou Ag3.2 Lucas et al. 2023
1244-VO-GH-YAWSON-VMF00051 1244-VO-GH-YAWSON Alexander Egyir-Yawson Ag3.2 Lucas et al. 2023
1245-VO-CI-CONSTANT-VMF00054 1245-VO-CI-CONSTANT Edi Constant Ag3.2 Lucas et al. 2023
1253-VO-TG-DJOGBENOU-VMF00052 1253-VO-TG-DJOGBENOU Luc Djogbenou Ag3.2 Lucas et al. 2023
1178-VO-UG-LAWNICZAK-VMF00025 1178-VO-UG-LAWNICZAK Mara Lawniczak Ag3.3
1244-VO-GH-YAWSON-VMF00149 1244-VO-GH-YAWSON Alexander Egyir-Yawson Ag3.4 Nagi et al. in prep.
crawford-2016 crawford-2016 Jacob E Crawford Ag3.7 Crawford et al. 2016
tennessen-2021 tennessen-2021 Jacob Tennessen Ag3.8 Tennessen et al. 2021
bergey-2019 bergey-2019 Christina Bergey Ag3.9 Bergey et al. 2019
campos-2021 campos-2021 Greg Lanzaro Ag3.9 Campos et al. 2021, Lanzaro et al. 2021
fontaine-2015-rebuild fontaine-2015-rebuild Michael.C.Fontaine Ag3.10 Fontaine et al. 2014

Sample metadata, unphased SNP calls, and phased SNP haplotypes were retrieved from the Malaria Vector Genome Observatory cloud data repository hosted in Google Cloud Storage (GCS) via the MalariaGEN Python API version 15.0.1.

Sample inclusion and grouping into cohorts#

Samples were considered for inclusion if they met the following criteria:

  • Gender assigned as female via comparison of sequence coverage on autosomes and sex chromosomes.

  • Taxon assigned as gambiae, coluzzii, arabiensis or bissau via principle components analysis of genomic data from Chromosome 3 and comparison with reference samples with known taxon assignments.

After filtering according to these inclusion criteria, samples were grouped into cohorts by taxon, location of sampling and date of sampling. Samples were grouped spatially if their collection locations were within the same level 2 administrative unit, according to geoBoundaries version 5.0.0. Samples were grouped temporally if their collection dates were within the same quarter (3 month period) where possible, except in a small number of cases where metadata were only available on year of collection.

Cohorts were excluded from the analysis if the sample size was less than 15. Cohorts with more than 100 samples were randomly downsampled for computational efficiency. Cohorts were also excluded from the analysis if they failed H12 or G123 window size calibration (see below). After applying these filters, a total of 61 cohorts were retained for analysis (Table 2).

Table 2. Cohorts selected for genome-wide selection scan analyses.
Cohort Country Region District Taxon Year Quarter Sample Size
Angola / Luanda / coluzzii / 2009 / Q2 Angola Luanda Luanda coluzzii 2009 2 77
Burkina Faso / Comoe / coluzzii / 2011 Burkina Faso Cascades Comoe coluzzii 2011 18
Burkina Faso / Comoe / coluzzii / 2012 Burkina Faso Cascades Comoe coluzzii 2012 63
Burkina Faso / Comoe / coluzzii / 2015 Burkina Faso Cascades Comoe coluzzii 2015 33
Burkina Faso / Comoe / coluzzii / 2016 Burkina Faso Cascades Comoe coluzzii 2016 53
Burkina Faso / Houet / coluzzii / 2012 / Q3 Burkina Faso Hauts-Bassins Houet coluzzii 2012 3 78
Burkina Faso / Houet / coluzzii / 2014 / Q3 Burkina Faso Hauts-Bassins Houet coluzzii 2014 3 32
Burkina Faso / Houet / gambiae / 2012 / Q3 Burkina Faso Hauts-Bassins Houet gambiae 2012 3 73
Burkina Faso / Houet / gambiae / 2014 / Q3 Burkina Faso Hauts-Bassins Houet gambiae 2014 3 41
Benin / Djougou / coluzzii / 2017 / Q2 Benin Donga Djougou coluzzii 2017 2 78
Benin / Djougou / gambiae / 2017 / Q2 Benin Donga Djougou gambiae 2017 2 30
Benin / Djougou / gambiae / 2017 / Q3 Benin Donga Djougou gambiae 2017 3 34
Benin / Avrankou / coluzzii / 2017 / Q3 Benin Oueme Avrankou coluzzii 2017 3 88
Democratic Republic of the Congo / Gbadolite / gambiae / 2015 / Q3 Democratic Republic of the Congo Nord-Ubangi Gbadolite gambiae 2015 3 44
Central African Republic / Bangui / gambiae / 1994 / Q1 Central African Republic Bangui Bangui gambiae 1994 1 53
Cote d'Ivoire / Sud-Comoe / gambiae / 2017 / Q3 Cote d'Ivoire Comoe Sud-Comoe gambiae 2017 3 37
Cote d'Ivoire / Agneby-Tiassa / coluzzii / 2012 Cote d'Ivoire Lagunes Agneby-Tiassa coluzzii 2012 80
Cameroon / Mayo-Kani / gambiae / 2005 Cameroon Far North Mayo-Kani gambiae 2005 18
Cameroon / Haut-Nyong / gambiae / 2009 / Q3 Cameroon East Haut-Nyong gambiae 2009 3 95
Cameroon / Lom-Et-Djerem / gambiae / 2009 / Q3 Cameroon East Lom-Et-Djerem gambiae 2009 3 163
Gabon / Libreville / gambiae / 2000 / Q4 Gabon Estuaire Libreville gambiae 2000 4 69
Ghana / Ablekuma Central Municipal / coluzzii / 2018 / Q1 Ghana Greater Accra Region Ablekuma Central Municipal coluzzii 2018 1 266
Ghana / La-Nkwantanang-Madina / gambiae / 2017 / Q4 Ghana Greater Accra Region La-Nkwantanang-Madina gambiae 2017 4 200
Ghana / Adansi Akrofuom / coluzzii / 2018 / Q4 Ghana Ashanti Region Adansi Akrofuom coluzzii 2018 4 64
Ghana / Adansi South / coluzzii / 2018 / Q4 Ghana Ashanti Region Adansi South coluzzii 2018 4 36
Ghana / Adansi South / gambiae / 2018 / Q4 Ghana Ashanti Region Adansi South gambiae 2018 4 29
Ghana / Amansie Central / coluzzii / 2018 / Q4 Ghana Ashanti Region Amansie Central coluzzii 2018 4 69
Ghana / Bekwai Municipal / coluzzii / 2018 / Q4 Ghana Ashanti Region Bekwai Municipal coluzzii 2018 4 53
Ghana / Twifo Atti-Morkwa / coluzzii / 2012 / Q3 Ghana Central Region Twifo Atti-Morkwa coluzzii 2012 3 25
Ghana / Upper Denkyira East Municipal / coluzzii / 2018 / Q4 Ghana Central Region Upper Denkyira East Municipal coluzzii 2018 4 23
Ghana / Upper Denkyira West / coluzzii / 2018 / Q4 Ghana Central Region Upper Denkyira West coluzzii 2018 4 118
Ghana / New Juaben South Municipal / gambiae / 2012 / Q4 Ghana Eastern Region New Juaben South Municipal gambiae 2012 4 23
Ghana / Effia Kwesimintsim Municipal / coluzzii / 2012 / Q3 Ghana Western Region Effia Kwesimintsim Municipal coluzzii 2012 3 24
Gambia, The / Lower Fuladu West / coluzzii / 2012 / Q4 Gambia, The Janjanbureh Lower Fuladu West coluzzii 2012 4 172
Gambia, The / Central Badibu / bissau / 2011 / Q3 Gambia, The Kerewan Central Badibu bissau 2011 3 52
Guinea / Kissidougou / gambiae / 2012 / Q4 Guinea Faranah Kissidougou gambiae 2012 4 51
Guinea / Macenta / gambiae / 2012 / Q4 Guinea Nzerekore Macenta gambiae 2012 4 51
Guinea-Bissau / Setor De Safim / bissau / 2010 Guinea-Bissau Biombo Setor de Safim bissau 2010 33
Guinea-Bissau / Bissau Autonomous Sector / bissau / 2010 Guinea-Bissau Bissau Bissau Autonomous Sector bissau 2010 60
Mali / Kangaba / gambiae / 2004 / Q3 Mali Koulikouro Kangaba gambiae 2004 3 23
Mali / Kati / coluzzii / 2014 / Q3 Mali Koulikouro Kati coluzzii 2014 3 27
Mali / Kati / gambiae / 2014 / Q3 Mali Koulikouro Kati gambiae 2014 3 24
Mali / Yanfolila / coluzzii / 2012 / Q4 Mali Sikasso Yanfolila coluzzii 2012 4 23
Mali / Yanfolila / gambiae / 2012 / Q4 Mali Sikasso Yanfolila gambiae 2012 4 53
Mali / Bla / coluzzii / 2004 / Q3 Mali Segou Bla coluzzii 2004 3 19
Malawi / Chikwawa / arabiensis / 2015 / Q2 Malawi Southern Region Chikwawa arabiensis 2015 2 41
Mozambique / Morrumbene / gambiae / 2004 / Q1 Mozambique Inhambane Morrumbene gambiae 2004 1 49
Mozambique / Morrumbene / gambiae / 2004 / Q2 Mozambique Inhambane Morrumbene gambiae 2004 2 22
Togo / Lome Commune / gambiae / 2017 / Q4 Togo Maritime Region Lome Commune gambiae 2017 4 179
Tanzania / Muleba / arabiensis / 2015 / Q1 Tanzania Kagera Muleba arabiensis 2015 1 39
Tanzania / Muleba / arabiensis / 2015 / Q2 Tanzania Kagera Muleba arabiensis 2015 2 98
Tanzania / Muleba / gambiae / 2015 / Q2 Tanzania Kagera Muleba gambiae 2015 2 18
Tanzania / Tarime / arabiensis / 2012 / Q3 Tanzania Mara Tarime arabiensis 2012 3 47
Tanzania / Muheza / gambiae / 2013 / Q1 Tanzania Tanga Muheza gambiae 2013 1 32
Tanzania / Moshi / arabiensis / 2012 / Q3 Tanzania Manyara Moshi arabiensis 2012 3 39
Uganda / Kalangala / gambiae / 2015 / Q2 Uganda Central Region Kalangala gambiae 2015 2 60
Uganda / Busia / gambiae / 2016 / Q2 Uganda Eastern Region Busia gambiae 2016 2 24
Uganda / Mayuge / gambiae / 2017 / Q2 Uganda Eastern Region Mayuge gambiae 2017 2 21
Uganda / Tororo / arabiensis / 2012 / Q4 Uganda Eastern Region Tororo arabiensis 2012 4 81
Uganda / Tororo / gambiae / 2012 / Q4 Uganda Eastern Region Tororo gambiae 2012 4 112
Uganda / Kanungu / gambiae / 2012 / Q4 Uganda Western Region Kanungu gambiae 2012 4 95

H12 and G123 window size calibration#

Both H12 (Garud et al. 2015) and G123 (Harris et al. 2018) are statistical methods for performing genome-wide selection scans which rely on dividing data into windows along the genome. Typically the size of these windows is set to a fixed number of polymorphic sites (SNPs). I.e., all windows contain data ─ either phased haplotypes or unphased genotypes ─ for the same number of SNPs. In order to detect recent selective sweeps, the size of these windows needs to be chosen so that windows are generally larger than the normal genetic distance over which linkage disequlibrium (LD) decays to background levels in the absence of recent positive selection. Therefore, in windows which are unaffected by recent selective sweeps, genetic diversity will be high and thus the values of the selection statistics will be low. Conversely, in windows affected by recent selective sweeps, linkage disequilibrium will extend over a longer genetic distance spanning multiple windows, so that genetic diversity within those windows is low and thus values of selection statistics will be high. In other words, the choice of window size affects the signal to noise ratio for selection scans using H12 and G123 statistics. If windows are too small, results are dominated by background noise. If windows are too large, noise is minimal but power to detect recent selection signals is reduced.

This decision regarding an appropriate window size needs to be made independently for each cohort of samples over which a selection scan will be performed. This is because different source populations may have different demographic histories, and this in turn may alter the genetic distance over which LD decays in the absence of positive selection. Previous studies have used various demographic inference methods to try to infer key demographic parameters for each cohort being analysed, then use these parameters to inform the decision of window size. In practice, this approach presents a number of challenges. Firstly, inference of demographic parameters is difficult, and even state of the art inference methods may reach inaccurate conclusions. Secondly, running demographic inference methods can be computationally demanding, and this becomes impractical for large numbers of cohorts.

For these reasons we have taken an empirical approach to window size calibration for H12 and G123 scans, designed to reach a good signal to noise ratio.

For each cohort, we compute H12 over contig 3RL for multiple window sizes of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000 or 10000 SNPs. We then compute the 95th percentile of statistic values over all windows. We choose the smallest window size for which the 95th percentile is below 0.08. This means that any window with a statistic value above this threshold will be in the top 5% of windows.

Similarly, we compute G123 over contig 3RL for multiple window sizes of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000, 9000 or 10000 SNPs. We then compute the 95th percentile of statistic values over all windows. We choose the smallest window size for which the 95th percentile is below 0.08.

TODO how was window-size calibration done?

TODO after calibration, some cohorts removed if cannot get a window-size.

H12 genome-wide selection scans#

TODO

G123 genome-wide selection scans#

TODO

IHS genome-wide selection scans#

TODO

Automated detection of selection signals#

TODO

Identification of selection alerts#

TODO

Web report generation#

TODO