banner

Workshop 1 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 4 - Analysing SNPs in the Vgsc gene#

Theme: Analysis

In this module we’re going to perform an analysis to discover single nucleotide polymorphisms (SNPs) in the voltage-gated sodium channel gene (Vgsc), which encodes the binding target for pyrethroid insecticides.

Learning objectives#

At the end of this module you will be able to:

  • Discover mutations (SNPs) that could potentially cause pyrethroid target-site resistance.

  • Compute SNP allele frequencies, i.e., how common are they in different mosquito cohorts?

  • Perform analyses to compare SNP allele frequences between mosquitoes from different species, geographical locations and dates of collection.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Discovering SNPs in the Vgsc gene and computing SNP allele frequencies#

First, let’s set up the malariagen_data package to access MalariaGEN data in the cloud.

%pip install -q --no-warn-conflicts malariagen_data

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data
import pandas as pd
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9
Results cache None
Cohorts analysis 20240418
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 9.0.0
Client location unknown

To discover SNPs and compute allele frequencies, we’re going to use the snp_allele_frequencies() function. Let’s have a look at the documentation for this function.

ag3.snp_allele_frequencies?
Signature:
ag3.snp_allele_frequencies(
    transcript: str,
    cohorts: Union[str, Mapping[str, str]],
    sample_query: Optional[str] = None,
    min_cohort_size: int = 10,
    site_mask: Optional[str] = None,
    sample_sets: Union[Sequence[str], str, NoneType] = None,
    drop_invariant: bool = True,
    effects: bool = True,
    include_counts: bool = False,
) -> pandas.core.frame.DataFrame
Docstring:
Compute SNP allele frequencies for a gene transcript.

Parameters
----------
transcript : str
    Gene transcript identifier.
cohorts : str or Mapping[str, str]
    Either a string giving the name of a predefined cohort set (e.g.,
    "admin1_month") or a dict mapping custom cohort labels to sample
    queries.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
min_cohort_size : int, optional, default: 10
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
site_mask : str or None, optional
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
drop_invariant : bool, optional, default: True
    If True, drop variants not observed in the selected samples.
effects : bool, optional, default: True
    If True, add SNP effect annotations.
include_counts : bool, optional, default: False
    Include columns with allele counts and number of non-missing allele
    calls (nobs).

Returns
-------
DataFrame
    A dataframe of SNP allele frequencies, one row per variant allele.

Notes
-----
Cohorts with fewer samples than `min_cohort_size` will be excluded
from output data frame.
File:      ~/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py
Type:      method

To discover SNPs in the Vgsc gene, we need to define some parameters.

First, we need to decide which gene transcript to use when determining what SNP effects will be. Here we’ll use the transcript with identifier “AGAP004707-RD”.

transcript = "AGAP004707-RD"

Next, to compute allele frequencies, we need to decide how our mosquitoes will be grouped into cohorts. There are different ways you can do this, for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.

cohorts = "admin1_year"

Next, we need to choose which samples to analyse. There are a number of different sample sets in the Ag3.0 data resource that we could use for this analysis. Let’s check what’s available.

ag3.sample_sets(release="3.0")[['sample_set', 'study_id', 'sample_count']]
sample_set study_id sample_count
0 AG1000G-AO AG1000G-AO 81
1 AG1000G-BF-A AG1000G-BF-1 181
2 AG1000G-BF-B AG1000G-BF-1 102
3 AG1000G-BF-C AG1000G-BF-2 13
4 AG1000G-CD AG1000G-CD 76
5 AG1000G-CF AG1000G-CF 73
6 AG1000G-CI AG1000G-CI 80
7 AG1000G-CM-A AG1000G-CM-1 303
8 AG1000G-CM-B AG1000G-CM-2 97
9 AG1000G-CM-C AG1000G-CM-3 44
10 AG1000G-FR AG1000G-FR 23
11 AG1000G-GA-A AG1000G-GA-1 69
12 AG1000G-GH AG1000G-GH 100
13 AG1000G-GM-A AG1000G-GM-1 74
14 AG1000G-GM-B AG1000G-GM-2 31
15 AG1000G-GM-C AG1000G-GM-3 174
16 AG1000G-GN-A AG1000G-GN-ML 45
17 AG1000G-GN-B AG1000G-GN-ML 185
18 AG1000G-GQ AG1000G-GQ 10
19 AG1000G-GW AG1000G-GW 101
20 AG1000G-KE AG1000G-KE 86
21 AG1000G-ML-A AG1000G-ML-1 60
22 AG1000G-ML-B AG1000G-ML-2 71
23 AG1000G-MW AG1000G-MW 41
24 AG1000G-MZ AG1000G-MZ 74
25 AG1000G-TZ AG1000G-TZ 300
26 AG1000G-UG AG1000G-UG 290
27 AG1000G-X AG1000G-X 669

To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.

sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]

OK, now we’re ready to run the analysis.

snp_allele_freqs_df = ag3.snp_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets, 
    drop_invariant=False,
)
snp_allele_freqs_df
                                     
                                      
                                     
                                     
pass_gamb_colu_arab pass_gamb_colu pass_arab frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af transcript effect impact ref_codon alt_codon aa_pos ref_aa alt_aa label
contig position ref_allele alt_allele aa_change
2L 2358158 A C M1L True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD START_LOST HIGH Atg Ctg 1.0 M L 2L:2,358,158 A>C (M1L)
T M1L True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD START_LOST HIGH Atg Ttg 1.0 M L 2L:2,358,158 A>T (M1L)
G M1V True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD START_LOST HIGH Atg Gtg 1.0 M V 2L:2,358,158 A>G (M1V)
2358159 T A M1K True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aTg aAg 1.0 M K 2L:2,358,159 T>A (M1K)
C M1T True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aTg aCg 1.0 M T 2L:2,358,159 T>C (M1T)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2431616 G C *2119S True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD STOP_LOST HIGH tGa tCa 2119.0 * S 2L:2,431,616 G>C (*2119S)
T *2119L True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD STOP_LOST HIGH tGa tTa 2119.0 * L 2L:2,431,616 G>T (*2119L)
2431617 A C *2119C True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD STOP_LOST HIGH tgA tgC 2119.0 * C 2L:2,431,617 A>C (*2119C)
T *2119C True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD STOP_LOST HIGH tgA tgT 2119.0 * C 2L:2,431,617 A>T (*2119C)
G *2119W True True True 0.0 0.0 0.0 0.0 0.0 0.0 AGAP004707-RD STOP_LOST HIGH tgA tgG 2119.0 * W 2L:2,431,617 A>G (*2119W)

209001 rows × 18 columns

The output from this function is a pandas DataFrame, where each row provides information about a SNP in the Vgsc gene.

Before we go further, to improve our understanding of what these data, let’s look at some background.

Grouping samples into cohorts#

The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.

To help define cohorts and analyse these data, we have added some metadata for each sample about its time and place of collection and its species:

  • Spatially - For most analyses we use administrative divisions to group the samples into cohorts. These give two levels of spatial resolution, where admin level 1 divides each country into a few large regions, while admin level 2 provides finer scale divisions such as provinces.

  • Temporally - For each sample we provide the year, quarter and month of collection. Depending on your analysis, you can choose to group samples by year, by year and by year quarter, or by year and by month, although note that for some samples the collection month is missing.

  • Taxonomically - Ag3.0 contains samples from different species in the Anopheles gambiae complex. To help with grouping by taxon, we have included a “taxon” field in the sample metadata.

Using these three dimensions, we have pre-defined six cohort sets, each of which groups samples into cohorts at different levels of spatio-temporal resolution. Within all cohort sets, samples are further subdivided by taxon.

  • admin1_year - Cohorts obtained by grouping samples by admin level 1, collection year and taxon.

  • admin1_quarter - Cohorts obtained by grouping samples by admin level 1, collection year and quarter, and taxon.

  • admin1_month - Cohorts obtained by grouping samples by admin level 1, collection year and month, and taxon.

  • admin2_year - Cohorts obtained by grouping samples by admin level 2, collection year and taxon.

  • admin2_quarter - Cohorts obtained by grouping samples by admin level 2, collection year and quarter, and taxon.

  • admin2_month - Cohorts obtained by grouping samples by admin level 2, collection year and month, and taxon.

Remember above we chose to use the “admin1_year” cohorts for our Vgsc analysis.

cohorts
'admin1_year'

Let’s now use this to understand the frequency columns in the SNP allele frequencies DataFrame. Here are the frequency column names:

frequency_columns = [
    col for col in snp_allele_freqs_df.columns 
    if col.startswith("frq_")
]
frequency_columns
['frq_BF-07_gamb_2004',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_BF-09_gamb_2012',
 'frq_BF-09_gamb_2014']

Here “frq_” is used to mean that these columns contain frequencies.

The second part of the column name is either “BF-09” or “BF-07”. These are standard identifiers that refer to level 1 administrative divisions within countries, often called regions. “BF-07” is the Centre-Sud region within Burkina Faso, and “BF-09” is the Haut-Bassins region.

The third part refers to the species, and in this example is either “gamb” meaning Anopheles gambiae or “colu” meaning Anopheles coluzzii.

The final part is the year of collection, which in this example ie either 2004, 2012 or 2014.

Thus, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.

Single nucleotide polymorphisms (SNPs)#

In the last module we learnt about reference genomes and gene annotations. Now one of the cool things we can do once we have an Anopheles gambiae reference genome, is sequence the genomes of wild mosquitoes and compare them to the reference genome. By aligning an individual’s sequencing reads to the reference, we can call its genotype.

The genotype is derived from how similar the individual is to the reference genome at each nucleotide position.

Where a nucleotide matches the reference genome, it will be called a reference allele, where “allele” is just a term for any kind of genetic variant. Where a nucleotide differs from the reference genome, it will be called an alternative allele.

Since, there are four possible nucleotides and one of these is the reference, there are always three possible alternative alleles for any given site. Another term for a genetic difference where two individuals differ at a single nucleotide position is single nucleotide polymorphism or SNP.

This SNP information is displayed in the first four columns of the snp_allele_frequencies() index in the output dataframe. We can see each alternative allele represented by it’s own row in our output. Let’s take a look at just the index.

snp_allele_freqs_df[[]]
contig position ref_allele alt_allele aa_change
2L 2358158 A C M1L
T M1L
G M1V
2358159 T A M1K
C M1T
... ... ... ...
2431616 G C *2119S
T *2119L
2431617 A C *2119C
T *2119C
G *2119W

209001 rows × 0 columns

SNP effects - some SNPs may be more interesting than others#

SNPs can have different effects depending on what the nucleotide change is and where in the genome it occurs.

In this analysis, we are interested in SNPs that affect protein structure, specifically those which will change the voltage-gated sodium-channel and could affect protein function and therefore insecticide resistance phenotype, so we need to look within the coding sequences (CDS) of the Vgsc gene.

However, not all SNPs which fall in CDSs cause protein changes, because the genetic code has some redundancy, meaning that different nucleotide sequences can encode the same amino acid. If a SNP in a CDS does change the amino acid, it is called a non-synonymous (NS) or missense SNP, and if the SNP does not change the amino acid, it is called a synonymous SNP.

Manually predicting SNP effects is quite involved, but snp_allele_frequencies() can predict them for us, let’s look at how this is represented in our output DataFrame.

snp_effects_df = snp_allele_freqs_df[["effect", "impact"]]
snp_effects_df
effect impact
contig position ref_allele alt_allele aa_change
2L 2358158 A C M1L START_LOST HIGH
T M1L START_LOST HIGH
G M1V START_LOST HIGH
2358159 T A M1K NON_SYNONYMOUS_CODING MODERATE
C M1T NON_SYNONYMOUS_CODING MODERATE
... ... ... ... ... ...
2431616 G C *2119S STOP_LOST HIGH
T *2119L STOP_LOST HIGH
2431617 A C *2119C STOP_LOST HIGH
T *2119C STOP_LOST HIGH
G *2119W STOP_LOST HIGH

209001 rows × 2 columns

Let’s look specifically at the genomic position where a SNP occurs which causes an insecticide-resistance mutation, also known as “kdr”.

snp_effects_df.loc[("2L", 2_422_652)]
effect impact
ref_allele alt_allele aa_change
A C L995F NON_SYNONYMOUS_CODING MODERATE
T L995F NON_SYNONYMOUS_CODING MODERATE
G L995L SYNONYMOUS_CODING LOW

For interest, let’s count the number of SNPs we have by their effect.

snp_allele_freqs_df.query("max_af > 0").groupby(["effect", "impact"]).size()
effect                 impact  
INTRONIC               MODIFIER    9467
NON_SYNONYMOUS_CODING  MODERATE     121
SPLICE_CORE            HIGH           8
SPLICE_REGION          MODERATE       9
STOP_GAINED            HIGH          14
SYNONYMOUS_CODING      LOW           56
dtype: int64

SNP allele frequencies#

Identifying the presence or absence of SNPs in wild caught mosquitoes is interesting, but the real value in generating SNP genotypes from large spatiotemporal collections of mosquitoes comes from the ability to see how groups of samples (cohorts) differ between geographical locations, species, and over time.

One way to compare SNP differences between cohorts is to calculate and compare SNP allele frequencies at each position in the genome by dividing the number of times each SNP allele is found in the cohort by the total number of individuals present in the cohort (multiplied by 2 because each individual mosquito is diploid and so carries two genome copies).

Let’s take another look at the allele frequencies we computed above, focusing just on the frequency columns.

snp_allele_freqs_df[frequency_columns + ['max_af']]
frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af
contig position ref_allele alt_allele aa_change
2L 2358158 A C M1L 0.0 0.0 0.0 0.0 0.0 0.0
T M1L 0.0 0.0 0.0 0.0 0.0 0.0
G M1V 0.0 0.0 0.0 0.0 0.0 0.0
2358159 T A M1K 0.0 0.0 0.0 0.0 0.0 0.0
C M1T 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ...
2431616 G C *2119S 0.0 0.0 0.0 0.0 0.0 0.0
T *2119L 0.0 0.0 0.0 0.0 0.0 0.0
2431617 A C *2119C 0.0 0.0 0.0 0.0 0.0 0.0
T *2119C 0.0 0.0 0.0 0.0 0.0 0.0
G *2119W 0.0 0.0 0.0 0.0 0.0 0.0

209001 rows × 6 columns

And let’s inspect the frequencies for SNPs at a specific genomic position of interest.

snp_allele_freqs_df.loc[("2L", 2_422_652), frequency_columns]
frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014
ref_allele alt_allele aa_change
A C L995F 0.000000 0.000000 0.000000 0.0 0.0
T L995F 0.076923 0.865854 0.886792 1.0 1.0
G L995L 0.000000 0.000000 0.000000 0.0 0.0

Visualising SNP allele frequencies via heatmaps#

To make our SNP allele frequencies DataFrame easier to interpret, we can filter it down to just non-synonymous SNPs that are at frequency above 5% in at least one of our cohorts.

ns_snps_df = snp_allele_freqs_df.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
ns_snps_df
pass_gamb_colu_arab pass_gamb_colu pass_arab frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af transcript effect impact ref_codon alt_codon aa_pos ref_aa alt_aa label
contig position ref_allele alt_allele aa_change
2L 2391228 G C V402L True True True 0.000000 0.067073 0.028302 0.000000 0.000000 0.067073 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE Gta Cta 402.0 V L 2L:2,391,228 G>C (V402L)
T V402L True True True 0.000000 0.054878 0.084906 0.000000 0.000000 0.084906 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE Gta Tta 402.0 V L 2L:2,391,228 G>T (V402L)
2416980 C T T791M True True True 0.000000 0.018293 0.000000 0.161616 0.239130 0.239130 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aCg aTg 791.0 T M 2L:2,416,980 C>T (T791M)
2422652 A T L995F True True True 0.076923 0.865854 0.886792 1.000000 1.000000 1.000000 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE ttA ttT 995.0 L F 2L:2,422,652 A>T (L995F)
2429617 T C I1527T True True True 0.000000 0.121951 0.113208 0.000000 0.000000 0.121951 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aTt aCt 1527.0 I T 2L:2,429,617 T>C (I1527T)
2429745 A T N1570Y True True True 0.038462 0.250000 0.320755 0.212121 0.141304 0.320755 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE Aat Tat 1570.0 N Y 2L:2,429,745 A>T (N1570Y)
2429897 A G E1597G True True True 0.000000 0.000000 0.000000 0.065657 0.032609 0.065657 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE gAa gGa 1597.0 E G 2L:2,429,897 A>G (E1597G)
2429915 A C K1603T True True True 0.000000 0.054878 0.056604 0.000000 0.000000 0.056604 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aAg aCg 1603.0 K T 2L:2,429,915 A>C (K1603T)
2430424 G T A1746S False True False 0.000000 0.000000 0.000000 0.151515 0.239130 0.239130 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE Gcc Tcc 1746.0 A S 2L:2,430,424 G>T (A1746S)
2430863 T C I1868T True True True 0.000000 0.000000 0.000000 0.247475 0.206522 0.247475 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE aTa aCa 1868.0 I T 2L:2,430,863 T>C (I1868T)
2430880 C T P1874S True True True 0.000000 0.213415 0.169811 0.000000 0.000000 0.213415 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE Cct Tct 1874.0 P S 2L:2,430,880 C>T (P1874S)
2430881 C T P1874L True True True 0.000000 0.073171 0.056604 0.222222 0.260870 0.260870 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE cCt cTt 1874.0 P L 2L:2,430,881 C>T (P1874L)
2431061 C T A1934V True True True 0.000000 0.121951 0.132075 0.000000 0.000000 0.132075 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE gCt gTt 1934.0 A V 2L:2,431,061 C>T (A1934V)

To make things even clearer, we have included a heatmap plotting function to style our filtered DataFrame, called plot_frequencies_heatmap().

ag3.plot_frequencies_heatmap(ns_snps_df)

Amino acid substitution frequencies#

You might have noticed that there are two rows with V402L in our previous heatmap plot. This is because in Burkina Faso, we find two different alternative alleles at the same genomic position, both causing the same amino acid substitution (valine to leucine).

If we are just interested in amino acid change frequencies, for example, when looking at potential insecticide resistance conferring mutations, we might want to combine the frequencies of the two alleles which cause V402L. In this case, we can use the aa_allele_frequencies() function in exactly the same way as we used snp_allele_frequencies().

aa_allele_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets
)
aa_allele_freqs_df
                                     
                                      
                                     
frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 transcript aa_pos ref_allele ref_aa alt_aa effect impact alt_allele max_af label
aa_change contig position
A32V 2L 2358252 0.000000 0.006098 0.0 0.000000 0.000000 AGAP004707-RD 32.0 C A V NON_SYNONYMOUS_CODING MODERATE T 0.006098 A32V (2L:2,358,252 C>T)
G54C 2L 2362019 0.000000 0.018293 0.0 0.000000 0.010870 AGAP004707-RD 54.0 G G C NON_SYNONYMOUS_CODING MODERATE T 0.018293 G54C (2L:2,362,019 G>T)
P55L 2L 2362023 0.000000 0.000000 0.0 0.005051 0.010870 AGAP004707-RD 55.0 C P L NON_SYNONYMOUS_CODING MODERATE T 0.010870 P55L (2L:2,362,023 C>T)
P59T 2L 2362034 0.000000 0.000000 0.0 0.000000 0.021739 AGAP004707-RD 59.0 C P T NON_SYNONYMOUS_CODING MODERATE A 0.021739 P59T (2L:2,362,034 C>A)
G73D 2L 2362077 0.000000 0.006098 0.0 0.000000 0.000000 AGAP004707-RD 73.0 G G D NON_SYNONYMOUS_CODING MODERATE A 0.006098 G73D (2L:2,362,077 G>A)
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
A2023G 2L 2431328 0.038462 0.000000 0.0 0.000000 0.000000 AGAP004707-RD 2023.0 C A G NON_SYNONYMOUS_CODING MODERATE G 0.038462 A2023G (2L:2,431,328 C>G)
S2037R 2L 2431371 0.000000 0.000000 0.0 0.005051 0.000000 AGAP004707-RD 2037.0 C S R NON_SYNONYMOUS_CODING MODERATE G 0.005051 S2037R (2L:2,431,371 C>G)
I2053V 2L 2431417 0.038462 0.000000 0.0 0.000000 0.000000 AGAP004707-RD 2053.0 A I V NON_SYNONYMOUS_CODING MODERATE G 0.038462 I2053V (2L:2,431,417 A>G)
G2055V 2L 2431424 0.000000 0.000000 0.0 0.005051 0.000000 AGAP004707-RD 2055.0 G G V NON_SYNONYMOUS_CODING MODERATE T 0.005051 G2055V (2L:2,431,424 G>T)
V2118A 2L 2431613 0.000000 0.000000 0.0 0.005051 0.000000 AGAP004707-RD 2118.0 T V A NON_SYNONYMOUS_CODING MODERATE C 0.005051 V2118A (2L:2,431,613 T>C)

134 rows × 15 columns

Let’s filter it again to just amino acid changes greater than 5% in at least one cohort. We don’t need to filter for non-synonymous mutations this time as this function has already done that for us.

aa_filt_df = aa_allele_freqs_df.query("max_af >= 0.05")
aa_filt_df
frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 transcript aa_pos ref_allele ref_aa alt_aa effect impact alt_allele max_af label
aa_change contig position
V402L 2L 2391228 0.000000 0.121951 0.113208 0.000000 0.000000 AGAP004707-RD 402.0 G V L NON_SYNONYMOUS_CODING MODERATE {C,T} 0.121951 V402L (2L:2,391,228 G>{C,T})
T791M 2L 2416980 0.000000 0.018293 0.000000 0.161616 0.239130 AGAP004707-RD 791.0 C T M NON_SYNONYMOUS_CODING MODERATE T 0.239130 T791M (2L:2,416,980 C>T)
L995F 2L 2422652 0.076923 0.865854 0.886792 1.000000 1.000000 AGAP004707-RD 995.0 A L F NON_SYNONYMOUS_CODING MODERATE T 1.000000 L995F (2L:2,422,652 A>T)
I1527T 2L 2429617 0.000000 0.121951 0.113208 0.000000 0.000000 AGAP004707-RD 1527.0 T I T NON_SYNONYMOUS_CODING MODERATE C 0.121951 I1527T (2L:2,429,617 T>C)
N1570Y 2L 2429745 0.038462 0.250000 0.320755 0.212121 0.141304 AGAP004707-RD 1570.0 A N Y NON_SYNONYMOUS_CODING MODERATE T 0.320755 N1570Y (2L:2,429,745 A>T)
E1597G 2L 2429897 0.000000 0.000000 0.000000 0.065657 0.032609 AGAP004707-RD 1597.0 A E G NON_SYNONYMOUS_CODING MODERATE G 0.065657 E1597G (2L:2,429,897 A>G)
K1603T 2L 2429915 0.000000 0.054878 0.056604 0.000000 0.000000 AGAP004707-RD 1603.0 A K T NON_SYNONYMOUS_CODING MODERATE C 0.056604 K1603T (2L:2,429,915 A>C)
A1746S 2L 2430424 0.000000 0.000000 0.000000 0.151515 0.239130 AGAP004707-RD 1746.0 G A S NON_SYNONYMOUS_CODING MODERATE T 0.239130 A1746S (2L:2,430,424 G>T)
I1868T 2L 2430863 0.000000 0.000000 0.000000 0.247475 0.206522 AGAP004707-RD 1868.0 T I T NON_SYNONYMOUS_CODING MODERATE C 0.247475 I1868T (2L:2,430,863 T>C)
P1874S 2L 2430880 0.000000 0.213415 0.169811 0.000000 0.000000 AGAP004707-RD 1874.0 C P S NON_SYNONYMOUS_CODING MODERATE T 0.213415 P1874S (2L:2,430,880 C>T)
P1874L 2L 2430881 0.000000 0.073171 0.056604 0.222222 0.260870 AGAP004707-RD 1874.0 C P L NON_SYNONYMOUS_CODING MODERATE T 0.260870 P1874L (2L:2,430,881 C>T)
A1934V 2L 2431061 0.000000 0.121951 0.132075 0.000000 0.000000 AGAP004707-RD 1934.0 C A V NON_SYNONYMOUS_CODING MODERATE T 0.132075 A1934V (2L:2,431,061 C>T)

Now we can visualise these frequencies the same way we did before.

ag3.plot_frequencies_heatmap(aa_filt_df)

From this new heatmap it’s clear that there are two SNPs causing a V402L substitution, and the combined frequency of these is identical to that of the I1527T substitution in the An. coluzzii cohorts. This is a strong indication that these two mutations are strongly associated and may be working together to confer a selective advantage via insecticide resistance.

Using sample queries#

In some cases you may want to analyse data across a number of different sample sets, but select only samples from a single country or species. In this case there is a useful sample_query parameter available with the functions snp_allele_frequencies() and aa_allele_frequencies(). The sample_query parameter accepts a pandas query string, which will be applied to the sample metadata to select samples to use in the analysis.

Let’s see it in action by computing amino acid allele frequencies for all An. arabiensis samples in the Ag3.0 data resource.

aa_arab_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets="3.0",
    sample_query="taxon == 'arabiensis'",
)
aa_arab_freqs_df = aa_arab_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_arab_freqs_df)
                                     
                                      
                                     

Plotting SNP frequency time series#

For surveillance it can be very informative to observice changes in allele frequencies over time for a given sampling location and species. Alleles under selection due to insecticide resistance will increase in frequency over time, and plotting the frequency data as a time series can help to see this. The malariagen_data package includes a function plot_frequencies_time_series() for plotting frequency time series. These are a little more advanced to use, but we’ll illustrate them below.

N.B., in the Ag3.0 data resource there is only one country where we have data from multiple time points, which is Burkina Faso. However, data in future releases will contain a number of time series, and so it’s useful to illustrate this functionality now.

First we need to compute the SNP allele frequencies in our gene of interest. To obtain the necessary data for time series plotting, we use a slightly different function for computing the frequencies. Don’t worry too much about this for the moment, we will revisit later in the course.

ds_snps_bf = ag3.aa_allele_frequencies_advanced(
    transcript=transcript,
    area_by="admin1_iso",  # group samples in space by admin level 1
    period_by="year",  # group samples in time by year
    sample_sets="3.0",
    sample_query="country == 'Burkina Faso'",
    variant_query="max_af > 0.05",  # only keep variants above 5% frequency in at least one cohort
)
ds_snps_bf
Access SNP calls: ⠋ (0:00:00.00)     
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:

'A' is deprecated and will be removed in a future version, please use 'Y' instead.
                                 
                                     
<xarray.Dataset> Size: 4kB
Dimensions:                 (cohorts: 5, variants: 12)
Dimensions without coordinates: cohorts, variants
Data variables: (12/31)
    cohort_area             (cohorts) object 40B 'BF-09' 'BF-09' ... 'BF-09'
    cohort_label            (cohorts) object 40B 'BF-09_colu_2012' ... 'BF-09...
    cohort_lat_max          (cohorts) float64 40B 11.24 11.24 12.06 11.24 11.24
    cohort_lat_mean         (cohorts) float64 40B 11.22 11.23 12.06 11.19 11.21
    cohort_lat_min          (cohorts) float64 40B 11.15 11.23 12.06 11.15 11.15
    cohort_lon_max          (cohorts) float64 40B -4.235 -4.235 ... -4.235
    ...                      ...
    variant_label           (variants) object 96B 'V402L (2L:2,391,228 G>{C,T...
    variant_max_af          (variants) float64 96B 0.122 0.2391 ... 0.1321
    variant_position        (variants) int32 48B 2391228 2416980 ... 2431061
    variant_ref_aa          (variants) object 96B 'V' 'T' 'L' ... 'P' 'P' 'A'
    variant_ref_allele      (variants) object 96B 'G' 'C' 'A' ... 'C' 'C' 'C'
    variant_transcript      (variants) object 96B 'AGAP004707-RD' ... 'AGAP00...
Attributes:
    title:    AGAP004707-RD (Vgsc/para) SNP frequencies

Now let’s plot the time series.

ag3.plot_frequencies_time_series(ds_snps_bf)

Note that this plot is interactive, you can hover over points for more information, and zoom into regions of the plot.

In this plot we have data from two regions, BF-09 (Haut-Bassins) and BF-07 (Centre-Sud). In BF-09 we have data from 2012 and 2014, and so we have the beginning of a time series. In BF-07 we have data only from 2004, so just a single point in time.

There are 3 sub-plots, faceted by taxon and area to allow for comparison. The error bars in the plots show 95% confidence intervals for the frequency estimates. In BF-09 where we have data from 2012 and 2014, it looks like all SNP frequencies were similar between 2012 and 2014 in both An. gambiae and An. coluzzii, with some known insecticide resistance alleles such as L995F already at fixation or very high frequency.

It’s interesting to note that in BF-07 region in 2004 there were no SNPs above 8% frequency, including the L995F SNP. This suggests that L995F rose in frequency in Burkina Faso between 2004 and 2012, although we cannot draw strong conclusions here because we have data from different regions.

Plotting SNP frequency maps#

Another analysis we can perform is to make geographical comparisons, by plotting SNP frequencies on a map. We can do this via functions in the malariagen_data package. Let’s demonstrate, using all data in the Ag3.0 resource.

ds_snps = ag3.aa_allele_frequencies_advanced(
    transcript=transcript,
    area_by="admin1_iso",  # group samples in space by admin level 1
    period_by="year",  # group samples in time by year
    sample_sets="3.0",
    variant_query="max_af > 0.05",  # only keep variants above 5% frequency in at least one cohort
)
ds_snps
Load genome features: ⠋ (0:00:00.00)
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:

'A' is deprecated and will be removed in a future version, please use 'Y' instead.
                                     
                                     
<xarray.Dataset> Size: 73kB
Dimensions:                 (cohorts: 54, variants: 30)
Dimensions without coordinates: cohorts, variants
Data variables: (12/31)
    cohort_area             (cohorts) object 432B 'KE-14' 'MW-S' ... 'TZ-25'
    cohort_label            (cohorts) object 432B 'KE-14_arab_2012' ... 'TZ-2...
    cohort_lat_max          (cohorts) float64 432B -3.511 -15.93 ... -4.94
    cohort_lat_mean         (cohorts) float64 432B -3.511 -15.93 ... -4.94
    cohort_lat_min          (cohorts) float64 432B -3.511 -15.93 ... -4.94
    cohort_lon_max          (cohorts) float64 432B 39.91 34.76 ... 39.91 38.95
    ...                      ...
    variant_label           (variants) object 240B 'R254K (2L:2,390,177 G>A)'...
    variant_max_af          (variants) float64 240B 0.5197 0.25 ... 0.07143
    variant_position        (variants) int32 120B 2390177 2391228 ... 2431417
    variant_ref_aa          (variants) object 240B 'R' 'V' 'D' ... 'I' 'T' 'I'
    variant_ref_allele      (variants) object 240B 'G' 'G' 'G' ... 'T' 'A' 'A'
    variant_transcript      (variants) object 240B 'AGAP004707-RD' ... 'AGAP0...
Attributes:
    title:    AGAP004707-RD (Vgsc/para) SNP frequencies

Take a look at the variants we’ve found.

ds_snps["variant_label"].values
array(['R254K (2L:2,390,177 G>A)', 'V402L (2L:2,391,228 G>{C,T})',
       'D466H (2L:2,399,997 G>C)', 'M490I (2L:2,400,071 G>{A,T})',
       'A656V (2L:2,407,769 C>T)', 'I693V (2L:2,407,954 A>G)',
       'E752V (2L:2,416,863 A>T)', 'M757L (2L:2,416,877 A>T)',
       'T791M (2L:2,416,980 C>T)', 'E862* (2L:2,417,305 G>T)',
       'L995S (2L:2,422,651 T>C)', 'L995F (2L:2,422,652 A>T)',
       'A1125V (2L:2,424,384 C>T)', 'V1507I (2L:2,429,556 G>A)',
       'I1527T (2L:2,429,617 T>C)', 'N1570Y (2L:2,429,745 A>T)',
       'E1597Q (2L:2,429,896 G>C)', 'E1597G (2L:2,429,897 A>G)',
       'K1603T (2L:2,429,915 A>C)', 'L1667M (2L:2,430,106 T>A)',
       'A1746S (2L:2,430,424 G>T)', 'Y1846* (2L:2,430,798 C>{A,G})',
       'V1853I (2L:2,430,817 G>A)', 'I1868T (2L:2,430,863 T>C)',
       'P1874S (2L:2,430,880 C>T)', 'P1874L (2L:2,430,881 C>T)',
       'A1934V (2L:2,431,061 C>T)', 'I1940T (2L:2,431,079 T>C)',
       'T2044A (2L:2,431,390 A>G)', 'I2053V (2L:2,431,417 A>G)'],
      dtype=object)

To explore these data a bit more interactively, there is also a function plot_frequencies_interactive_map() to create a map with some controls to select the taxon, period and variant of interest. Here’s an example of how to use it.

ag3.plot_frequencies_interactive_map(ds_snps)

N.B., you will need to run this notebook for yourself in colab in order for the interactive map to work.

Well done!#

In this module we have learnt how to analyse SNP mutations in the target of pyrethroid insecticides, the voltage-gated sodium-channel. We have calculated the allele frequencies of the SNPs in cohorts of mosquitoes and learnt how to filter and plot them for ease of interpretation.

Practical exercises#

English#

  1. Open this notebook in Google Colab and run it for yourself from top to bottom. Hint: click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

  2. Looking at the heatmap output (either amino acid or SNP), can you spot a relationship between the V402L and L995F frequencies? If so, what is it?

  3. Re-run the whole analysis but using the Ghanaian sample set. Hint: Try sample_sets = "AG1000G-GH". Or any other samples of interest.

  4. What are the cohorts for this new sample set? Hint: see frequency_columns.

  5. Above, we looked at the kdr “West” SNP position, compare and contrast this with the kdr “East” SNP position. What is the amino acid change for kdr “East”? Hint: The position is 2422651.

  6. For the Ghanaian sample set, add the a y_label which says “aa change” to the amino acid frequency heatmap. Hint: you can view all the function’s parameters with ag3.aa_allele_frequencies?.

  7. Remove the colorbar from the same heatmap. Hint: False.

  8. Is the same relationship between V402L and L995F frequencies present in Ghana? What might be an evolutionary interpretation of this relationship?

Français#

  1. Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Indice: cliquer sur l’icône fusée () et sélectionner “Colab” dans le menu déroulant.

  2. En regardant la heatmap (soit pour les SNPs soit pour les amino acides), pouvez vous détecter la relation entre les fréquences de V402L et L995F? Si oui, quelle est elle?

  3. Re-exécuter toute l’analyse mais en utilisant les données ghanéennes. Indice: utiliser sample_sets = "AG1000G-GH".

  4. Quelles sont les cohortes de ce nouvel ensemble de données? Indice: regarder frequency_columns.

  5. Plus haut, nous avons regardé la position SNP de kdr “West”, comparer et contraster la avec la position kdr “East”. Quelle est la différence en terme d’amino acides pour kdr “East”? Indice: position = 2422651.

  6. Pour l’ensemble de données ghanéennes, ajouter le y_label nommé “aa change” à la heatmap des fréquences des amino acides. Indice: tous les paramètres de la fonction peuvent être vus avec ag3.aa_allele_frequencies?.

  7. Supprimer la barre de couleur de cette même heatmap. Indice: quelque chose doit être défini commeFalse.

  8. Est-ce que la même relation entre les fréquences de V402L et L995F présente au Ghana? Quelle pourrait être l’interprétation en terme d’évolution de cette relation?