Workshop 1 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - Analysing SNPs in the Vgsc gene#
Theme: Analysis
In this module we’re going to perform an analysis to discover single nucleotide polymorphisms (SNPs) in the voltage-gated sodium channel gene (Vgsc), which encodes the binding target for pyrethroid insecticides.
Learning objectives#
At the end of this module you will be able to:
Discover mutations (SNPs) that could potentially cause pyrethroid target-site resistance.
Compute SNP allele frequencies, i.e., how common are they in different mosquito cohorts?
Perform analyses to compare SNP allele frequences between mosquitoes from different species, geographical locations and dates of collection.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Discovering SNPs in the Vgsc gene and computing SNP allele frequencies#
First, let’s set up the malariagen_data
package to access MalariaGEN data in the cloud.
%pip install -q --no-warn-conflicts malariagen_data
Note that authentication is required to access data through the package, please follow the instructions here.
import malariagen_data
import pandas as pd
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | None |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
To discover SNPs and compute allele frequencies, we’re going to use the snp_allele_frequencies()
function. Let’s have a look at the documentation for this function.
ag3.snp_allele_frequencies?
Signature:
ag3.snp_allele_frequencies(
transcript: str,
cohorts: Union[str, Mapping[str, str]],
sample_query: Optional[str] = None,
min_cohort_size: int = 10,
site_mask: Optional[str] = None,
sample_sets: Union[Sequence[str], str, NoneType] = None,
drop_invariant: bool = True,
effects: bool = True,
include_counts: bool = False,
) -> pandas.core.frame.DataFrame
Docstring:
Compute SNP allele frequencies for a gene transcript.
Parameters
----------
transcript : str
Gene transcript identifier.
cohorts : str or Mapping[str, str]
Either a string giving the name of a predefined cohort set (e.g.,
"admin1_month") or a dict mapping custom cohort labels to sample
queries.
sample_query : str or None, optional
A pandas query string to be evaluated against the sample metadata, to
select samples to be included in the returned data.
min_cohort_size : int, optional, default: 10
Minimum cohort size. Raise an error if the number of samples is less
than this value.
site_mask : str or None, optional
Which site filters mask to apply. See the `site_mask_ids` property for
available values.
sample_sets : sequence of str or str or None, optional
List of sample sets and/or releases. Can also be a single sample set
or release.
drop_invariant : bool, optional, default: True
If True, drop variants not observed in the selected samples.
effects : bool, optional, default: True
If True, add SNP effect annotations.
include_counts : bool, optional, default: False
Include columns with allele counts and number of non-missing allele
calls (nobs).
Returns
-------
DataFrame
A dataframe of SNP allele frequencies, one row per variant allele.
Notes
-----
Cohorts with fewer samples than `min_cohort_size` will be excluded
from output data frame.
File: ~/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py
Type: method
To discover SNPs in the Vgsc gene, we need to define some parameters.
First, we need to decide which gene transcript to use when determining what SNP effects will be. Here we’ll use the transcript with identifier “AGAP004707-RD”.
transcript = "AGAP004707-RD"
Next, to compute allele frequencies, we need to decide how our mosquitoes will be grouped into cohorts. There are different ways you can do this, for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.
cohorts = "admin1_year"
Next, we need to choose which samples to analyse. There are a number of different sample sets in the Ag3.0 data resource that we could use for this analysis. Let’s check what’s available.
ag3.sample_sets(release="3.0")[['sample_set', 'study_id', 'sample_count']]
sample_set | study_id | sample_count | |
---|---|---|---|
0 | AG1000G-AO | AG1000G-AO | 81 |
1 | AG1000G-BF-A | AG1000G-BF-1 | 181 |
2 | AG1000G-BF-B | AG1000G-BF-1 | 102 |
3 | AG1000G-BF-C | AG1000G-BF-2 | 13 |
4 | AG1000G-CD | AG1000G-CD | 76 |
5 | AG1000G-CF | AG1000G-CF | 73 |
6 | AG1000G-CI | AG1000G-CI | 80 |
7 | AG1000G-CM-A | AG1000G-CM-1 | 303 |
8 | AG1000G-CM-B | AG1000G-CM-2 | 97 |
9 | AG1000G-CM-C | AG1000G-CM-3 | 44 |
10 | AG1000G-FR | AG1000G-FR | 23 |
11 | AG1000G-GA-A | AG1000G-GA-1 | 69 |
12 | AG1000G-GH | AG1000G-GH | 100 |
13 | AG1000G-GM-A | AG1000G-GM-1 | 74 |
14 | AG1000G-GM-B | AG1000G-GM-2 | 31 |
15 | AG1000G-GM-C | AG1000G-GM-3 | 174 |
16 | AG1000G-GN-A | AG1000G-GN-ML | 45 |
17 | AG1000G-GN-B | AG1000G-GN-ML | 185 |
18 | AG1000G-GQ | AG1000G-GQ | 10 |
19 | AG1000G-GW | AG1000G-GW | 101 |
20 | AG1000G-KE | AG1000G-KE | 86 |
21 | AG1000G-ML-A | AG1000G-ML-1 | 60 |
22 | AG1000G-ML-B | AG1000G-ML-2 | 71 |
23 | AG1000G-MW | AG1000G-MW | 41 |
24 | AG1000G-MZ | AG1000G-MZ | 74 |
25 | AG1000G-TZ | AG1000G-TZ | 300 |
26 | AG1000G-UG | AG1000G-UG | 290 |
27 | AG1000G-X | AG1000G-X | 669 |
To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.
sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]
OK, now we’re ready to run the analysis.
snp_allele_freqs_df = ag3.snp_allele_frequencies(
transcript=transcript,
cohorts=cohorts,
sample_sets=sample_sets,
drop_invariant=False,
)
snp_allele_freqs_df
pass_gamb_colu_arab | pass_gamb_colu | pass_arab | frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | transcript | effect | impact | ref_codon | alt_codon | aa_pos | ref_aa | alt_aa | label | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
contig | position | ref_allele | alt_allele | aa_change | ||||||||||||||||||
2L | 2358158 | A | C | M1L | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | START_LOST | HIGH | Atg | Ctg | 1.0 | M | L | 2L:2,358,158 A>C (M1L) |
T | M1L | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | START_LOST | HIGH | Atg | Ttg | 1.0 | M | L | 2L:2,358,158 A>T (M1L) | |||
G | M1V | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | START_LOST | HIGH | Atg | Gtg | 1.0 | M | V | 2L:2,358,158 A>G (M1V) | |||
2358159 | T | A | M1K | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aTg | aAg | 1.0 | M | K | 2L:2,358,159 T>A (M1K) | |
C | M1T | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aTg | aCg | 1.0 | M | T | 2L:2,358,159 T>C (M1T) | |||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2431616 | G | C | *2119S | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | STOP_LOST | HIGH | tGa | tCa | 2119.0 | * | S | 2L:2,431,616 G>C (*2119S) | |
T | *2119L | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | STOP_LOST | HIGH | tGa | tTa | 2119.0 | * | L | 2L:2,431,616 G>T (*2119L) | |||
2431617 | A | C | *2119C | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | STOP_LOST | HIGH | tgA | tgC | 2119.0 | * | C | 2L:2,431,617 A>C (*2119C) | |
T | *2119C | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | STOP_LOST | HIGH | tgA | tgT | 2119.0 | * | C | 2L:2,431,617 A>T (*2119C) | |||
G | *2119W | True | True | True | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | AGAP004707-RD | STOP_LOST | HIGH | tgA | tgG | 2119.0 | * | W | 2L:2,431,617 A>G (*2119W) |
209001 rows × 18 columns
The output from this function is a pandas DataFrame, where each row provides information about a SNP in the Vgsc gene.
Before we go further, to improve our understanding of what these data, let’s look at some background.
Grouping samples into cohorts#
The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.
To help define cohorts and analyse these data, we have added some metadata for each sample about its time and place of collection and its species:
Spatially - For most analyses we use administrative divisions to group the samples into cohorts. These give two levels of spatial resolution, where admin level 1 divides each country into a few large regions, while admin level 2 provides finer scale divisions such as provinces.
Temporally - For each sample we provide the year, quarter and month of collection. Depending on your analysis, you can choose to group samples by year, by year and by year quarter, or by year and by month, although note that for some samples the collection month is missing.
Taxonomically - Ag3.0 contains samples from different species in the Anopheles gambiae complex. To help with grouping by taxon, we have included a “taxon” field in the sample metadata.
Using these three dimensions, we have pre-defined six cohort sets, each of which groups samples into cohorts at different levels of spatio-temporal resolution. Within all cohort sets, samples are further subdivided by taxon.
admin1_year - Cohorts obtained by grouping samples by admin level 1, collection year and taxon.
admin1_quarter - Cohorts obtained by grouping samples by admin level 1, collection year and quarter, and taxon.
admin1_month - Cohorts obtained by grouping samples by admin level 1, collection year and month, and taxon.
admin2_year - Cohorts obtained by grouping samples by admin level 2, collection year and taxon.
admin2_quarter - Cohorts obtained by grouping samples by admin level 2, collection year and quarter, and taxon.
admin2_month - Cohorts obtained by grouping samples by admin level 2, collection year and month, and taxon.
Remember above we chose to use the “admin1_year” cohorts for our Vgsc analysis.
cohorts
'admin1_year'
Let’s now use this to understand the frequency columns in the SNP allele frequencies DataFrame. Here are the frequency column names:
frequency_columns = [
col for col in snp_allele_freqs_df.columns
if col.startswith("frq_")
]
frequency_columns
['frq_BF-07_gamb_2004',
'frq_BF-09_colu_2012',
'frq_BF-09_colu_2014',
'frq_BF-09_gamb_2012',
'frq_BF-09_gamb_2014']
Here “frq_” is used to mean that these columns contain frequencies.
The second part of the column name is either “BF-09” or “BF-07”. These are standard identifiers that refer to level 1 administrative divisions within countries, often called regions. “BF-07” is the Centre-Sud region within Burkina Faso, and “BF-09” is the Haut-Bassins region.
The third part refers to the species, and in this example is either “gamb” meaning Anopheles gambiae or “colu” meaning Anopheles coluzzii.
The final part is the year of collection, which in this example ie either 2004, 2012 or 2014.
Thus, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.
Single nucleotide polymorphisms (SNPs)#
In the last module we learnt about reference genomes and gene annotations. Now one of the cool things we can do once we have an Anopheles gambiae reference genome, is sequence the genomes of wild mosquitoes and compare them to the reference genome. By aligning an individual’s sequencing reads to the reference, we can call its genotype.
The genotype is derived from how similar the individual is to the reference genome at each nucleotide position.
Where a nucleotide matches the reference genome, it will be called a reference allele, where “allele” is just a term for any kind of genetic variant. Where a nucleotide differs from the reference genome, it will be called an alternative allele.
Since, there are four possible nucleotides and one of these is the reference, there are always three possible alternative alleles for any given site. Another term for a genetic difference where two individuals differ at a single nucleotide position is single nucleotide polymorphism or SNP.
This SNP information is displayed in the first four columns of the snp_allele_frequencies()
index in the output dataframe. We can see each alternative allele represented by it’s own row in our output. Let’s take a look at just the index.
snp_allele_freqs_df[[]]
contig | position | ref_allele | alt_allele | aa_change |
---|---|---|---|---|
2L | 2358158 | A | C | M1L |
T | M1L | |||
G | M1V | |||
2358159 | T | A | M1K | |
C | M1T | |||
... | ... | ... | ... | |
2431616 | G | C | *2119S | |
T | *2119L | |||
2431617 | A | C | *2119C | |
T | *2119C | |||
G | *2119W |
209001 rows × 0 columns
SNP effects - some SNPs may be more interesting than others#
SNPs can have different effects depending on what the nucleotide change is and where in the genome it occurs.
In this analysis, we are interested in SNPs that affect protein structure, specifically those which will change the voltage-gated sodium-channel and could affect protein function and therefore insecticide resistance phenotype, so we need to look within the coding sequences (CDS) of the Vgsc gene.
However, not all SNPs which fall in CDSs cause protein changes, because the genetic code has some redundancy, meaning that different nucleotide sequences can encode the same amino acid. If a SNP in a CDS does change the amino acid, it is called a non-synonymous (NS) or missense SNP, and if the SNP does not change the amino acid, it is called a synonymous SNP.
Manually predicting SNP effects is quite involved, but snp_allele_frequencies()
can predict them for us, let’s look at how this is represented in our output DataFrame.
snp_effects_df = snp_allele_freqs_df[["effect", "impact"]]
snp_effects_df
effect | impact | |||||
---|---|---|---|---|---|---|
contig | position | ref_allele | alt_allele | aa_change | ||
2L | 2358158 | A | C | M1L | START_LOST | HIGH |
T | M1L | START_LOST | HIGH | |||
G | M1V | START_LOST | HIGH | |||
2358159 | T | A | M1K | NON_SYNONYMOUS_CODING | MODERATE | |
C | M1T | NON_SYNONYMOUS_CODING | MODERATE | |||
... | ... | ... | ... | ... | ... | |
2431616 | G | C | *2119S | STOP_LOST | HIGH | |
T | *2119L | STOP_LOST | HIGH | |||
2431617 | A | C | *2119C | STOP_LOST | HIGH | |
T | *2119C | STOP_LOST | HIGH | |||
G | *2119W | STOP_LOST | HIGH |
209001 rows × 2 columns
Let’s look specifically at the genomic position where a SNP occurs which causes an insecticide-resistance mutation, also known as “kdr”.
snp_effects_df.loc[("2L", 2_422_652)]
effect | impact | |||
---|---|---|---|---|
ref_allele | alt_allele | aa_change | ||
A | C | L995F | NON_SYNONYMOUS_CODING | MODERATE |
T | L995F | NON_SYNONYMOUS_CODING | MODERATE | |
G | L995L | SYNONYMOUS_CODING | LOW |
For interest, let’s count the number of SNPs we have by their effect.
snp_allele_freqs_df.query("max_af > 0").groupby(["effect", "impact"]).size()
effect impact
INTRONIC MODIFIER 9467
NON_SYNONYMOUS_CODING MODERATE 121
SPLICE_CORE HIGH 8
SPLICE_REGION MODERATE 9
STOP_GAINED HIGH 14
SYNONYMOUS_CODING LOW 56
dtype: int64
SNP allele frequencies#
Identifying the presence or absence of SNPs in wild caught mosquitoes is interesting, but the real value in generating SNP genotypes from large spatiotemporal collections of mosquitoes comes from the ability to see how groups of samples (cohorts) differ between geographical locations, species, and over time.
One way to compare SNP differences between cohorts is to calculate and compare SNP allele frequencies at each position in the genome by dividing the number of times each SNP allele is found in the cohort by the total number of individuals present in the cohort (multiplied by 2 because each individual mosquito is diploid and so carries two genome copies).
Let’s take another look at the allele frequencies we computed above, focusing just on the frequency columns.
snp_allele_freqs_df[frequency_columns + ['max_af']]
frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | |||||
---|---|---|---|---|---|---|---|---|---|---|
contig | position | ref_allele | alt_allele | aa_change | ||||||
2L | 2358158 | A | C | M1L | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
T | M1L | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
G | M1V | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
2358159 | T | A | M1K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
C | M1T | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2431616 | G | C | *2119S | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
T | *2119L | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
2431617 | A | C | *2119C | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
T | *2119C | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
G | *2119W | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
209001 rows × 6 columns
And let’s inspect the frequencies for SNPs at a specific genomic position of interest.
snp_allele_freqs_df.loc[("2L", 2_422_652), frequency_columns]
frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | |||
---|---|---|---|---|---|---|---|
ref_allele | alt_allele | aa_change | |||||
A | C | L995F | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 |
T | L995F | 0.076923 | 0.865854 | 0.886792 | 1.0 | 1.0 | |
G | L995L | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 |
Visualising SNP allele frequencies via heatmaps#
To make our SNP allele frequencies DataFrame easier to interpret, we can filter it down to just non-synonymous SNPs that are at frequency above 5% in at least one of our cohorts.
ns_snps_df = snp_allele_freqs_df.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
ns_snps_df
pass_gamb_colu_arab | pass_gamb_colu | pass_arab | frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | transcript | effect | impact | ref_codon | alt_codon | aa_pos | ref_aa | alt_aa | label | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
contig | position | ref_allele | alt_allele | aa_change | ||||||||||||||||||
2L | 2391228 | G | C | V402L | True | True | True | 0.000000 | 0.067073 | 0.028302 | 0.000000 | 0.000000 | 0.067073 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | Gta | Cta | 402.0 | V | L | 2L:2,391,228 G>C (V402L) |
T | V402L | True | True | True | 0.000000 | 0.054878 | 0.084906 | 0.000000 | 0.000000 | 0.084906 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | Gta | Tta | 402.0 | V | L | 2L:2,391,228 G>T (V402L) | |||
2416980 | C | T | T791M | True | True | True | 0.000000 | 0.018293 | 0.000000 | 0.161616 | 0.239130 | 0.239130 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aCg | aTg | 791.0 | T | M | 2L:2,416,980 C>T (T791M) | |
2422652 | A | T | L995F | True | True | True | 0.076923 | 0.865854 | 0.886792 | 1.000000 | 1.000000 | 1.000000 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | ttA | ttT | 995.0 | L | F | 2L:2,422,652 A>T (L995F) | |
2429617 | T | C | I1527T | True | True | True | 0.000000 | 0.121951 | 0.113208 | 0.000000 | 0.000000 | 0.121951 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aTt | aCt | 1527.0 | I | T | 2L:2,429,617 T>C (I1527T) | |
2429745 | A | T | N1570Y | True | True | True | 0.038462 | 0.250000 | 0.320755 | 0.212121 | 0.141304 | 0.320755 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | Aat | Tat | 1570.0 | N | Y | 2L:2,429,745 A>T (N1570Y) | |
2429897 | A | G | E1597G | True | True | True | 0.000000 | 0.000000 | 0.000000 | 0.065657 | 0.032609 | 0.065657 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | gAa | gGa | 1597.0 | E | G | 2L:2,429,897 A>G (E1597G) | |
2429915 | A | C | K1603T | True | True | True | 0.000000 | 0.054878 | 0.056604 | 0.000000 | 0.000000 | 0.056604 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aAg | aCg | 1603.0 | K | T | 2L:2,429,915 A>C (K1603T) | |
2430424 | G | T | A1746S | False | True | False | 0.000000 | 0.000000 | 0.000000 | 0.151515 | 0.239130 | 0.239130 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | Gcc | Tcc | 1746.0 | A | S | 2L:2,430,424 G>T (A1746S) | |
2430863 | T | C | I1868T | True | True | True | 0.000000 | 0.000000 | 0.000000 | 0.247475 | 0.206522 | 0.247475 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | aTa | aCa | 1868.0 | I | T | 2L:2,430,863 T>C (I1868T) | |
2430880 | C | T | P1874S | True | True | True | 0.000000 | 0.213415 | 0.169811 | 0.000000 | 0.000000 | 0.213415 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | Cct | Tct | 1874.0 | P | S | 2L:2,430,880 C>T (P1874S) | |
2430881 | C | T | P1874L | True | True | True | 0.000000 | 0.073171 | 0.056604 | 0.222222 | 0.260870 | 0.260870 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | cCt | cTt | 1874.0 | P | L | 2L:2,430,881 C>T (P1874L) | |
2431061 | C | T | A1934V | True | True | True | 0.000000 | 0.121951 | 0.132075 | 0.000000 | 0.000000 | 0.132075 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | gCt | gTt | 1934.0 | A | V | 2L:2,431,061 C>T (A1934V) |
To make things even clearer, we have included a heatmap plotting function to style our filtered DataFrame, called plot_frequencies_heatmap()
.
ag3.plot_frequencies_heatmap(ns_snps_df)
Amino acid substitution frequencies#
You might have noticed that there are two rows with V402L
in our previous heatmap plot. This is because in Burkina Faso, we find two different alternative alleles at the same genomic position, both causing the same amino acid substitution (valine to leucine).
If we are just interested in amino acid change frequencies, for example, when looking at potential insecticide resistance conferring mutations, we might want to combine the frequencies of the two alleles which cause V402L. In this case, we can use the aa_allele_frequencies()
function in exactly the same way as we used snp_allele_frequencies()
.
aa_allele_freqs_df = ag3.aa_allele_frequencies(
transcript=transcript,
cohorts=cohorts,
sample_sets=sample_sets
)
aa_allele_freqs_df
frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | transcript | aa_pos | ref_allele | ref_aa | alt_aa | effect | impact | alt_allele | max_af | label | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa_change | contig | position | |||||||||||||||
A32V | 2L | 2358252 | 0.000000 | 0.006098 | 0.0 | 0.000000 | 0.000000 | AGAP004707-RD | 32.0 | C | A | V | NON_SYNONYMOUS_CODING | MODERATE | T | 0.006098 | A32V (2L:2,358,252 C>T) |
G54C | 2L | 2362019 | 0.000000 | 0.018293 | 0.0 | 0.000000 | 0.010870 | AGAP004707-RD | 54.0 | G | G | C | NON_SYNONYMOUS_CODING | MODERATE | T | 0.018293 | G54C (2L:2,362,019 G>T) |
P55L | 2L | 2362023 | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.010870 | AGAP004707-RD | 55.0 | C | P | L | NON_SYNONYMOUS_CODING | MODERATE | T | 0.010870 | P55L (2L:2,362,023 C>T) |
P59T | 2L | 2362034 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.021739 | AGAP004707-RD | 59.0 | C | P | T | NON_SYNONYMOUS_CODING | MODERATE | A | 0.021739 | P59T (2L:2,362,034 C>A) |
G73D | 2L | 2362077 | 0.000000 | 0.006098 | 0.0 | 0.000000 | 0.000000 | AGAP004707-RD | 73.0 | G | G | D | NON_SYNONYMOUS_CODING | MODERATE | A | 0.006098 | G73D (2L:2,362,077 G>A) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
A2023G | 2L | 2431328 | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.000000 | AGAP004707-RD | 2023.0 | C | A | G | NON_SYNONYMOUS_CODING | MODERATE | G | 0.038462 | A2023G (2L:2,431,328 C>G) |
S2037R | 2L | 2431371 | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.000000 | AGAP004707-RD | 2037.0 | C | S | R | NON_SYNONYMOUS_CODING | MODERATE | G | 0.005051 | S2037R (2L:2,431,371 C>G) |
I2053V | 2L | 2431417 | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.000000 | AGAP004707-RD | 2053.0 | A | I | V | NON_SYNONYMOUS_CODING | MODERATE | G | 0.038462 | I2053V (2L:2,431,417 A>G) |
G2055V | 2L | 2431424 | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.000000 | AGAP004707-RD | 2055.0 | G | G | V | NON_SYNONYMOUS_CODING | MODERATE | T | 0.005051 | G2055V (2L:2,431,424 G>T) |
V2118A | 2L | 2431613 | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.000000 | AGAP004707-RD | 2118.0 | T | V | A | NON_SYNONYMOUS_CODING | MODERATE | C | 0.005051 | V2118A (2L:2,431,613 T>C) |
134 rows × 15 columns
Let’s filter it again to just amino acid changes greater than 5% in at least one cohort. We don’t need to filter for non-synonymous mutations this time as this function has already done that for us.
aa_filt_df = aa_allele_freqs_df.query("max_af >= 0.05")
aa_filt_df
frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | transcript | aa_pos | ref_allele | ref_aa | alt_aa | effect | impact | alt_allele | max_af | label | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
aa_change | contig | position | |||||||||||||||
V402L | 2L | 2391228 | 0.000000 | 0.121951 | 0.113208 | 0.000000 | 0.000000 | AGAP004707-RD | 402.0 | G | V | L | NON_SYNONYMOUS_CODING | MODERATE | {C,T} | 0.121951 | V402L (2L:2,391,228 G>{C,T}) |
T791M | 2L | 2416980 | 0.000000 | 0.018293 | 0.000000 | 0.161616 | 0.239130 | AGAP004707-RD | 791.0 | C | T | M | NON_SYNONYMOUS_CODING | MODERATE | T | 0.239130 | T791M (2L:2,416,980 C>T) |
L995F | 2L | 2422652 | 0.076923 | 0.865854 | 0.886792 | 1.000000 | 1.000000 | AGAP004707-RD | 995.0 | A | L | F | NON_SYNONYMOUS_CODING | MODERATE | T | 1.000000 | L995F (2L:2,422,652 A>T) |
I1527T | 2L | 2429617 | 0.000000 | 0.121951 | 0.113208 | 0.000000 | 0.000000 | AGAP004707-RD | 1527.0 | T | I | T | NON_SYNONYMOUS_CODING | MODERATE | C | 0.121951 | I1527T (2L:2,429,617 T>C) |
N1570Y | 2L | 2429745 | 0.038462 | 0.250000 | 0.320755 | 0.212121 | 0.141304 | AGAP004707-RD | 1570.0 | A | N | Y | NON_SYNONYMOUS_CODING | MODERATE | T | 0.320755 | N1570Y (2L:2,429,745 A>T) |
E1597G | 2L | 2429897 | 0.000000 | 0.000000 | 0.000000 | 0.065657 | 0.032609 | AGAP004707-RD | 1597.0 | A | E | G | NON_SYNONYMOUS_CODING | MODERATE | G | 0.065657 | E1597G (2L:2,429,897 A>G) |
K1603T | 2L | 2429915 | 0.000000 | 0.054878 | 0.056604 | 0.000000 | 0.000000 | AGAP004707-RD | 1603.0 | A | K | T | NON_SYNONYMOUS_CODING | MODERATE | C | 0.056604 | K1603T (2L:2,429,915 A>C) |
A1746S | 2L | 2430424 | 0.000000 | 0.000000 | 0.000000 | 0.151515 | 0.239130 | AGAP004707-RD | 1746.0 | G | A | S | NON_SYNONYMOUS_CODING | MODERATE | T | 0.239130 | A1746S (2L:2,430,424 G>T) |
I1868T | 2L | 2430863 | 0.000000 | 0.000000 | 0.000000 | 0.247475 | 0.206522 | AGAP004707-RD | 1868.0 | T | I | T | NON_SYNONYMOUS_CODING | MODERATE | C | 0.247475 | I1868T (2L:2,430,863 T>C) |
P1874S | 2L | 2430880 | 0.000000 | 0.213415 | 0.169811 | 0.000000 | 0.000000 | AGAP004707-RD | 1874.0 | C | P | S | NON_SYNONYMOUS_CODING | MODERATE | T | 0.213415 | P1874S (2L:2,430,880 C>T) |
P1874L | 2L | 2430881 | 0.000000 | 0.073171 | 0.056604 | 0.222222 | 0.260870 | AGAP004707-RD | 1874.0 | C | P | L | NON_SYNONYMOUS_CODING | MODERATE | T | 0.260870 | P1874L (2L:2,430,881 C>T) |
A1934V | 2L | 2431061 | 0.000000 | 0.121951 | 0.132075 | 0.000000 | 0.000000 | AGAP004707-RD | 1934.0 | C | A | V | NON_SYNONYMOUS_CODING | MODERATE | T | 0.132075 | A1934V (2L:2,431,061 C>T) |
Now we can visualise these frequencies the same way we did before.
ag3.plot_frequencies_heatmap(aa_filt_df)
From this new heatmap it’s clear that there are two SNPs causing a V402L substitution, and the combined frequency of these is identical to that of the I1527T substitution in the An. coluzzii cohorts. This is a strong indication that these two mutations are strongly associated and may be working together to confer a selective advantage via insecticide resistance.
Using sample queries#
In some cases you may want to analyse data across a number of different sample sets, but select only samples from a single country or species. In this case there is a useful sample_query
parameter available with the functions snp_allele_frequencies()
and aa_allele_frequencies()
. The sample_query
parameter accepts a pandas query string, which will be applied to the sample metadata to select samples to use in the analysis.
Let’s see it in action by computing amino acid allele frequencies for all An. arabiensis samples in the Ag3.0 data resource.
aa_arab_freqs_df = ag3.aa_allele_frequencies(
transcript=transcript,
cohorts=cohorts,
sample_sets="3.0",
sample_query="taxon == 'arabiensis'",
)
aa_arab_freqs_df = aa_arab_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_arab_freqs_df)
Plotting SNP frequency time series#
For surveillance it can be very informative to observice changes in allele frequencies over time for a given sampling location and species. Alleles under selection due to insecticide resistance will increase in frequency over time, and plotting the frequency data as a time series can help to see this. The malariagen_data
package includes a function plot_frequencies_time_series()
for plotting frequency time series. These are a little more advanced to use, but we’ll illustrate them below.
N.B., in the Ag3.0 data resource there is only one country where we have data from multiple time points, which is Burkina Faso. However, data in future releases will contain a number of time series, and so it’s useful to illustrate this functionality now.
First we need to compute the SNP allele frequencies in our gene of interest. To obtain the necessary data for time series plotting, we use a slightly different function for computing the frequencies. Don’t worry too much about this for the moment, we will revisit later in the course.
ds_snps_bf = ag3.aa_allele_frequencies_advanced(
transcript=transcript,
area_by="admin1_iso", # group samples in space by admin level 1
period_by="year", # group samples in time by year
sample_sets="3.0",
sample_query="country == 'Burkina Faso'",
variant_query="max_af > 0.05", # only keep variants above 5% frequency in at least one cohort
)
ds_snps_bf
Access SNP calls: ⠋ (0:00:00.00)
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:
'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 4kB Dimensions: (cohorts: 5, variants: 12) Dimensions without coordinates: cohorts, variants Data variables: (12/31) cohort_area (cohorts) object 40B 'BF-09' 'BF-09' ... 'BF-09' cohort_label (cohorts) object 40B 'BF-09_colu_2012' ... 'BF-09... cohort_lat_max (cohorts) float64 40B 11.24 11.24 12.06 11.24 11.24 cohort_lat_mean (cohorts) float64 40B 11.22 11.23 12.06 11.19 11.21 cohort_lat_min (cohorts) float64 40B 11.15 11.23 12.06 11.15 11.15 cohort_lon_max (cohorts) float64 40B -4.235 -4.235 ... -4.235 ... ... variant_label (variants) object 96B 'V402L (2L:2,391,228 G>{C,T... variant_max_af (variants) float64 96B 0.122 0.2391 ... 0.1321 variant_position (variants) int32 48B 2391228 2416980 ... 2431061 variant_ref_aa (variants) object 96B 'V' 'T' 'L' ... 'P' 'P' 'A' variant_ref_allele (variants) object 96B 'G' 'C' 'A' ... 'C' 'C' 'C' variant_transcript (variants) object 96B 'AGAP004707-RD' ... 'AGAP00... Attributes: title: AGAP004707-RD (Vgsc/para) SNP frequencies
Now let’s plot the time series.
ag3.plot_frequencies_time_series(ds_snps_bf)
Note that this plot is interactive, you can hover over points for more information, and zoom into regions of the plot.
In this plot we have data from two regions, BF-09 (Haut-Bassins) and BF-07 (Centre-Sud). In BF-09 we have data from 2012 and 2014, and so we have the beginning of a time series. In BF-07 we have data only from 2004, so just a single point in time.
There are 3 sub-plots, faceted by taxon and area to allow for comparison. The error bars in the plots show 95% confidence intervals for the frequency estimates. In BF-09 where we have data from 2012 and 2014, it looks like all SNP frequencies were similar between 2012 and 2014 in both An. gambiae and An. coluzzii, with some known insecticide resistance alleles such as L995F already at fixation or very high frequency.
It’s interesting to note that in BF-07 region in 2004 there were no SNPs above 8% frequency, including the L995F SNP. This suggests that L995F rose in frequency in Burkina Faso between 2004 and 2012, although we cannot draw strong conclusions here because we have data from different regions.
Plotting SNP frequency maps#
Another analysis we can perform is to make geographical comparisons, by plotting SNP frequencies on a map. We can do this via functions in the malariagen_data
package. Let’s demonstrate, using all data in the Ag3.0 resource.
ds_snps = ag3.aa_allele_frequencies_advanced(
transcript=transcript,
area_by="admin1_iso", # group samples in space by admin level 1
period_by="year", # group samples in time by year
sample_sets="3.0",
variant_query="max_af > 0.05", # only keep variants above 5% frequency in at least one cohort
)
ds_snps
Load genome features: ⠋ (0:00:00.00)
/home/jonbrenas/.local/lib/python3.10/site-packages/malariagen_data/anoph/snp_frq.py:1203: FutureWarning:
'A' is deprecated and will be removed in a future version, please use 'Y' instead.
<xarray.Dataset> Size: 73kB Dimensions: (cohorts: 54, variants: 30) Dimensions without coordinates: cohorts, variants Data variables: (12/31) cohort_area (cohorts) object 432B 'KE-14' 'MW-S' ... 'TZ-25' cohort_label (cohorts) object 432B 'KE-14_arab_2012' ... 'TZ-2... cohort_lat_max (cohorts) float64 432B -3.511 -15.93 ... -4.94 cohort_lat_mean (cohorts) float64 432B -3.511 -15.93 ... -4.94 cohort_lat_min (cohorts) float64 432B -3.511 -15.93 ... -4.94 cohort_lon_max (cohorts) float64 432B 39.91 34.76 ... 39.91 38.95 ... ... variant_label (variants) object 240B 'R254K (2L:2,390,177 G>A)'... variant_max_af (variants) float64 240B 0.5197 0.25 ... 0.07143 variant_position (variants) int32 120B 2390177 2391228 ... 2431417 variant_ref_aa (variants) object 240B 'R' 'V' 'D' ... 'I' 'T' 'I' variant_ref_allele (variants) object 240B 'G' 'G' 'G' ... 'T' 'A' 'A' variant_transcript (variants) object 240B 'AGAP004707-RD' ... 'AGAP0... Attributes: title: AGAP004707-RD (Vgsc/para) SNP frequencies
Take a look at the variants we’ve found.
ds_snps["variant_label"].values
array(['R254K (2L:2,390,177 G>A)', 'V402L (2L:2,391,228 G>{C,T})',
'D466H (2L:2,399,997 G>C)', 'M490I (2L:2,400,071 G>{A,T})',
'A656V (2L:2,407,769 C>T)', 'I693V (2L:2,407,954 A>G)',
'E752V (2L:2,416,863 A>T)', 'M757L (2L:2,416,877 A>T)',
'T791M (2L:2,416,980 C>T)', 'E862* (2L:2,417,305 G>T)',
'L995S (2L:2,422,651 T>C)', 'L995F (2L:2,422,652 A>T)',
'A1125V (2L:2,424,384 C>T)', 'V1507I (2L:2,429,556 G>A)',
'I1527T (2L:2,429,617 T>C)', 'N1570Y (2L:2,429,745 A>T)',
'E1597Q (2L:2,429,896 G>C)', 'E1597G (2L:2,429,897 A>G)',
'K1603T (2L:2,429,915 A>C)', 'L1667M (2L:2,430,106 T>A)',
'A1746S (2L:2,430,424 G>T)', 'Y1846* (2L:2,430,798 C>{A,G})',
'V1853I (2L:2,430,817 G>A)', 'I1868T (2L:2,430,863 T>C)',
'P1874S (2L:2,430,880 C>T)', 'P1874L (2L:2,430,881 C>T)',
'A1934V (2L:2,431,061 C>T)', 'I1940T (2L:2,431,079 T>C)',
'T2044A (2L:2,431,390 A>G)', 'I2053V (2L:2,431,417 A>G)'],
dtype=object)
To explore these data a bit more interactively, there is also a function plot_frequencies_interactive_map()
to create a map with some controls to select the taxon, period and variant of interest. Here’s an example of how to use it.
ag3.plot_frequencies_interactive_map(ds_snps)
N.B., you will need to run this notebook for yourself in colab in order for the interactive map to work.
Well done!#
In this module we have learnt how to analyse SNP mutations in the target of pyrethroid insecticides, the voltage-gated sodium-channel. We have calculated the allele frequencies of the SNPs in cohorts of mosquitoes and learnt how to filter and plot them for ease of interpretation.
Practical exercises#
English#
Open this notebook in Google Colab and run it for yourself from top to bottom. Hint: click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Looking at the heatmap output (either amino acid or SNP), can you spot a relationship between the
V402L
andL995F
frequencies? If so, what is it?Re-run the whole analysis but using the Ghanaian sample set. Hint: Try
sample_sets = "AG1000G-GH"
. Or any other samples of interest.What are the cohorts for this new sample set? Hint: see
frequency_columns
.Above, we looked at the kdr “West” SNP position, compare and contrast this with the kdr “East” SNP position. What is the amino acid change for kdr “East”? Hint: The position is 2422651.
For the Ghanaian sample set, add the a y_label which says “aa change” to the amino acid frequency heatmap. Hint: you can view all the function’s parameters with
ag3.aa_allele_frequencies?
.Remove the colorbar from the same heatmap. Hint:
False
.Is the same relationship between
V402L
andL995F
frequencies present in Ghana? What might be an evolutionary interpretation of this relationship?
Français#
Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Indice: cliquer sur l’icône fusée () et sélectionner “Colab” dans le menu déroulant.
En regardant la heatmap (soit pour les SNPs soit pour les amino acides), pouvez vous détecter la relation entre les fréquences de
V402L
etL995F
? Si oui, quelle est elle?Re-exécuter toute l’analyse mais en utilisant les données ghanéennes. Indice: utiliser
sample_sets = "AG1000G-GH"
.Quelles sont les cohortes de ce nouvel ensemble de données? Indice: regarder
frequency_columns
.Plus haut, nous avons regardé la position SNP de kdr “West”, comparer et contraster la avec la position kdr “East”. Quelle est la différence en terme d’amino acides pour kdr “East”? Indice: position = 2422651.
Pour l’ensemble de données ghanéennes, ajouter le y_label nommé “aa change” à la heatmap des fréquences des amino acides. Indice: tous les paramètres de la fonction peuvent être vus avec
ag3.aa_allele_frequencies?
.Supprimer la barre de couleur de cette même heatmap. Indice: quelque chose doit être défini comme
False
.Est-ce que la même relation entre les fréquences de
V402L
etL995F
présente au Ghana? Quelle pourrait être l’interprétation en terme d’évolution de cette relation?