Module 4 - Analysing SNPs in the Vgsc gene

Module 4 - Analysing SNPs in the Vgsc gene#

Theme: Analysis

In this module we’re going to perform an analysis to discover single nucleotide polymorphisms (SNPs) in the voltage-gated sodium channel gene (Vgsc), which encodes the binding target for pyrethroid insecticides.

Learning objectives#

At the end of this module you will be able to:

Discover mutations (SNPs) that could potentially cause pyrethroid target-site resistance.
Compute SNP allele frequencies, i.e., how common are they in different mosquito cohorts?
Perform analyses to compare SNP allele frequences between mosquitoes from different species, geographical locations and dates of collection.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Discovering SNPs in the Vgsc gene and computing SNP allele frequencies#

First, let’s set up the malariagen_data package to access MalariaGEN data in the cloud.

%pip install -q --no-warn-conflicts malariagen_data

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data

import pandas as pd

import plotly.io as pio
pio.renderers.default = "notebook+colab"

#import os
#os.environ["MGEN_SHOW_PROGRESS"] = "0"

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	None
Cohorts analysis	20250502
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.0.1
Client location	Iowa, United States (Google Cloud us-central1)

To discover SNPs and compute allele frequencies, we’re going to use the snp_allele_frequencies() function. Let’s have a look at the documentation for this function.

ag3.snp_allele_frequencies?

Signature:
ag3.snp_allele_frequencies(
    transcript: str,
    cohorts: Union[str, Mapping[str, str]],
    sample_query: Optional[str] = None,
    sample_query_options: Optional[dict] = None,
    min_cohort_size: int = 10,
    site_mask: Optional[str] = None,
    sample_sets: Union[str, Sequence[str], NoneType] = None,
    drop_invariant: bool = True,
    effects: bool = True,
    include_counts: bool = False,
    chunks: Union[int, str, Tuple[Union[int, str], ...], Callable[[Tuple[int, ...]], Union[int, str, Tuple[Union[int, str], ...]]]] = 'native',
    inline_array: bool = True,
) -> pandas.core.frame.DataFrame
Docstring:
Compute SNP allele frequencies for a gene transcript.

Parameters
----------
transcript : str
    Gene transcript identifier.
cohorts : str or Mapping[str, str]
    Either a string giving the name of a predefined cohort set (e.g.,
    "admin1_month") or a dict mapping custom cohort labels to sample
    queries.
sample_query : str or None, optional
    A pandas query string to be evaluated against the sample metadata, to
    select samples to be included in the returned data.
sample_query_options : dict or None, optional
    A dictionary of arguments that will be passed through to pandas
    query() or eval(), e.g. parser, engine, local_dict, global_dict,
    resolvers.
min_cohort_size : int, optional, default: 10
    Minimum cohort size. Raise an error if the number of samples is less
    than this value.
site_mask : str or None, optional
    Which site filters mask to apply. See the `site_mask_ids` property for
    available values.
sample_sets : sequence of str or str or None, optional
    List of sample sets and/or releases. Can also be a single sample set
    or release.
drop_invariant : bool, optional, default: True
    If True, drop variants not observed in the selected samples.
effects : bool, optional, default: True
    If True, add SNP effect annotations.
include_counts : bool, optional, default: False
    Include columns with allele counts and number of non-missing allele
    calls (nobs).
chunks : int or str or tuple of int or str or Callable[[typing.Tuple[int, ...]], int or str or tuple of int or str], optional, default: 'native'
    Define how input data being read from zarr should be divided into
    chunks for a dask computation. If 'native', use underlying zarr
    chunks. If a string specifying a target memory size, e.g., '300 MiB',
    resize chunks in arrays with more than one dimension to match this
    size. If 'auto', let dask decide chunk size.  If 'ndauto', let dask
    decide chunk size but only for arrays with more than one dimension. If
    'ndauto0', as 'ndauto' but only vary the first chunk dimension. If
    'ndauto1', as 'ndauto' but only vary the second chunk dimension. If
    'ndauto01', as 'ndauto' but only vary the first and second chunk
    dimensions. Also, can be a tuple of integers, or a callable which
    accepts the native chunks as a single argument and returns a valid
    dask chunks value.
inline_array : bool, optional, default: True
    Passed through to dask `from_array()`.

Returns
-------
DataFrame
    A dataframe of SNP allele frequencies, one row per variant allele.
    The variant alleles are indexed by their contig, their position, the
    reference allele, the alternate allele and the associated amino acid
    change. The columns are split into three categories: there is one
    column for each taxon filter (e.g., pass_funestus, pass_gamb_colu,
    ...) containing whether the site of the variant allele passes the
    filter; there is then 1 column for each cohort containing the
    frequency of the variant allele within the cohort, additionally
    there is a column `max_af` containing the maximum allele frequency
    of the variant allele accross all cohorts; finally, there are 9
    columns describing the variant allele: `transcript` contains the
    gene transcript used for this analysis, `effect` is the effect of
    the allele change, `impact`is the impact of the allele change,
    `ref_codon` is the reference codon, `alt_codon` is the altered codon
    with the variant allele, `aa_pos` is the position of the amino acid,
    `ref_aa` is the reference amino acid, `alt_aa` is the altered amino
    acid with the varaint allele, and `label` is the label of the
    variant allele.

Notes
-----
Cohorts with fewer samples than `min_cohort_size` will be excluded
from output data frame.
File:      /home/conda/developer/55fe7ffdc8f19782d8fa1d5de44c1f26cc58e5a472146c0c30061f0238bc3185-20250317-170017-682536-85-training-nb-maintenance-mgen-15.0.1/lib/python3.11/site-packages/malariagen_data/anoph/snp_frq.py
Type:      method

To discover SNPs in the Vgsc gene, we need to define some parameters.

First, we need to decide which gene transcript to use when determining what SNP effects will be. Here we’ll use the transcript with identifier “AGAP004707-RD”.

transcript = "AGAP004707-RD"

Next, to compute allele frequencies, we need to decide how our mosquitoes will be grouped into cohorts. There are different ways you can do this, for this analysis we’ll group spatially by level 1 administrative divisions within countries, and temporally by year.

cohorts = "admin1_year"

Next, we need to choose which samples to analyse. There are a number of different sample sets in the Ag3.0 data resource that we could use for this analysis. Let’s check what’s available.

ag3.sample_sets(release="3.0")[['sample_set', 'study_id', 'sample_count']]

	sample_set	study_id	sample_count
0	AG1000G-AO	AG1000G-AO	81
1	AG1000G-BF-A	AG1000G-BF-1	181
2	AG1000G-BF-B	AG1000G-BF-1	102
3	AG1000G-BF-C	AG1000G-BF-2	13
4	AG1000G-CD	AG1000G-CD	76
5	AG1000G-CF	AG1000G-CF	73
6	AG1000G-CI	AG1000G-CI	80
7	AG1000G-CM-A	AG1000G-CM-1	303
8	AG1000G-CM-B	AG1000G-CM-2	97
9	AG1000G-CM-C	AG1000G-CM-3	44
10	AG1000G-FR	AG1000G-FR	23
11	AG1000G-GA-A	AG1000G-GA-1	69
12	AG1000G-GH	AG1000G-GH	100
13	AG1000G-GM-A	AG1000G-GM-1	74
14	AG1000G-GM-B	AG1000G-GM-2	31
15	AG1000G-GM-C	AG1000G-GM-3	174
16	AG1000G-GN-A	AG1000G-GN-ML	45
17	AG1000G-GN-B	AG1000G-GN-ML	185
18	AG1000G-GQ	AG1000G-GQ	10
19	AG1000G-GW	AG1000G-GW	101
20	AG1000G-KE	AG1000G-KE	86
21	AG1000G-ML-A	AG1000G-ML-1	60
22	AG1000G-ML-B	AG1000G-ML-2	71
23	AG1000G-MW	AG1000G-MW	41
24	AG1000G-MZ	AG1000G-MZ	74
25	AG1000G-TZ	AG1000G-TZ	300
26	AG1000G-UG	AG1000G-UG	290
27	AG1000G-X	AG1000G-X	669

To keep things simple, for this module we’ll focus on mosquitoes from Burkina Faso. There are three sample sets in the Ag3.0 resource providing data on mosquitoes from Burkina Faso.

sample_sets = ["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"]

OK, now we’re ready to run the analysis.

snp_allele_freqs_df = ag3.snp_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets, 
    drop_invariant=False,
)
snp_allele_freqs_df

					pass_gamb_colu_arab	pass_gamb_colu	pass_arab	frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	transcript	effect	impact	ref_codon	alt_codon	aa_pos	ref_aa	alt_aa	label
contig	position	ref_allele	alt_allele	aa_change
2L	2358158	A	C	M1L	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	START_LOST	HIGH	Atg	Ctg	1.0	M	L	2L:2,358,158 A>C (M1L)
			T	M1L	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	START_LOST	HIGH	Atg	Ttg	1.0	M	L	2L:2,358,158 A>T (M1L)
			G	M1V	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	START_LOST	HIGH	Atg	Gtg	1.0	M	V	2L:2,358,158 A>G (M1V)
	2358159	T	A	M1K	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aTg	aAg	1.0	M	K	2L:2,358,159 T>A (M1K)
	2358159	T	C	M1T	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aTg	aCg	1.0	M	T	2L:2,358,159 T>C (M1T)
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	2431616	G	C	*2119S	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	STOP_LOST	HIGH	tGa	tCa	2119.0	*	S	2L:2,431,616 G>C (*2119S)
	2431616	G	T	*2119L	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	STOP_LOST	HIGH	tGa	tTa	2119.0	*	L	2L:2,431,616 G>T (*2119L)
	2431617	A	C	*2119C	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	STOP_LOST	HIGH	tgA	tgC	2119.0	*	C	2L:2,431,617 A>C (*2119C)
			T	*2119C	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	STOP_LOST	HIGH	tgA	tgT	2119.0	*	C	2L:2,431,617 A>T (*2119C)
			G	*2119W	True	True	True	0.0	0.0	0.0	0.0	0.0	0.0	AGAP004707-RD	STOP_LOST	HIGH	tgA	tgG	2119.0	*	W	2L:2,431,617 A>G (*2119W)

209001 rows × 18 columns

The output from this function is a pandas DataFrame, where each row provides information about a SNP in the Vgsc gene.

Before we go further, to improve our understanding of what these data, let’s look at some background.

Grouping samples into cohorts#

The MalariaGEN Ag3.0 data resource contain mosquito samples collected across large spatial and temporal scales, and from different mosquito species. When we want to run population genetic analyses on datasets like these, the data must be divided into biologically relevant cohorts, where a cohort is simply a group of samples we want to analyse together.

To help define cohorts and analyse these data, we have added some metadata for each sample about its time and place of collection and its species:

Spatially - For most analyses we use administrative divisions to group the samples into cohorts. These give two levels of spatial resolution, where admin level 1 divides each country into a few large regions, while admin level 2 provides finer scale divisions such as provinces.
Temporally - For each sample we provide the year, quarter and month of collection. Depending on your analysis, you can choose to group samples by year, by year and by year quarter, or by year and by month, although note that for some samples the collection month is missing.
Taxonomically - Ag3.0 contains samples from different species in the Anopheles gambiae complex. To help with grouping by taxon, we have included a “taxon” field in the sample metadata.

Using these three dimensions, we have pre-defined six cohort sets, each of which groups samples into cohorts at different levels of spatio-temporal resolution. Within all cohort sets, samples are further subdivided by taxon.

admin1_year - Cohorts obtained by grouping samples by admin level 1, collection year and taxon.
admin1_quarter - Cohorts obtained by grouping samples by admin level 1, collection year and quarter, and taxon.
admin1_month - Cohorts obtained by grouping samples by admin level 1, collection year and month, and taxon.
admin2_year - Cohorts obtained by grouping samples by admin level 2, collection year and taxon.
admin2_quarter - Cohorts obtained by grouping samples by admin level 2, collection year and quarter, and taxon.
admin2_month - Cohorts obtained by grouping samples by admin level 2, collection year and month, and taxon.

Remember above we chose to use the “admin1_year” cohorts for our Vgsc analysis.

cohorts

'admin1_year'

Let’s now use this to understand the frequency columns in the SNP allele frequencies DataFrame. Here are the frequency column names:

frequency_columns = [
    col for col in snp_allele_freqs_df.columns 
    if col.startswith("frq_")
]
frequency_columns

['frq_BF-07_gamb_2004',
 'frq_BF-09_colu_2012',
 'frq_BF-09_colu_2014',
 'frq_BF-09_gamb_2012',
 'frq_BF-09_gamb_2014']

Here “frq_” is used to mean that these columns contain frequencies.

The second part of the column name is either “BF-09” or “BF-07”. These are standard identifiers that refer to level 1 administrative divisions within countries, often called regions. “BF-07” is the Centre-Sud region within Burkina Faso, and “BF-09” is the Haut-Bassins region.

The third part refers to the species, and in this example is either “gamb” meaning Anopheles gambiae or “colu” meaning Anopheles coluzzii.

The final part is the year of collection, which in this example ie either 2004, 2012 or 2014.

Thus, e.g., “frq_BF-09_gamb_2012” means the frequency of the allele within the cohort of Anopheles gambiae mosquitoes from the BF-09 (Haut-Bassins) region collected in 2012.

Single nucleotide polymorphisms (SNPs)#

In the last module we learnt about reference genomes and gene annotations. Now one of the cool things we can do once we have an Anopheles gambiae reference genome, is sequence the genomes of wild mosquitoes and compare them to the reference genome. By aligning an individual’s sequencing reads to the reference, we can call its genotype.

The genotype is derived from how similar the individual is to the reference genome at each nucleotide position.

Where a nucleotide matches the reference genome, it will be called a reference allele, where “allele” is just a term for any kind of genetic variant. Where a nucleotide differs from the reference genome, it will be called an alternative allele.

Since, there are four possible nucleotides and one of these is the reference, there are always three possible alternative alleles for any given site. Another term for a genetic difference where two individuals differ at a single nucleotide position is single nucleotide polymorphism or SNP.

This SNP information is displayed in the first four columns of the snp_allele_frequencies() index in the output dataframe. We can see each alternative allele represented by it’s own row in our output. Let’s take a look at just the index.

snp_allele_freqs_df[[]]


contig	position	ref_allele	alt_allele	aa_change
2L	2358158	A	C	M1L
			T	M1L
			G	M1V
	2358159	T	A	M1K
	2358159	T	C	M1T
	...	...	...	...
	2431616	G	C	*2119S
	2431616	G	T	*2119L
	2431617	A	C	*2119C
			T	*2119C
			G	*2119W

209001 rows × 0 columns

SNP effects - some SNPs may be more interesting than others#

SNPs can have different effects depending on what the nucleotide change is and where in the genome it occurs.

In this analysis, we are interested in SNPs that affect protein structure, specifically those which will change the voltage-gated sodium-channel and could affect protein function and therefore insecticide resistance phenotype, so we need to look within the coding sequences (CDS) of the Vgsc gene.

However, not all SNPs which fall in CDSs cause protein changes, because the genetic code has some redundancy, meaning that different nucleotide sequences can encode the same amino acid. If a SNP in a CDS does change the amino acid, it is called a non-synonymous (NS) or missense SNP, and if the SNP does not change the amino acid, it is called a synonymous SNP.

Manually predicting SNP effects is quite involved, but snp_allele_frequencies() can predict them for us, let’s look at how this is represented in our output DataFrame.

snp_effects_df = snp_allele_freqs_df[["effect", "impact"]]
snp_effects_df

					effect	impact
contig	position	ref_allele	alt_allele	aa_change
2L	2358158	A	C	M1L	START_LOST	HIGH
			T	M1L	START_LOST	HIGH
			G	M1V	START_LOST	HIGH
	2358159	T	A	M1K	NON_SYNONYMOUS_CODING	MODERATE
	2358159	T	C	M1T	NON_SYNONYMOUS_CODING	MODERATE
	...	...	...	...	...	...
	2431616	G	C	*2119S	STOP_LOST	HIGH
	2431616	G	T	*2119L	STOP_LOST	HIGH
	2431617	A	C	*2119C	STOP_LOST	HIGH
			T	*2119C	STOP_LOST	HIGH
			G	*2119W	STOP_LOST	HIGH

209001 rows × 2 columns

Let’s look specifically at the genomic position where a SNP occurs which causes an insecticide-resistance mutation, also known as “kdr”.

snp_effects_df.loc[("2L", 2_422_652)]

			effect	impact
ref_allele	alt_allele	aa_change
A	C	L995F	NON_SYNONYMOUS_CODING	MODERATE
	T	L995F	NON_SYNONYMOUS_CODING	MODERATE
	G	L995L	SYNONYMOUS_CODING	LOW

For interest, let’s count the number of SNPs we have by their effect.

snp_allele_freqs_df.query("max_af > 0").groupby(["effect", "impact"]).size()

effect                 impact  
INTRONIC               MODIFIER    9467
NON_SYNONYMOUS_CODING  MODERATE     121
SPLICE_CORE            HIGH           8
SPLICE_REGION          MODERATE       9
STOP_GAINED            HIGH          14
SYNONYMOUS_CODING      LOW           56
dtype: int64

SNP allele frequencies#

Identifying the presence or absence of SNPs in wild caught mosquitoes is interesting, but the real value in generating SNP genotypes from large spatiotemporal collections of mosquitoes comes from the ability to see how groups of samples (cohorts) differ between geographical locations, species, and over time.

One way to compare SNP differences between cohorts is to calculate and compare SNP allele frequencies at each position in the genome by dividing the number of times each SNP allele is found in the cohort by the total number of individuals present in the cohort (multiplied by 2 because each individual mosquito is diploid and so carries two genome copies).

Let’s take another look at the allele frequencies we computed above, focusing just on the frequency columns.

snp_allele_freqs_df[frequency_columns + ['max_af']]

					frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af
contig	position	ref_allele	alt_allele	aa_change
2L	2358158	A	C	M1L	0.0	0.0	0.0	0.0	0.0	0.0
			T	M1L	0.0	0.0	0.0	0.0	0.0	0.0
			G	M1V	0.0	0.0	0.0	0.0	0.0	0.0
	2358159	T	A	M1K	0.0	0.0	0.0	0.0	0.0	0.0
	2358159	T	C	M1T	0.0	0.0	0.0	0.0	0.0	0.0
	...	...	...	...	...	...	...	...	...	...
	2431616	G	C	*2119S	0.0	0.0	0.0	0.0	0.0	0.0
	2431616	G	T	*2119L	0.0	0.0	0.0	0.0	0.0	0.0
	2431617	A	C	*2119C	0.0	0.0	0.0	0.0	0.0	0.0
			T	*2119C	0.0	0.0	0.0	0.0	0.0	0.0
			G	*2119W	0.0	0.0	0.0	0.0	0.0	0.0

209001 rows × 6 columns

And let’s inspect the frequencies for SNPs at a specific genomic position of interest.

snp_allele_freqs_df.loc[("2L", 2_422_652), frequency_columns]

			frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014
ref_allele	alt_allele	aa_change
A	C	L995F	0.000000	0.000000	0.000000	0.0	0.0
	T	L995F	0.076923	0.865854	0.886792	1.0	1.0
	G	L995L	0.000000	0.000000	0.000000	0.0	0.0

Visualising SNP allele frequencies via heatmaps#

To make our SNP allele frequencies DataFrame easier to interpret, we can filter it down to just non-synonymous SNPs that are at frequency above 5% in at least one of our cohorts.

ns_snps_df = snp_allele_freqs_df.query("effect == 'NON_SYNONYMOUS_CODING' and max_af >= 0.05")
ns_snps_df

					pass_gamb_colu_arab	pass_gamb_colu	pass_arab	frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	transcript	effect	impact	ref_codon	alt_codon	aa_pos	ref_aa	alt_aa	label
contig	position	ref_allele	alt_allele	aa_change
2L	2391228	G	C	V402L	True	True	True	0.000000	0.067073	0.028302	0.000000	0.000000	0.067073	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	Gta	Cta	402.0	V	L	2L:2,391,228 G>C (V402L)
	2391228	G	T	V402L	True	True	True	0.000000	0.054878	0.084906	0.000000	0.000000	0.084906	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	Gta	Tta	402.0	V	L	2L:2,391,228 G>T (V402L)
	2416980	C	T	T791M	True	True	True	0.000000	0.018293	0.000000	0.161616	0.239130	0.239130	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aCg	aTg	791.0	T	M	2L:2,416,980 C>T (T791M)
	2422652	A	T	L995F	True	True	True	0.076923	0.865854	0.886792	1.000000	1.000000	1.000000	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	ttA	ttT	995.0	L	F	2L:2,422,652 A>T (L995F)
	2429617	T	C	I1527T	True	True	True	0.000000	0.121951	0.113208	0.000000	0.000000	0.121951	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aTt	aCt	1527.0	I	T	2L:2,429,617 T>C (I1527T)
	2429745	A	T	N1570Y	True	True	True	0.038462	0.250000	0.320755	0.212121	0.141304	0.320755	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	Aat	Tat	1570.0	N	Y	2L:2,429,745 A>T (N1570Y)
	2429897	A	G	E1597G	True	True	True	0.000000	0.000000	0.000000	0.065657	0.032609	0.065657	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	gAa	gGa	1597.0	E	G	2L:2,429,897 A>G (E1597G)
	2429915	A	C	K1603T	True	True	True	0.000000	0.054878	0.056604	0.000000	0.000000	0.056604	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aAg	aCg	1603.0	K	T	2L:2,429,915 A>C (K1603T)
	2430424	G	T	A1746S	False	True	False	0.000000	0.000000	0.000000	0.151515	0.239130	0.239130	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	Gcc	Tcc	1746.0	A	S	2L:2,430,424 G>T (A1746S)
	2430863	T	C	I1868T	True	True	True	0.000000	0.000000	0.000000	0.247475	0.206522	0.247475	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	aTa	aCa	1868.0	I	T	2L:2,430,863 T>C (I1868T)
	2430880	C	T	P1874S	True	True	True	0.000000	0.213415	0.169811	0.000000	0.000000	0.213415	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	Cct	Tct	1874.0	P	S	2L:2,430,880 C>T (P1874S)
	2430881	C	T	P1874L	True	True	True	0.000000	0.073171	0.056604	0.222222	0.260870	0.260870	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	cCt	cTt	1874.0	P	L	2L:2,430,881 C>T (P1874L)
	2431061	C	T	A1934V	True	True	True	0.000000	0.121951	0.132075	0.000000	0.000000	0.132075	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	gCt	gTt	1934.0	A	V	2L:2,431,061 C>T (A1934V)

To make things even clearer, we have included a heatmap plotting function to style our filtered DataFrame, called plot_frequencies_heatmap().

ag3.plot_frequencies_heatmap(ns_snps_df)

Amino acid substitution frequencies#

You might have noticed that there are two rows with V402L in our previous heatmap plot. This is because in Burkina Faso, we find two different alternative alleles at the same genomic position, both causing the same amino acid substitution (valine to leucine).

If we are just interested in amino acid change frequencies, for example, when looking at potential insecticide resistance conferring mutations, we might want to combine the frequencies of the two alleles which cause V402L. In this case, we can use the aa_allele_frequencies() function in exactly the same way as we used snp_allele_frequencies().

aa_allele_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets=sample_sets
)
aa_allele_freqs_df

			frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	transcript	aa_pos	ref_allele	ref_aa	alt_aa	effect	impact	alt_allele	max_af	label
aa_change	contig	position
A32V	2L	2358252	0.000000	0.006098	0.0	0.000000	0.000000	AGAP004707-RD	32.0	C	A	V	NON_SYNONYMOUS_CODING	MODERATE	T	0.006098	A32V (2L:2,358,252 C>T)
G54C	2L	2362019	0.000000	0.018293	0.0	0.000000	0.010870	AGAP004707-RD	54.0	G	G	C	NON_SYNONYMOUS_CODING	MODERATE	T	0.018293	G54C (2L:2,362,019 G>T)
P55L	2L	2362023	0.000000	0.000000	0.0	0.005051	0.010870	AGAP004707-RD	55.0	C	P	L	NON_SYNONYMOUS_CODING	MODERATE	T	0.010870	P55L (2L:2,362,023 C>T)
P59T	2L	2362034	0.000000	0.000000	0.0	0.000000	0.021739	AGAP004707-RD	59.0	C	P	T	NON_SYNONYMOUS_CODING	MODERATE	A	0.021739	P59T (2L:2,362,034 C>A)
G73D	2L	2362077	0.000000	0.006098	0.0	0.000000	0.000000	AGAP004707-RD	73.0	G	G	D	NON_SYNONYMOUS_CODING	MODERATE	A	0.006098	G73D (2L:2,362,077 G>A)
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
A2023G	2L	2431328	0.038462	0.000000	0.0	0.000000	0.000000	AGAP004707-RD	2023.0	C	A	G	NON_SYNONYMOUS_CODING	MODERATE	G	0.038462	A2023G (2L:2,431,328 C>G)
S2037R	2L	2431371	0.000000	0.000000	0.0	0.005051	0.000000	AGAP004707-RD	2037.0	C	S	R	NON_SYNONYMOUS_CODING	MODERATE	G	0.005051	S2037R (2L:2,431,371 C>G)
I2053V	2L	2431417	0.038462	0.000000	0.0	0.000000	0.000000	AGAP004707-RD	2053.0	A	I	V	NON_SYNONYMOUS_CODING	MODERATE	G	0.038462	I2053V (2L:2,431,417 A>G)
G2055V	2L	2431424	0.000000	0.000000	0.0	0.005051	0.000000	AGAP004707-RD	2055.0	G	G	V	NON_SYNONYMOUS_CODING	MODERATE	T	0.005051	G2055V (2L:2,431,424 G>T)
V2118A	2L	2431613	0.000000	0.000000	0.0	0.005051	0.000000	AGAP004707-RD	2118.0	T	V	A	NON_SYNONYMOUS_CODING	MODERATE	C	0.005051	V2118A (2L:2,431,613 T>C)

134 rows × 15 columns

Let’s filter it again to just amino acid changes greater than 5% in at least one cohort. We don’t need to filter for non-synonymous mutations this time as this function has already done that for us.

aa_filt_df = aa_allele_freqs_df.query("max_af >= 0.05")
aa_filt_df

			frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	transcript	aa_pos	ref_allele	ref_aa	alt_aa	effect	impact	alt_allele	max_af	label
aa_change	contig	position
V402L	2L	2391228	0.000000	0.121951	0.113208	0.000000	0.000000	AGAP004707-RD	402.0	G	V	L	NON_SYNONYMOUS_CODING	MODERATE	{C,T}	0.121951	V402L (2L:2,391,228 G>{C,T})
T791M	2L	2416980	0.000000	0.018293	0.000000	0.161616	0.239130	AGAP004707-RD	791.0	C	T	M	NON_SYNONYMOUS_CODING	MODERATE	T	0.239130	T791M (2L:2,416,980 C>T)
L995F	2L	2422652	0.076923	0.865854	0.886792	1.000000	1.000000	AGAP004707-RD	995.0	A	L	F	NON_SYNONYMOUS_CODING	MODERATE	T	1.000000	L995F (2L:2,422,652 A>T)
I1527T	2L	2429617	0.000000	0.121951	0.113208	0.000000	0.000000	AGAP004707-RD	1527.0	T	I	T	NON_SYNONYMOUS_CODING	MODERATE	C	0.121951	I1527T (2L:2,429,617 T>C)
N1570Y	2L	2429745	0.038462	0.250000	0.320755	0.212121	0.141304	AGAP004707-RD	1570.0	A	N	Y	NON_SYNONYMOUS_CODING	MODERATE	T	0.320755	N1570Y (2L:2,429,745 A>T)
E1597G	2L	2429897	0.000000	0.000000	0.000000	0.065657	0.032609	AGAP004707-RD	1597.0	A	E	G	NON_SYNONYMOUS_CODING	MODERATE	G	0.065657	E1597G (2L:2,429,897 A>G)
K1603T	2L	2429915	0.000000	0.054878	0.056604	0.000000	0.000000	AGAP004707-RD	1603.0	A	K	T	NON_SYNONYMOUS_CODING	MODERATE	C	0.056604	K1603T (2L:2,429,915 A>C)
A1746S	2L	2430424	0.000000	0.000000	0.000000	0.151515	0.239130	AGAP004707-RD	1746.0	G	A	S	NON_SYNONYMOUS_CODING	MODERATE	T	0.239130	A1746S (2L:2,430,424 G>T)
I1868T	2L	2430863	0.000000	0.000000	0.000000	0.247475	0.206522	AGAP004707-RD	1868.0	T	I	T	NON_SYNONYMOUS_CODING	MODERATE	C	0.247475	I1868T (2L:2,430,863 T>C)
P1874S	2L	2430880	0.000000	0.213415	0.169811	0.000000	0.000000	AGAP004707-RD	1874.0	C	P	S	NON_SYNONYMOUS_CODING	MODERATE	T	0.213415	P1874S (2L:2,430,880 C>T)
P1874L	2L	2430881	0.000000	0.073171	0.056604	0.222222	0.260870	AGAP004707-RD	1874.0	C	P	L	NON_SYNONYMOUS_CODING	MODERATE	T	0.260870	P1874L (2L:2,430,881 C>T)
A1934V	2L	2431061	0.000000	0.121951	0.132075	0.000000	0.000000	AGAP004707-RD	1934.0	C	A	V	NON_SYNONYMOUS_CODING	MODERATE	T	0.132075	A1934V (2L:2,431,061 C>T)

Now we can visualise these frequencies the same way we did before.

ag3.plot_frequencies_heatmap(aa_filt_df)

From this new heatmap it’s clear that there are two SNPs causing a V402L substitution, and the combined frequency of these is identical to that of the I1527T substitution in the An. coluzzii cohorts. This is a strong indication that these two mutations are strongly associated and may be working together to confer a selective advantage via insecticide resistance.

Using sample queries#

In some cases you may want to analyse data across a number of different sample sets, but select only samples from a single country or species. In this case there is a useful sample_query parameter available with the functions snp_allele_frequencies() and aa_allele_frequencies(). The sample_query parameter accepts a pandas query string, which will be applied to the sample metadata to select samples to use in the analysis.

Let’s see it in action by computing amino acid allele frequencies for all An. arabiensis samples in the Ag3.0 data resource.

aa_arab_freqs_df = ag3.aa_allele_frequencies(
    transcript=transcript, 
    cohorts=cohorts, 
    sample_sets="3.0",
    sample_query="taxon == 'arabiensis'",
)

aa_arab_freqs_df = aa_arab_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_arab_freqs_df)

Well done!#

In this module we have learnt how to analyse SNP mutations in the target of pyrethroid insecticides, the voltage-gated sodium-channel. We have calculated the allele frequencies of the SNPs in cohorts of mosquitoes and learnt how to filter and plot them for ease of interpretation.

Practical exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. Hint: click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.
Looking at the heatmap output (either amino acid or SNP), can you spot a relationship between the V402L and L995F frequencies? If so, what is it?
Re-run the whole analysis but using the Ghanaian sample set. Hint: Try sample_sets = "AG1000G-GH". Or any other samples of interest.
What are the cohorts for this new sample set? Hint: see frequency_columns.
Above, we looked at the kdr “West” SNP position, compare and contrast this with the kdr “East” SNP position. What is the amino acid change for kdr “East”? Hint: The position is 2422651.
For the Ghanaian sample set, add the a y_label which says “aa change” to the amino acid frequency heatmap. Hint: you can view all the function’s parameters with ag3.plot_frequencies_heatmap?.
Remove the colorbar from the same heatmap. Hint: False.
Is the same relationship between V402L and L995F frequencies present in Ghana? What might be an evolutionary interpretation of this relationship?

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Indice: cliquer sur l’icône fusée () et sélectionner “Colab” dans le menu déroulant.
En regardant la heatmap (soit pour les SNPs soit pour les amino acides), pouvez vous détecter la relation entre les fréquences de V402L et L995F? Si oui, quelle est elle?
Re-exécuter toute l’analyse mais en utilisant les données ghanéennes. Indice: utiliser sample_sets = "AG1000G-GH".
Quelles sont les cohortes de ce nouvel ensemble de données? Indice: regarder frequency_columns.
Plus haut, nous avons regardé la position SNP de kdr “West”, comparer et contraster la avec la position kdr “East”. Quelle est la différence en terme d’amino acides pour kdr “East”? Indice: position = 2422651.
Pour l’ensemble de données ghanéennes, ajouter le y_label nommé “aa change” à la heatmap des fréquences des amino acides. Indice: tous les paramètres de la fonction peuvent être vus avec ag3.plot_frequencies_heatmap?.
Supprimer la barre de couleur de cette même heatmap. Indice: quelque chose doit être défini commeFalse.
Est-ce que la même relation entre les fréquences de V402L et L995F présente au Ghana? Quelle pourrait être l’interprétation en terme d’évolution de cette relation?

Module 4 - Analysing SNPs in the Vgsc gene

Contents

Module 4 - Analysing SNPs in the Vgsc gene#

Learning objectives#

Lecture#

English#

Français#

Discovering SNPs in the Vgsc gene and computing SNP allele frequencies#

Grouping samples into cohorts#

Single nucleotide polymorphisms (SNPs)#

SNP effects - some SNPs may be more interesting than others#

SNP allele frequencies#

Visualising SNP allele frequencies via heatmaps#

Amino acid substitution frequencies#

Using sample queries#

Plotting SNP frequency time series#

Plotting SNP frequency maps#

Well done!#

Practical exercises#

English#

Français#