Workshop 6 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 4 - AnoPrimer - Primer Design in Anopheles gambiae#
Theme: Tools & technology
In the previous modules of workshop 6, we learnt how to discover candidate loci for insecticide resistance, using genome-wide selection scans (GWSS). In this module, we introduce a new python package AnoPrimer, which uses the malariagen_data API and primer3-py, to allow us to design primers and probes in Anopheles gambiae s.l. We can use them to design these primers and probes to help us validate putative resistance loci. By integrating with data from MalariaGEN, AnoPrimer allows us to consider and avoid genetic variation in the Ag1000G resource during the design of our primers.
Learning objectives#
At the end of this module you will be able to:
Describe the principles of PCR, qPCR, and the role of primers.
Describe some of the applications of PCR in vector control.
Explain why it is important to consider SNPs when designing primers.
Design genomic DNA primers to target the ace1-G280S mutation.
Design RT-qPCR primers to measure gene expression of the carboxylesterase, COEAE2F.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Introduction#
What is the polymerase chain reaction (PCR)?#
PCR is a technique used to “amplify” small segments of DNA. Using PCR it is possible to generate thousands to millions of copies of a particular section of DNA from a very small input amount. In the context of vector control, we can use PCR and qPCR to genotype vectors for specific alleles, determine species, identify bloodmeals, measure gene expression, and much more.
Ingredients
The DNA template to be copied
DNA nucleotide bases (also known as dNTPs). DNA bases (A, C, G and T) are the building blocks of DNA and are needed to construct the new strand of DNA
Taq polymerase enzyme to add in the new DNA bases
Buffer to ensure the right conditions for the reaction
Primers, short stretches of DNA that initiate the PCR reaction, designed to bind to either side of the section of DNA you want to copy. DNA polymerase enzymes which perform DNA replication are only capable of adding nucleotides to the 3’-end of an existing nucleic acid, requiring a primer be bound to the template before DNA polymerase can begin a complementary strand.
The stages of PCR:
Denaturation – Heat the reaction to ~95°C to break apart the double stranded DNA template into single strands.
Annealing – Lower the temperature to 50-56°C enable the DNA primers to attach to the single-stranded template DNA.
Extension – Raise the temperature to 72°C and the new strand of DNA is synthesised by the Taq polymerase enzyme.
Repeat entire process for 25-40 cycles. After each cycle, the number of DNA molecules will approximately double.
Standard PCR (Detection / Genotyping / Sequencing)#
In standard PCR, we amplify specific regions of the genome and utilise the PCR product at the end of the reaction. This could be an endpoint PCR, where we run the PCR product on an agarose gel to determine the size of the amplicon. Alternatively, standard PCR can be used to amplify specific regions of the genome prior to next-generation sequencing (amplicon sequencing).
Example applications in An. gambiae#
The SINE PCR species ID assay for An. gambiae s.l, which can differentiate between the members of the gambiae complex, based on a marker on the X chromosome [1]
The 2La PCR assay, which determines the karyotype of the 2la inversion. [2]
ANOSPP, an amplicon sequencing panel which determines species within the entire Anopheles genus [3]
Quantitative PCR (Detection / Genotyping / Genotyping)#
In quantitative PCR, the concept is the same as standard PCR, but we measure the amount of DNA in the reaction at each cycle. To do this, we use either a fluorescent dye or fluorescent hybridisation probes, which emit light as the DNA concentration increases. The number of cycles at which each sample passes a given threshold, is called the Cq or Ct value. By measuring fluorescense and determining Cq values, we can determine the amount of DNA template that was in the original sample.
SYBR green#
SYBR green is a fluorescent dye, which emits a fluorescent signal when bound to double stranded DNA.
Example applications in An. gambiae#
Standard RT-qPCR assays for measuring gene expression
The SINE melt curve assay species ID for An. gambiae s.l [4], a high throughput version of the SINE PCR assay.
TaqMan / LNA Probe#
Hybridisation probes are short sequences, like primers, which bind to the DNA template. These can have flourophores attached, which emit fluorescence when the probe is displaced from the DNA template. By designing multiple probes (with different fluorophores) which are specific to either the wild-type or mutant allele, we can genotype SNPs. TaqMan and Locked nucleic acid probes have modifications which increase the stability of the probe-template duplex and help probes to discriminate between SNPs.
Example applications in An. gambiae#
TaqMan genotyping assays for Ace1, KDR.
Kdr LNA assay [5], which simultaneously genotypes the vgsc-995F and vgsc-995S mutations.
What is important to consider when designing primers?#
Suitable primers are crucial to effective PCR reactions and must be designed to be robust, reliable and consistent across experimental conditions. Primers are typically designed with the following characteristics:
Size: Size of the primer.
Between 17-24 bases long
This can vary depending on the application. Often, TaqMan and LNA probes are shorter, as it helps to discriminate between SNPs.
Tm: the temperature at which the primer duplex dissociates into single strands.
A Melting temperature of 59–64°C, with an ideal temperature of 62°C, which is based on typical PCR conditions and the optimum temperature for PCR enzymes.
GC content
GC content between 35–70%, with an ideal content of ~50%.
Should not contain regions of 4 or more consecutive G residues.
Primers should also be free of strong secondary structures and self-complementarity. Primer design algorithms, such as primer3, will take these considerations into account.
What happens if there are SNPs in primer binding sites?#
Single nucleotide polymorphisms (SNPs) in primer binding sites can affect both the stability and Tm of the primer-template duplex, as well as the efficiency with which DNA polymerases can extend the primer (Figure 2). In some cases, this can completely prevent primer binding and amplification of the template DNA, often referred to as null alleles. Null alleles can become particular troublesome when performing PCR on pooled samples, where we may not observe whether all samples amplified successfully, and so we may not be sampling and observing the full range of alleles.
An equally problematic scenario may occur if primers bind but with unequal efficiency against different genetic variants. In this case, in any quantitative molecular assay such as qPCR for gene expression, SNPs could lead to biases in the estimation of gene expression between genetic variants or strains. Even single SNPs can introduce a variety of effects, ranging from minor to major impacts on Cq values [6]. The effect will depend on the type of SNP (which nucleotides are involved), and on the position of the SNP (3’ or 5’ end), as SNPs within the last 5 nucleotides at the 3’ end can disrupt the nearby polymerase active site, and so these tend to have a much greater impact [7]. SNPs at the terminal 3’ base had the strongest shift of Cq, altering Cq by as much as 5–7 cycles (Figure 2).
To maximise the accuracy of our data we should therefore aim to either design primers that avoid SNPs completely or that contain a mix of bases (degenerate) at the sites of SNPs. There is a useful article on this topic on the IDT website - Consider SNPs when designing PCR and qPCR assays.
AnoPrimer#
As we have seen in earlier workshops, the Ag1000G resource [8] has revealed extreme amounts of genetic variation in Anopheles gambiae s.l. You can find a SNP in less than every 2 bases of the accessible genome - which makes considering SNPs even more important when designing molecular assays. However, it was not previously straightforward to consider genetic variation when designing primers, and so the vast majority of primers currently in use did not consider SNP variation during their design.
Primer3 is the most widely cited program for primer design, and has been used extensively over the past two decades [9]. It is also the primer design engine behind Primer-BLAST [10], a web-server which designs primers and then blasts the primers to check for specificity. Thanks to the primer3-py python package and the malariagen_data API, it is now possible to design primers in the cloud with google colab, considering SNP variation.
Setup#
Install and import the packages we’ll need.
%pip install -q --no-warn-conflicts malariagen_data AnoPrimer primer3-py kaleido gget seaborn
Note: you may need to restart the kernel to use updated packages.
You might see a warning message looking something like:
Tue May 14 04:41:51 2024 WARNING No project ID could be determined. Consider running 'gcloud config set project' or setting the GOOGLE_CLOUD_PROJECT environment variable
This can safely be ignored.
# Import libraries
import malariagen_data
import pandas as pd
import primer3
import AnoPrimer
import gget
#configure plotting with matplotlib
%matplotlib inline
%config InlineBackend.figure_format = "retina"
Wed May 22 06:52:06 2024 WARNING No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
Wed May 22 06:52:07 2024 WARNING No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
Worked example: Designing primers for the ace1-280S mutation#
In the rest of this module, we will design primers and probes for the ace1-280S mutation (previously ace1-119S). This mutation is known to be important in resistance to organophosphates and carbamates, insecticides which are widely used in indoor residual spraying (IRS). The resistance mutation has spread throughout much of west and central Africa, including introgressing from An. gambiae to An. coluzzii, and is often found on the background of large duplications which often pair wild-type and mutant alleles [11]. We see very large signals of selection at this locus in many contemporary populations of An. gambiae.
Selecting primer parameters#
In the below cells, replace the values with those desired for your primers.
#@title **Primer parameters** { run: "auto" }
# N.B., this cell will be rendered as a form when running on colab
assay_type = "gDNA primers + probe" #@param ["gDNA primers", "gDNA primers + probe", "probe", "cDNA primers"]
assay_name = 'ace1-280s' #@param {type:"string"}
min_amplicon_size = 60 #@param {type:"integer"}
max_amplicon_size = 120 #@param {type:"integer"}
amplicon_size_range = [[min_amplicon_size, max_amplicon_size]]
n_primer_pairs = 6 #@param {type:"slider", min:1, max:20, step:1}
#@markdown
#@markdown target_loc is required for gDNA primers and probes, and transcript required for qPCR primers.
contig = "2R" #@param ['2L', '2R', '3L', '3R', 'X']
target_loc = '3492074' #@param {type:"string"}
target_loc = int(target_loc)
transcript = '' #@param {type:"string"}
if any(item in assay_type for item in ['gDNA', 'probe']):
assert target_loc > 0, "Target location must be above 0 and less than the contig length"
elif assay_type == 'cDNA primers':
assert len(transcript) > 2, "Transcript ID is not valid, should be vectorbase ID such as 'AGAP004707-RD'"
Configure access to the MalariaGEN Ag3 data resource. Note that authentication is required to access data through the package, please follow the instructions here.
ag3 = malariagen_data.Ag3()
ag3
Wed May 22 06:52:41 2024 WARNING No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9 |
Results cache | None |
Cohorts analysis | 20240418 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 9.0.0 |
Client location | unknown |
Load sequence data for the chromosomal arm of choice, using the malariagen_data API:
genome_seq = ag3.genome_sequence(region=contig)
print(f"Our genome sequence for {contig} is {genome_seq.shape[0]} bp long")
Our genome sequence for 2R is 61545105 bp long
Now we need to extract the bit of genome sequence we need. We will use functions in the AnoPrimer package. For genomic DNA primers, this is quite simple and we can make direct use of the ag3.genome_sequence()
function, but for cDNA qPCR primers, we must only include coding sequence, and so we must concatenate exons together.
With qPCR primers for cDNA, we also must ensure that one primer spans an exon-exon junction, to ensure that any residual genomic DNA in our samples does not get amplified. We must therefore make a note of where the exon junctions are, and we store that as a list in the exon_junctions
variable.
if any(item in assay_type for item in ['gDNA', 'probe']):
# genomic DNA
target_sequence, gdna_pos, seq_parameters = AnoPrimer.prepare_gDNA_sequence(
target_loc=target_loc,
amplicon_size_range=amplicon_size_range,
genome_seq=genome_seq,
assay_name=assay_name,
assay_type=assay_type
)
elif assay_type == 'qPCR primers':
# RT-quantitative PCR, cDNA
target_sequence, exon_junctions, gdna_pos, seq_parameters = AnoPrimer.prepare_cDNA_sequence(
transcript=transcript,
gff=ag3.geneset(),
genome_seq=genome_seq,
assay_name=assay_name
)
The target sequence is 239 bases long
the target snp is 119 bp into our target sequence
Now we have our target sequence. Lets take a look…
seq_parameters['SEQUENCE_TEMPLATE']
'CGGGCGCGACCATGTGGAACCCGAACACGCCCCTGTCCGAGGACTGTCTGTACATTAACGTGGTGGCACCGCGACCCCGGCCCAAGAATGCGGCCGTCATGCTGTGGATCTTCGGCGGCGGCTTCTACTCCGGCACCGCCACCCTGGACGTGTACGACCACCGGGCGCTTGCGTCGGAGGAGAACGTGATCGTGGTGTCGCTGCAGTACCGCGTGGCCAGTCTGGGCTTCCTGTTTCTC'
seq_parameters
{'SEQUENCE_ID': 'ace1-280s',
'SEQUENCE_TEMPLATE': 'CGGGCGCGACCATGTGGAACCCGAACACGCCCCTGTCCGAGGACTGTCTGTACATTAACGTGGTGGCACCGCGACCCCGGCCCAAGAATGCGGCCGTCATGCTGTGGATCTTCGGCGGCGGCTTCTACTCCGGCACCGCCACCCTGGACGTGTACGACCACCGGGCGCTTGCGTCGGAGGAGAACGTGATCGTGGTGTCGCTGCAGTACCGCGTGGCCAGTCTGGGCTTCCTGTTTCTC',
'SEQUENCE_TARGET': [119, 10],
'GENOMIC_TARGET': 3492074,
'SEQUENCE_INTERNAL_EXCLUDED_REGION': [[1, 99], [139, 100]]}
We need to set up a second python dictionary, which will be our input to primer3. This contains our preferred primer parameters. In the below cell, you can modify or add primer3 parameters, such as optimal primer size, TM, GC content etc etc. A full list of possible parameters and their functions can be found in the primer3 2.6.1 manual.
primer_parameters = {
'PRIMER_OPT_SIZE': 20,
'PRIMER_TASK':'generic',
'PRIMER_MIN_SIZE': 17,
'PRIMER_MAX_SIZE': 24,
'PRIMER_OPT_TM': 60.0,
'PRIMER_MIN_TM': 55.0,
'PRIMER_MAX_TM': 64.0,
'PRIMER_MIN_GC': 30.0,
'PRIMER_MAX_GC': 75.0,
'PRIMER_MIN_THREE_PRIME_DISTANCE': 3, # this parameter is the minimum distance between successive pairs. If 1, it means successive primer pairs could be identical bar one base shift
'PRIMER_INTERNAL_OPT_SIZE': 16, # Probe size preferences if selected, otherwise ignored
'PRIMER_INTERNAL_MIN_SIZE': 10,
'PRIMER_INTERNAL_MAX_SIZE': 22,
'PRIMER_INTERNAL_MIN_TM': 45,
'PRIMER_INTERNAL_MAX_TM': 65, # Probe considerations are quite relaxed, assumed that LNAs / Taqman will be used later to affect TM
# Extra primer3 parameters can go here
# In the same format as above
}
# adds some necessary parameters depending on assay type
primer_parameters = AnoPrimer.primer_params(
assay_type=assay_type,
primer_parameters=primer_parameters,
n_primer_pairs=n_primer_pairs,
amplicon_size_range=amplicon_size_range,
)
Run the primer3 algorithm!#
primer_dict = primer3.designPrimers(
seq_args=seq_parameters,
global_args=primer_parameters
)
It should be fast. The output, which we store as ‘primer_dict’, is a python dictionary containing the full results from primer3. We will turn this into a pandas dataframe containing just the necessary bits of information. First, we’ll print some information from the primer3 run.
AnoPrimer.primer3_run_statistics(primer_dict, assay_type)
primer_forward_explain : considered 796, GC content failed 84, low tm 72, high tm 347, ok 293
primer_reverse_explain : considered 724, GC content failed 68, low tm 17, high tm 395, high hairpin stability 17, ok 227
primer_probe_explain : considered 524, overlap excluded region 212, GC content failed 20, low tm 100, high tm 2, high hairpin stability 1, ok 189
primer_pair_explain : considered 3107, unacceptable product size 3099, primer in pair overlaps a primer in a better pair 1279, ok 8
primer_forward_num_returned : 6
primer_reverse_num_returned : 6
primer_probe_num_returned : 6
primer_pair_num_returned : 6
primer_pair : [{'PENALTY': 0.7107151008710275, 'COMPL_ANY_TH': 0.0, 'COMPL_END_TH': 0.0, 'PRODUCT_SIZE': 95, 'PRODUCT_TM': 88.90156407263927}, {'PENALTY': 1.8268188261876048, 'COMPL_ANY_TH': 0.0, 'COMPL_END_TH': 0.0, 'PRODUCT_SIZE': 66, 'PRODUCT_TM': 86.70188305190561}, {'PENALTY': 3.1582579651147853, 'COMPL_ANY_TH': 0.0, 'COMPL_END_TH': 1.562469314775342, 'PRODUCT_SIZE': 103, 'PRODUCT_TM': 89.70217725608332}, {'PENALTY': 3.3586929277500417, 'COMPL_ANY_TH': 12.973825966488278, 'COMPL_END_TH': 17.231667178159967, 'PRODUCT_SIZE': 112, 'PRODUCT_TM': 90.19160166662422}, {'PENALTY': 3.997840573385986, 'COMPL_ANY_TH': 12.56535604041153, 'COMPL_END_TH': 5.976322718447193, 'PRODUCT_SIZE': 117, 'PRODUCT_TM': 90.31415966418224}, {'PENALTY': 4.920050397896603, 'COMPL_ANY_TH': 0.0, 'COMPL_END_TH': 6.284961991996681, 'PRODUCT_SIZE': 109, 'PRODUCT_TM': 90.03744046085752}]
primer_forward : [{'PENALTY': 0.46258069149581615, 'SEQUENCE': 'TCATGCTGTGGATCTTCGGC', 'COORDS': [96, 20], 'TM': 60.462580691495816, 'GC_PERCENT': 55.0, 'SELF_ANY_TH': 12.259940761643293, 'SELF_END_TH': 12.259940761643293, 'HAIRPIN_TH': 36.59415215911514, 'END_STABILITY': 5.54}, {'PENALTY': 0.7458448421797357, 'SEQUENCE': 'GCCGTCATGCTGTGGATCTT', 'COORDS': [92, 20], 'TM': 60.745844842179736, 'GC_PERCENT': 55.0, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 0.0, 'END_STABILITY': 2.4}, {'PENALTY': 1.0630056029440311, 'SEQUENCE': 'CAAGAATGCGGCCGTCATG', 'COORDS': [82, 19], 'TM': 59.93699439705597, 'GC_PERCENT': 57.89473684210526, 'SELF_ANY_TH': 27.740677587638743, 'SELF_END_TH': 17.565181116897747, 'HAIRPIN_TH': 0.0, 'END_STABILITY': 3.07}, {'PENALTY': 1.2319584737745117, 'SEQUENCE': 'ACATTAACGTGGTGGCACCG', 'COORDS': [51, 20], 'TM': 61.23195847377451, 'GC_PERCENT': 55.0, 'SELF_ANY_TH': 30.138122022371874, 'SELF_END_TH': 21.521828989913672, 'HAIRPIN_TH': 42.57679566489054, 'END_STABILITY': 4.94}, {'PENALTY': 2.1245702891837936, 'SEQUENCE': 'GGCCGTCATGCTGTGGAT', 'COORDS': [91, 18], 'TM': 60.124570289183794, 'GC_PERCENT': 61.111111111111114, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 0.0, 'END_STABILITY': 3.41}, {'PENALTY': 2.8924249804836677, 'SEQUENCE': 'CTGTGGATCTTCGGCGGC', 'COORDS': [101, 18], 'TM': 60.89242498048367, 'GC_PERCENT': 66.66666666666667, 'SELF_ANY_TH': 7.2145116769710285, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 0.0, 'END_STABILITY': 6.53}]
primer_reverse : [{'PENALTY': 0.24813440937521136, 'SEQUENCE': 'GATCACGTTCTCCTCCGACG', 'COORDS': [190, 20], 'TM': 60.24813440937521, 'GC_PERCENT': 60.0, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 42.12619709637721, 'END_STABILITY': 5.12}, {'PENALTY': 1.0809739840078691, 'SEQUENCE': 'GTCGTACACGTCCAGGGTG', 'COORDS': [157, 19], 'TM': 60.08097398400787, 'GC_PERCENT': 63.1578947368421, 'SELF_ANY_TH': 1.1902412874118795, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 36.90501393969038, 'END_STABILITY': 4.61}, {'PENALTY': 2.095252362170754, 'SEQUENCE': 'GTTCTCCTCCGACGCAAGC', 'COORDS': [184, 19], 'TM': 61.095252362170754, 'GC_PERCENT': 63.1578947368421, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 45.62213695404927, 'END_STABILITY': 4.01}, {'PENALTY': 2.12673445397553, 'SEQUENCE': 'CGGTGGTCGTACACGTCC', 'COORDS': [162, 18], 'TM': 60.12673445397553, 'GC_PERCENT': 66.66666666666667, 'SELF_ANY_TH': 10.508612029603285, 'SELF_END_TH': 10.508612029603285, 'HAIRPIN_TH': 45.42514212470502, 'END_STABILITY': 4.79}, {'PENALTY': 1.8732702842021922, 'SEQUENCE': 'TACTGCAGCGACACCACGAT', 'COORDS': [207, 20], 'TM': 61.87327028420219, 'GC_PERCENT': 55.0, 'SELF_ANY_TH': 15.265667197231778, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 35.08910707969903, 'END_STABILITY': 3.73}, {'PENALTY': 2.027625417412935, 'SEQUENCE': 'GGTACTGCAGCGACACCAC', 'COORDS': [209, 19], 'TM': 61.027625417412935, 'GC_PERCENT': 63.1578947368421, 'SELF_ANY_TH': 15.265667197231778, 'SELF_END_TH': 0.405162456403275, 'HAIRPIN_TH': 0.0, 'END_STABILITY': 4.16}]
primer_probe : [{'PENALTY': 4.161169114371262, 'SEQUENCE': 'GGCTTCTACTCCGGCACCGC', 'COORDS': [119, 20], 'TM': 59.83883088562874, 'GC_PERCENT': 70.0, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 36.28074618918981}, {'PENALTY': 3.1023652706057874, 'SEQUENCE': 'GCGGCGGCTTCTACTCCGG', 'COORDS': [114, 19], 'TM': 59.89763472939421, 'GC_PERCENT': 73.6842105263158, 'SELF_ANY_TH': 7.734280912202962, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 42.29746586682808}, {'PENALTY': 2.194535966840135, 'SEQUENCE': 'TGGATCTTCGGCGGCGGC', 'COORDS': [104, 18], 'TM': 59.805464033159865, 'GC_PERCENT': 72.22222222222223, 'SELF_ANY_TH': 28.91330394666676, 'SELF_END_TH': 21.80273259353129, 'HAIRPIN_TH': 38.904208273950076}, {'PENALTY': 2.194535966840135, 'SEQUENCE': 'TGGATCTTCGGCGGCGGC', 'COORDS': [104, 18], 'TM': 59.805464033159865, 'GC_PERCENT': 72.22222222222223, 'SELF_ANY_TH': 28.91330394666676, 'SELF_END_TH': 21.80273259353129, 'HAIRPIN_TH': 38.904208273950076}, {'PENALTY': 3.1023652706057874, 'SEQUENCE': 'GCGGCGGCTTCTACTCCGG', 'COORDS': [114, 19], 'TM': 59.89763472939421, 'GC_PERCENT': 73.6842105263158, 'SELF_ANY_TH': 7.734280912202962, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 42.29746586682808}, {'PENALTY': 4.161169114371262, 'SEQUENCE': 'GGCTTCTACTCCGGCACCGC', 'COORDS': [119, 20], 'TM': 59.83883088562874, 'GC_PERCENT': 70.0, 'SELF_ANY_TH': 0.0, 'SELF_END_TH': 0.0, 'HAIRPIN_TH': 36.28074618918981}]
Now lets convert this into an easy to read pandas dataframe.
primer_df = AnoPrimer.primer3_to_pandas(primer_dict, assay_type)
primer_df
primer_pair | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
parameter | ||||||
primer_forward_sequence | TCATGCTGTGGATCTTCGGC | GCCGTCATGCTGTGGATCTT | CAAGAATGCGGCCGTCATG | ACATTAACGTGGTGGCACCG | GGCCGTCATGCTGTGGAT | CTGTGGATCTTCGGCGGC |
primer_reverse_sequence | GATCACGTTCTCCTCCGACG | GTCGTACACGTCCAGGGTG | GTTCTCCTCCGACGCAAGC | CGGTGGTCGTACACGTCC | TACTGCAGCGACACCACGAT | GGTACTGCAGCGACACCAC |
primer_probe_sequence | GGCTTCTACTCCGGCACCGC | GCGGCGGCTTCTACTCCGG | TGGATCTTCGGCGGCGGC | TGGATCTTCGGCGGCGGC | GCGGCGGCTTCTACTCCGG | GGCTTCTACTCCGGCACCGC |
primer_forward_tm | 60.462581 | 60.745845 | 59.936994 | 61.231958 | 60.12457 | 60.892425 |
primer_reverse_tm | 60.248134 | 60.080974 | 61.095252 | 60.126734 | 61.87327 | 61.027625 |
primer_probe_tm | 59.838831 | 59.897635 | 59.805464 | 59.805464 | 59.897635 | 59.838831 |
primer_forward_gc_percent | 55.0 | 55.0 | 57.894737 | 55.0 | 61.111111 | 66.666667 |
primer_reverse_gc_percent | 60.0 | 63.157895 | 63.157895 | 66.666667 | 55.0 | 63.157895 |
primer_probe_gc_percent | 70.0 | 73.684211 | 72.222222 | 72.222222 | 73.684211 | 70.0 |
primer_forward | [96, 20] | [92, 20] | [82, 19] | [51, 20] | [91, 18] | [101, 18] |
primer_reverse | [190, 20] | [157, 19] | [184, 19] | [162, 18] | [207, 20] | [209, 19] |
primer_probe | [119, 20] | [114, 19] | [104, 18] | [104, 18] | [114, 19] | [119, 20] |
primer_pair_product_size | 95 | 66 | 103 | 112 | 117 | 109 |
We can write this to .tsv and excel files, which can be explored in other editors.
primer_df.to_csv(f"{assay_name}.{assay_type}.primers.tsv", sep="\t")
primer_df.to_excel(f"{assay_name}.{assay_type}.primers.xlsx")
Looking for variation using the Ag1000G data resource#
As we’ve seen in earlier workshops, Ag1000G samples are organised into sample sets. Lets look at what each sample set contains, breaking it down by species, year and country.
metadata = ag3.sample_metadata(sample_sets='3.0')
pivot_country_year_taxon = (
metadata
.pivot_table(
index=["sample_set", "country", "year"],
columns=["taxon"],
values="sample_id",
aggfunc="count",
fill_value=0
)
)
pivot_country_year_taxon
taxon | arabiensis | coluzzii | gambiae | gcx1 | gcx3 | unassigned | ||
---|---|---|---|---|---|---|---|---|
sample_set | country | year | ||||||
AG1000G-AO | Angola | 2009 | 0 | 81 | 0 | 0 | 0 | 0 |
AG1000G-BF-A | Burkina Faso | 2012 | 0 | 82 | 99 | 0 | 0 | 0 |
AG1000G-BF-B | Burkina Faso | 2014 | 3 | 53 | 46 | 0 | 0 | 0 |
AG1000G-BF-C | Burkina Faso | 2004 | 0 | 0 | 13 | 0 | 0 | 0 |
AG1000G-CD | Democratic Republic of the Congo | 2015 | 0 | 0 | 76 | 0 | 0 | 0 |
AG1000G-CF | Central African Republic | 1993 | 0 | 5 | 2 | 0 | 0 | 0 |
1994 | 0 | 13 | 53 | 0 | 0 | 0 | ||
AG1000G-CI | Cote d'Ivoire | 2012 | 0 | 80 | 0 | 0 | 0 | 0 |
AG1000G-CM-A | Cameroon | 2009 | 0 | 0 | 303 | 0 | 0 | 0 |
AG1000G-CM-B | Cameroon | 2005 | 0 | 7 | 90 | 0 | 0 | 0 |
AG1000G-CM-C | Cameroon | 2013 | 2 | 19 | 23 | 0 | 0 | 0 |
AG1000G-FR | Mayotte | 2011 | 0 | 0 | 23 | 0 | 0 | 0 |
AG1000G-GA-A | Gabon | 2000 | 0 | 0 | 69 | 0 | 0 | 0 |
AG1000G-GH | Ghana | 2012 | 0 | 64 | 36 | 0 | 0 | 0 |
AG1000G-GM-A | Gambia, The | 2011 | 0 | 6 | 0 | 68 | 0 | 0 |
AG1000G-GM-B | Gambia, The | 2006 | 0 | 22 | 0 | 9 | 0 | 0 |
AG1000G-GM-C | Gambia, The | 2012 | 0 | 172 | 2 | 0 | 0 | 0 |
AG1000G-GN-A | Guinea | 2012 | 0 | 4 | 41 | 0 | 0 | 0 |
AG1000G-GN-B | Guinea | 2012 | 0 | 7 | 83 | 0 | 0 | 1 |
Mali | 2012 | 0 | 27 | 65 | 0 | 0 | 2 | |
AG1000G-GQ | Equatorial Guinea | 2002 | 0 | 0 | 10 | 0 | 0 | 0 |
AG1000G-GW | Guinea-Bissau | 2010 | 0 | 0 | 7 | 93 | 0 | 1 |
AG1000G-KE | Kenya | 2000 | 0 | 0 | 19 | 0 | 0 | 0 |
2007 | 3 | 0 | 0 | 0 | 0 | 0 | ||
2012 | 10 | 0 | 0 | 0 | 54 | 0 | ||
AG1000G-ML-A | Mali | 2014 | 0 | 27 | 33 | 0 | 0 | 0 |
AG1000G-ML-B | Mali | 2004 | 2 | 36 | 33 | 0 | 0 | 0 |
AG1000G-MW | Malawi | 2015 | 41 | 0 | 0 | 0 | 0 | 0 |
AG1000G-MZ | Mozambique | 2003 | 0 | 0 | 3 | 0 | 0 | 0 |
2004 | 0 | 0 | 71 | 0 | 0 | 0 | ||
AG1000G-TZ | Tanzania | 2012 | 87 | 0 | 0 | 0 | 0 | 0 |
2013 | 1 | 0 | 32 | 0 | 10 | 0 | ||
2015 | 137 | 0 | 32 | 0 | 1 | 0 | ||
AG1000G-UG | Uganda | 2012 | 82 | 0 | 207 | 0 | 0 | 1 |
AG1000G-X | Lab Cross | -1 | 0 | 0 | 0 | 0 | 0 | 297 |
Here, we can see the breakdown by sample set for country, species and year. For the purposes of this notebook, let’s use the Ghana sample set. If we wanted to use all sample sets, we could supply ‘3.0’ instead of a sample set, which will load all samples from the Ag3.0 release.
metadata.columns
Index(['sample_id', 'partner_sample_id', 'contributor', 'country', 'location',
'year', 'month', 'latitude', 'longitude', 'sex_call', 'sample_set',
'release', 'quarter', 'study_id', 'study_url',
'aim_species_fraction_arab', 'aim_species_fraction_colu',
'aim_species_fraction_colu_no2l', 'aim_species_gambcolu_arabiensis',
'aim_species_gambiae_coluzzii', 'aim_species', 'country_iso',
'admin1_name', 'admin1_iso', 'admin2_name', 'taxon',
'cohort_admin1_year', 'cohort_admin1_month', 'cohort_admin1_quarter',
'cohort_admin2_year', 'cohort_admin2_month', 'cohort_admin2_quarter'],
dtype='object')
sample_set = ['AG1000G-BF-A', 'AG1000G-GH', 'AG1000G-GN-A']
# here we could subset to specific values in the metadata e.g.: "taxon == 'gambiae'" , or "taxon == 'arabiensis'"
sample_query = None
Plot allele frequencies in primers locations#
Now we can plot the primers pairs, and the frequency of any alternate alleles in the Ag1000G sample set of choice. When calculating allele frequencies, we will take the sum of all alternate alleles, as we are interested here in any mutations which are different from the reference genome. We can see the frequencies of specific alleles by hovering over the points of the plot - in some cases it may be preferable to design degenerate primers rather than avoid a primer pair completely.
We will also plot the primer Tm, GC and genomic spans of each primer binding site. We can use this plot to identify primers pairs and probes which may be suitable, particularly trying to avoid SNPs in the 3’ end.
results_dict = AnoPrimer.plot_primer_snp_frequencies(
species='gambiae_sl',
primer_df=primer_df,
gdna_pos=gdna_pos,
contig=contig,
sample_sets=sample_set,
sample_query=sample_query,
assay_type=assay_type,
seq_parameters=seq_parameters,
out_dir="."
)
Now lets plot these primer pairs across the genome, highlighting where they are in relation to any nearby exons…
AnoPrimer.plot_primer_locs(
species='gambiae_sl',
primer_res_dict=results_dict,
primer_df=primer_df,
assay_type=assay_type,
contig=contig,
seq_parameters=seq_parameters,
legend_loc='lower left'
)
Checking our primers for specificity to the Anopheles gambiae genome#
We can use a cool new python package, gget, to rapidly search our primers against the AgamP3 genome, to ensure they will not amplify any other regions of the genome. gget rapidly queries large databases - in this case we can use the program BLAT to align primer sequences. Unfortunately as the lengths of primer sequences are so short, it is at the limit of BLATs sensitivity, and in some cases, matches are not returned. gget can also currently only query the older AgamP3 assembly. Therefore, it is also recommended run a more exhaustive search in Primer-BLAST.
blat_result_df = AnoPrimer.gget_blat_genome(primer_df, assay_type, assembly='anoGam3')
blat_result_df
genome | query_size | aligned_start | aligned_end | matches | mismatches | %_aligned | %_matched | chromosome | strand | start | end | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
primer | |||||||||||||
0 | forward_0 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | + | 3492051 | 3492070 |
reverse_0 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | - | 3492126 | 3492145 | |
probe_0 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | + | 3492074 | 3492093 | |
1 | forward_1 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | + | 3492047 | 3492066 |
reverse_1 | anoGam3 | 19 | 1 | 19 | 19 | 0 | 100.0 | 100.00 | chr2R | - | 3492094 | 3492112 | |
probe_1 | anoGam3 | 19 | 1 | 19 | 19 | 0 | 100.0 | 100.00 | chr2R | + | 3492069 | 3492087 | |
2 | forward_2 | anoGam3 | 19 | 1 | 19 | 18 | 1 | 100.0 | 94.74 | chrX | - | 17504190 | 17504208 |
reverse_2 | anoGam3 | 19 | 1 | 19 | 19 | 0 | 100.0 | 100.00 | chr2R | - | 3492121 | 3492139 | |
probe_2 | anoGam3 | 18 | 1 | 18 | 18 | 0 | 100.0 | 100.00 | chr2R | + | 3492059 | 3492076 | |
3 | forward_3 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | + | 3492006 | 3492025 |
reverse_3 | anoGam3 | 18 | 1 | 18 | 18 | 0 | 100.0 | 100.00 | chr2R | - | 3492100 | 3492117 | |
probe_3 | anoGam3 | 18 | 1 | 18 | 18 | 0 | 100.0 | 100.00 | chr2R | + | 3492059 | 3492076 | |
4 | forward_4 | anoGam3 | 18 | 1 | 18 | 18 | 0 | 100.0 | 100.00 | chr2R | + | 3492046 | 3492063 |
reverse_4 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | - | 3492143 | 3492162 | |
probe_4 | anoGam3 | 19 | 1 | 19 | 19 | 0 | 100.0 | 100.00 | chr2R | + | 3492069 | 3492087 | |
5 | forward_5 | anoGam3 | 18 | 1 | 18 | 18 | 0 | 100.0 | 100.00 | chr2R | + | 3492056 | 3492073 |
reverse_5 | anoGam3 | 19 | 1 | 19 | 19 | 0 | 100.0 | 100.00 | chr2R | - | 3492146 | 3492164 | |
probe_5 | anoGam3 | 20 | 1 | 20 | 20 | 0 | 100.0 | 100.00 | chr2R | + | 3492074 | 3492093 |
Further considerations#
We may now have designed suitable primers. However, there are some further considerations…
Primers should be run in Primer-BLAST, to ensure specificity against the host organism, and specificity for the genomic location of interest.
If in multiplexed use with other primers or probes, primers must not interact with each other. This can be investigated on a one by one basis using the IDT tool oligoanalyzer, though higher throughput algorithms may be required.
If designing TaqMan or Locked Nucleic Acid (LNA) probes for SNP detection, further modification will be required, such as adding fluorophores and ensuring the two probes (one for wild-type, one for mutant) can discriminate between SNPs. for LNA probes, you will want to play around with the placement of LNAs in the oligo sequence, which can allow short probes (~10-14 bases) to bind with high affinity and discriminate between SNPs. IDT have a tool for this which allow you to check the binding affinity between mismatches, though it requires a log in https://eu.idtdna.com/calc/analyzer/lna.
Many more considerations…. IDT - How to design primers and probes for PCR and qPCR
Worked example: Primers to avoid#
In the above ace1 example, all of the primer pairs look reasonable. Lets have a look at an example where some of the primers do not look so good. A UGT detoxification gene, AGAP006222, is overexpressed very highly in populations of An. coluzzii - lets try and design some qPCR primers to measure its expression. The below AnoPrimer.designPrimers()
function integrates the entire notebook into one function, for convenience.
primer_parameters = {
'PRIMER_OPT_SIZE': 20,
'PRIMER_TASK':'generic',
'PRIMER_MIN_SIZE': 17,
'PRIMER_MAX_SIZE': 24,
'PRIMER_OPT_TM': 60.0,
'PRIMER_MIN_TM': 55.0,
'PRIMER_MAX_TM': 64.0,
'PRIMER_MIN_GC': 30.0,
'PRIMER_MAX_GC': 70.0,
'PRIMER_MIN_THREE_PRIME_DISTANCE': 3, # this parameter is the minimum distance between successive pairs. If 1, it means successive primer pairs could be identical bar one base shift
'PRIMER_INTERNAL_OPT_SIZE': 16, # Probe size preferences if selected, otherwise ignored
'PRIMER_INTERNAL_MIN_SIZE': 10,
'PRIMER_INTERNAL_MAX_SIZE': 22,
'PRIMER_INTERNAL_MIN_TM': 45,
'PRIMER_INTERNAL_MAX_TM': 65, # Probe considerations are quite relaxed, assumed that LNAs / Taqman will be used later to affect TM
# Extra primer3 parameters can go here
# In the same format as above
}
primer_df, blat_df = AnoPrimer.designPrimers(
species='gambiae_sl',
assay_type='cDNA primers', # assay_type options are: 'cDNA primers', 'gDNA primers', 'gDNA primers + probe', 'probe'
target='AGAP006222-RA', # target should be an AGAP transcript identifier in for cDNA, otherwise should be an integer in genome
assay_name='UGT_222',
n_primer_pairs=5,
min_amplicon_size=60,
max_amplicon_size=120,
primer_parameters=primer_parameters,
sample_sets='AG1000G-GH', # sample_set = '3.0' .you can also supply lists with multiple sample sets e.g ['AG1000G-GH', 'AG1000G-CI', 'AG1000G-BF-A']
sample_query="taxon == 'coluzzii'",
out_dir="."
)
We can see that only primer pair 4 really looks suitable, as all the other primer pairs contain SNPs at their 3’ end in at least one of the forward or reverse primers.
Well done!!#
Well done on completing the notebook and designing sets of SNP-variation informed primers!
Exercises#
English#
For the exercises, you can either re-run this notebook or use the function in the cell above, which integrates the above notebook into one single function, for convenience.
Design a set of qPCR primers to measure gene expression of the carboxylesterase gene, COEAE2F (contig=’2L’, transcript=’AGAP006228-RA’), restricting the samples to just Anopheles gambiae. HINT: use ‘cDNA primers’.
Design a set of primers and a probe for the vgsc-995F kdr mutation (contig = 2L, genome bp = 2422652)
Design a set of qPCR primers to target the gene from module 3’s practical exercise.
Français#
En tant qu’exercices, vous pouvez soit exécuter à nouveau ce notebook ou utiliser la fonction de la cellule ci-dessus, qui intègre le contenu du reste du notebook dans une seule cellule, pour une plus grande aisance.
Créer un ensemble d’amorces pour qPCR afin de mesurer l’expression génique du gène carboxylesterase COEAE2F (contig=’2L’, transcript=’AGAP006228-RA’), en limitant les échantillons aux Anopheles gambiae. INDICE: utiliser ‘cDNA primers’.
Créer un ensemble d’amorces et une sonde pour la mutation vgsc-995F kdr (contig = 2L, genome bp = 2422652)
Créer un ensemble d’amorces pour qPCR ciblant le gène de l’exercice pratique du module 3
primer_parameters = {
'PRIMER_OPT_SIZE': 20,
'PRIMER_TASK':'generic',
'PRIMER_MIN_SIZE': 17,
'PRIMER_MAX_SIZE': 24,
'PRIMER_OPT_TM': 60.0,
'PRIMER_MIN_TM': 55.0,
'PRIMER_MAX_TM': 64.0,
'PRIMER_MIN_GC': 30.0,
'PRIMER_MAX_GC': 75.0,
'PRIMER_MIN_THREE_PRIME_DISTANCE': 3, # this parameter is the minimum distance between successive pairs. If 1, it means successive primer pairs could be identical bar one base shift
'PRIMER_INTERNAL_OPT_SIZE': 16, # Probe size preferences if selected, otherwise ignored
'PRIMER_INTERNAL_MIN_SIZE': 10,
'PRIMER_INTERNAL_MAX_SIZE': 22,
'PRIMER_INTERNAL_MIN_TM': 45,
'PRIMER_INTERNAL_MAX_TM': 65, # Probe considerations are quite relaxed, assumed that LNAs / Taqman will be used later to affect TM
# Extra primer3 parameters can go here
# In the same format as above
}
# primer_df, blat_df = AnoPrimer.designPrimers(
# species='gambiae_sl',
# assay_type='cDNA primers', # assay_type options are: 'cDNA primers', 'gDNA primers', 'gDNA primers + probe', 'probe'
# target='AGAP000818-RA', # target should be an AGAP transcript identifier in for cDNA primers, otherwise should be an integer in genome
# assay_name='COEAE2F',
# n_primer_pairs=8,
# min_amplicon_size=60,
# max_amplicon_size=120,
# primer_parameters=primer_parameters,
# sample_sets=['AG1000G-BF-A', 'AG1000G-GH', 'AG1000G-GN-A'],
# out_dir="."
# )
Future development#
Any contributions or suggestions on how we can improve this notebook are more than welcome. Please email or log an issue on github. This notebook and source code for AnoPrimer are located here - sanjaynagi/AnoPrimer
References#
[1] Santolamazza, F., Mancini, E., Simard, F. et al. Insertion polymorphisms of SINE200 retrotransposons within speciation islands of Anopheles gambiae molecular forms. Malar J 7, 163 (2008). https://doi.org/10.1186/1475-2875-7-163
[2] White BJ, Santolamazza F, Kamau L, Pombi M, Grushko O, Mouline K, Brengues C, Guelbeogo W, Coulibaly M, Kayondo JK, Sharakhov I, Simard F, Petrarca V, Della Torre A, Besansky NJ. Molecular karyotyping of the 2La inversion in Anopheles gambiae. Am J Trop Med Hyg. 2007 Feb;76(2):334-9. PMID: 17297045.
[3] Makunin, A., Korlević, P., Park, N., Goodwin, S., Waterhouse, R. M., von Wyschetzki, K., Jacob, C. G., Davies, R., Kwiatkowski, D., St Laurent, B., Ayala, D., & Lawniczak, M. K. N. (2022). A targeted amplicon sequencing panel to simultaneously identify mosquito species and Plasmodium presence across the entire Anopheles genus. Molecular Ecology Resources, 22, 28– 44. https://doi.org/10.1111/1755-0998.13436
[4] Chabi J, Van’t Hof A, N’dri LK, Datsomor A, Okyere D, Njoroge H, et al. (2019) Rapid high throughput SYBR green assay for identifying the malaria vectors Anopheles arabiensis, Anopheles coluzzii and Anopheles gambiae s.s. Giles. PLoS ONE 14(4)
[5] Lynd, A., Oruni, A., van’t Hof, A.E. et al. Insecticide resistance in Anopheles gambiae from the northern Democratic Republic of Congo, with extreme knockdown resistance (kdr) mutation frequencies revealed by a new diagnostic assay. Malar J 17, 412 (2018). https://doi.org/10.1186/s12936-018-2561-5
[6] Lefever S, Pattyn F, et al. (2013) Single-nucleotide polymorphisms and other mismatches reduce performance of quantitative PCR assays. Single-nucleotide polymorphisms and other mismatches reduce performance of quantitative PCR assays. Clin Chem, 59(10):1470–1480.
[7] Owczarzy R, Tataurov AV, et al. (2008) IDT SciTools: a suite for analysis and design of nucleic acid oligomers. Nucl Acids Res, 36 (suppl 2):W163–169.
[8] The Anopheles gambiae 1000 Genomes Consortium (2020). Genome variation and population structure among 1142 mosquitoes of the African malaria vector species Anopheles gambiae and Anopheles coluzzii. Genome Research, 30: 1533-1546. https://genome.cshlp.org/content/early/2020/09/25/gr.262790.120
[9] Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M and Rozen SG (2012). Primer3–new capabilities and interfaces. Nucleic Acids Research. 40(15):e115.
[10] Ye, J., Coulouris, G., Zaretskaya, I. et al. Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 13, 134 (2012). https://doi.org/10.1186/1471-2105-13-134
[11] Grau-Bové X, Lucas E, Pipini D, Rippon E, van ‘t Hof AE, Constant E, et al. (2021) Resistance to pirimiphos-methyl in West African Anopheles is spreading via duplication and introgression of the Ace1 locus. PLoS Genet 17(1): e1009253. https://doi.org/10.1371/journal.pgen.1009253