banner

Workshop 3 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 3 - Genome-wide SNP data#

Theme: Data

In this module, we will look at the data that we use to analyse population structure: genome-wide SNP data. We will look again at how SNPs are called from whole-genome sequence data, and how the resulting SNP data are stored and accessed from Python code. We will also introduce site filters, which are an important tool to use when analysing SNP data from across the genome.

Learning objectives#

At the end of this module, you will be able to:

  • Explain how SNPs are called from whole-genome sequence data

  • Use IGV to browse whole-genome sequence data

  • Access genome-wide SNP calls from Python code

  • Access site filters and explain why we need them

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

As usual, let’s begin by installing and importing some Python packages, and configuring access to Anopheles genomic data from the MalariaGEN Ag3.0 data resource.

%pip install -q --no-warn-conflicts malariagen_data
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

import malariagen_data
import numpy as np
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9
Results cache None
Cohorts analysis 20231215
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 8.7.0
Client location unknown

Data locality#

Before we go any further, take a moment to check the client location field in the output of the cell above. This shows where your colab VM is running.

In this workshop we will be accessing data on millions of SNPs. Because we are accessing a larger amount of data than we have done in previous workshops, we will need to make sure we are making the best possible use of our cloud computing resources, so that computations run as fast as possible and we don’t have to wait for too long to see the results.

In particular, data locality becomes important. Roughly speaking, data locality means how physically close is the computer on which your code is running to the computer on which the data is stored.

MalariaGEN data is stored in Google Cloud in the US. Google Cloud is in fact is made up of data centres in many different locations across the US and also Europe and Asia. We chose to use the US simply because it is cheaper.

Here’s a map of the current Google Cloud locations

When you use Google Colab, Google gives you a virtual machine somewhere in Google Cloud where your code will be executed. That virtual machine will be allocated usually from Google Cloud somewhere in the US. If so, there will be good locality between your colab VM (running in Google Cloud in the US) and MalariaGEN data (stored on a multi-region Google Cloud Storage bucket in the US).

Occasionally, however, colab will allocate your VM from a data centre outside the US. If this happens, data locality is not good, because data then has to travel across the internet between continents.

So before we start accessing data, it’s a good idea to check the client location to see where your colab VM is running.

If this shows somewhere inside the US, then you’re good to go.

If it shows somewhere outside the US, performance accessing MalariaGEN data will be poor, and computations can take more than twice as long. In this case, it’s better to ask colab for a new VM, by selecting “Runtime > Disconnect and delete runtime” from the menu. Then rerun notebook from the top, and check the client location again. Colab will allocate a new VM, hopefully in the US this time.

From sequence reads to SNP calls#

Let’s look now at some Anopheles whole-genome sequence data, and how we get from sequence reads to single nucleotide polymorphism (SNP) calls.

Recall that the MalariaGEN team at the Wellcome Sanger Institute performs deep whole-genome sequencing of individual mosquitoes using Illumina paired-end sequencing technology. What this means is:

  • Specimen shipping - Individual mosquito specimens are shipped to Sanger in separate tubes or wells (it is important to keep the mosquito specimens separate, so the DNA from one specimen does not contaminate another).

  • DNA extraction - Genomic DNA is extracted from each individual specimen using nexttec kits and stored separately.

  • Library prep - DNA is then prepared for sequencing, which includes fragmenting the DNA, adding adapters (so DNA binds to the flowcell), and multiplexing (combining DNA from multiple specimens so they can be sequenced in the same run then separated afterwards).

  • Whole-genome sequencing - Sequencing is performed on an Illumina machine. Currently the Sanger Institute is using NovaSeq instruments for whole-genome sequencing of mosquitoes. Previously, HiSeq 2000 and HiSeq X instruments were used.

    • Sequencing currently generates around 50 million sequence reads per mosquito specimen. Each read is 150 bp long, and reads come in pairs, where each read in a pair originates from opposite ends of the same DNA fragment.

    • The target depth of coverage for each individual mosquito is 30x. This means we aim to generate enough sequence reads so that each base of the mosquito genome is sequenced around 30 times. In other words, for every base in the reference genome, we expect to have around 30 sequence reads aligned overlapping that position. 30x coverage is “deep sequencing”, which is good for accurate SNP calling.

  • Sequence read alignment - Sequence reads from each individual mosquito are then aligned to a reference genome. We currently use the AgamP4 (a.k.a. PEST) reference genome for any specimens which are An. gambiae, An. coluzzii or An. arabiensis.

  • SNP calling - SNPs are called by analysing the sequence read alignments for nucleotides which differ between the reads and the reference sequence. SNPs are called at all sites across the whole genome, using the Genome Analysis Toolkit (GATK).

We can view the sequence read alignments for any sample using a specialised genomics viewer called IGV, via the view_alignments() function. Let’s view alignments for a randomly chosen sample from Burkina Faso, starting our view with the whole of chromosome arm 2L…

ag3.view_alignments(sample="AB0087-C", region="2L")
                                     

Here’s some suggestions for how to interact with this viewer and start exploring the sequence read alignments

  • This initial view shows the whole of chromosome arm 2L, which is 49,364,325 base pairs (49.3 Mbp) long. This is too zoomed out to see any alignments or SNP calls. We can, however, see the gene annotations - recall we studied gene annotations in workshop 1, module 3.

  • We need to zoom in to see more data. Try zooming in to a region around 2.4 Mbp. There are various ways to zoom in, e.g.:

    • Click on the “+” button.

    • Click and select a region on the distance scale at the top.

    • Double click somewhere in the genes track.

  • Keep zooming in until you can see horizontal grey bars appear in the “Alignments” track - these are sequence reads.

  • Keep zooming in further, trying to stay centred on position 2,422,652, until you see the letters (nucleotides) of the reference genome appear.

  • At this point you should notice that some of the sequence reads have a “T” showing at this position. This “T” is showing because it is different from the reference genome. Try clicking on the gear icon for the alignments track and select “show all bases”.

  • We are usually only interested in positions where reads are different from reference sequence, hence “show all bases” is turned off by default. You may want to turn it off again.

  • Depth of coverage is shown above the sequence reads. Notice the colours at this position, showing that around half the reads have an “A” (reference) allele, and the other half have a “T” (alternate) allele. This is what a SNP looks like. We’ve studied this SNP before - it causes the Vgsc L995F substitution.

  • This particular sample has a heterozygous genotype - remember mosquitoes are diploid, and so have two genome copies, one from each parent - these may be different from each other, in which case you get roughly 50/50 sequence reads from each.

  • Click on the square at this position in the upper half of the SNPs track. This shows the SNP position (2,422,652), reference allele (A) and alternate alleles (C,T,G).

  • Click on the square at this position in the lower half of the SNPs track. This shows the genotype call “A|T” which means a heterozygous genotype with one “A” and one “T” allele.

  • Now zoom out and browse around a little, and look for some more SNPs in the neighbouring region.

Exercise 1 (English)

In the code cell below, uncomment the code and run the cell to create a new IGV browser, showing data for sample AB0096-C, centred on position 2L:2,422,652. What genotype does this sample have at this SNP?

Exercice 1 (Français)

Dans la cellule ci-dessous, supprimer le “#” et lancer la cellule pour créer un nouvelle fenêtre IGV, affichant les données pour l’échantillon AB0096-C, centré sur la position 2L:2,422,652. Quel génotype possède cet échantillon à ce SNP?

# ag3.view_alignments(sample="AB0096-C", region="2L:2,422,601-2,422,701")

Exercise 2 (English)

Create a new code cell below with an IGV browser showing sample AB0087-C centred on position 2L:2,391,228. What is the reference allele at this SNP? What are the alternate alleles? What is the genotype of this sample?

Exercice 2 (Français)

Créer une nouvelle cellule code avec une fenêtre IGV affichant l’échantillon AB0087-C centré sur la position 2L:2,391,228. Quel est l’allèle de référence à ce SNP? Quel sont les allèles alternatifs? Quel est le génotype de cet échantillon?

Accessing SNP calls#

As we’ve mentioned, for this workshop we are not interested in any particular genes or SNPs, rather we want to use SNPs from across the genome to study population structure. Let’s take a look at how the SNP calls are stored and accessed from Python code. For example, access SNP calls for chromosome arm 2L, for sample set AG1000G-BF_A …

ds_snps = ag3.snp_calls(region="2L", sample_sets="AG1000G-BF-A")
ds_snps
                                 
<xarray.Dataset> Size: 141GB
Dimensions:                             (variants: 48525747, alleles: 4,
                                         samples: 181, ploidy: 2)
Coordinates:
    variant_position                    (variants) int32 194MB dask.array<chunksize=(524288,), meta=np.ndarray>
    variant_contig                      (variants) uint8 49MB dask.array<chunksize=(524288,), meta=np.ndarray>
    sample_id                           (samples) <U24 17kB dask.array<chunksize=(181,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele                      (variants, alleles) |S1 194MB dask.array<chunksize=(524288, 1), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool 49MB dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool 49MB dask.array<chunksize=(300000,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool 49MB dask.array<chunksize=(300000,), meta=np.ndarray>
    call_genotype                       (variants, samples, ploidy) int8 18GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
    call_GQ                             (variants, samples) int16 18GB dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_MQ                             (variants, samples) int16 18GB dask.array<chunksize=(300000, 50), meta=np.ndarray>
    call_AD                             (variants, samples, alleles) int16 70GB dask.array<chunksize=(300000, 50, 4), meta=np.ndarray>
    call_genotype_mask                  (variants, samples, ploidy) bool 18GB dask.array<chunksize=(300000, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

Notice that the dimensions shows “samples: 181”. This is the number of samples (mosquito specimens) in sample set AG1000G-BF-A.

Notice that the dimensions shows “variants: 48,525,747. What does this mean?

In MalariaGEN we perform all-sites SNP calling, which means we call and store the genotype for every sample at every position in the genome, regardless of whether there is any variation or not. So here we have SNP calls at 48,525,747 positions, or “sites”.

If you’re sharp you’ll remember that chromomsome arm 2L is 49,364,325 base pairs long, so why the discrepancy? This is because there are some gaps in the reference genome, which are positions where the reference nucleotide is unknown (“N”). We do not call SNPs at these sites. We can check this is true…

seq = ag3.genome_sequence(region="2L").compute()
np.count_nonzero((seq != b'N') & (seq != b'n'))
48525747

Let’s access the genomic positions of the SNPs we’ve called…

pos = ds_snps["variant_position"].values
pos
array([       1,        2,        3, ..., 49364323, 49364324, 49364325],
      dtype=int32)

So, the first SNP calls are made at position 1, and the last SNP calls are made at position 49,364,325.

Let’s look at the alleles…

alleles = ds_snps["variant_allele"].values
alleles
array([[b'A', b'C', b'T', b'G'],
       [b'A', b'C', b'T', b'G'],
       [b'C', b'A', b'T', b'G'],
       ...,
       [b'A', b'C', b'T', b'G'],
       [b'A', b'C', b'T', b'G'],
       [b'A', b'C', b'T', b'G']], dtype='|S1')

In this array, each row corresponds to a position in the genome where we perform SNP calling.

The first column contains the alleles which correspond to the nucleotides in the reference genome, called the “reference alleles”.

The second, third and fourth columns contain possible “alternate alleles”, which are other nucleotides that could occur at that site.

We can’t look at the genotypes for the whole genome, because the data are too big - we would run out of memory on our colab VM, and it would crash.

# Don't do this - it will crash your VM!

# gt = ds_snps["call_genotype"].values
ds_snps["call_genotype"].data
Array Chunk
Bytes 16.36 GiB 28.61 MiB
Shape (48525747, 181, 2) (300000, 50, 2)
Dask graph 648 chunks in 1 graph layer
Data type int8 numpy.ndarray
2 181 48525747

Instead, let’s look at data for a single position and sample.

snp_call = (
    ds_snps
    .set_index(variants="variant_position", samples="sample_id")
    .sel(variants=2_422_652, samples="AB0087-C")
)

Here are the alleles at this site:

snp_call["variant_allele"].values
array([b'A', b'C', b'T', b'G'], dtype='|S1')

Here is the genotype call in the selected sample:

snp_call["call_genotype"].values
array([0, 2], dtype=int8)

Note that the genotype here is encoded as numbers rather than letters. By convention, we use the number 0 to represent the reference allele (A); and the numbers 1, 2, 3 represent the first, second and third alternate alleles (C, T, G) respectively.

So the numerical genotype [0, 2] represents the nucleotide genotype A/T.

Exercise 3 (English)

Uncomment the code cell below and run it to access the SNP call for position 2,422,652 in sample AB0096-C. What is the genotype of this sample, decoded from numerical representation into nucleotide representation?

Exercice 3 (Français)

Dans la cellule ci-dessous, supprimer le # et lancer la cellule pour obtenir le SNP à la position 2’422’652 dans l’échantillon AB0096-C. Quel est le génotype de cet échantillon? Il faudra convertir la représentation numérique du SNP en représentation nucléotidique.

# snp_call = (
#     ds_snps
#     .set_index(variants="variant_position", samples="sample_id")
#     .sel(variants=2_422_652, samples="AB0096-C")
# )
# snp_call["call_genotype"].values

Exercise 4 (English)

Create a new code cell to access the SNP call for sample AB0087-C at position 2,391,228. What are the SNP alleles at this site? What is the genotype of this sample in numerical and nucleotide representations?

Exercice 4 (Français)

Créer une nouvelle cellule code et obtenir le SNP pour l’échantillon AB0087-C à la position 2’391’228. Quel sont les allèles de ce SNP? Quel est le génotype de cet échantillon en représentation numérique et nucléotidique?

Allele counts and segregating SNPs#

Although we perform SNP calling at all sites in the genome, at many sites there will be no variation in the samples we’ve genotyped. If there is no variation, we call this a “non-segregating site”. On the other hand, if there is some variation between the individuals genotyped, we call this a “segregating site”.

For a given cohort of samples, we might like to know how many segregating sites are present across the genome. We can answer that question by performing an allele count. This is a computation where we scan across all the genotypes at all sites, and for each site we count the number of times each allele is observed.

Let’s perform an allele count for chromosome arm 2L for sample set AG1000G-BF-A.

Note that this will take a minute or two, because we have to scan data for millions of SNPs.

ac = ag3.snp_allele_counts(region="2L", sample_sets="AG1000G-BF-A")
ac
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       ...,
       [4, 0, 0, 0],
       [4, 0, 0, 0],
       [4, 0, 0, 0]], dtype=int32)

Here again each row corresponds to a site, and the columns correspond to the alleles. So the first column gives the reference allele counts, and the remaining columns give the alternate allele counts.

Let’s slice this array to see data for some other sites.

ac[2000:3000]
array([[343,   0,  19,   0],
       [362,   0,   0,   0],
       [362,   0,   0,   0],
       ...,
       [362,   0,   0,   0],
       [362,   0,   0,   0],
       [362,   0,   0,   0]], dtype=int32)

In the array above, the first site shown is segregating because we observed the reference allele 343 times and the second alternate allele 19 times, thus there is some variation.

All the other sites are non-segregating because only the reference allele is observed.

We can count the total number of segregating sites using a Python package called scikit-allel.

import allel
ac = allel.AlleleCountsArray(ac)
ac.count_segregating()
14946593

Exercise 5 (English)

Uncomment and run the code cell below to find out how many variant SNPs are present in the AG1000G-BF-A sample set on chromosome arm 3L.

Exercice 5 (Français)

Supprimer le # dans la cellule ci-dessous et lancer la cellule pour calculer le nombre de SNPs présents dans le jeu d’échantillons AG1000G-BF-A sur le bras de chromosome 3L.

# ac = ag3.snp_allele_counts(region="3L", sample_sets="AG1000G-BF-A")
# ac = allel.AlleleCountsArray(ac)
# ac.count_segregating()

Exercise 6 (English)

Create a new code cell to count the number of segregating SNPs on chromosome arm 3R in An. coluzzii samples from sample set AG1000G-GH. Hint: add a sample_query="taxon == 'coluzzii'" parameter.

Exercice 6 (Français)

Créer une nouvelle cellule code et calculer le nombre de “segregating SNPs” (SNPs variables) sur le bras de chromosome 3R dans les moustiques An. coluzzii du jeu d’échantillons AG1000G-GH. Indice: ajouner le paramètre sample_query="taxon == 'coluzzii'".

Site filters#

Why do we need site filters?#

When analysing genome-wide SNP data, there are some sites in the genome where it is difficult to reliably call SNP genotypes. This can be for a variety of reasons. For example, some genome regions are very repetitive, and short sequence reads cannot be aligned unambiguously. Another issue can be that some samples have structural variation in some regions of the genome, where different samples have different numbers of copies of a given sequence.

To get a sense of what this looks like in the data, let’s use IGV again, and let’s jump straight into the genome region we were looking at earlier.

ag3.view_alignments(sample="AB0087-C", region="2L:2,419,000-2,425,000")

For most of the region, the alignments look “tidy”, in the sense that almost all reads are coloured grey, and depth of coverage is consistent.

But on the left, notice the area where coverage drops, and there are many transparent reads - these have a mapping quality of zero, which means they could equally well have been aligned at another position in the genome.

This indicates some kind of repetitive region of the genome, and our alignments in this region are not reliable, thus we cannot expect to accurately call SNPs in that region.

By the way, if you’re wondering what all the different colours mean in the browser above, check out the IGV alignment track docs.

What site filters are available?#

To identify and avoid sites like these, we have created site filters. These are Boolean masks which identify sites where SNP genotyping is reliable, based on the fact that most sequence reads are unambiguously aligned, and there is minimal evidence for structural variation.

Because we have multiple mosquito species included in the data (An. gambiae, An. coluzzii and An. arabiensis) we have created three different site filters which are appropriate in different circumstances, depending on which samples you’re analysing. Here is a summary of the three site filters:

  • “gamb_colu_arab” - These site filters are appropriate when performing a joint analysis of samples from any of the three species. They are, however, more stringent the other filters, which we’ll explore further below.

  • “gamb_colu” - These site filters are appropriate when analysing only samples that are not An. arabiensis.

  • “arab” - These sige filters are appropriate when analysing only samples that are An. arabiensis.

For each of these filters, there is one Boolean mask per chromosome arm. The Boolean masks are True where a site passes the filter and False otherwise. E.g., here’s the gamb_colu_arab site filter for chromosome arm 2L:

loc_site_pass = ds_snps["variant_filter_pass_gamb_colu_arab"].values
loc_site_pass
array([False, False, False, ..., False, False, False])

We can count how many sites pass and fail this filter.

n_pass = np.count_nonzero(loc_site_pass)
n_pass
32529983
n_fail = np.count_nonzero(~loc_site_pass)
n_fail
15995764

We can also apply this filter to the data, via the site_mask parameter. E.g., let’s access SNP call data with the gamb_colu_arab filters applied.

ds_snps_pass = ag3.snp_calls(
    region="2L", 
    sample_sets="AG1000G-BF-A", 
    site_mask="gamb_colu_arab"
)
ds_snps_pass
                                   
<xarray.Dataset> Size: 95GB
Dimensions:                             (variants: 32529983, alleles: 4,
                                         samples: 181, ploidy: 2)
Coordinates:
    variant_position                    (variants) int32 130MB dask.array<chunksize=(67282,), meta=np.ndarray>
    variant_contig                      (variants) uint8 33MB dask.array<chunksize=(67282,), meta=np.ndarray>
    sample_id                           (samples) <U24 17kB dask.array<chunksize=(181,), meta=np.ndarray>
Dimensions without coordinates: variants, alleles, samples, ploidy
Data variables:
    variant_allele                      (variants, alleles) |S1 130MB dask.array<chunksize=(67282, 1), meta=np.ndarray>
    variant_filter_pass_gamb_colu_arab  (variants) bool 33MB dask.array<chunksize=(28623,), meta=np.ndarray>
    variant_filter_pass_gamb_colu       (variants) bool 33MB dask.array<chunksize=(28623,), meta=np.ndarray>
    variant_filter_pass_arab            (variants) bool 33MB dask.array<chunksize=(28623,), meta=np.ndarray>
    call_genotype                       (variants, samples, ploidy) int8 12GB dask.array<chunksize=(28623, 50, 2), meta=np.ndarray>
    call_GQ                             (variants, samples) int16 12GB dask.array<chunksize=(28623, 50), meta=np.ndarray>
    call_MQ                             (variants, samples) int16 12GB dask.array<chunksize=(28623, 50), meta=np.ndarray>
    call_AD                             (variants, samples, alleles) int16 47GB dask.array<chunksize=(28623, 50, 4), meta=np.ndarray>
    call_genotype_mask                  (variants, samples, ploidy) bool 12GB dask.array<chunksize=(28623, 50, 2), meta=np.ndarray>
Attributes:
    contigs:  ('2R', '2L', '3R', '3L', 'X')

Notice the size of the variants dimension is smaller, and corresponds to the number of sites passing the filter we calculated above (32,529,983).

To get some more intuition for site filters and segregating SNPs, let’s plot SNPs for a region of chromosome arm 2L spanning the Vgsc gene.

ag3.plot_snps(
    region="2L:2,350,000-2,450,000",
    sample_sets="AG1000G-BF-A",
    site_mask="gamb_colu_arab",
);
                                     

Exercise 7 (English)

Plot SNPs for the region 2L:25,350,000-25,440,000 using sample set AG1000G-CI and the gamb_colu site mask.

What gene does this region overlap?

Now zoom in on position 25,429,236. Try to answer the following questions:

  • Is it segregating?

  • Does it pass the site filters?

  • How many alleles are found? Hint: look at the “No. alleles” field in the hover text.

  • Which allele is most common? How many times was it observed?

Exercice 7 (Français)

Créer un graphe intéractif des SNPs dans la région génomique 2L:25,350,000-25,440,000, utilisant le jeu d’échantillons AG1000G-CI et le filtre de position gamb_colu.

Quel gène est présent dans cette région?

Aggrandir la position 25,429,236 et répondre aux questions suivantes:

  • Le SNP à cette position est-il variable (“segregating”).

  • La position passe-t-elle le filtre de positions?

  • Combien d’allèle retrouve-t-on à cette position? Indice: Placer la souris sur le SNP et consulter la rubrique “No. alleles”

  • Quel allèle est le plus fréquent? Combien de fois le retrouve-t-on?

Well done!#

Congratulations on making it to the end of this module!

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing, and try the practical exercises along the way.

Have go at the practical exercises, but please don’t worry if you don’t have time to do them all during the practical session, and please ask the teaching assistants for help if you are stuck.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Français#

Ouvrir ce notebook dans Google Colab et l’exécuter vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, pensez à ce que chaque cellule fait et essayez de faire les exercices quand vous les rencontrez.

Essayez de faire les exercices mais ne vous inquiétez pas si vous n’avez pas le temps de tout faire pendant la séance appliquée et n’hésitez pas à demander aux enseignants assistants si vous avez besoin d’aide parce que vous êtes bloqués.

Indice: Pour ouvrir le notebook dans Google Colab, cliquer sur l’icône de fusée au sommet de cette page puis choisissez “Colab” dans le menu déroulant.