banner

Workshop 4 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 4 - Discovering cryptic taxa#

Theme: Analysis

In this module we’re going to detect cryptic taxa in the Ag3.0 data resource using ancestry informative markers (AIMs) and principal component analysis (PCA). We will use functions in the malariagen_data python package to run analyses then learn how interpret the results to discover cryptic taxonomic structure.

Learning objectives#

At the end of this module you will:

  • Understand MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.

  • Be able to run and plot AIM and PCA analyses, and interpret the results.

  • Be able to detect cryptic taxa.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Why do we need to discover cryptic taxa?#

Generally, we need to discover cryptic taxa in our data sets because they will generate population structure. In workshop 3 we explored how genetically distinct populations need to be separated, else population analysis results will be confounded.

Specifically, we need to identify cryptic taxa for genomic surveillance and vector control. Cryptic taxa may differ from known taxa in medically important phenotypes, e.g., biting times, vector competence or insecticide resistance. Vector control methods that work for some vector taxa may fail to control others.

Setup#

Before we begin the analysis, let’s set up the Python packages we’ll need to use.

First install and import the malariagen_data package.

!pip install -q --no-warn-conflicts malariagen_data
import malariagen_data
import os

Some analyses may take a while to complete, particularly if you’re running this code on a service with modest computational resources such as Google Colab. To avoid having to rerun these analyses, we’ll save the results so we can come back to them later. In Google Colab, you can save results to your Google Drive, which will mean you don’t lose results even if you leave the notebook and come back several days later.

Mount your Google Drive - you will need to follow the authorization instructions.

try:
    # if running on colab, mount Google drive
    from google.colab import drive
    drive.mount('drive')
except ImportError:
    pass

With our Google Drive now mounted, we can define and make a directory where we want to save our results.

results_dir = 'drive/MyDrive/Colab Data/module_4_results'
os.makedirs(results_dir, exist_ok=True)

In Google Colab, we can actually see our mounted drive and results directory by clicking on the file tab on the left hand side of the screen.

Next we should setup the malariagen_data package. As we want to save our reults in the Google Drive folder we just set up, we’ll use the results_cache parameter and assign our results directory to it. If we were running this notebook locally, then we could assign a local folder to this parameter and the results would instead get stored on our hard drive.

ag3 = malariagen_data.Ag3(results_cache=results_dir)
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0
Results cache /home/ahernank/github/anopheles-genomic-surveillance.github.io/docs/workshop-4/drive/MyDrive/Colab Data/module_4_results
Cohorts analysis 20230516
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 7.11.0
Client location unknown

Remember to check Client location in the output above - our cloud data is stored in the US, so want our Google Colab virtual machine (VM) to be based in the US too. If your client location is somewhere else in the world, select Runtime then Disconnect and delete runtime from the toolbar at the top of notebook, then rerun the notebook from the top. This will ensure our analyses run as fast as possible.

Step 1 - Ancestry informative marker (AIM) analysis#

Before we investigate what taxa are present in our dataset in detail, we first make provisional species calls. We could do this using the single marker molecular typing results that contributors often supply when sending samples, generated by assays such as Scott et al. (1993) and Santolamazza et al. (2008). However, as we have discussed in the previous module, these single markers are a blunt tool - the marker may indicate one species, while the rest of the genome indicates a different species entirely.

Rather than a single marker, we use multiple ancestry-informative SNP markers (AIMs) taken from across the genome. These are derived by taking a population of each species and looking for the biallelic SNPs that have different alleles fixed (or almost fixed) in the different species. Depending on how diverged the mosquito species are, this method gives us hundreds or thousands of ancestry informative marker SNPs across the genome.

With these AIMs, we can use the percentage of AIM alleles in an individual to assign provisional species. E.g., currently we assign any sample with 85% or more An. arabiensis AIM alleles as An. arabiensis. These AIM fractions can be found in the sample metadata in the “aim_species_fraction_arab” and “aim_species_fraction_colu” columns. The provisional species calls made from these AIM data can be found in the sample metadata “aim_species” column.

Let’s remind ourselves of what the sample metadata looks like.

sample_meta_df = ag3.sample_metadata()
sample_meta_df.head()
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 AR0047-C LUA047 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
1 AR0049-C LUA049 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
2 AR0051-C LUA051 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
3 AR0061-C LUA061 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
4 AR0078-C LUA078 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2

5 rows × 30 columns

There are several columns in the metadata which provide data on AIMs. Let’s take a look.

aim_columns = [c for c in sample_meta_df if c.startswith("aim_")]
aim_columns
['aim_species_fraction_arab',
 'aim_species_fraction_colu',
 'aim_species_fraction_colu_no2l',
 'aim_species_gambcolu_arabiensis',
 'aim_species_gambiae_coluzzii',
 'aim_species']
sample_meta_df[["sample_id"] + aim_columns].head()
sample_id aim_species_fraction_arab aim_species_fraction_colu aim_species_fraction_colu_no2l aim_species_gambcolu_arabiensis aim_species_gambiae_coluzzii aim_species
0 AR0047-C 0.000958 0.856734 0.945545 gambcolu coluzzii coluzzii
1 AR0049-C 0.000767 0.821967 0.935484 gambcolu coluzzii coluzzii
2 AR0051-C 0.000766 0.825899 0.938640 gambcolu coluzzii coluzzii
3 AR0061-C 0.001724 0.827187 0.940100 gambcolu coluzzii coluzzii
4 AR0078-C 0.000574 0.816224 0.928867 gambcolu coluzzii coluzzii

Note the “aim_species” column - this has the value of our provisional species call made from looking at AIM genotype calls. Let’s see all the values this can take:

sample_meta_df.groupby("aim_species").size()
aim_species
arabiensis                           368
coluzzii                             751
gambiae                             1613
intermediate_gambcolu_arabiensis       1
intermediate_gambiae_coluzzii        348
dtype: int64

In the previous module we looked at how we can visualise the AIMs underlying this provisional species call with an AIM plot, using the handy ag3.plot_aim_heatmap() function in the malariagen_data python package. Let’s remind ourselves of how this function works, and what the AIMs looks like.

ag3.plot_aim_heatmap?

To visualise the AIM calls as a heatmap we just need to specify which set of markers we want to use (aims), either “gambcolu_vs_arab” or “gamb_vs_colu”. There’s also a sample_set parameter and a sample_query parameter to select which samples to look at.

Let’s look again at the “gamb_vs_colu” AIMs from the “AG1000G-BF-A” sample set of mosquitoes from Burkina Faso.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-BF-A"
)

In the plot above, the subplots represent different contigs (chromosome arms) and the rows are individual samples. These plots are interactive and hovering the mouse pointer brings up data about the AIM variant index (column), mosquito sample ID (row), and AIM genotype (here 0 means homozygous gamb/gamb, 1 means heterozygous gamb/colu and 2 means homozygous colu/colu).

In the “gamb_vs_colu” plot, blue represents homozygous An. gambiae genotypes, red represents homozygous An. coluzzii genotypes, and in yellow we see genotypes that are heterozygous for the gambiae and coluzzii alleles.

Despite the fact that many samples are affected by the known introgression event on chromosome arm 2L, in most cases we still make fairly clear provisional species assignments using the AIMs. In fact, because this introgression is so common, we actually ignore chromosome arm 2L when making provisional species calls between gambiae and coluzzii. Here are the provisional assignments for this sample set:

sample_meta_df.query("sample_set == 'AG1000G-BF-A'").groupby("aim_species").size()
aim_species
coluzzii                         82
gambiae                          98
intermediate_gambiae_coluzzii     1
dtype: int64

Let’s look at another AIM plot, this time using the “gambcolu_vs_arab” AIMs to look at mosquitoes from Uganda.

ag3.plot_aim_heatmap(
    aims="gambcolu_vs_arab", 
    sample_sets="AG1000G-UG"
)

In the “gambcolu_vs_arab” plot, green represents genotypes which are homozygous for the An. arabiensis allele; purple represents genotypes homozygous for the allele found in An. gambiae and An. coluzzii, and in yellow we again see genotypes that are heterozygous.

Though the AIMs are not 100% informative due to how they have been obtained (see module 3), we can see lots of samples where most of the AIMs are homozygous for one species. This what we expected, as these samples should belong to one of these known taxa.

Sometimes, however, we see samples where the species is not clear. We can see a sample like this (ACO198-C) in the Uganda plot above. This sample has heterozygous genotypes at almost all AIMs, and so gets assigned as “intermediate_gambcolu_arabiensis” in the samples metadata “aim_species” column.

sample_meta_df.query("sample_id == 'AC0198-C'")[["sample_id"] + aim_columns]
sample_id aim_species_fraction_arab aim_species_fraction_colu aim_species_fraction_colu_no2l aim_species_gambcolu_arabiensis aim_species_gambiae_coluzzii aim_species
2682 AC0198-C 0.494832 0.211288 0.2075 intermediate NaN intermediate_gambcolu_arabiensis

Here are counts of the different AIM species assignments in Uganda:

sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby("aim_species").size()
aim_species
arabiensis                           82
gambiae                             207
intermediate_gambcolu_arabiensis      1
dtype: int64

There are other sample sets where there are multiple samples that do not have a clear species assignment, e.g., Guinea-Bissau.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-GW"
)

In sample sets like this, there may be many “intermediate” AIM species assignments:

sample_meta_df.query("sample_set == 'AG1000G-GW'").groupby("aim_species").size()
aim_species
gambiae                          28
intermediate_gambiae_coluzzii    73
dtype: int64

Sample sets with many samples assigned a “intermediate” provisional species are flagged for further investigation with principal component analysis (PCA).

Step 2 - Principal component analysis#

Recap: what is principal component analysis?#

In the previous workshop we learnt how PCA can be used to identify genetic structure in a group of samples and we learnt why being able to detect structure is useful for genomic surveillance and vector control.

Fundamentally, PCA is a method for reducing the dimensions of a dataset to help make interpreting the data easier. The PCA finds axes through the data that describe its variance, in the case of genomic data, this effectively collapses thousands or millions of dimensions (SNPs) down to a handful of principal components which allow tractable investigation of structure in the data.

When we are trying to identify the causes of the apparent “intermediate” taxon samples from AIMs analysis, the way PCA reveals the structure of these samples can be be very helpful.

Signals of hybridisation - Uganda#

In the Ugandan “gambcolu_vs_arab” aim plot we saw an individual that appeared to be heterozygous for AIMs across it’s genome. Let’s run a PCA on this sample set and see how this individual appears.

Let’s briefly remind ourselves of the pca function documentation.

ag3.pca?

Let’s define some parameters to use in all the PCA computations.

region = "3L:15,000,000-41,000,000"
n_snps = 100_000

Now run a PCA with Ugandan samples.

pca_df, evr = ag3.pca(
    region=region, 
    n_snps=n_snps, 
    sample_sets="AG1000G-UG"
)

We can look at the explained variance ratio array (evr) to get an feeling for how many principal components we are interested in for this particular PCA. This array contains the proportion of the total variance in the dataset explained by each principal component. The easiest way to do this is to plot the array using the malariagen_data function plot_pca_variance().

ag3.plot_pca_variance?
ag3.plot_pca_variance(evr)

Where the variance plot flattens out is a good rule of thumb as to where principal components may be more noise than signal of structure. In this case it looks like PC1 explains much of the variance in the data relative to all other PCs.

Now we can plot the PCA data using the malariagen_data package.

ag3.plot_pca_coords?

So we just need to plug in our pca_df Pandas DataFrame, however, we can also colour our points using the “gambcolu_vs_arab” AIM derived provisional taxon data to help us interpret the plot.

ag3.plot_pca_coords(pca_df, color="aim_species")

With the points coloured by our AIM analysis, we see that principal component 1 (PC1), which describes the most variation in our data, has detected structure driven by species. The separate An. gambiae and An. arabiensis clusters demonstrate a strong degree of reproductive isolation between the species in Uganda.

Equidistant from the two species clusters, we can see a single sample - AC0198-C. The AIM analysis defined this individual as being “intermediate_gambcolu_arabiensis” because it did not carry enough of either An. arabiensis or An. gambiae/coluzzi to be classified as such. The position of this sample, on the same axis of variation (PC1) as our two clusters of species suggests that this individual is a An. gambiae X An. arabiensis hybrid. Furthermore, that the sample falls at the approximate mid-point between the two main clusters, suggests that this individual is an F1 (first filial generation) hybrid. A multi-generational backcrossed individual (>F1), would fall on the same axis but appear closer to one AIM species cluster or the other, depending on which species had been backcrossed into.

In the sample metadata, the “taxon” column contains the results of these PCA-based cryptic species analyses.

In this case (AG1000G-UG), we don’t need to alter our provisional AIM species assignments, and so the “taxon” column will be identical to “aim_species”.

sample_meta_df.query("sample_set == 'AG1000G-UG'").groupby(["aim_species", "taxon"]).size()
aim_species                       taxon     
arabiensis                        arabiensis     82
gambiae                           gambiae       207
intermediate_gambcolu_arabiensis  unassigned      1
dtype: int64

Signals of cryptic species - Tanzania#

Let’s look at a difference situation, where think there is evidence for a cryptic species that we weren’t previously aware of.

Let’s plot the “gamb_vs_colu” (An. gambiae vs An. coluzzii) AIMs for the Tanzanian sample set, excluding samples assigned as An. arabiensis.

ag3.plot_aim_heatmap(
    aims="gamb_vs_colu", 
    sample_sets="AG1000G-TZ", 
    sample_query="aim_species != 'arabiensis'"
)

We can see that a number of samples show mixed ancestry (blue, red and yellow genotypes) over the 2 and 3 chromosomes, but appear homozygous for An. gambiae ancestry on the X chromosome.

In the AIM analysis, these samples were provisionally labelled as being “intermediate_gambiae_coluzzii” in the “aim_species” column of the sample metadata.

This result is particularly interesting, as the An. coluzzii species range does not reach this far East. Let’s investigate this sample set using PCA.

pca_df, evr = ag3.pca(
    region=region, 
    n_snps=n_snps, 
    sample_sets="AG1000G-TZ",
)

We should plot the variance array to get an idea which principal components we should look at.

ag3.plot_pca_variance(evr)

For these samples, the explained variance doesn’t flatten out until PC4, which suggests that we should look at the first three PCs. We could make multiple 2D scatter plots to investigate these three PCs e.g. PC1 vs. PC2 and PC2 vs. PC3. But it would be easier to intepret the results if we could make one 3D scatter plot and visualise all three PCs together. There is a function in malariagen_data that makes this very simple.

ag3.plot_pca_coords_3d?

The parameters are very similar to the 2D PCA plotting function, except here we have another axis parameter “z” that we can assign another principal component to. Let’s colour the points by aim_species as we did before.

ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="aim_species")

Under the hood, malariagen_data is building these plots with Plotly, so they are interactive. The plot can be rotated by clicking on the plot and moving the mouse and scrolling allows zooming in and out. Holding the mouse pointer over a point will reveal metadata about that sample. Let’s try to interpret this plot one PC at a time.

PC1#

The first principal component has separated all the “arabiensis from all other “aim_species”. This is what we might expect when we sample two species, due to reproductive isolation.

PC2#

The second principal component has pulled all of the “intermediate_gambiae_coluzzii” samples as well as some “gambiae” samples away from all other samples. This is striking, the cluster that includes the intermediate samples is separated on its own axis of variation.

PC3#

The third principal component splits two clusters of “gambiae” samples. If we look at the metadata attached to samples in these two clusters, we can see that in one cluster, samples were collected in Muleba and in the other, the samples were collected in Muheza. This looks like a classic case of geographic isolation between the An. gambiae populations from these two sites.

Interpretation#

There is clearly a lot to unpack when it comes to the population structure we find in this sample set. The strongest signal of variance (PC1) is being driven by reproductive isolation between An. arabiensis and other species.

What is interesting here is that all the “arabiensis” samples are gathered together in a single cluster, but there are two clusters of “gambiae” (separated by PC3). Let’s have a look at some ecological information about the collection sites in Tanzania (the code for this figure can be found here).

Our “gambiae” samples come from Muheza and Muleba, in the East and West of the country respectively. We might conclude that our two clusters of “gambiae” in our PCA are being driven simply by the geographic distance separating the sample sites. However, “arabiensis” have also been collected from sites on either side of the country, and yet all these “arabiensis” samples cluster together.

There is a body of research showing An. arabiensis has a higher aridity tolerance than An. gambiae (e.g. Gray & Bradley (2005)).

One explanation of our results, consistent with these other findings, is that as Muheza and Muleba are separated by a brown (more arid) region that runs approximately North to South, splitting the country in two. The two “gambiae” populations are thus separated by a barrier of unsuitable environment, enabling the evolution of structure (two clusters in our PCA). As An. arabiensis can tolerate the dryer region, geneflow can occur across the country resulting in less within-species structure (one cluster in our PCA).

But what about the other cluster on our PCA, containing intermediate samples separated from other clusters by PC2. From studies of species distributions we know that there are no An. coluzzii in Tanzania and these samples do not look like hybrids between An. gambiae and An. arabiensis as they are not in between the species clusters pulled apart by PC1 (like our hybrid sample in Uganda).

Perhaps these samples are cryptic species, if they are species we don’t have AIMs for they could be labelled as intermediate in the AIM analysis. However, if this cluster represents a cryptic species, why are there four samples labelled as “gambiae” also present in the cluster?

  • BL0357-C

  • BL0366-C

  • BL0370-C

  • BL0384-C

To dig down into this further, we could make the same PCA plot as before, but this time colour the points by “aim_fraction_colu”, which as it suggests, is the fraction of AIMs from the “gamb_vs_colu” analysis, that suggest An. coluzzii ancestry.

Colour by AIM fraction#

ag3.plot_pca_coords_3d(
    pca_df, 
    x="PC1", 
    y="PC2", 
    z="PC3", 
    color="aim_species_fraction_colu"
)

When we colour the points by fraction coluzzii ancestry we can see that the “gambiae” samples which cluster with the “intermediate_gambiae_coluzzii” actually have relatively high “coluzzii” ancestry (lighter purple) compared with the other “gambiae” (darker purple) in Tanzania. This means that although they had been provisionally labelled as “gambiae” they were at the lower end of the AIM % cut-off.

As mentioned earlier, the AIMs are not 100% informative, and the species cut-offs we use are somewhat arbitrary. This is why we just use the AIM analyses to give provisonal species calls, and then follow up with PCA to give us a more nuanced picture of both structure and taxa.

Labelling cryptic taxa#

With the evidence we have collected on this “intermediate_gambiae_coluzzii” cluster of samples in Tanzania, we would re-label these samples as members of a cryptic taxa. In the sample metadata, the taxon column for these samples has consequently been changed to “gcx3”. This stands for “gambiae complex cryptic taxa 3”, as it is actually the third cryptic taxon we have identified in the Ag3.0 samples.

We can see this re-labelling if we run the same PCA but colour our samples by “taxon”.

ag3.plot_pca_coords_3d(pca_df, x="PC1", y="PC2", z="PC3", color="taxon")

We can also see how some of the AIM species assignments are converted into a cryptic taxon assignment. I.e., there are some samples labelled as either “gambiae” or “intermediate_gambiae_coluzzii” in the aim_species column, which get assigned as “gcx3” in the taxon column.

sample_meta_df.query("sample_set == 'AG1000G-TZ'").groupby(["aim_species", "taxon"]).size()
aim_species                    taxon     
arabiensis                     arabiensis    225
gambiae                        gambiae        64
                               gcx3            4
intermediate_gambiae_coluzzii  gcx3            7
dtype: int64

The importance of identifying cryptic taxa#

Our principal component analysis has highlighted population structure that appears to be due to a previously-unknown cryptic species in Tanzania. Can we use our genomic data to identify operationally important differences with this taxa?

If we remember module 4 of the first workshop, we used the malariagen_data package to plot gene SNP allele frequencies by cohorts, as a heatmap. Let’s do that for the Vgsc gene (target of pyrethroid insecticides) in our Tanzanian “gambiae” and “gcx3” cohorts.

ag3.aa_allele_frequencies?
aa_allele_freqs_df = ag3.aa_allele_frequencies(
    transcript="AGAP004707-RD", 
    cohorts="admin1_year", 
    sample_sets="AG1000G-TZ",
    sample_query="taxon != 'arabiensis'",
)
aa_filt_df = aa_allele_freqs_df.query("max_af > 0.05")
ag3.plot_frequencies_heatmap(aa_filt_df)

The heatmap shows that the gcx3 cohort of samples (“TZ-25-gcx3-2013”) does not carry any pyrethoid target-site resistance alleles, whereas the “gambiae” cohort from the same collection year and region (“TZ-25-gamb-2013”) has a 41% allele frequency of the kdr (knock-down resistance) allele L995S.

This may mean the gcx3 mosquitoes have very different insecticide resistance profiles to the gambiae mosquitoes from the same location. Consequently, it is important to consider them separately for further investigations into the best vector control strategies.

Taxon analysis workflow - recap#

1. AIMs analysis#

  • First compute AIMs and use these to make provisional species calls.

  • Look for sample sets where there’s evidence of “intermediate” samples.

2. PCA analysis#

  • Run PCAs for each suspect sample set. Adding sample sets from nearby countries with clean species calls can help as a comparison.

  • Look for distinct clusters on PCA plots containing the “intermediate” samples.

3. Interpret#

  • Do these intermediate samples fall between clusters of known species?

    • If yes, they could be hybrids of the two known species.

  • Is there a PCA dimension which pulls the cluster out uniquely?

    • If yes, they could belong to a cryptic taxon

Well done!#

  • In this module we have followed MalariaGEN’s mosquito taxon analysis pipeline and the reasoning behind it.

  • Run and plotted both AIM and PCA analyses, and learnt how to interpret the results.

  • Learnt how to detect cryptic taxa.

Exercises#

English#

Open this notebook in Google Colab and run it for yourself from top to bottom. As you run through the notebook, cell by cell, think about what each cell is doing.

Hint: To open the notebook in Google Colab, click the rocket icon at the top of the page, then select “Colab” from the drop-down menu.

Next, run a taxon analysis workflow for Guinea Bissau - “AG1000G-GW”.

  • Plot the “gamb_vs_colu” AIMs for Guinea Bissau.

    • How might we interpret this plot?

  • Run and plot a PCA for Guinea Bissau.

    • How many PCs should we investigate?

    • How might we interpret the PCA plot? Hint: try colouring markers by “aim_species_fraction_colu”

  • Add Burkina Faso A sample set to the PCA analysis - [“AG1000G-GW, “AG1000G-BF-A”].

    • Does this help clarify the situation, if so why?

    • Could we have a hybrid cluster or a cryptic taxon cluster?

    • Explain why you think that.

Français#

Ouvrez ce notebook dans Google Colab et exécutez-le vous-même du début à la fin. Pendant que vous exécutez le notebook, cellule par cellule, considérez ce que chaque cellule fait.

Indice: Ouvrir ce notebook dans Google Colab, cliquez sur l’icône “fusée” au sommet de la page et sélectionner “Colab” dans le menu.

Ensuite, exécuter l’analyse des taxons pour la Guinée-Bissau – “AG1000G-GW”.

  • Afficher le diagramme des AIMs “gamb_vs_colu” pour Guinea-Bissau

    • Comment interpreter ce diagramme?

  • Exécuter et afficher la PCA pour la Guinée

    • Combien de PCs doivent être étudiés?

    • Comment interpréter le diagramme de la PCA? Indice: essayer de colorer les marqueurs en fonction de “aim_species_fraction_colu”.

  • Ajouter l’ensemble de données Burkina Faso A à l’analyse PCA – [“AG1000G-GW”, “AG1000G-BF-A”].

    • Est-ce que cela aide à rendre la situation plus claire? Si oui, pourquoi?

    • Est-ce vous observez un groupe d’hybrides ou un groupe d’un taxon cryptique?

    • Pourquoi pensez-vous cela?