banner

Workshop 2 - Training course in data analysis for genomic surveillance of African malaria vectors


Module 1 - Introduction to pandas DataFrames#

Theme: Tools & Technology

In this module we’re going to take a closer look at a Python package called pandas, which is an essential part of our data analysis toolkit. We use pandas whenever we need to store, manipulate or analyse tables of data, also known as DataFrames. In this module we’ll focus on using pandas to analyse genomic data, but it can be used to analyse any kind of data, so it’s a very useful tool to have at your fingertips.

Learning objectives#

After this module you will be able to:

  • Explain what a DataFrame is

  • Access columns and rows from a DataFrame

  • Use indexes and queries

  • Write a DataFrame to a file

Please note that this module will cover only a small part of the features that pandas provides - see the links at then end of this notebook if you’d like to go further.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

Let’s install the malariagen_data package so we have easy access to some genomic data to demonstrate with. Pandas is already installed on colab.

%pip install -q --no-warn-conflicts malariagen_data

Now import the malariagen_data and pandas packages. Note that by convention we use pd as an abbreviation for pandas.

import malariagen_data
import pandas as pd

Set up access to the MalariaGEN Ag3.0 data resource.

ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs.
Storage URL gs://vo_agam_release/
Data releases available 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8
Results cache None
Cohorts analysis 20231215
AIM analysis 20220528
Site filters analysis dt_20200416
Software version malariagen_data 8.7.0
Client location unknown

pandas DataFrames#

We use the pandas DataFrame class whenever we want to store and analyse data that can be structured as a table with rows and columns. Let’s take a look at some examples of DataFrames that we met in the first workshop of this training course.

Example: sample metadata#

In workshop 1, module 2 we encountered the sample_metadata() function. This function returns a DataFrame of sample metadata. Let’s call this function and assign the result to the variable df_samples.

df_samples = ag3.sample_metadata(sample_sets="3.0")
                                     

Check what type of object we have received.

type(df_samples)
pandas.core.frame.DataFrame

Inspect the data in the DataFrame by running a code cell containing just the name of the variable.

df_samples
sample_id partner_sample_id contributor country location year month latitude longitude sex_call ... admin1_name admin1_iso admin2_name taxon cohort_admin1_year cohort_admin1_month cohort_admin1_quarter cohort_admin2_year cohort_admin2_month cohort_admin2_quarter
0 AR0047-C LUA047 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
1 AR0049-C LUA049 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
2 AR0051-C LUA051 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
3 AR0061-C LUA061 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
4 AR0078-C LUA078 Joao Pinto Angola Luanda 2009 4 -8.884 13.302 F ... Luanda AO-LUA Luanda coluzzii AO-LUA_colu_2009 AO-LUA_colu_2009_04 AO-LUA_colu_2009_Q2 AO-LUA_Luanda_colu_2009 AO-LUA_Luanda_colu_2009_04 AO-LUA_Luanda_colu_2009_Q2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3076 AD0494-C 80-2-o-16 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 F ... NaN NaN NaN unassigned NaN NaN NaN NaN NaN NaN
3077 AD0495-C 80-2-o-17 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M ... NaN NaN NaN unassigned NaN NaN NaN NaN NaN NaN
3078 AD0496-C 80-2-o-18 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M ... NaN NaN NaN unassigned NaN NaN NaN NaN NaN NaN
3079 AD0497-C 80-2-o-19 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 F ... NaN NaN NaN unassigned NaN NaN NaN NaN NaN NaN
3080 AD0498-C 80-2-o-20 Martin Donnelly Lab Cross LSTM -1 -1 53.409 -2.969 M ... NaN NaN NaN unassigned NaN NaN NaN NaN NaN NaN

3081 rows × 32 columns

There are 3,081 rows and 32 columns in this DataFrame. Each row provides data about a single mosquito DNA sample that has been sequenced by MalariaGEN. The columns provide data about those mosquitoes, such as the country and year in which the mosquito was originally collected.

Note that the representation of the DataFrame above only shows some of the data - there are too many rows and columns to show all at once. The ellipsis character (”…”) shows where some data have been hidden.

Example: genome annotations#

In workshop 1, module 3 we encountered the geneset() function. This function returns a DataFrame of genome annotations. Let’s call the function and assign the result to the variable df_geneset.

df_geneset = ag3.geneset()
df_geneset
                                     
contig source type start end score strand phase ID Parent Name description
0 2L VectorBase chromosome 1 49364325 NaN NaN NaN 2L NaN NaN NaN
1 2L VectorBase gene 157348 186936 NaN - NaN AGAP004677 NaN NaN methylenetetrahydrofolate dehydrogenase(NAD ) ...
2 2L VectorBase mRNA 157348 181305 NaN - NaN AGAP004677-RA AGAP004677 NaN NaN
3 2L VectorBase three_prime_UTR 157348 157495 NaN - NaN NaN AGAP004677-RA NaN NaN
4 2L VectorBase exon 157348 157623 NaN - NaN NaN AGAP004677-RA AGAP004677-RB-E4 NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
196140 Y_unplaced VectorBase five_prime_UTR 47932 48111 NaN + NaN NaN AGAP029375-RA NaN NaN
196141 Y_unplaced VectorBase exon 47932 48138 NaN + NaN NaN AGAP029375-RA AGAP029375-RA-E2 NaN
196142 Y_unplaced VectorBase CDS 48112 48138 NaN + 0.0 AGAP029375-PA AGAP029375-RA NaN NaN
196143 Y_unplaced VectorBase exon 48301 48385 NaN + NaN NaN AGAP029375-RA AGAP029375-RA-E3 NaN
196144 Y_unplaced VectorBase CDS 48301 48385 NaN + 0.0 AGAP029375-PA AGAP029375-RA NaN NaN

196145 rows × 12 columns

There are 196,145 rows and 12 columns in this DataFrame. Each row provides data about an annotated feature on the Anopheles gambiae AgamP4 reference genome, such as a gene. The columns provide data about those features, such as the contig (chromosome arm) on which the feature occurs, and the start and end coordinates.

Example: SNP allele frequencies#

In workshop 1, module 4, we encountered the snp_allele_frequencies() function. This function returns a DataFrame of single nucleotide polymorphisms (SNPs) together with their allele frequencies in different mosquito cohorts. Let’s call this function and assign the result to the variable snp_freqs_df.

snp_freqs_df = ag3.snp_allele_frequencies(
    transcript="AGAP004707-RD", 
    cohorts="admin1_year", 
    sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"], 
)
snp_freqs_df
                                     
                                      
                                     
pass_gamb_colu_arab pass_gamb_colu pass_arab frq_BF-07_gamb_2004 frq_BF-09_colu_2012 frq_BF-09_colu_2014 frq_BF-09_gamb_2012 frq_BF-09_gamb_2014 max_af transcript effect impact ref_codon alt_codon aa_pos ref_aa alt_aa label
contig position ref_allele alt_allele aa_change
2L 2358252 C T A32V True True True 0.000000 0.006098 0.0 0.000000 0.0 0.006098 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE gCa gTa 32.0 A V 2L:2,358,252 C>T (A32V)
2358328 T C NaN True True True 0.000000 0.006098 0.0 0.000000 0.0 0.006098 AGAP004707-RD INTRONIC MODIFIER None None NaN None None 2L:2,358,328 T>C
2358441 A T NaN True True True 0.038462 0.000000 0.0 0.000000 0.0 0.038462 AGAP004707-RD INTRONIC MODIFIER None None NaN None None 2L:2,358,441 A>T
2358468 A C NaN True True True 0.269231 0.000000 0.0 0.000000 0.0 0.269231 AGAP004707-RD INTRONIC MODIFIER None None NaN None None 2L:2,358,468 A>C
2358581 T A NaN True True True 0.192308 0.000000 0.0 0.000000 0.0 0.192308 AGAP004707-RD INTRONIC MODIFIER None None NaN None None 2L:2,358,581 T>A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2431424 G T G2055V True True True 0.000000 0.000000 0.0 0.005051 0.0 0.005051 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE gGc gTc 2055.0 G V 2L:2,431,424 G>T (G2055V)
2431518 C A A2086A True True True 0.038462 0.000000 0.0 0.000000 0.0 0.038462 AGAP004707-RD SYNONYMOUS_CODING LOW gcC gcA 2086.0 A A 2L:2,431,518 C>A (A2086A)
2431527 G C V2089V True True True 0.038462 0.000000 0.0 0.000000 0.0 0.038462 AGAP004707-RD SYNONYMOUS_CODING LOW gtG gtC 2089.0 V V 2L:2,431,527 G>C (V2089V)
2431569 C A V2103V True True True 0.038462 0.000000 0.0 0.000000 0.0 0.038462 AGAP004707-RD SYNONYMOUS_CODING LOW gtC gtA 2103.0 V V 2L:2,431,569 C>A (V2103V)
2431613 T C V2118A True True True 0.000000 0.000000 0.0 0.005051 0.0 0.005051 AGAP004707-RD NON_SYNONYMOUS_CODING MODERATE gTc gCc 2118.0 V A 2L:2,431,613 T>C (V2118A)

9675 rows × 18 columns

There are 9,675 rows and 18 columns in this DataFrame. Each row provides data about a SNP. The columns provide data about those SNPs, such as their effect, and the alternate allele frequency in a particular cohort of mosquitoes, e.g., the column named “frq_BF-09_colu_2012” provides the frequency in Anopheles coluzzii mosquitoes collected in the BF-09 (Hauts-Bassins) administrative region of Burkina Faso in 2012.

Aside: variable naming styles#

You may have noticed that in the examples above, whenever we assigned a DataFrame to a variable, the name of the variable either started with the prefix “df_” or ended with the suffix “_df”. Here “df” stands for DataFrame, and is used just as a helpful reminder that the value of the variable is a pandas DataFrame. Using a prefix of suffix on the variable name like this is entirely optional and is up to your own coding style. You can name these variables whatever you like. Different instructors in this course prefer different styles, and you will see different styles being used in different modules.

Example: reading from a CSV file#

You can also create a pandas DataFrame by reading data from a file, such as a comma-separated value (CSV) file, or an Excel workbook file. Let’s use the pandas read_csv() function to read data from a CSV file containing data from a US census that colab provides as example data for you to play with.

df_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_housing
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
... ... ... ... ... ... ... ... ... ...
16995 -124.26 40.58 52.0 2217.0 394.0 907.0 369.0 2.3571 111400.0
16996 -124.27 40.69 36.0 2349.0 528.0 1194.0 465.0 2.5179 79000.0
16997 -124.30 41.84 17.0 2677.0 531.0 1244.0 456.0 3.0313 103600.0
16998 -124.30 41.80 19.0 2672.0 552.0 1298.0 478.0 1.9797 85800.0
16999 -124.35 40.54 52.0 1820.0 300.0 806.0 270.0 3.0147 94600.0

17000 rows × 9 columns

This DataFrame has 17,000 rows and 9 columns. Each row provides data about a block of housing. The columns provide data about the houses within a block, such as the total number of bedrooms and the number of households.

Note that you can upload your own files to Colab if you want to read in and analyse some of your own data with pandas.

Example: creating a DataFrame from scratch#

You can also create a DataFrame by providing the data within your code. To demonstrate this, let’s create a DataFrame with data about characters in a famous science fiction book called The Hitchhiker’s Guide to the Galaxy.

df_hitchhikers = pd.DataFrame(data={
    "name": ["Arthur Dent", "Tricia McMillan", "Ford Prefect", "Zaphod Beeblebrox", "Marvin the Paranoid Android"],
    "taxon": ["human", "human", "betelgeusian", "betelgeusian", "robot"],
    "gender": ["male", "female", "male", "male", None],
    "number_of_heads": [1, 1, 1, 2, 1],
    "intelligence": [100, 180, 120, 30, 5_000_000]
})
df_hitchhikers
name taxon gender number_of_heads intelligence
0 Arthur Dent human male 1 100
1 Tricia McMillan human female 1 180
2 Ford Prefect betelgeusian male 1 120
3 Zaphod Beeblebrox betelgeusian male 2 30
4 Marvin the Paranoid Android robot None 1 5000000

This DataFrame has 5 rows and 5 columns. Each row provides data about a fictional character in the book. The columns provide data about each character, such as their name, taxon and number of heads (some alien species can have more than one).

Structure of a DataFrame#

Hopefully this will already be obvious from the examples above, but a pandas DataFrame is a data structure that is organised into rows and columns. Each DataFrame also has an index, which is a special column (or set of columns) that is shown in bold at the left of a DataFrame, and which we’ll look at in more detail later.

%%html
<img width="90%" height="90%"src="https://storage.googleapis.com/vo_agam_release/reference/training-images/workshop-2/W2M1-1.PNG"/>

Counting rows and columns#

If a DataFrame is large, you may find it useful to know how many rows and columns it contains. The representation of a DataFrame shows at the bottom how many rows and columns are present in the DataFrame, but you can also use the built-in len() function if you want to access the number of rows or columns in your code.

num_rows = len(df_hitchhikers)
num_rows
5
num_cols = len(df_hitchhikers.columns)
num_cols
5

Exercise 1 (English)

  • Count the number of rows and columns in the df_samples DataFrame.

Exercice 1 (Français)

  • Compter le nombre de lignes et de colonnes du DataFrame df_samples.

Accessing columns#

Often you will have a DataFrame with many columns, and you many only need to work with data from a smaller number of columns, or just a single column. Accessing one or more columns from a DataFrame can be done using the square bracket syntax after the name of a DataFrame variable. Let’s see how this works.

Accessing column names#

First, let’s access the names of all the columns in a DataFrame, to see what columns are available. You can do this by accessing the columns property of a DataFrame. For simplicity, let’s also convert the result to a Python list, using the built-in list() function.

col_names = list(df_hitchhikers.columns)
col_names
['name', 'taxon', 'gender', 'number_of_heads', 'intelligence']

This can be useful especially if you have a lot of columns in a DataFrame.

Exercise 2 (English)

  • Access the column names of the df_samples DataFrame.

Exercice 2 (Français)

  • Afficher les noms des colonnes du DataFrame df_samples.

Accessing a single column#

To access a single column, give the column name as a string inside square brackets. Let’s access the “name” column from the DataFrame of book characters, and assign the result to a new variable character_name.

character_name = df_hitchhikers["name"]

When you access a single column or row from a DataFrame, pandas returns a Series object to you. Pandas uses the Series class to hold data either from a single column or a single row, i.e., a one-dimensional sequence of data values.

type(character_name)
pandas.core.series.Series

We can also inspect the data in the column.

character_name
0                    Arthur Dent
1                Tricia McMillan
2                   Ford Prefect
3              Zaphod Beeblebrox
4    Marvin the Paranoid Android
Name: name, dtype: object

Investigating data in a column#

The pandas Series class has some useful functions for investigating the data in a DataFrame.

E.g., let’s access the “taxon” column and use the value_counts() functions to get a summary of the categorical data in the column.

character_taxon = df_hitchhikers["taxon"]
character_taxon.value_counts()
taxon
human           2
betelgeusian    2
robot           1
Name: count, dtype: int64

The pandas Series class also has a describe() function which can be useful for investigating numerical data. Let’s use it with the number_of_heads column.

character_heads = df_hitchhikers["number_of_heads"]
character_heads.describe()
count    5.000000
mean     1.200000
std      0.447214
min      1.000000
25%      1.000000
50%      1.000000
75%      1.000000
max      2.000000
Name: number_of_heads, dtype: float64

The characters in our DataFrame have at least (min) 1 and at most (max) 2 heads. The mean number of heads is 1.2, but of course this a bit silly, because number of heads is not a continuous variable.

Exercise 3 (English)

  • Access the “country” column from the df_samples DataFrame, and then use the value_counts() function to show the number of mosquito samples from each country.

  • Access the “type” column from the df_geneset DataFrame, and then use the value_counts() function to show the number of genome annotations of different types.

  • Access the “effect” column from the snp_freqs_df DataFrame, and then use the value_counts() function to show the number of SNPs with different effects.

Exercice 3 (Français)

  • Obtenir la colonne “country” du DataFrame df_samples et utiliser la fonction value_counts() pour afficher le nombre de moustiques capturés dans chaque pays.

  • Obtenir la colonne “type” du DataFrame df_geneset et utiliser la fonction value_counts() pour afficher le nombre d’annotations du génome de chaque type.

  • Obtenir la colonne “effect” du DataFrame snp_freqs_df et utiliser la fonction value_counts() pour afficher le nombre de SNPs pour chaque effet possible.

Accessing multiple columns#

To access multiple columns, provide the column names as a list of strings. Let’s access the “name”, “taxon” and “number_of_heads” columns from the DataFrame of book characters, and assign the resulting DataFrame to a new variable df_hh.

selected_cols = ["name", "taxon", "number_of_heads"]
df_hh = df_hitchhikers[selected_cols]
df_hh
name taxon number_of_heads
0 Arthur Dent human 1
1 Tricia McMillan human 1
2 Ford Prefect betelgeusian 1
3 Zaphod Beeblebrox betelgeusian 2
4 Marvin the Paranoid Android robot 1

You can also put the list of columns directly inside the square brackets, which would give the same result.

df_hh = df_hitchhikers[["name", "taxon", "number_of_heads"]]
df_hh
name taxon number_of_heads
0 Arthur Dent human 1
1 Tricia McMillan human 1
2 Ford Prefect betelgeusian 1
3 Zaphod Beeblebrox betelgeusian 2
4 Marvin the Paranoid Android robot 1

Exercise 4 (English)

  • Create a new DataFrame containing only the “sample_id”, “country” and “year” columns from the df_samples DataFrame.

  • Create a new DataFrame containing only the “ID”, “contig”, “start” and “end” columns from the df_geneset DataFrame.

  • Create a new DataFrame containing only the frequency columns from the snp_freqs_df DataFrame (hint: the names of these columns all begin with “frq_”).

Exercice 4 (Français)

  • Créer un nouveau DataFrame contenant seulement les colonnes “sample_id”, “country” et “year” du DataFrame df_samples.

  • Créer un nouveau DataFrame contenant seulement les colonnes “ID”, “contig”, “start” et “end” du DataFrame df_geneset.

  • Créer un nouveau DataFrame contenant seulement les colonnes contenant les fréquences du DataFrame snp_freqs_df (Indice: les noms de ces colonnes commencent tous par “frq_”).

Accessing rows#

Now we’ve seen how to access columns, let’s see how to access rows. To access rows by their position within a DataFrame, we can use the iloc property.

Accessing a single row#

To access a single row, we use the iloc property, then provide the row number as an integer inside square brackets, remembering that numbering starts from 0 in Python. For example, access the first row of our DataFrame of book characters, and assign the result to a variable first_row.

first_row = df_hitchhikers.iloc[0]

When accessing a single row, pandas returns a Series.

type(first_row)
pandas.core.series.Series

Inspect the data in the row.

first_row
name               Arthur Dent
taxon                    human
gender                    male
number_of_heads              1
intelligence               100
Name: 0, dtype: object

Access specific data values within the row.

first_row["name"]
'Arthur Dent'
first_row["number_of_heads"]
1

Exercise 5 (English)

  • Access the 81st row in the df_samples DataFrame. What country did this mosquito sample come from? (Remember: numbering starts from 0 in Python.)

  • Access the 73,287th row in the df_geneset DataFrame. What type of annotation is this? What is its Name?

Exercice 5 (Français)

  • Obtenir la 81ème ligne du DataFrame df_samples. De quel pays est issu ce moustique? (Indice: les indices commencent à 0 en Python)

  • Obtenir la 73 287ème ligne du DataFrame df_geneset. De quel type d’annotation s’agit-il? Quel est son nom?

Accessing multiple rows#

We can also use iloc to access multiple rows,from some start position to some end position. To do this we provide two integers separated by a colon (“:”) which is called a slice.

As well as remembering that numbering starts from zero, we also have to remember that slices include the start position but do not include the stop position. E.g., to access the first two rows, provide the start and stop positions 0 and 2. This returns a new DataFrame with just the selected rows within the slice.

first_two_rows = df_hitchhikers.iloc[0:2]
first_two_rows
name taxon gender number_of_heads intelligence
0 Arthur Dent human male 1 100
1 Tricia McMillan human female 1 180

Exercise 6 (English)

  • Access rows starting from the 6th row up to and including the 10th row in the df_samples DataFrame.

Exercice 6 (Français)

  • Obtenir toutes les lignes entre la 6ème et la 10ème (incluses) du DataFrame df_samples.

Indexes#

Every pandas DataFrame has an index, which is a special column (or set of columns) which can be used to access rows quickly and conveniently. In particular, indexes can be used to assign a label to every row, and then use that label to locate and access matching rows.

For example, when working with our DataFrame of book characters, we might want to access rows using the name of the character. To do this, we can set the “name” column as the index.

df_hitchhikers_indexed = df_hitchhikers.set_index("name")
df_hitchhikers_indexed
taxon gender number_of_heads intelligence
name
Arthur Dent human male 1 100
Tricia McMillan human female 1 180
Ford Prefect betelgeusian male 1 120
Zaphod Beeblebrox betelgeusian male 2 30
Marvin the Paranoid Android robot None 1 5000000

The DataFrame df_hitchhikers_indexed has the same data as before, but notice that the “name” column has moved to the left and is shown in bold. This shows us that the “name” column is now being used as the DataFrame index.

Now that we’ve set a useful index, we can use the loc property to locate rows with a particular name. E.g., locate the row where the character’s name is “Tricia McMillan”.

trillian = df_hitchhikers_indexed.loc["Tricia McMillan"]

This indexing action locates a single row, and so pandas returns a Series.

type(trillian)
pandas.core.series.Series

Inspect the row of data we located.

trillian
taxon               human
gender             female
number_of_heads         1
intelligence          180
Name: Tricia McMillan, dtype: object

It’s also useful to know that you can always reset the index back to the default, using the reset_index() function. The default is to index the DataFrame with integers.

df_hitchhikers_reset = df_hitchhikers_indexed.reset_index()
df_hitchhikers_reset
name taxon gender number_of_heads intelligence
0 Arthur Dent human male 1 100
1 Tricia McMillan human female 1 180
2 Ford Prefect betelgeusian male 1 120
3 Zaphod Beeblebrox betelgeusian male 2 30
4 Marvin the Paranoid Android robot None 1 5000000

Exercise 7 (English)

  • Set the “sample_id” column as the index on the df_samples DataFrame, then locate the row with sample identifier “BK0055-C”. Where was this mosquito collected?

  • Set the “ID” column as the index on the df_geneset DataFrame, then locate the row with identifier “AGAP000818”. What is the name of this gene?

Exercice 7 (Français)

  • Faire de la colonne “sample_id” l’index du DataFrame df_samples et ensuite trouver la ligne avec comme identificant d’échantillon “BK0055-C”. D’où est issu ce moustique?

  • Faire de la colonne “ID” l’index du DataFrame df_geneset et ensuite trouver la ligne avec comme identifiant “AGAP000818”. Quel est le nom de ce gène?

Queries#

DataFrames also provide a query() function, which can be used to select rows matching some condition or combination of conditions. The query() function accepts a string containing a query expression.

Let’s query the book characters DataFrame to select only rows where the value of the “taxon” column is “human”. To match a value we use the “==” operator.

df_human_hitchhikers = df_hitchhikers.query("taxon == 'human'")
df_human_hitchhikers
name taxon gender number_of_heads intelligence
0 Arthur Dent human male 1 100
1 Tricia McMillan human female 1 180

Let’s do the opposite, selecting only rows where the value of the “taxon” column is not “human”, using the “!=” operator.

df_non_human_hitchhikers = df_hitchhikers.query("taxon != 'human'")
df_non_human_hitchhikers
name taxon gender number_of_heads intelligence
2 Ford Prefect betelgeusian male 1 120
3 Zaphod Beeblebrox betelgeusian male 2 30
4 Marvin the Paranoid Android robot None 1 5000000

Let’s query a numerical field using the “>” greater than operator.

df_multi_headed_hitchhikers = df_hitchhikers.query("number_of_heads > 1")
df_multi_headed_hitchhikers
name taxon gender number_of_heads intelligence
3 Zaphod Beeblebrox betelgeusian male 2 30

A query expression can combine conditions on multiple columns using “and” and “or”. E.g., find rows where the value of the “taxon” column is “human” and the value of the “number_of_heads” column is 1.

df_non_human_one_headed_hitchhikers = df_hitchhikers.query(
    "taxon != 'human' and number_of_heads == 1"
)
df_non_human_one_headed_hitchhikers
name taxon gender number_of_heads intelligence
2 Ford Prefect betelgeusian male 1 120
4 Marvin the Paranoid Android robot None 1 5000000

Exercise 8 (English)

  • Query the df_samples DataFrame to select mosquitoes collected in Burkina Faso.

  • Query the df_samples DataFrame to select mosquitoes collected in Burkina Faso in 2014.

  • Query the df_samples DataFrame to select Anopheles coluzzii mosquitoes collected in Burkina Faso in 2014 (hint: taxon == 'coluzzii').

  • Query the snp_freqs_df DataFrame to select SNPs which change the protein sequence (hint: effect == 'NON_SYNONYMOUS_CODING').

  • Query the snp_freqs_df DataFrame to select SNPs which change the protein sequence and which occur above 5% frequency in at least one cohort (hint: max_af > 0.05).

Exercice 8 (Français)

  • Sélectionner parmi les moustiques du DataFrame df_samples ceux qui ont été capturés au Burkina Faso.

  • Sélectionner parmi les moustiques du DataFrame df_samples ceux qui ont été capturés au Burkina Faso en 2014.

  • Sélectionner parmi les moustiques du DataFrame df_samples les Anophèles coluzzii qui ont été capturés au Burkina Faso en 2014. (Indice: taxon == 'coluzzii').

  • Sélectionner parmi les SNPs du DataFrame snp_freqs_df ceux qui modifient la séquence d’une protéine (Indice: effect == 'NON_SYNONYMOUS_CODING').

  • Sélectionner parmi les SNPs du DataFrame snp_freqs_df ceux qui modifient la séquence d’une protéine et apparaissent avec une fréquence supérieure à 5% dans au moins une cohorte (Indice: max_af > 0.05).

Input/output#

We saw at the beginning of this module how to read in data from a file. Let’s look finally at how to write data out to a file.

To write a DataFrame out to a CSV file, use the to_csv() function. E.g., let’s write out the book characters DataFrame to a CSV file.

df_hitchhikers.to_csv("hitchhikers.csv")

You can download this file to your local computer by opening the file browser tab to the left in colab.

There are several other file formats you can write to. E.g., you could write to an Excel file. Here, we use the

df_hitchhikers.to_excel("hitchhikers.xlsx")

Any of the DataFrames we created in this module could be written out in the same way. E.g., we could write a DataFrame that we obtained by querying the book characters for only humans.

df_human_hitchhikers.to_excel("hh_human.xlsx")

Exercise 9 (English)

  • Write the df_samples DataFrame out to an Excel file.

  • Query the df_samples DataFrame to select only mosquitoes from Burkina Faso, then write the resulting DataFrame out to a CSV file.

Exercice 9 (Français)

  • Enregistrer le DataFrame df_samples sous forme de fichier Excel.

  • Sélectionner parmi les moustiques du DataFrame df_samples ceux provenant du Burkina Faso et enregistrer le DataFrame ainsi créé sous forme de fichier CSV.

Topics not covered here#

Please note that this module has only covered a few of the many features that pandas supports. Here are a few of the features that we have not covered, but which you might like to learn more about through other online tutorials.

  • Grouping and aggregation

  • Pivot tables

  • Joining and merging

  • Concatenating

  • Advanced indexing (multi-level indexes; Boolean indexing; …)

  • Modifying data

  • Apply

  • Missing data

  • Time series

  • Plotting

  • Text

Further reading#

Here are a few links to online resources providing further information about pandas for new users.