Module 1 - Introduction to pandas DataFrames

Workshop 2 - Training course in data analysis for genomic surveillance of African malaria vectors

Module 1 - Introduction to pandas DataFrames#

Theme: Tools & Technology

In this module we’re going to take a closer look at a Python package called pandas, which is an essential part of our data analysis toolkit. We use pandas whenever we need to store, manipulate or analyse tables of data, also known as DataFrames. In this module we’ll focus on using pandas to analyse genomic data, but it can be used to analyse any kind of data, so it’s a very useful tool to have at your fingertips.

Learning objectives#

After this module you will be able to:

Explain what a DataFrame is
Access columns and rows from a DataFrame
Use indexes and queries
Write a DataFrame to a file

Please note that this module will cover only a small part of the features that pandas provides - see the links at then end of this notebook if you’d like to go further.

Lecture#

English#

Français#

Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.

Setup#

Let’s install the malariagen_data package so we have easy access to some genomic data to demonstrate with. Pandas is already installed on colab.

%pip install -q --no-warn-conflicts malariagen_data

Now import the malariagen_data and pandas packages. Note that by convention we use pd as an abbreviation for pandas.

Note that authentication is required to access data through the package, please follow the instructions here.

import malariagen_data

import pandas as pd

#import os
#os.environ["MGEN_SHOW_PROGRESS"] = "0"

Set up access to the MalariaGEN Ag3.0 data resource.

ag3 = malariagen_data.Ag3()
ag3

MalariaGEN Ag3 API client
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact support@malariagen.net. See also the Ag3 API docs.
Storage URL	gs://vo_agam_release_master_us_central1
Data releases available	3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, 3.11, 3.12, 3.13, 3.14
Results cache	None
Cohorts analysis	20250502
AIM analysis	20220528
Site filters analysis	dt_20200416
Software version	malariagen_data 15.0.1
Client location	Iowa, United States (Google Cloud us-central1)

pandas DataFrames#

We use the pandas DataFrame class whenever we want to store and analyse data that can be structured as a table with rows and columns. Let’s take a look at some examples of DataFrames that we met in the first workshop of this training course.

Example: sample metadata#

In workshop 1, module 2 we encountered the sample_metadata() function. This function returns a DataFrame of sample metadata. Let’s call this function and assign the result to the variable df_samples.

df_samples = ag3.sample_metadata(sample_sets="3.0")

Check what type of object we have received.

type(df_samples)

pandas.core.frame.DataFrame

Inspect the data in the DataFrame by running a code cell containing just the name of the variable.

df_samples

	sample_id	partner_sample_id	contributor	country	location	year	month	latitude	longitude	sex_call	...	admin1_name	admin1_iso	admin2_name	taxon	cohort_admin1_year	cohort_admin1_month	cohort_admin1_quarter	cohort_admin2_year	cohort_admin2_month	cohort_admin2_quarter
0	AR0047-C	LUA047	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_colu_2009_Q2	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04	AO-LUA_Luanda_colu_2009_Q2
1	AR0049-C	LUA049	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_colu_2009_Q2	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04	AO-LUA_Luanda_colu_2009_Q2
2	AR0051-C	LUA051	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_colu_2009_Q2	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04	AO-LUA_Luanda_colu_2009_Q2
3	AR0061-C	LUA061	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_colu_2009_Q2	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04	AO-LUA_Luanda_colu_2009_Q2
4	AR0078-C	LUA078	Joao Pinto	Angola	Luanda	2009	4	-8.884	13.302	F	...	Luanda	AO-LUA	Luanda	coluzzii	AO-LUA_colu_2009	AO-LUA_colu_2009_04	AO-LUA_colu_2009_Q2	AO-LUA_Luanda_colu_2009	AO-LUA_Luanda_colu_2009_04	AO-LUA_Luanda_colu_2009_Q2
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3076	AD0494-C	80-2-o-16	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	F	...	NaN	NaN	NaN	unassigned	NaN	NaN	NaN	NaN	NaN	NaN
3077	AD0495-C	80-2-o-17	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	unassigned	NaN	NaN	NaN	NaN	NaN	NaN
3078	AD0496-C	80-2-o-18	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	unassigned	NaN	NaN	NaN	NaN	NaN	NaN
3079	AD0497-C	80-2-o-19	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	F	...	NaN	NaN	NaN	unassigned	NaN	NaN	NaN	NaN	NaN	NaN
3080	AD0498-C	80-2-o-20	Martin Donnelly	Lab Cross	LSTM	-1	-1	53.409	-2.969	M	...	NaN	NaN	NaN	unassigned	NaN	NaN	NaN	NaN	NaN	NaN

3081 rows × 57 columns

There are 3,081 rows and 32 columns in this DataFrame. Each row provides data about a single mosquito DNA sample that has been sequenced by MalariaGEN. The columns provide data about those mosquitoes, such as the country and year in which the mosquito was originally collected.

Note that the representation of the DataFrame above only shows some of the data - there are too many rows and columns to show all at once. The ellipsis character (”…”) shows where some data have been hidden.

Example: genome annotations#

In workshop 1, module 3 we encountered the genome_features() function. This function returns a DataFrame of genome annotations. Let’s call the function and assign the result to the variable df_genome_features.

df_genome_features = ag3.genome_features()
df_genome_features

	contig	source	type	start	end	score	strand	phase	ID	Parent	Name	description
0	2L	VectorBase	chromosome	1	49364325	NaN	NaN	NaN	2L	NaN	NaN	NaN
1	2L	VectorBase	gene	157348	186936	NaN	-	NaN	AGAP004677	NaN	NaN	methylenetetrahydrofolate dehydrogenase(NAD ) ...
2	2L	VectorBase	mRNA	157348	181305	NaN	-	NaN	AGAP004677-RA	AGAP004677	NaN	NaN
3	2L	VectorBase	three_prime_UTR	157348	157495	NaN	-	NaN	NaN	AGAP004677-RA	NaN	NaN
4	2L	VectorBase	exon	157348	157623	NaN	-	NaN	NaN	AGAP004677-RA	AGAP004677-RB-E4	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
196140	Y_unplaced	VectorBase	five_prime_UTR	47932	48111	NaN	+	NaN	NaN	AGAP029375-RA	NaN	NaN
196141	Y_unplaced	VectorBase	exon	47932	48138	NaN	+	NaN	NaN	AGAP029375-RA	AGAP029375-RA-E2	NaN
196142	Y_unplaced	VectorBase	CDS	48112	48138	NaN	+	0.0	AGAP029375-PA	AGAP029375-RA	NaN	NaN
196143	Y_unplaced	VectorBase	exon	48301	48385	NaN	+	NaN	NaN	AGAP029375-RA	AGAP029375-RA-E3	NaN
196144	Y_unplaced	VectorBase	CDS	48301	48385	NaN	+	0.0	AGAP029375-PA	AGAP029375-RA	NaN	NaN

196145 rows × 12 columns

There are 196,145 rows and 12 columns in this DataFrame. Each row provides data about an annotated feature on the Anopheles gambiae AgamP4 reference genome, such as a gene. The columns provide data about those features, such as the contig (chromosome arm) on which the feature occurs, and the start and end coordinates.

Example: SNP allele frequencies#

In workshop 1, module 4, we encountered the snp_allele_frequencies() function. This function returns a DataFrame of single nucleotide polymorphisms (SNPs) together with their allele frequencies in different mosquito cohorts. Let’s call this function and assign the result to the variable snp_freqs_df.

snp_freqs_df = ag3.snp_allele_frequencies(
    transcript="AGAP004707-RD", 
    cohorts="admin1_year", 
    sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"], 
)
snp_freqs_df

					pass_gamb_colu_arab	pass_gamb_colu	pass_arab	frq_BF-07_gamb_2004	frq_BF-09_colu_2012	frq_BF-09_colu_2014	frq_BF-09_gamb_2012	frq_BF-09_gamb_2014	max_af	transcript	effect	impact	ref_codon	alt_codon	aa_pos	ref_aa	alt_aa	label
contig	position	ref_allele	alt_allele	aa_change
2L	2358252	C	T	A32V	True	True	True	0.000000	0.006098	0.0	0.000000	0.0	0.006098	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	gCa	gTa	32.0	A	V	2L:2,358,252 C>T (A32V)
	2358328	T	C	NaN	True	True	True	0.000000	0.006098	0.0	0.000000	0.0	0.006098	AGAP004707-RD	INTRONIC	MODIFIER	None	None	NaN	None	None	2L:2,358,328 T>C
	2358441	A	T	NaN	True	True	True	0.038462	0.000000	0.0	0.000000	0.0	0.038462	AGAP004707-RD	INTRONIC	MODIFIER	None	None	NaN	None	None	2L:2,358,441 A>T
	2358468	A	C	NaN	True	True	True	0.269231	0.000000	0.0	0.000000	0.0	0.269231	AGAP004707-RD	INTRONIC	MODIFIER	None	None	NaN	None	None	2L:2,358,468 A>C
	2358581	T	A	NaN	True	True	True	0.192308	0.000000	0.0	0.000000	0.0	0.192308	AGAP004707-RD	INTRONIC	MODIFIER	None	None	NaN	None	None	2L:2,358,581 T>A
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
	2431424	G	T	G2055V	True	True	True	0.000000	0.000000	0.0	0.005051	0.0	0.005051	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	gGc	gTc	2055.0	G	V	2L:2,431,424 G>T (G2055V)
	2431518	C	A	A2086A	True	True	True	0.038462	0.000000	0.0	0.000000	0.0	0.038462	AGAP004707-RD	SYNONYMOUS_CODING	LOW	gcC	gcA	2086.0	A	A	2L:2,431,518 C>A (A2086A)
	2431527	G	C	V2089V	True	True	True	0.038462	0.000000	0.0	0.000000	0.0	0.038462	AGAP004707-RD	SYNONYMOUS_CODING	LOW	gtG	gtC	2089.0	V	V	2L:2,431,527 G>C (V2089V)
	2431569	C	A	V2103V	True	True	True	0.038462	0.000000	0.0	0.000000	0.0	0.038462	AGAP004707-RD	SYNONYMOUS_CODING	LOW	gtC	gtA	2103.0	V	V	2L:2,431,569 C>A (V2103V)
	2431613	T	C	V2118A	True	True	True	0.000000	0.000000	0.0	0.005051	0.0	0.005051	AGAP004707-RD	NON_SYNONYMOUS_CODING	MODERATE	gTc	gCc	2118.0	V	A	2L:2,431,613 T>C (V2118A)

9675 rows × 18 columns

There are 9,675 rows and 18 columns in this DataFrame. Each row provides data about a SNP. The columns provide data about those SNPs, such as their effect, and the alternate allele frequency in a particular cohort of mosquitoes, e.g., the column named “frq_BF-09_colu_2012” provides the frequency in Anopheles coluzzii mosquitoes collected in the BF-09 (Hauts-Bassins) administrative region of Burkina Faso in 2012.

Aside: variable naming styles#

You may have noticed that in the examples above, whenever we assigned a DataFrame to a variable, the name of the variable either started with the prefix “df_” or ended with the suffix “_df”. Here “df” stands for DataFrame, and is used just as a helpful reminder that the value of the variable is a pandas DataFrame. Using a prefix of suffix on the variable name like this is entirely optional and is up to your own coding style. You can name these variables whatever you like. Different instructors in this course prefer different styles, and you will see different styles being used in different modules.

Example: reading from a CSV file#

You can also create a pandas DataFrame by reading data from a file, such as a comma-separated value (CSV) file, or an Excel workbook file. Let’s use the pandas read_csv() function to read data from a CSV file containing data from a US census that colab provides as example data for you to play with.

df_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_housing

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0
...	...	...	...	...	...	...	...	...	...
16995	-124.26	40.58	52.0	2217.0	394.0	907.0	369.0	2.3571	111400.0
16996	-124.27	40.69	36.0	2349.0	528.0	1194.0	465.0	2.5179	79000.0
16997	-124.30	41.84	17.0	2677.0	531.0	1244.0	456.0	3.0313	103600.0
16998	-124.30	41.80	19.0	2672.0	552.0	1298.0	478.0	1.9797	85800.0
16999	-124.35	40.54	52.0	1820.0	300.0	806.0	270.0	3.0147	94600.0

17000 rows × 9 columns

This DataFrame has 17,000 rows and 9 columns. Each row provides data about a block of housing. The columns provide data about the houses within a block, such as the total number of bedrooms and the number of households.

Note that you can upload your own files to Colab if you want to read in and analyse some of your own data with pandas.

Example: creating a DataFrame from scratch#

You can also create a DataFrame by providing the data within your code. To demonstrate this, let’s create a DataFrame with data about characters in a famous science fiction book called The Hitchhiker’s Guide to the Galaxy.

df_hitchhikers = pd.DataFrame(data={
    "name": ["Arthur Dent", "Tricia McMillan", "Ford Prefect", "Zaphod Beeblebrox", "Marvin the Paranoid Android"],
    "taxon": ["human", "human", "betelgeusian", "betelgeusian", "robot"],
    "gender": ["male", "female", "male", "male", None],
    "number_of_heads": [1, 1, 1, 2, 1],
    "intelligence": [100, 180, 120, 30, 5_000_000]
})
df_hitchhikers

	name	taxon	gender	number_of_heads	intelligence
0	Arthur Dent	human	male	1	100
1	Tricia McMillan	human	female	1	180
2	Ford Prefect	betelgeusian	male	1	120
3	Zaphod Beeblebrox	betelgeusian	male	2	30
4	Marvin the Paranoid Android	robot	None	1	5000000

This DataFrame has 5 rows and 5 columns. Each row provides data about a fictional character in the book. The columns provide data about each character, such as their name, taxon and number of heads (some alien species can have more than one).

Structure of a DataFrame#

Hopefully this will already be obvious from the examples above, but a pandas DataFrame is a data structure that is organised into rows and columns. Each DataFrame also has an index, which is a special column (or set of columns) that is shown in bold at the left of a DataFrame, and which we’ll look at in more detail later.

Counting rows and columns#

If a DataFrame is large, you may find it useful to know how many rows and columns it contains. The representation of a DataFrame shows at the bottom how many rows and columns are present in the DataFrame, but you can also use the built-in len() function if you want to access the number of rows or columns in your code.

num_rows = len(df_hitchhikers)
num_rows

num_cols = len(df_hitchhikers.columns)
num_cols

Exercise 1 (English)

Count the number of rows and columns in the df_samples DataFrame.

Exercice 1 (Français)

Compter le nombre de lignes et de colonnes du DataFrame df_samples.

Accessing columns#

Often you will have a DataFrame with many columns, and you many only need to work with data from a smaller number of columns, or just a single column. Accessing one or more columns from a DataFrame can be done using the square bracket syntax after the name of a DataFrame variable. Let’s see how this works.

Accessing column names#

First, let’s access the names of all the columns in a DataFrame, to see what columns are available. You can do this by accessing the columns property of a DataFrame. For simplicity, let’s also convert the result to a Python list, using the built-in list() function.

col_names = list(df_hitchhikers.columns)
col_names

['name', 'taxon', 'gender', 'number_of_heads', 'intelligence']

This can be useful especially if you have a lot of columns in a DataFrame.

Exercise 2 (English)

Access the column names of the df_samples DataFrame.

Exercice 2 (Français)

Afficher les noms des colonnes du DataFrame df_samples.

Accessing a single column#

To access a single column, give the column name as a string inside square brackets. Let’s access the “name” column from the DataFrame of book characters, and assign the result to a new variable character_name.

character_name = df_hitchhikers["name"]

When you access a single column or row from a DataFrame, pandas returns a Series object to you. Pandas uses the Series class to hold data either from a single column or a single row, i.e., a one-dimensional sequence of data values.

type(character_name)

pandas.core.series.Series

We can also inspect the data in the column.

character_name

                  Arthur Dent
              Tricia McMillan
                 Ford Prefect
            Zaphod Beeblebrox
  Marvin the Paranoid Android
Name: name, dtype: object

Investigating data in a column#

The pandas Series class has some useful functions for investigating the data in a DataFrame.

E.g., let’s access the “taxon” column and use the value_counts() functions to get a summary of the categorical data in the column.

character_taxon = df_hitchhikers["taxon"]
character_taxon.value_counts()

taxon
human           2
betelgeusian    2
robot           1
Name: count, dtype: int64

The pandas Series class also has a describe() function which can be useful for investigating numerical data. Let’s use it with the number_of_heads column.

character_heads = df_hitchhikers["number_of_heads"]
character_heads.describe()

count    5.000000
mean     1.200000
std      0.447214
min      1.000000
25%      1.000000
50%      1.000000
75%      1.000000
max      2.000000
Name: number_of_heads, dtype: float64

The characters in our DataFrame have at least (min) 1 and at most (max) 2 heads. The mean number of heads is 1.2, but of course this a bit silly, because number of heads is not a continuous variable.

Exercise 3 (English)

Access the “country” column from the df_samples DataFrame, and then use the value_counts() function to show the number of mosquito samples from each country.
Access the “type” column from the df_genome_features DataFrame, and then use the value_counts() function to show the number of genome annotations of different types.
Access the “effect” column from the snp_freqs_df DataFrame, and then use the value_counts() function to show the number of SNPs with different effects.

Exercice 3 (Français)

Obtenir la colonne “country” du DataFrame df_samples et utiliser la fonction value_counts() pour afficher le nombre de moustiques capturés dans chaque pays.
Obtenir la colonne “type” du DataFrame df_genome_features et utiliser la fonction value_counts() pour afficher le nombre d’annotations du génome de chaque type.
Obtenir la colonne “effect” du DataFrame snp_freqs_df et utiliser la fonction value_counts() pour afficher le nombre de SNPs pour chaque effet possible.

Accessing multiple columns#

To access multiple columns, provide the column names as a list of strings. Let’s access the “name”, “taxon” and “number_of_heads” columns from the DataFrame of book characters, and assign the resulting DataFrame to a new variable df_hh.

selected_cols = ["name", "taxon", "number_of_heads"]
df_hh = df_hitchhikers[selected_cols]
df_hh

	name	taxon	number_of_heads
0	Arthur Dent	human	1
1	Tricia McMillan	human	1
2	Ford Prefect	betelgeusian	1
3	Zaphod Beeblebrox	betelgeusian	2
4	Marvin the Paranoid Android	robot	1

You can also put the list of columns directly inside the square brackets, which would give the same result.

df_hh = df_hitchhikers[["name", "taxon", "number_of_heads"]]
df_hh

	name	taxon	number_of_heads
0	Arthur Dent	human	1
1	Tricia McMillan	human	1
2	Ford Prefect	betelgeusian	1
3	Zaphod Beeblebrox	betelgeusian	2
4	Marvin the Paranoid Android	robot	1

Exercise 4 (English)

Create a new DataFrame containing only the “sample_id”, “country” and “year” columns from the df_samples DataFrame.
Create a new DataFrame containing only the “ID”, “contig”, “start” and “end” columns from the df_genome_features DataFrame.
Create a new DataFrame containing only the frequency columns from the snp_freqs_df DataFrame (hint: the names of these columns all begin with “frq_”).

Exercice 4 (Français)

Créer un nouveau DataFrame contenant seulement les colonnes “sample_id”, “country” et “year” du DataFrame df_samples.
Créer un nouveau DataFrame contenant seulement les colonnes “ID”, “contig”, “start” et “end” du DataFrame df_genome_features.
Créer un nouveau DataFrame contenant seulement les colonnes contenant les fréquences du DataFrame snp_freqs_df (Indice: les noms de ces colonnes commencent tous par “frq_”).

Accessing rows#

Now we’ve seen how to access columns, let’s see how to access rows. To access rows by their position within a DataFrame, we can use the iloc property.

Accessing a single row#

To access a single row, we use the iloc property, then provide the row number as an integer inside square brackets, remembering that numbering starts from 0 in Python. For example, access the first row of our DataFrame of book characters, and assign the result to a variable first_row.

first_row = df_hitchhikers.iloc[0]

When accessing a single row, pandas returns a Series.

type(first_row)

pandas.core.series.Series

Inspect the data in the row.

first_row

name               Arthur Dent
taxon                    human
gender                    male
number_of_heads              1
intelligence               100
Name: 0, dtype: object

Access specific data values within the row.

first_row["name"]

'Arthur Dent'

first_row["number_of_heads"]

Exercise 5 (English)

Access the 81st row in the df_samples DataFrame. What country did this mosquito sample come from? (Remember: numbering starts from 0 in Python.)
Access the 73,287th row in the df_genome_features DataFrame. What type of annotation is this? What is its Name?

Exercice 5 (Français)

Obtenir la 81ème ligne du DataFrame df_samples. De quel pays est issu ce moustique? (Indice: les indices commencent à 0 en Python)
Obtenir la 73 287ème ligne du DataFrame df_genome_features. De quel type d’annotation s’agit-il? Quel est son nom?

Accessing multiple rows#

We can also use iloc to access multiple rows,from some start position to some end position. To do this we provide two integers separated by a colon (“:”) which is called a slice.

As well as remembering that numbering starts from zero, we also have to remember that slices include the start position but do not include the stop position. E.g., to access the first two rows, provide the start and stop positions 0 and 2. This returns a new DataFrame with just the selected rows within the slice.

first_two_rows = df_hitchhikers.iloc[0:2]
first_two_rows

	name	taxon	gender	number_of_heads	intelligence
0	Arthur Dent	human	male	1	100
1	Tricia McMillan	human	female	1	180

Exercise 6 (English)

Access rows starting from the 6th row up to and including the 10th row in the df_samples DataFrame.

Exercice 6 (Français)

Obtenir toutes les lignes entre la 6ème et la 10ème (incluses) du DataFrame df_samples.

Indexes#

Every pandas DataFrame has an index, which is a special column (or set of columns) which can be used to access rows quickly and conveniently. In particular, indexes can be used to assign a label to every row, and then use that label to locate and access matching rows.

For example, when working with our DataFrame of book characters, we might want to access rows using the name of the character. To do this, we can set the “name” column as the index.

df_hitchhikers_indexed = df_hitchhikers.set_index("name")
df_hitchhikers_indexed

	taxon	gender	number_of_heads	intelligence
name
Arthur Dent	human	male	1	100
Tricia McMillan	human	female	1	180
Ford Prefect	betelgeusian	male	1	120
Zaphod Beeblebrox	betelgeusian	male	2	30
Marvin the Paranoid Android	robot	None	1	5000000

The DataFrame df_hitchhikers_indexed has the same data as before, but notice that the “name” column has moved to the left and is shown in bold. This shows us that the “name” column is now being used as the DataFrame index.

Now that we’ve set a useful index, we can use the loc property to locate rows with a particular name. E.g., locate the row where the character’s name is “Tricia McMillan”.

trillian = df_hitchhikers_indexed.loc["Tricia McMillan"]

This indexing action locates a single row, and so pandas returns a Series.

type(trillian)

pandas.core.series.Series

Inspect the row of data we located.

trillian

taxon               human
gender             female
number_of_heads         1
intelligence          180
Name: Tricia McMillan, dtype: object

It’s also useful to know that you can always reset the index back to the default, using the reset_index() function. The default is to index the DataFrame with integers.

df_hitchhikers_reset = df_hitchhikers_indexed.reset_index()
df_hitchhikers_reset

	name	taxon	gender	number_of_heads	intelligence
0	Arthur Dent	human	male	1	100
1	Tricia McMillan	human	female	1	180
2	Ford Prefect	betelgeusian	male	1	120
3	Zaphod Beeblebrox	betelgeusian	male	2	30
4	Marvin the Paranoid Android	robot	None	1	5000000

Exercise 7 (English)

Set the “sample_id” column as the index on the df_samples DataFrame, then locate the row with sample identifier “BK0055-C”. Where was this mosquito collected?
Set the “ID” column as the index on the df_genome_features DataFrame, then locate the row with identifier “AGAP000818”. What is the name of this gene?

Exercice 7 (Français)

Faire de la colonne “sample_id” l’index du DataFrame df_samples et ensuite trouver la ligne avec comme identificant d’échantillon “BK0055-C”. D’où est issu ce moustique?
Faire de la colonne “ID” l’index du DataFrame df_genome_features et ensuite trouver la ligne avec comme identifiant “AGAP000818”. Quel est le nom de ce gène?

Queries#

DataFrames also provide a query() function, which can be used to select rows matching some condition or combination of conditions. The query() function accepts a string containing a query expression.

Let’s query the book characters DataFrame to select only rows where the value of the “taxon” column is “human”. To match a value we use the “==” operator.

df_human_hitchhikers = df_hitchhikers.query("taxon == 'human'")
df_human_hitchhikers

	name	taxon	gender	number_of_heads	intelligence
0	Arthur Dent	human	male	1	100
1	Tricia McMillan	human	female	1	180

Let’s do the opposite, selecting only rows where the value of the “taxon” column is not “human”, using the “!=” operator.

df_non_human_hitchhikers = df_hitchhikers.query("taxon != 'human'")
df_non_human_hitchhikers

	name	taxon	gender	number_of_heads	intelligence
2	Ford Prefect	betelgeusian	male	1	120
3	Zaphod Beeblebrox	betelgeusian	male	2	30
4	Marvin the Paranoid Android	robot	None	1	5000000

Let’s query a numerical field using the “>” greater than operator.

df_multi_headed_hitchhikers = df_hitchhikers.query("number_of_heads > 1")
df_multi_headed_hitchhikers

	name	taxon	gender	number_of_heads	intelligence
3	Zaphod Beeblebrox	betelgeusian	male	2	30

A query expression can combine conditions on multiple columns using “and” and “or”. E.g., find rows where the value of the “taxon” column is “human” and the value of the “number_of_heads” column is 1.

df_non_human_one_headed_hitchhikers = df_hitchhikers.query(
    "taxon != 'human' and number_of_heads == 1"
)
df_non_human_one_headed_hitchhikers

	name	taxon	gender	number_of_heads	intelligence
2	Ford Prefect	betelgeusian	male	1	120
4	Marvin the Paranoid Android	robot	None	1	5000000

Exercise 8 (English)

Query the df_samples DataFrame to select mosquitoes collected in Burkina Faso.
Query the df_samples DataFrame to select mosquitoes collected in Burkina Faso in 2014.
Query the df_samples DataFrame to select Anopheles coluzzii mosquitoes collected in Burkina Faso in 2014 (hint: taxon == 'coluzzii').
Query the snp_freqs_df DataFrame to select SNPs which change the protein sequence (hint: effect == 'NON_SYNONYMOUS_CODING').
Query the snp_freqs_df DataFrame to select SNPs which change the protein sequence and which occur above 5% frequency in at least one cohort (hint: max_af > 0.05).

Exercice 8 (Français)

Sélectionner parmi les moustiques du DataFrame df_samples ceux qui ont été capturés au Burkina Faso.
Sélectionner parmi les moustiques du DataFrame df_samples ceux qui ont été capturés au Burkina Faso en 2014.
Sélectionner parmi les moustiques du DataFrame df_samples les Anophèles coluzzii qui ont été capturés au Burkina Faso en 2014. (Indice: taxon == 'coluzzii').
Sélectionner parmi les SNPs du DataFrame snp_freqs_df ceux qui modifient la séquence d’une protéine (Indice: effect == 'NON_SYNONYMOUS_CODING').
Sélectionner parmi les SNPs du DataFrame snp_freqs_df ceux qui modifient la séquence d’une protéine et apparaissent avec une fréquence supérieure à 5% dans au moins une cohorte (Indice: max_af > 0.05).

Input/output#

We saw at the beginning of this module how to read in data from a file. Let’s look finally at how to write data out to a file.

To write a DataFrame out to a CSV file, use the to_csv() function. E.g., let’s write out the book characters DataFrame to a CSV file.

df_hitchhikers.to_csv("hitchhikers.csv")

You can download this file to your local computer by opening the file browser tab to the left in colab.

There are several other file formats you can write to. E.g., you could write to an Excel file. Here, we use the

df_hitchhikers.to_excel("hitchhikers.xlsx")

Any of the DataFrames we created in this module could be written out in the same way. E.g., we could write a DataFrame that we obtained by querying the book characters for only humans.

df_human_hitchhikers.to_excel("hh_human.xlsx")

Exercise 9 (English)

Write the df_samples DataFrame out to an Excel file.
Query the df_samples DataFrame to select only mosquitoes from Burkina Faso, then write the resulting DataFrame out to a CSV file.

Exercice 9 (Français)

Enregistrer le DataFrame df_samples sous forme de fichier Excel.
Sélectionner parmi les moustiques du DataFrame df_samples ceux provenant du Burkina Faso et enregistrer le DataFrame ainsi créé sous forme de fichier CSV.

Topics not covered here#

Please note that this module has only covered a few of the many features that pandas supports. Here are a few of the features that we have not covered, but which you might like to learn more about through other online tutorials.

Grouping and aggregation
Pivot tables
Joining and merging
Concatenating
Advanced indexing (multi-level indexes; Boolean indexing; …)
Modifying data
Apply
Missing data
Time series
Plotting
Text
…