Workshop 2 - Training course in data analysis for genomic surveillance of African malaria vectors
Module 1 - Introduction to pandas DataFrames#
Theme: Tools & Technology
In this module we’re going to take a closer look at a Python package called pandas, which is an essential part of our data analysis toolkit. We use pandas whenever we need to store, manipulate or analyse tables of data, also known as DataFrames. In this module we’ll focus on using pandas to analyse genomic data, but it can be used to analyse any kind of data, so it’s a very useful tool to have at your fingertips.
Learning objectives#
After this module you will be able to:
Explain what a DataFrame is
Access columns and rows from a DataFrame
Use indexes and queries
Write a DataFrame to a file
Please note that this module will cover only a small part of the features that pandas provides - see the links at then end of this notebook if you’d like to go further.
Lecture#
English#
Français#
Please note that the code in the cells below might differ from that shown in the video. This can happen because Python packages and their dependencies change due to updates, necessitating tweaks to the code.
Setup#
Let’s install the malariagen_data
package so we have easy access to some genomic data to demonstrate with. Pandas is already installed on colab.
%pip install -q --no-warn-conflicts malariagen_data
Now import the malariagen_data
and pandas
packages. Note that by convention we use pd
as an abbreviation for pandas
.
import malariagen_data
import pandas as pd
Set up access to the MalariaGEN Ag3.0 data resource.
ag3 = malariagen_data.Ag3()
ag3
MalariaGEN Ag3 API client | |
---|---|
Please note that data are subject to terms of use, for more information see the MalariaGEN website or contact data@malariagen.net. See also the Ag3 API docs. | |
Storage URL | gs://vo_agam_release/ |
Data releases available | 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8 |
Results cache | None |
Cohorts analysis | 20231215 |
AIM analysis | 20220528 |
Site filters analysis | dt_20200416 |
Software version | malariagen_data 8.7.0 |
Client location | unknown |
pandas DataFrames#
We use the pandas DataFrame
class whenever we want to store and analyse data that can be structured as a table with rows and columns. Let’s take a look at some examples of DataFrames that we met in the first workshop of this training course.
Example: sample metadata#
In workshop 1, module 2 we encountered the sample_metadata() function. This function returns a DataFrame of sample metadata. Let’s call this function and assign the result to the variable df_samples
.
df_samples = ag3.sample_metadata(sample_sets="3.0")
Check what type of object we have received.
type(df_samples)
pandas.core.frame.DataFrame
Inspect the data in the DataFrame by running a code cell containing just the name of the variable.
df_samples
sample_id | partner_sample_id | contributor | country | location | year | month | latitude | longitude | sex_call | ... | admin1_name | admin1_iso | admin2_name | taxon | cohort_admin1_year | cohort_admin1_month | cohort_admin1_quarter | cohort_admin2_year | cohort_admin2_month | cohort_admin2_quarter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AR0047-C | LUA047 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
1 | AR0049-C | LUA049 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
2 | AR0051-C | LUA051 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
3 | AR0061-C | LUA061 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
4 | AR0078-C | LUA078 | Joao Pinto | Angola | Luanda | 2009 | 4 | -8.884 | 13.302 | F | ... | Luanda | AO-LUA | Luanda | coluzzii | AO-LUA_colu_2009 | AO-LUA_colu_2009_04 | AO-LUA_colu_2009_Q2 | AO-LUA_Luanda_colu_2009 | AO-LUA_Luanda_colu_2009_04 | AO-LUA_Luanda_colu_2009_Q2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3076 | AD0494-C | 80-2-o-16 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3077 | AD0495-C | 80-2-o-17 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3078 | AD0496-C | 80-2-o-18 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3079 | AD0497-C | 80-2-o-19 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | F | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3080 | AD0498-C | 80-2-o-20 | Martin Donnelly | Lab Cross | LSTM | -1 | -1 | 53.409 | -2.969 | M | ... | NaN | NaN | NaN | unassigned | NaN | NaN | NaN | NaN | NaN | NaN |
3081 rows × 32 columns
There are 3,081 rows and 32 columns in this DataFrame. Each row provides data about a single mosquito DNA sample that has been sequenced by MalariaGEN. The columns provide data about those mosquitoes, such as the country and year in which the mosquito was originally collected.
Note that the representation of the DataFrame above only shows some of the data - there are too many rows and columns to show all at once. The ellipsis character (”…”) shows where some data have been hidden.
Example: genome annotations#
In workshop 1, module 3 we encountered the geneset() function. This function returns a DataFrame of genome annotations. Let’s call the function and assign the result to the variable df_geneset
.
df_geneset = ag3.geneset()
df_geneset
contig | source | type | start | end | score | strand | phase | ID | Parent | Name | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2L | VectorBase | chromosome | 1 | 49364325 | NaN | NaN | NaN | 2L | NaN | NaN | NaN |
1 | 2L | VectorBase | gene | 157348 | 186936 | NaN | - | NaN | AGAP004677 | NaN | NaN | methylenetetrahydrofolate dehydrogenase(NAD ) ... |
2 | 2L | VectorBase | mRNA | 157348 | 181305 | NaN | - | NaN | AGAP004677-RA | AGAP004677 | NaN | NaN |
3 | 2L | VectorBase | three_prime_UTR | 157348 | 157495 | NaN | - | NaN | NaN | AGAP004677-RA | NaN | NaN |
4 | 2L | VectorBase | exon | 157348 | 157623 | NaN | - | NaN | NaN | AGAP004677-RA | AGAP004677-RB-E4 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
196140 | Y_unplaced | VectorBase | five_prime_UTR | 47932 | 48111 | NaN | + | NaN | NaN | AGAP029375-RA | NaN | NaN |
196141 | Y_unplaced | VectorBase | exon | 47932 | 48138 | NaN | + | NaN | NaN | AGAP029375-RA | AGAP029375-RA-E2 | NaN |
196142 | Y_unplaced | VectorBase | CDS | 48112 | 48138 | NaN | + | 0.0 | AGAP029375-PA | AGAP029375-RA | NaN | NaN |
196143 | Y_unplaced | VectorBase | exon | 48301 | 48385 | NaN | + | NaN | NaN | AGAP029375-RA | AGAP029375-RA-E3 | NaN |
196144 | Y_unplaced | VectorBase | CDS | 48301 | 48385 | NaN | + | 0.0 | AGAP029375-PA | AGAP029375-RA | NaN | NaN |
196145 rows × 12 columns
There are 196,145 rows and 12 columns in this DataFrame. Each row provides data about an annotated feature on the Anopheles gambiae AgamP4 reference genome, such as a gene. The columns provide data about those features, such as the contig (chromosome arm) on which the feature occurs, and the start and end coordinates.
Example: SNP allele frequencies#
In workshop 1, module 4, we encountered the snp_allele_frequencies() function. This function returns a DataFrame of single nucleotide polymorphisms (SNPs) together with their allele frequencies in different mosquito cohorts. Let’s call this function and assign the result to the variable snp_freqs_df
.
snp_freqs_df = ag3.snp_allele_frequencies(
transcript="AGAP004707-RD",
cohorts="admin1_year",
sample_sets=["AG1000G-BF-A", "AG1000G-BF-B", "AG1000G-BF-C"],
)
snp_freqs_df
pass_gamb_colu_arab | pass_gamb_colu | pass_arab | frq_BF-07_gamb_2004 | frq_BF-09_colu_2012 | frq_BF-09_colu_2014 | frq_BF-09_gamb_2012 | frq_BF-09_gamb_2014 | max_af | transcript | effect | impact | ref_codon | alt_codon | aa_pos | ref_aa | alt_aa | label | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
contig | position | ref_allele | alt_allele | aa_change | ||||||||||||||||||
2L | 2358252 | C | T | A32V | True | True | True | 0.000000 | 0.006098 | 0.0 | 0.000000 | 0.0 | 0.006098 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | gCa | gTa | 32.0 | A | V | 2L:2,358,252 C>T (A32V) |
2358328 | T | C | NaN | True | True | True | 0.000000 | 0.006098 | 0.0 | 0.000000 | 0.0 | 0.006098 | AGAP004707-RD | INTRONIC | MODIFIER | None | None | NaN | None | None | 2L:2,358,328 T>C | |
2358441 | A | T | NaN | True | True | True | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.038462 | AGAP004707-RD | INTRONIC | MODIFIER | None | None | NaN | None | None | 2L:2,358,441 A>T | |
2358468 | A | C | NaN | True | True | True | 0.269231 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.269231 | AGAP004707-RD | INTRONIC | MODIFIER | None | None | NaN | None | None | 2L:2,358,468 A>C | |
2358581 | T | A | NaN | True | True | True | 0.192308 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.192308 | AGAP004707-RD | INTRONIC | MODIFIER | None | None | NaN | None | None | 2L:2,358,581 T>A | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2431424 | G | T | G2055V | True | True | True | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.0 | 0.005051 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | gGc | gTc | 2055.0 | G | V | 2L:2,431,424 G>T (G2055V) | |
2431518 | C | A | A2086A | True | True | True | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.038462 | AGAP004707-RD | SYNONYMOUS_CODING | LOW | gcC | gcA | 2086.0 | A | A | 2L:2,431,518 C>A (A2086A) | |
2431527 | G | C | V2089V | True | True | True | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.038462 | AGAP004707-RD | SYNONYMOUS_CODING | LOW | gtG | gtC | 2089.0 | V | V | 2L:2,431,527 G>C (V2089V) | |
2431569 | C | A | V2103V | True | True | True | 0.038462 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.038462 | AGAP004707-RD | SYNONYMOUS_CODING | LOW | gtC | gtA | 2103.0 | V | V | 2L:2,431,569 C>A (V2103V) | |
2431613 | T | C | V2118A | True | True | True | 0.000000 | 0.000000 | 0.0 | 0.005051 | 0.0 | 0.005051 | AGAP004707-RD | NON_SYNONYMOUS_CODING | MODERATE | gTc | gCc | 2118.0 | V | A | 2L:2,431,613 T>C (V2118A) |
9675 rows × 18 columns
There are 9,675 rows and 18 columns in this DataFrame. Each row provides data about a SNP. The columns provide data about those SNPs, such as their effect, and the alternate allele frequency in a particular cohort of mosquitoes, e.g., the column named “frq_BF-09_colu_2012” provides the frequency in Anopheles coluzzii mosquitoes collected in the BF-09 (Hauts-Bassins) administrative region of Burkina Faso in 2012.
Aside: variable naming styles#
You may have noticed that in the examples above, whenever we assigned a DataFrame to a variable, the name of the variable either started with the prefix “df_
” or ended with the suffix “_df
”. Here “df” stands for DataFrame, and is used just as a helpful reminder that the value of the variable is a pandas DataFrame. Using a prefix of suffix on the variable name like this is entirely optional and is up to your own coding style. You can name these variables whatever you like. Different instructors in this course prefer different styles, and you will see different styles being used in different modules.
Example: reading from a CSV file#
You can also create a pandas DataFrame by reading data from a file, such as a comma-separated value (CSV) file, or an Excel workbook file. Let’s use the pandas read_csv() function to read data from a CSV file containing data from a US census that colab provides as example data for you to play with.
df_housing = pd.read_csv("sample_data/california_housing_train.csv")
df_housing
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
---|---|---|---|---|---|---|---|---|---|
0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16995 | -124.26 | 40.58 | 52.0 | 2217.0 | 394.0 | 907.0 | 369.0 | 2.3571 | 111400.0 |
16996 | -124.27 | 40.69 | 36.0 | 2349.0 | 528.0 | 1194.0 | 465.0 | 2.5179 | 79000.0 |
16997 | -124.30 | 41.84 | 17.0 | 2677.0 | 531.0 | 1244.0 | 456.0 | 3.0313 | 103600.0 |
16998 | -124.30 | 41.80 | 19.0 | 2672.0 | 552.0 | 1298.0 | 478.0 | 1.9797 | 85800.0 |
16999 | -124.35 | 40.54 | 52.0 | 1820.0 | 300.0 | 806.0 | 270.0 | 3.0147 | 94600.0 |
17000 rows × 9 columns
This DataFrame has 17,000 rows and 9 columns. Each row provides data about a block of housing. The columns provide data about the houses within a block, such as the total number of bedrooms and the number of households.
Note that you can upload your own files to Colab if you want to read in and analyse some of your own data with pandas.
Example: creating a DataFrame from scratch#
You can also create a DataFrame by providing the data within your code. To demonstrate this, let’s create a DataFrame with data about characters in a famous science fiction book called The Hitchhiker’s Guide to the Galaxy.
df_hitchhikers = pd.DataFrame(data={
"name": ["Arthur Dent", "Tricia McMillan", "Ford Prefect", "Zaphod Beeblebrox", "Marvin the Paranoid Android"],
"taxon": ["human", "human", "betelgeusian", "betelgeusian", "robot"],
"gender": ["male", "female", "male", "male", None],
"number_of_heads": [1, 1, 1, 2, 1],
"intelligence": [100, 180, 120, 30, 5_000_000]
})
df_hitchhikers
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
0 | Arthur Dent | human | male | 1 | 100 |
1 | Tricia McMillan | human | female | 1 | 180 |
2 | Ford Prefect | betelgeusian | male | 1 | 120 |
3 | Zaphod Beeblebrox | betelgeusian | male | 2 | 30 |
4 | Marvin the Paranoid Android | robot | None | 1 | 5000000 |
This DataFrame has 5 rows and 5 columns. Each row provides data about a fictional character in the book. The columns provide data about each character, such as their name, taxon and number of heads (some alien species can have more than one).
Structure of a DataFrame#
Hopefully this will already be obvious from the examples above, but a pandas DataFrame is a data structure that is organised into rows and columns. Each DataFrame also has an index, which is a special column (or set of columns) that is shown in bold at the left of a DataFrame, and which we’ll look at in more detail later.
%%html
<img width="90%" height="90%"src="https://storage.googleapis.com/vo_agam_release/reference/training-images/workshop-2/W2M1-1.PNG"/>
Counting rows and columns#
If a DataFrame is large, you may find it useful to know how many rows and columns it contains. The representation of a DataFrame shows at the bottom how many rows and columns are present in the DataFrame, but you can also use the built-in len()
function if you want to access the number of rows or columns in your code.
num_rows = len(df_hitchhikers)
num_rows
5
num_cols = len(df_hitchhikers.columns)
num_cols
5
Exercise 1 (English)
Count the number of rows and columns in the
df_samples
DataFrame.
Exercice 1 (Français)
Compter le nombre de lignes et de colonnes du DataFrame
df_samples
.
Accessing columns#
Often you will have a DataFrame with many columns, and you many only need to work with data from a smaller number of columns, or just a single column. Accessing one or more columns from a DataFrame can be done using the square bracket syntax after the name of a DataFrame variable. Let’s see how this works.
Accessing column names#
First, let’s access the names of all the columns in a DataFrame, to see what columns are available. You can do this by accessing the columns
property of a DataFrame. For simplicity, let’s also convert the result to a Python list, using the built-in list()
function.
col_names = list(df_hitchhikers.columns)
col_names
['name', 'taxon', 'gender', 'number_of_heads', 'intelligence']
This can be useful especially if you have a lot of columns in a DataFrame.
Exercise 2 (English)
Access the column names of the
df_samples
DataFrame.
Exercice 2 (Français)
Afficher les noms des colonnes du DataFrame
df_samples
.
Accessing a single column#
To access a single column, give the column name as a string inside square brackets. Let’s access the “name” column from the DataFrame of book characters, and assign the result to a new variable character_name
.
character_name = df_hitchhikers["name"]
When you access a single column or row from a DataFrame, pandas returns a Series object to you. Pandas uses the Series class to hold data either from a single column or a single row, i.e., a one-dimensional sequence of data values.
type(character_name)
pandas.core.series.Series
We can also inspect the data in the column.
character_name
0 Arthur Dent
1 Tricia McMillan
2 Ford Prefect
3 Zaphod Beeblebrox
4 Marvin the Paranoid Android
Name: name, dtype: object
Investigating data in a column#
The pandas Series class has some useful functions for investigating the data in a DataFrame.
E.g., let’s access the “taxon” column and use the value_counts()
functions to get a summary of the categorical data in the column.
character_taxon = df_hitchhikers["taxon"]
character_taxon.value_counts()
taxon
human 2
betelgeusian 2
robot 1
Name: count, dtype: int64
The pandas Series class also has a describe()
function which can be useful for investigating numerical data. Let’s use it with the number_of_heads
column.
character_heads = df_hitchhikers["number_of_heads"]
character_heads.describe()
count 5.000000
mean 1.200000
std 0.447214
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 2.000000
Name: number_of_heads, dtype: float64
The characters in our DataFrame have at least (min) 1 and at most (max) 2 heads. The mean number of heads is 1.2, but of course this a bit silly, because number of heads is not a continuous variable.
Exercise 3 (English)
Access the “country” column from the
df_samples
DataFrame, and then use thevalue_counts()
function to show the number of mosquito samples from each country.Access the “type” column from the
df_geneset
DataFrame, and then use thevalue_counts()
function to show the number of genome annotations of different types.Access the “effect” column from the
snp_freqs_df
DataFrame, and then use thevalue_counts()
function to show the number of SNPs with different effects.
Exercice 3 (Français)
Obtenir la colonne “country” du DataFrame
df_samples
et utiliser la fonctionvalue_counts()
pour afficher le nombre de moustiques capturés dans chaque pays.Obtenir la colonne “type” du DataFrame
df_geneset
et utiliser la fonctionvalue_counts()
pour afficher le nombre d’annotations du génome de chaque type.Obtenir la colonne “effect” du DataFrame
snp_freqs_df
et utiliser la fonctionvalue_counts()
pour afficher le nombre de SNPs pour chaque effet possible.
Accessing multiple columns#
To access multiple columns, provide the column names as a list of strings. Let’s access the “name”, “taxon” and “number_of_heads” columns from the DataFrame of book characters, and assign the resulting DataFrame to a new variable df_hh
.
selected_cols = ["name", "taxon", "number_of_heads"]
df_hh = df_hitchhikers[selected_cols]
df_hh
name | taxon | number_of_heads | |
---|---|---|---|
0 | Arthur Dent | human | 1 |
1 | Tricia McMillan | human | 1 |
2 | Ford Prefect | betelgeusian | 1 |
3 | Zaphod Beeblebrox | betelgeusian | 2 |
4 | Marvin the Paranoid Android | robot | 1 |
You can also put the list of columns directly inside the square brackets, which would give the same result.
df_hh = df_hitchhikers[["name", "taxon", "number_of_heads"]]
df_hh
name | taxon | number_of_heads | |
---|---|---|---|
0 | Arthur Dent | human | 1 |
1 | Tricia McMillan | human | 1 |
2 | Ford Prefect | betelgeusian | 1 |
3 | Zaphod Beeblebrox | betelgeusian | 2 |
4 | Marvin the Paranoid Android | robot | 1 |
Exercise 4 (English)
Create a new DataFrame containing only the “sample_id”, “country” and “year” columns from the
df_samples
DataFrame.Create a new DataFrame containing only the “ID”, “contig”, “start” and “end” columns from the
df_geneset
DataFrame.Create a new DataFrame containing only the frequency columns from the
snp_freqs_df
DataFrame (hint: the names of these columns all begin with “frq_”).
Exercice 4 (Français)
Créer un nouveau DataFrame contenant seulement les colonnes “sample_id”, “country” et “year” du DataFrame
df_samples
.Créer un nouveau DataFrame contenant seulement les colonnes “ID”, “contig”, “start” et “end” du DataFrame
df_geneset
.Créer un nouveau DataFrame contenant seulement les colonnes contenant les fréquences du DataFrame
snp_freqs_df
(Indice: les noms de ces colonnes commencent tous par “frq_”).
Accessing rows#
Now we’ve seen how to access columns, let’s see how to access rows. To access rows by their position within a DataFrame, we can use the iloc
property.
Accessing a single row#
To access a single row, we use the iloc
property, then provide the row number as an integer inside square brackets, remembering that numbering starts from 0 in Python. For example, access the first row of our DataFrame of book characters, and assign the result to a variable first_row
.
first_row = df_hitchhikers.iloc[0]
When accessing a single row, pandas returns a Series.
type(first_row)
pandas.core.series.Series
Inspect the data in the row.
first_row
name Arthur Dent
taxon human
gender male
number_of_heads 1
intelligence 100
Name: 0, dtype: object
Access specific data values within the row.
first_row["name"]
'Arthur Dent'
first_row["number_of_heads"]
1
Exercise 5 (English)
Access the 81st row in the
df_samples
DataFrame. What country did this mosquito sample come from? (Remember: numbering starts from 0 in Python.)Access the 73,287th row in the
df_geneset
DataFrame. What type of annotation is this? What is its Name?
Exercice 5 (Français)
Obtenir la 81ème ligne du DataFrame
df_samples
. De quel pays est issu ce moustique? (Indice: les indices commencent à 0 en Python)Obtenir la 73 287ème ligne du DataFrame
df_geneset
. De quel type d’annotation s’agit-il? Quel est son nom?
Accessing multiple rows#
We can also use iloc
to access multiple rows,from some start position to some end position. To do this we provide two integers separated by a colon (“:”) which is called a slice.
As well as remembering that numbering starts from zero, we also have to remember that slices include the start position but do not include the stop position. E.g., to access the first two rows, provide the start and stop positions 0 and 2. This returns a new DataFrame with just the selected rows within the slice.
first_two_rows = df_hitchhikers.iloc[0:2]
first_two_rows
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
0 | Arthur Dent | human | male | 1 | 100 |
1 | Tricia McMillan | human | female | 1 | 180 |
Exercise 6 (English)
Access rows starting from the 6th row up to and including the 10th row in the
df_samples
DataFrame.
Exercice 6 (Français)
Obtenir toutes les lignes entre la 6ème et la 10ème (incluses) du DataFrame
df_samples
.
Indexes#
Every pandas DataFrame has an index, which is a special column (or set of columns) which can be used to access rows quickly and conveniently. In particular, indexes can be used to assign a label to every row, and then use that label to locate and access matching rows.
For example, when working with our DataFrame of book characters, we might want to access rows using the name of the character. To do this, we can set the “name” column as the index.
df_hitchhikers_indexed = df_hitchhikers.set_index("name")
df_hitchhikers_indexed
taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|
name | ||||
Arthur Dent | human | male | 1 | 100 |
Tricia McMillan | human | female | 1 | 180 |
Ford Prefect | betelgeusian | male | 1 | 120 |
Zaphod Beeblebrox | betelgeusian | male | 2 | 30 |
Marvin the Paranoid Android | robot | None | 1 | 5000000 |
The DataFrame df_hitchhikers_indexed
has the same data as before, but notice that the “name” column has moved to the left and is shown in bold. This shows us that the “name” column is now being used as the DataFrame index.
Now that we’ve set a useful index, we can use the loc
property to locate rows with a particular name. E.g., locate the row where the character’s name is “Tricia McMillan”.
trillian = df_hitchhikers_indexed.loc["Tricia McMillan"]
This indexing action locates a single row, and so pandas returns a Series.
type(trillian)
pandas.core.series.Series
Inspect the row of data we located.
trillian
taxon human
gender female
number_of_heads 1
intelligence 180
Name: Tricia McMillan, dtype: object
It’s also useful to know that you can always reset the index back to the default, using the reset_index()
function. The default is to index the DataFrame with integers.
df_hitchhikers_reset = df_hitchhikers_indexed.reset_index()
df_hitchhikers_reset
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
0 | Arthur Dent | human | male | 1 | 100 |
1 | Tricia McMillan | human | female | 1 | 180 |
2 | Ford Prefect | betelgeusian | male | 1 | 120 |
3 | Zaphod Beeblebrox | betelgeusian | male | 2 | 30 |
4 | Marvin the Paranoid Android | robot | None | 1 | 5000000 |
Exercise 7 (English)
Set the “sample_id” column as the index on the
df_samples
DataFrame, then locate the row with sample identifier “BK0055-C”. Where was this mosquito collected?Set the “ID” column as the index on the
df_geneset
DataFrame, then locate the row with identifier “AGAP000818”. What is the name of this gene?
Exercice 7 (Français)
Faire de la colonne “sample_id” l’index du DataFrame
df_samples
et ensuite trouver la ligne avec comme identificant d’échantillon “BK0055-C”. D’où est issu ce moustique?Faire de la colonne “ID” l’index du DataFrame
df_geneset
et ensuite trouver la ligne avec comme identifiant “AGAP000818”. Quel est le nom de ce gène?
Queries#
DataFrames also provide a query()
function, which can be used to select rows matching some condition or combination of conditions. The query()
function accepts a string containing a query expression.
Let’s query the book characters DataFrame to select only rows where the value of the “taxon” column is “human”. To match a value we use the “==” operator.
df_human_hitchhikers = df_hitchhikers.query("taxon == 'human'")
df_human_hitchhikers
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
0 | Arthur Dent | human | male | 1 | 100 |
1 | Tricia McMillan | human | female | 1 | 180 |
Let’s do the opposite, selecting only rows where the value of the “taxon” column is not “human”, using the “!=” operator.
df_non_human_hitchhikers = df_hitchhikers.query("taxon != 'human'")
df_non_human_hitchhikers
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
2 | Ford Prefect | betelgeusian | male | 1 | 120 |
3 | Zaphod Beeblebrox | betelgeusian | male | 2 | 30 |
4 | Marvin the Paranoid Android | robot | None | 1 | 5000000 |
Let’s query a numerical field using the “>” greater than operator.
df_multi_headed_hitchhikers = df_hitchhikers.query("number_of_heads > 1")
df_multi_headed_hitchhikers
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
3 | Zaphod Beeblebrox | betelgeusian | male | 2 | 30 |
A query expression can combine conditions on multiple columns using “and” and “or”. E.g., find rows where the value of the “taxon” column is “human” and the value of the “number_of_heads” column is 1.
df_non_human_one_headed_hitchhikers = df_hitchhikers.query(
"taxon != 'human' and number_of_heads == 1"
)
df_non_human_one_headed_hitchhikers
name | taxon | gender | number_of_heads | intelligence | |
---|---|---|---|---|---|
2 | Ford Prefect | betelgeusian | male | 1 | 120 |
4 | Marvin the Paranoid Android | robot | None | 1 | 5000000 |
Exercise 8 (English)
Query the
df_samples
DataFrame to select mosquitoes collected in Burkina Faso.Query the
df_samples
DataFrame to select mosquitoes collected in Burkina Faso in 2014.Query the
df_samples
DataFrame to select Anopheles coluzzii mosquitoes collected in Burkina Faso in 2014 (hint:taxon == 'coluzzii'
).Query the
snp_freqs_df
DataFrame to select SNPs which change the protein sequence (hint:effect == 'NON_SYNONYMOUS_CODING'
).Query the
snp_freqs_df
DataFrame to select SNPs which change the protein sequence and which occur above 5% frequency in at least one cohort (hint:max_af > 0.05
).
Exercice 8 (Français)
Sélectionner parmi les moustiques du DataFrame
df_samples
ceux qui ont été capturés au Burkina Faso.Sélectionner parmi les moustiques du DataFrame
df_samples
ceux qui ont été capturés au Burkina Faso en 2014.Sélectionner parmi les moustiques du DataFrame
df_samples
les Anophèles coluzzii qui ont été capturés au Burkina Faso en 2014. (Indice:taxon == 'coluzzii'
).Sélectionner parmi les SNPs du DataFrame
snp_freqs_df
ceux qui modifient la séquence d’une protéine (Indice:effect == 'NON_SYNONYMOUS_CODING'
).Sélectionner parmi les SNPs du DataFrame
snp_freqs_df
ceux qui modifient la séquence d’une protéine et apparaissent avec une fréquence supérieure à 5% dans au moins une cohorte (Indice:max_af > 0.05
).
Input/output#
We saw at the beginning of this module how to read in data from a file. Let’s look finally at how to write data out to a file.
To write a DataFrame out to a CSV file, use the to_csv() function. E.g., let’s write out the book characters DataFrame to a CSV file.
df_hitchhikers.to_csv("hitchhikers.csv")
You can download this file to your local computer by opening the file browser tab to the left in colab.
There are several other file formats you can write to. E.g., you could write to an Excel file. Here, we use the
df_hitchhikers.to_excel("hitchhikers.xlsx")
Any of the DataFrames we created in this module could be written out in the same way. E.g., we could write a DataFrame that we obtained by querying the book characters for only humans.
df_human_hitchhikers.to_excel("hh_human.xlsx")
Exercise 9 (English)
Write the
df_samples
DataFrame out to an Excel file.Query the
df_samples
DataFrame to select only mosquitoes from Burkina Faso, then write the resulting DataFrame out to a CSV file.
Exercice 9 (Français)
Enregistrer le DataFrame
df_samples
sous forme de fichier Excel.Sélectionner parmi les moustiques du DataFrame
df_samples
ceux provenant du Burkina Faso et enregistrer le DataFrame ainsi créé sous forme de fichier CSV.
Topics not covered here#
Please note that this module has only covered a few of the many features that pandas supports. Here are a few of the features that we have not covered, but which you might like to learn more about through other online tutorials.
Grouping and aggregation
Pivot tables
Joining and merging
Concatenating
Advanced indexing (multi-level indexes; Boolean indexing; …)
Modifying data
Apply
Missing data
Time series
Plotting
Text
…
Further reading#
Here are a few links to online resources providing further information about pandas for new users.
pandas user guide (in particular see 10 minutes to pandas)