Last updated: 2026-03-04 01:41:56
Data in Python
Introduction
In this chapter, we introduce Python functionality for working with data. This is where we deviate from the Python standard library, and start introducing third-party packages. Namely, we introduce the two most important Python packages for working with data:
numpy—for working with arrays (Arrays with numpy)pandas—for working with tables (Tables with pandas)
This chapter serves as a brief summary of one of the topics which are prerequisites of the main contents of the book (Prerequisites). Readers who are already familiar with the material, can either skip this chapter, or go over it for recollection and self-check, before going further. Readers who are new to the material, can use this chapter as a checklist of the topics they need to focus on when learning the prerequisites of the book.
For a detailed introduction to the topics presented in this chapter, see the following chapters from the Introduction to Spatial Data Programming with Python book:
Packages
import numpy as np
import pandas as pdArrays with numpy
What is numpy?
numpy (Harris et al. 2020) is a Python package for working with arrays—multi-dimensional ordered collections of values of the same type—typically numeric.
Creating arrays
A numpy array can be created from a list using function np.array. For example, the following two-level list named x:
x = [
[1, 0, 0],
[2, 1, 2],
[3, 1, 0],
[2, 7, 1]
]
x[[1, 0, 0], [2, 1, 2], [3, 1, 0], [2, 7, 1]]
can be transformed into a two-dimensional numpy array named b, as follows:
b = np.array(x)
barray([[1, 0, 0],
[2, 1, 2],
[3, 1, 0],
[2, 7, 1]])
The numpy package has numerous other methods for creating an array. For example, function np.zeros can be used to create an array with a replicated value of zero, of the specified shape (more on that below):
a = np.zeros(7)
aarray([0., 0., 0., 0., 0., 0., 0.])
The most important properties of an array are its number of dimensions .ndim:
a.ndim1
b.ndim2
its .shape, i.e., the lengths of the dimensions:
a.shape(7,)
b.shape(4, 3)
and its data type (.dtype):
a.dtypedtype('float64')
b.dtypedtype('int64')
Subsetting
Subsets of an array can be created using list-style indices, separately for each dimension, separated by commas. The : symbol specifies “all indices”. For example:
barray([[1, 0, 0],
[2, 1, 2],
[3, 1, 0],
[2, 7, 1]])
b[1, :] ## 2nd rowarray([2, 1, 2])
b[:, 0] ## 1st columnarray([1, 2, 3, 2])
b[1:3, 0:2] ## rows 2-3, columns 1-2array([[2, 1],
[3, 1]])
Note that the number of dimensions is automatically reduced when possible. For example, the first two of the above subsets are 1-dimensional arrays, while the third subset is 2-dimensional.
Missing data
Missing data in a numpy array are represented by the special float value np.nan. Therefore, an np.nan value can only appear in arrays of type float. To demonstrate, let’s convert array b from int to float, using method .astype:
b = b.astype(float)
barray([[1., 0., 0.],
[2., 1., 2.],
[3., 1., 0.],
[2., 7., 1.]])
Now, we can assign an np.nan into array b:
b[3, 1] = np.nan
barray([[ 1., 0., 0.],
[ 2., 1., 2.],
[ 3., 1., 0.],
[ 2., nan, 1.]])
ndarray to list
An array can be transformed to a list, using the .tolist method:
b.tolist()[[1.0, 0.0, 0.0], [2.0, 1.0, 2.0], [3.0, 1.0, 0.0], [2.0, nan, 1.0]]
This can be through of as the reverse of array creation from a list (Creating arrays).
Tables with pandas
What is pandas?
The pandas package (McKinney 2010) provides data structures and functions for working with tables in Python. It provides two fundumental data structures:
Series—an indexed one-dimensional array, used to represent a table rowDataFrame—an indexed two-dimensional array, used to represent a table
Series from scratch
Let’s create an example of a Series to see how this data structure behaves. We can take a list of values:
s = ['Soroka', 'Yoseftal', 'Barzilai']
s['Soroka', 'Yoseftal', 'Barzilai']
and convert it a Series, using function pd.Series:
s = pd.Series(s)
s0 Soroka
1 Yoseftal
2 Barzilai
dtype: str
Note that a Series is composed of indices (in this case the default consecutive integers):
s.indexRangeIndex(start=0, stop=3, step=1)
and the values:
s.values<ArrowStringArray>
['Soroka', 'Yoseftal', 'Barzilai']
Length: 3, dtype: str
DataFrame from scratch
A DataFrame is a collection of Series all sharing the same (row) indices, comprising columns, so that the entire data structure is a table. For demonstrating DataFrame creation, let’s create two more series, with longitude and latitude values:
lon = [34.800933, 34.940560, 34.562429]
lat = [31.258211, 29.554168, 31.663251]The three series—s, lon, and lat—can be combined into a DataFrame, using function pd.DataFrame, as follows:
dat = pd.DataFrame({'name': s, 'lon': lon, 'lat': lat})
dat| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Note that:
- We “skipped” transforming
lonandlattoSeries; the conversion toDataFrameacceptslists too, in which case they are automatically transformed toSeries - The above input is a
dictwhere keys are column names, and values are the column data
Setting the index
When necessary, we can “transfer” column value into the row indices, using the .set_index method:
dat.set_index('name')| lon | lat | |
|---|---|---|
| name | ||
| Soroka | 34.800933 | 31.258211 |
| Yoseftal | 34.940560 | 29.554168 |
| Barzilai | 34.562429 | 31.663251 |
The above operation automatically also sets an axis name, which is typically not useful and confusing. To remove the axis name, we can use rename_axis as follows:
dat.set_index('name').rename_axis(None, axis=0)| lon | lat | |
|---|---|---|
| Soroka | 34.800933 | 31.258211 |
| Yoseftal | 34.940560 | 29.554168 |
| Barzilai | 34.562429 | 31.663251 |
DataFrame from ndarray
A two-dimensional numpy array can be transformed to a DataFrame, as follows:
barray([[ 1., 0., 0.],
[ 2., 1., 2.],
[ 3., 1., 0.],
[ 2., nan, 1.]])
pd.DataFrame(b)| 0 | 1 | 2 | |
|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 |
| 1 | 2.0 | 1.0 | 2.0 |
| 2 | 3.0 | 1.0 | 0.0 |
| 3 | 2.0 | NaN | 1.0 |
The row and column names can be set as part of the DataFrame creation, as follows:
pd.DataFrame(b, index=['a','b','c','d'], columns=['col1','col2','col3'])| col1 | col2 | col3 | |
|---|---|---|---|
| a | 1.0 | 0.0 | 0.0 |
| b | 2.0 | 1.0 | 2.0 |
| c | 3.0 | 1.0 | 0.0 |
| d | 2.0 | NaN | 1.0 |
Subsetting—rows and columns
DataFrame rows and columns can be selected using methods .loc (using the indices) or .iloc (using numeric numpy-style indices). In .loc the subset is inclusive of the last index, while in .iloc it is not. In both cases, two indices can be passed: rows and columns (in that order). The : symbol specifies “all” rows or columns. list-style and numpy slicing syntax can be used to specify multiple indices. For example, considering the table dat:
dat| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Here is how we can select the first two rows:
dat.iloc[:2, :]| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
and here is how we can select columns lon and lat:
dat.loc[:, ['lon', 'lat']]| lon | lat | |
|---|---|---|
| 0 | 34.800933 | 31.258211 |
| 1 | 34.940560 | 29.554168 |
| 2 | 34.562429 | 31.663251 |
Commonly used “shortcuts” include selecting one column, as a Series, by passing just one column index:
dat['name']0 Soroka
1 Yoseftal
2 Barzilai
Name: name, dtype: str
Or selecting one or more columns, as a DataFrame, by passing a list of column indices:
dat[['lon', 'lat']]| lon | lat | |
|---|---|---|
| 0 | 34.800933 | 31.258211 |
| 1 | 34.940560 | 29.554168 |
| 2 | 34.562429 | 31.663251 |
Subsetting—rows by condition
We can subset specific rows from a DataFrame by:
- Creating a boolean
Seriesspecifying which rows to retain (typically using one or more of the exisitingDataFramecolumns) - Passing the
Seriesas an index inside square brackets
For example, considering the DataFrame named dat:
dat| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Here is how we can subset the rows where latitude is greater than 30:
sel = dat['lat'] > 30
sel0 True
1 False
2 True
Name: lat, dtype: bool
dat[sel]| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 2 | Barzilai | 34.562429 | 31.663251 |
or combined into one expression:
dat[dat['lat'] > 30]| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Sometimes we need to subset rows where the value belongs to a set of values, such as a list of names:
sel = ['Soroka', 'Barzilai']In such case, the .isin operator can be used, as follows:
dat[dat['name'].isin(sel)]| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Sorting rows
A DataFrame rows can be sorted using the .sort_values method. We can specify the column(s) taken into account for sorting through by, and the sorting order through ascending (default is False). For example:
dat.sort_values(by='lat')| name | lon | lat | |
|---|---|---|---|
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 0 | Soroka | 34.800933 | 31.258211 |
| 2 | Barzilai | 34.562429 | 31.663251 |
dat.sort_values(by='lat', ascending=False)| name | lon | lat | |
|---|---|---|---|
| 2 | Barzilai | 34.562429 | 31.663251 |
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
Missing data
Missing data in Series and DataFrames are represented using np.nan. For example:
dat.loc[1, 'name'] = np.nan
dat| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | NaN | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Missing values can be detected using method .isna:
dat['name'].isna()0 False
1 True
2 False
Name: name, dtype: bool
or .notna:
dat['name'].notna()0 True
1 False
2 True
Name: name, dtype: bool
Let’s fill the missing value before moving on:
dat.loc[1, 'name'] = 'Yoseftal'
dat| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Series to list
A Series can be converted to a list using the .to_list (or synonym .tolist) method:
dat['lon'].to_list()[34.800933, 34.94056, 34.562429]
This can be thought of as the opposite of Series creation from a list (Series from scratch).
DataFrame to '.csv'
A DataFrame can be exported to a '.csv' file as follows:
dat.to_csv('output/hospitals.csv', index=False)DataFrame from '.csv'
A DataFrame can be imported from a '.csv' file using the pd.read_csv function:
pd.read_csv('output/hospitals.csv')| name | lon | lat | |
|---|---|---|---|
| 0 | Soroka | 34.800933 | 31.258211 |
| 1 | Yoseftal | 34.940560 | 29.554168 |
| 2 | Barzilai | 34.562429 | 31.663251 |
Exercises
Exercise 02-01
- Create a \(10 \times 10\)
ndarrayof typeintrepresenting the multiplication table
Exercise 02-02
- Calculate the bounds of the hospitals (DataFrame from '.csv'), as a list with four
floatvalues of the form[xmin,ymin,xmax,ymax]
Exercise 02-03
- The file
'output/europe_borders.csv'is a pairwise matrix of european countries, whereTruemarks that the two countris borders intersect (see Pairwise matrices) - Calculate a table with two columns:
name—country namecount—number of neighbors, excluding self
- Sort the table in decreasing order according to number of neighbors, and print the first 6 rows (i.e., the 6 countries with most neighbors) (Table 19.1)
- Tip: You can use use the
.summethod withaxis=1to calculate row sums