Data in Python

Last updated: 2026-03-04 01:41:56

Introduction

In this chapter, we introduce Python functionality for working with data. This is where we deviate from the Python standard library, and start introducing third-party packages. Namely, we introduce the two most important Python packages for working with data:

numpy—for working with arrays (Arrays with numpy)
pandas—for working with tables (Tables with pandas)

Note

This chapter serves as a brief summary of one of the topics which are prerequisites of the main contents of the book (Prerequisites). Readers who are already familiar with the material, can either skip this chapter, or go over it for recollection and self-check, before going further. Readers who are new to the material, can use this chapter as a checklist of the topics they need to focus on when learning the prerequisites of the book.

Note

For a detailed introduction to the topics presented in this chapter, see the following chapters from the Introduction to Spatial Data Programming with Python book:

Packages

import numpy as np
import pandas as pd

Arrays with `numpy`

What is `numpy`?

numpy (Harris et al. 2020) is a Python package for working with arrays—multi-dimensional ordered collections of values of the same type—typically numeric.

Creating arrays

A numpy array can be created from a list using function np.array. For example, the following two-level list named x:

x = [
    [1, 0, 0],
    [2, 1, 2],
    [3, 1, 0],
    [2, 7, 1]
]
x

[[1, 0, 0], [2, 1, 2], [3, 1, 0], [2, 7, 1]]

can be transformed into a two-dimensional numpy array named b, as follows:

b = np.array(x)
b

array([[1, 0, 0],
       [2, 1, 2],
       [3, 1, 0],
       [2, 7, 1]])

The numpy package has numerous other methods for creating an array. For example, function np.zeros can be used to create an array with a replicated value of zero, of the specified shape (more on that below):

a = np.zeros(7)
a

array([0., 0., 0., 0., 0., 0., 0.])

The most important properties of an array are its number of dimensions .ndim:

a.ndim

b.ndim

its .shape, i.e., the lengths of the dimensions:

a.shape

(7,)

b.shape

(4, 3)

and its data type (.dtype):

a.dtype

dtype('float64')

b.dtype

dtype('int64')

Subsetting

Subsets of an array can be created using list-style indices, separately for each dimension, separated by commas. The : symbol specifies “all indices”. For example:

array([[1, 0, 0],
       [2, 1, 2],
       [3, 1, 0],
       [2, 7, 1]])

b[1, :]  ## 2nd row

array([2, 1, 2])

b[:, 0]  ## 1st column

array([1, 2, 3, 2])

b[1:3, 0:2]  ## rows 2-3, columns 1-2

array([[2, 1],
       [3, 1]])

Note that the number of dimensions is automatically reduced when possible. For example, the first two of the above subsets are 1-dimensional arrays, while the third subset is 2-dimensional.

Missing data

Missing data in a numpy array are represented by the special float value np.nan. Therefore, an np.nan value can only appear in arrays of type float. To demonstrate, let’s convert array b from int to float, using method .astype:

b = b.astype(float)
b

array([[1., 0., 0.],
       [2., 1., 2.],
       [3., 1., 0.],
       [2., 7., 1.]])

Now, we can assign an np.nan into array b:

b[3, 1] = np.nan
b

array([[ 1.,  0.,  0.],
       [ 2.,  1.,  2.],
       [ 3.,  1.,  0.],
       [ 2., nan,  1.]])

`ndarray` to `list`

An array can be transformed to a list, using the .tolist method:

b.tolist()

[[1.0, 0.0, 0.0], [2.0, 1.0, 2.0], [3.0, 1.0, 0.0], [2.0, nan, 1.0]]

This can be through of as the reverse of array creation from a list (Creating arrays).

Tables with `pandas`

What is `pandas`?

The pandas package (McKinney 2010) provides data structures and functions for working with tables in Python. It provides two fundumental data structures:

Series—an indexed one-dimensional array, used to represent a table row
DataFrame—an indexed two-dimensional array, used to represent a table

`Series` from scratch

Let’s create an example of a Series to see how this data structure behaves. We can take a list of values:

s = ['Soroka', 'Yoseftal', 'Barzilai']
s

['Soroka', 'Yoseftal', 'Barzilai']

and convert it a Series, using function pd.Series:

s = pd.Series(s)
s

0      Soroka
1    Yoseftal
2    Barzilai
dtype: str

Note that a Series is composed of indices (in this case the default consecutive integers):

s.index

RangeIndex(start=0, stop=3, step=1)

and the values:

s.values

<ArrowStringArray>
['Soroka', 'Yoseftal', 'Barzilai']
Length: 3, dtype: str

`DataFrame` from scratch

A DataFrame is a collection of Series all sharing the same (row) indices, comprising columns, so that the entire data structure is a table. For demonstrating DataFrame creation, let’s create two more series, with longitude and latitude values:

lon = [34.800933, 34.940560, 34.562429]
lat = [31.258211, 29.554168, 31.663251]

The three series—s, lon, and lat—can be combined into a DataFrame, using function pd.DataFrame, as follows:

dat = pd.DataFrame({'name': s, 'lon': lon, 'lat': lat})
dat

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168
2	Barzilai	34.562429	31.663251

Note that:

We “skipped” transforming lon and lat to Series; the conversion to DataFrame accepts lists too, in which case they are automatically transformed to Series
The above input is a dict where keys are column names, and values are the column data

Setting the index

When necessary, we can “transfer” column value into the row indices, using the .set_index method:

dat.set_index('name')

	lon	lat
name
Soroka	34.800933	31.258211
Yoseftal	34.940560	29.554168
Barzilai	34.562429	31.663251

The above operation automatically also sets an axis name, which is typically not useful and confusing. To remove the axis name, we can use rename_axis as follows:

dat.set_index('name').rename_axis(None, axis=0)

	lon	lat
Soroka	34.800933	31.258211
Yoseftal	34.940560	29.554168
Barzilai	34.562429	31.663251

`DataFrame` from `ndarray`

A two-dimensional numpy array can be transformed to a DataFrame, as follows:

array([[ 1.,  0.,  0.],
       [ 2.,  1.,  2.],
       [ 3.,  1.,  0.],
       [ 2., nan,  1.]])

pd.DataFrame(b)

	0	1	2
0	1.0	0.0	0.0
1	2.0	1.0	2.0
2	3.0	1.0	0.0
3	2.0	NaN	1.0

The row and column names can be set as part of the DataFrame creation, as follows:

pd.DataFrame(b, index=['a','b','c','d'], columns=['col1','col2','col3'])

	col1	col2	col3
a	1.0	0.0	0.0
b	2.0	1.0	2.0
c	3.0	1.0	0.0
d	2.0	NaN	1.0

Subsetting—rows and columns

DataFrame rows and columns can be selected using methods .loc (using the indices) or .iloc (using numeric numpy-style indices). In .loc the subset is inclusive of the last index, while in .iloc it is not. In both cases, two indices can be passed: rows and columns (in that order). The : symbol specifies “all” rows or columns. list-style and numpy slicing syntax can be used to specify multiple indices. For example, considering the table dat:

dat

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168
2	Barzilai	34.562429	31.663251

Here is how we can select the first two rows:

dat.iloc[:2, :]

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168

and here is how we can select columns lon and lat:

dat.loc[:, ['lon', 'lat']]

	lon	lat
0	34.800933	31.258211
1	34.940560	29.554168
2	34.562429	31.663251

Commonly used “shortcuts” include selecting one column, as a Series, by passing just one column index:

dat['name']

0      Soroka
1    Yoseftal
2    Barzilai
Name: name, dtype: str

Or selecting one or more columns, as a DataFrame, by passing a list of column indices:

dat[['lon', 'lat']]

	lon	lat
0	34.800933	31.258211
1	34.940560	29.554168
2	34.562429	31.663251

Subsetting—rows by condition

We can subset specific rows from a DataFrame by:

Creating a boolean Series specifying which rows to retain (typically using one or more of the exisiting DataFrame columns)
Passing the Series as an index inside square brackets

For example, considering the DataFrame named dat:

dat

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168
2	Barzilai	34.562429	31.663251

Here is how we can subset the rows where latitude is greater than 30:

sel = dat['lat'] > 30
sel

0     True
1    False
2     True
Name: lat, dtype: bool

dat[sel]

	name	lon	lat
0	Soroka	34.800933	31.258211
2	Barzilai	34.562429	31.663251

or combined into one expression:

dat[dat['lat'] > 30]

	name	lon	lat
0	Soroka	34.800933	31.258211
2	Barzilai	34.562429	31.663251

Sometimes we need to subset rows where the value belongs to a set of values, such as a list of names:

sel = ['Soroka', 'Barzilai']

In such case, the .isin operator can be used, as follows:

dat[dat['name'].isin(sel)]

	name	lon	lat
0	Soroka	34.800933	31.258211
2	Barzilai	34.562429	31.663251

Sorting rows

A DataFrame rows can be sorted using the .sort_values method. We can specify the column(s) taken into account for sorting through by, and the sorting order through ascending (default is False). For example:

dat.sort_values(by='lat')

	name	lon	lat
1	Yoseftal	34.940560	29.554168
0	Soroka	34.800933	31.258211
2	Barzilai	34.562429	31.663251

dat.sort_values(by='lat', ascending=False)

	name	lon	lat
2	Barzilai	34.562429	31.663251
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168

Missing data

Missing data in Series and DataFrames are represented using np.nan. For example:

dat.loc[1, 'name'] = np.nan
dat

	name	lon	lat
0	Soroka	34.800933	31.258211
1	NaN	34.940560	29.554168
2	Barzilai	34.562429	31.663251

Missing values can be detected using method .isna:

dat['name'].isna()

0    False
1     True
2    False
Name: name, dtype: bool

or .notna:

dat['name'].notna()

0     True
1    False
2     True
Name: name, dtype: bool

Let’s fill the missing value before moving on:

dat.loc[1, 'name'] = 'Yoseftal'
dat

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168
2	Barzilai	34.562429	31.663251

`Series` to `list`

A Series can be converted to a list using the .to_list (or synonym .tolist) method:

dat['lon'].to_list()

[34.800933, 34.94056, 34.562429]

This can be thought of as the opposite of Series creation from a list (Series from scratch).

`DataFrame` to `'.csv'`

A DataFrame can be exported to a '.csv' file as follows:

dat.to_csv('output/hospitals.csv', index=False)

`DataFrame` from `'.csv'`

A DataFrame can be imported from a '.csv' file using the pd.read_csv function:

pd.read_csv('output/hospitals.csv')

	name	lon	lat
0	Soroka	34.800933	31.258211
1	Yoseftal	34.940560	29.554168
2	Barzilai	34.562429	31.663251

Exercises

Exercise 02-01

Create a \(10 \times 10\) ndarray of type int representing the multiplication table

Exercise 02-02

Calculate the bounds of the hospitals (DataFrame from '.csv'), as a list with four float values of the form [xmin,ymin,xmax,ymax]

Exercise 02-03

The file 'output/europe_borders.csv' is a pairwise matrix of european countries, where True marks that the two countris borders intersect (see Pairwise matrices)
Calculate a table with two columns:
- name—country name
- count—number of neighbors, excluding self
Sort the table in decreasing order according to number of neighbors, and print the first 6 rows (i.e., the 6 countries with most neighbors) (Table 19.1)
Tip: You can use use the .sum method with axis=1 to calculate row sums

Introduction

Packages

Arrays with numpy

What is numpy?

Creating arrays

Subsetting

Missing data

ndarray to list

Tables with pandas

What is pandas?

Series from scratch

DataFrame from scratch

Setting the index

DataFrame from ndarray

Subsetting—rows and columns

Subsetting—rows by condition

Sorting rows

Missing data

Series to list

DataFrame to '.csv'

DataFrame from '.csv'

Exercises

Exercise 02-01

Exercise 02-02

Exercise 02-03

Arrays with `numpy`

What is `numpy`?

`ndarray` to `list`

Tables with `pandas`

What is `pandas`?

`Series` from scratch

`DataFrame` from scratch

`DataFrame` from `ndarray`

`Series` to `list`

`DataFrame` to `'.csv'`

`DataFrame` from `'.csv'`