Data in Python

Last updated: 2026-03-04 01:41:56

Introduction

In this chapter, we introduce Python functionality for working with data. This is where we deviate from the Python standard library, and start introducing third-party packages. Namely, we introduce the two most important Python packages for working with data:

Note

This chapter serves as a brief summary of one of the topics which are prerequisites of the main contents of the book (Prerequisites). Readers who are already familiar with the material, can either skip this chapter, or go over it for recollection and self-check, before going further. Readers who are new to the material, can use this chapter as a checklist of the topics they need to focus on when learning the prerequisites of the book.

Note

For a detailed introduction to the topics presented in this chapter, see the following chapters from the Introduction to Spatial Data Programming with Python book:

Packages

import numpy as np
import pandas as pd

Arrays with numpy

What is numpy?

numpy (Harris et al. 2020) is a Python package for working with arrays—multi-dimensional ordered collections of values of the same type—typically numeric.

Creating arrays

A numpy array can be created from a list using function np.array. For example, the following two-level list named x:

x = [
    [1, 0, 0],
    [2, 1, 2],
    [3, 1, 0],
    [2, 7, 1]
]
x
[[1, 0, 0], [2, 1, 2], [3, 1, 0], [2, 7, 1]]

can be transformed into a two-dimensional numpy array named b, as follows:

b = np.array(x)
b
array([[1, 0, 0],
       [2, 1, 2],
       [3, 1, 0],
       [2, 7, 1]])

The numpy package has numerous other methods for creating an array. For example, function np.zeros can be used to create an array with a replicated value of zero, of the specified shape (more on that below):

a = np.zeros(7)
a
array([0., 0., 0., 0., 0., 0., 0.])

The most important properties of an array are its number of dimensions .ndim:

a.ndim
1
b.ndim
2

its .shape, i.e., the lengths of the dimensions:

a.shape
(7,)
b.shape
(4, 3)

and its data type (.dtype):

a.dtype
dtype('float64')
b.dtype
dtype('int64')

Subsetting

Subsets of an array can be created using list-style indices, separately for each dimension, separated by commas. The : symbol specifies “all indices”. For example:

b
array([[1, 0, 0],
       [2, 1, 2],
       [3, 1, 0],
       [2, 7, 1]])
b[1, :]  ## 2nd row
array([2, 1, 2])
b[:, 0]  ## 1st column
array([1, 2, 3, 2])
b[1:3, 0:2]  ## rows 2-3, columns 1-2
array([[2, 1],
       [3, 1]])

Note that the number of dimensions is automatically reduced when possible. For example, the first two of the above subsets are 1-dimensional arrays, while the third subset is 2-dimensional.

Missing data

Missing data in a numpy array are represented by the special float value np.nan. Therefore, an np.nan value can only appear in arrays of type float. To demonstrate, let’s convert array b from int to float, using method .astype:

b = b.astype(float)
b
array([[1., 0., 0.],
       [2., 1., 2.],
       [3., 1., 0.],
       [2., 7., 1.]])

Now, we can assign an np.nan into array b:

b[3, 1] = np.nan
b
array([[ 1.,  0.,  0.],
       [ 2.,  1.,  2.],
       [ 3.,  1.,  0.],
       [ 2., nan,  1.]])

ndarray to list

An array can be transformed to a list, using the .tolist method:

b.tolist()
[[1.0, 0.0, 0.0], [2.0, 1.0, 2.0], [3.0, 1.0, 0.0], [2.0, nan, 1.0]]

This can be through of as the reverse of array creation from a list (Creating arrays).

Tables with pandas

What is pandas?

The pandas package (McKinney 2010) provides data structures and functions for working with tables in Python. It provides two fundumental data structures:

  • Series—an indexed one-dimensional array, used to represent a table row
  • DataFrame—an indexed two-dimensional array, used to represent a table

Series from scratch

Let’s create an example of a Series to see how this data structure behaves. We can take a list of values:

s = ['Soroka', 'Yoseftal', 'Barzilai']
s
['Soroka', 'Yoseftal', 'Barzilai']

and convert it a Series, using function pd.Series:

s = pd.Series(s)
s
0      Soroka
1    Yoseftal
2    Barzilai
dtype: str

Note that a Series is composed of indices (in this case the default consecutive integers):

s.index
RangeIndex(start=0, stop=3, step=1)

and the values:

s.values
<ArrowStringArray>
['Soroka', 'Yoseftal', 'Barzilai']
Length: 3, dtype: str

DataFrame from scratch

A DataFrame is a collection of Series all sharing the same (row) indices, comprising columns, so that the entire data structure is a table. For demonstrating DataFrame creation, let’s create two more series, with longitude and latitude values:

lon = [34.800933, 34.940560, 34.562429]
lat = [31.258211, 29.554168, 31.663251]

The three series—s, lon, and lat—can be combined into a DataFrame, using function pd.DataFrame, as follows:

dat = pd.DataFrame({'name': s, 'lon': lon, 'lat': lat})
dat
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Note that:

  • We “skipped” transforming lon and lat to Series; the conversion to DataFrame accepts lists too, in which case they are automatically transformed to Series
  • The above input is a dict where keys are column names, and values are the column data

Setting the index

When necessary, we can “transfer” column value into the row indices, using the .set_index method:

dat.set_index('name')
lon lat
name
Soroka 34.800933 31.258211
Yoseftal 34.940560 29.554168
Barzilai 34.562429 31.663251

The above operation automatically also sets an axis name, which is typically not useful and confusing. To remove the axis name, we can use rename_axis as follows:

dat.set_index('name').rename_axis(None, axis=0)
lon lat
Soroka 34.800933 31.258211
Yoseftal 34.940560 29.554168
Barzilai 34.562429 31.663251

DataFrame from ndarray

A two-dimensional numpy array can be transformed to a DataFrame, as follows:

b
array([[ 1.,  0.,  0.],
       [ 2.,  1.,  2.],
       [ 3.,  1.,  0.],
       [ 2., nan,  1.]])
pd.DataFrame(b)
0 1 2
0 1.0 0.0 0.0
1 2.0 1.0 2.0
2 3.0 1.0 0.0
3 2.0 NaN 1.0

The row and column names can be set as part of the DataFrame creation, as follows:

pd.DataFrame(b, index=['a','b','c','d'], columns=['col1','col2','col3'])
col1 col2 col3
a 1.0 0.0 0.0
b 2.0 1.0 2.0
c 3.0 1.0 0.0
d 2.0 NaN 1.0

Subsetting—rows and columns

DataFrame rows and columns can be selected using methods .loc (using the indices) or .iloc (using numeric numpy-style indices). In .loc the subset is inclusive of the last index, while in .iloc it is not. In both cases, two indices can be passed: rows and columns (in that order). The : symbol specifies “all” rows or columns. list-style and numpy slicing syntax can be used to specify multiple indices. For example, considering the table dat:

dat
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Here is how we can select the first two rows:

dat.iloc[:2, :]
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168

and here is how we can select columns lon and lat:

dat.loc[:, ['lon', 'lat']]
lon lat
0 34.800933 31.258211
1 34.940560 29.554168
2 34.562429 31.663251

Commonly used “shortcuts” include selecting one column, as a Series, by passing just one column index:

dat['name']
0      Soroka
1    Yoseftal
2    Barzilai
Name: name, dtype: str

Or selecting one or more columns, as a DataFrame, by passing a list of column indices:

dat[['lon', 'lat']]
lon lat
0 34.800933 31.258211
1 34.940560 29.554168
2 34.562429 31.663251

Subsetting—rows by condition

We can subset specific rows from a DataFrame by:

  1. Creating a boolean Series specifying which rows to retain (typically using one or more of the exisiting DataFrame columns)
  2. Passing the Series as an index inside square brackets

For example, considering the DataFrame named dat:

dat
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Here is how we can subset the rows where latitude is greater than 30:

sel = dat['lat'] > 30
sel
0     True
1    False
2     True
Name: lat, dtype: bool
dat[sel]
name lon lat
0 Soroka 34.800933 31.258211
2 Barzilai 34.562429 31.663251

or combined into one expression:

dat[dat['lat'] > 30]
name lon lat
0 Soroka 34.800933 31.258211
2 Barzilai 34.562429 31.663251

Sometimes we need to subset rows where the value belongs to a set of values, such as a list of names:

sel = ['Soroka', 'Barzilai']

In such case, the .isin operator can be used, as follows:

dat[dat['name'].isin(sel)]
name lon lat
0 Soroka 34.800933 31.258211
2 Barzilai 34.562429 31.663251

Sorting rows

A DataFrame rows can be sorted using the .sort_values method. We can specify the column(s) taken into account for sorting through by, and the sorting order through ascending (default is False). For example:

dat.sort_values(by='lat')
name lon lat
1 Yoseftal 34.940560 29.554168
0 Soroka 34.800933 31.258211
2 Barzilai 34.562429 31.663251
dat.sort_values(by='lat', ascending=False)
name lon lat
2 Barzilai 34.562429 31.663251
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168

Missing data

Missing data in Series and DataFrames are represented using np.nan. For example:

dat.loc[1, 'name'] = np.nan
dat
name lon lat
0 Soroka 34.800933 31.258211
1 NaN 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Missing values can be detected using method .isna:

dat['name'].isna()
0    False
1     True
2    False
Name: name, dtype: bool

or .notna:

dat['name'].notna()
0     True
1    False
2     True
Name: name, dtype: bool

Let’s fill the missing value before moving on:

dat.loc[1, 'name'] = 'Yoseftal'
dat
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Series to list

A Series can be converted to a list using the .to_list (or synonym .tolist) method:

dat['lon'].to_list()
[34.800933, 34.94056, 34.562429]

This can be thought of as the opposite of Series creation from a list (Series from scratch).

DataFrame to '.csv'

A DataFrame can be exported to a '.csv' file as follows:

dat.to_csv('output/hospitals.csv', index=False)

DataFrame from '.csv'

A DataFrame can be imported from a '.csv' file using the pd.read_csv function:

pd.read_csv('output/hospitals.csv')
name lon lat
0 Soroka 34.800933 31.258211
1 Yoseftal 34.940560 29.554168
2 Barzilai 34.562429 31.663251

Exercises

Exercise 02-01

  • Create a \(10 \times 10\) ndarray of type int representing the multiplication table

Exercise 02-02

  • Calculate the bounds of the hospitals (DataFrame from '.csv'), as a list with four float values of the form [xmin,ymin,xmax,ymax]

Exercise 02-03

  • The file 'output/europe_borders.csv' is a pairwise matrix of european countries, where True marks that the two countris borders intersect (see Pairwise matrices)
  • Calculate a table with two columns:
    • name—country name
    • count—number of neighbors, excluding self
  • Sort the table in decreasing order according to number of neighbors, and print the first 6 rows (i.e., the 6 countries with most neighbors) (Table 19.1)
  • Tip: You can use use the .sum method with axis=1 to calculate row sums