Vector layers (geopandas)#
Last updated: 2026-02-12 16:03:56
Introduction#
In the previous chapter, we worked with individual geometries using the shapely package (Geometries (shapely)). In GIS paractice, however, we usually work with vector layers, rather than individual geometries. Vector layers can be thought of as:
a collection of (one or more) geometries,
associated with (zero or more) non-spatial attributes, and
associated with a Coordinate Reference System (CRS) definition (possibly “undefined”)
In this chapter, we introduce the geopandas package for working with vector layers. First, we are going to see how a vector layer can be created from scratch (see Creating a GeoDataFrame), based on shapely geometries, to understand the relation between shapely and geopandas data structures, namely:
GeoSeries—A geometry column, i.e., a collection ofshapelygeometries with a CRS definitionGeoDataFrame—A vector layer, i.e., a table containing aGeoSeries
Then, we go on to cover some of the basic operations and workflows with vector layers using the geopandas package:
Reading vector layers—Importing a vector layer from a file, such as a Shapefile
Plotting (geopandas)—Plotting vector layers
Filtering by attributes—Selecting features by attribute
Table to point layer—Creating a point layer from a table
Points to line (geopandas)—Converting a point layer to a line layer
Writing GeoDataFrame to file—Writing a vector layer to file
In the next chapter (Geometric operations), we are going to cover spatial operations with geopandas, such as calculating new geometries (buffers, etc.) and spatial join.
What is geopandas?#
geopandas, as the name suggests, is an extension of pandas that enables working with spatial (geographic) data, specifically with vector layers. To facilitate the representation of spatial data, geopandas defines two data structures, which are extensions of the analogous data structures from pandas for non-spatial data:
GeoSeries—An extension of aSeries(see Creating a Series) for storing a sequence of geometries and a Coordinate Reference System (CRS) definition, whereas each geometry is ashapelyobject (see Geometries (shapely))GeoDataFrame—An extension of aDataFrame(see Creating a DataFrame), where one of the columns is aGeoSeries
As will be demostrated in a moment (see Creating a GeoDataFrame), a GeoDataFrame, representing a vector layer, is thus a combination (Fig. 47) of:
A
GeoSeries, which holds the geometries in a special “geometry” columnZero or more
Series, which represent the non-spatial attributes
Since geopandas is based on pandas, many of the DataFrame workflows and operators we learned about earlier (see Tables (pandas) and Table reshaping and joins) work exactly the same way with a GeoDataFrame. This is especially true for operations that involve just the non-spatial properties, such as filtering by attributes (see Filtering by attributes).
Fig. 47 GeoDataFrame structure (source: https://geopandas.org/getting_started/introduction.html)#
The geopandas package depends on several other Python packages, most importantly:
numpypandasshapely—Interface to GEOS (see Geometries (shapely))
To put it another way, geopandas may be considered a “wrapper” of pyogrio, shapely, and pyproj, giving access to all (or most) of their functionality through pandas-based data structures and simpler-to-use methods. Compared to other packages, geopandas may be considered a high-level approach for working with vector layers in Python. A lower-level approach is provided through the fiona package, for example. The osgeo package provides (through its ogr sub-package) another, more low-level, approach for vector layer access.
Since geopandas is simpler to learn and work with than its alternatives, it is the recommended approach for the purposes in this book. For instance, using fiona and osgeo we typically have to process vector features one-by-one inside a for loop, while in geopandas the entire layer is being ‘read’ and processed at once. Nevertheless, in some situations it is necessary to resort to lower-level approaches. For example, processing features one-by-one is essential when the layer is too big to fit in memory.
Vector layer from scratch#
Creating a GeoSeries#
Before starting to work with geopandas, we have to load it. We also load pandas and shapely, which we are already familiar with, since they are needed for some of the examples:
import pandas as pd
import shapely
import geopandas as gpd
As a first step to learn about geopandas, we are going to create a geometry column (GeoSeries), and then a vector layer (GeoDataFrame) (see Creating a GeoDataFrame) from sratch. This will help us get a better understanding of the structure of a GeoDataFrame (Fig. 47).
Let’s start with the geometry column, i.e., a GeoSeries. The geometric component of our layer is going to be a sequence of two 'Polygon' geometries, hereby named pol1 and pol2 (see shapely from WKT):
pol1 = shapely.from_wkt('POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))')
pol2 = shapely.from_wkt('POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))')
To create a GeoSeries, we can pass a list of shapely geometries to the gpd.GeoSeries function. The second (optional) argument passed to gpd.GeoSeries is a crs, specifying the Coordinate Reference System (CRS). In this case, we specify the WGS84 geographical CRS, using its EPSG code 4326 (Table 23):
s = gpd.GeoSeries([pol1, pol2], crs=4326)
s
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
dtype: geometry
Creating a GeoDataFrame#
A GeoDataFrame can be constructed from a dict representing the columns, one of them containing the geometries, and the other(s) containing non-spatial attributes (if any). The dict values may be Series, GeoSeries, or list data structure. First, let us create the dictionary:
d = {'name': ['a', 'b'], 'geometry': s}
d
{'name': ['a', 'b'],
'geometry': 0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
dtype: geometry}
In this particular case, we have:
the (mandatory) geometry column, conventionally named
'geometry', specified using aGeoSeries(see Creating a GeoSeries), andone non-spatial attribute, named
'name', specified using alist
Now, we can convert the dict to a GeoDataFrame, using the gpd.GeoDataFrame function:
gdf = gpd.GeoDataFrame(d)
gdf
| name | geometry | |
|---|---|---|
| 0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
| 1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
or in a single step:
gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': s})
gdf
| name | geometry | |
|---|---|---|
| 0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
| 1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
We can also use a list of geometries instead of a GeoSeries, in which case the crs can be specified inside gpd.GeoDataFrame:
gdf = gpd.GeoDataFrame({'name': ['a', 'b'], 'geometry': [pol1, pol2]}, crs=4326)
gdf
| name | geometry | |
|---|---|---|
| 0 | a | POLYGON ((0 0, 0 -1, 7.5 -1, 7.... |
| 1 | b | POLYGON ((0 1, 1 0, 2 0.5, 3 0,... |
Note that, the workflow to create a GeoDataFrame is very similar to the way we created a DataFrame with the pd.DataFrame function (see Creating a DataFrame). Basically, the only difference is the presence of the geometry column.
GeoDataFrame structure#
Series columns#
Let us examine the structure of the gdf object we just created to understand the structure of a GeoDataFrame.
An individual column of a GeoDataFrame can be accessed just like a column of a DataFrame (see Selecting DataFrame columns). For example, the following expression returns the 'name' column:
gdf['name']
0 a
1 b
Name: name, dtype: object
as a Series object:
type(gdf['name'])
pandas.core.series.Series
More specifically, 'name' is a Series of type object, since it contains strings.
GeoSeries (geometry) column#
The following expression returns the 'geometry' column:
gdf['geometry']
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry
However, the recommended approach to access the geometry is through the .geometry property, which refers to the geometry column regardless of its name (whether the name is 'geometry' or not):
gdf.geometry
0 POLYGON ((0 0, 0 -1, 7.5 -1, 7....
1 POLYGON ((0 1, 1 0, 2 0.5, 3 0,...
Name: geometry, dtype: geometry
As mentioned above, the geometry column is a GeoSeries object:
type(gdf.geometry)
geopandas.geoseries.GeoSeries
The 'geometry' column is what differentiates a GeoDataFrame from an ordinary DataFrame.
Note
It is possible for a GeoDataFrame to contain multiple GeoSeries columns. However, only one GeoSeries column at a time can be the “active” geometry column, meaning that it becomes accessible through the .geometry property, participates in spatial calculations, displayed when plotting the layer, etc. To make a given column “active”, we use the .set_geometry method, as in dat=dat.set_geometry('column_name'), where column_name is a name of a GeoSeries column in dat.
Individual geometries#
As evident from the way we constructed the object (see Creating a GeoDataFrame), each “cell” in the geometry column is an individual shapely geometry:
gdf.geometry.iloc[0]
gdf.geometry.iloc[1]
If necessary, for instance when applying shapely functions (see Points to line (geopandas)), the entire geometry column can be extracted as a list of shapely geometries using to_list (see Series to list):
gdf.geometry.to_list()
[<POLYGON ((0 0, 0 -1, 7.5 -1, 7.5 0, 0 0))>,
<POLYGON ((0 1, 1 0, 2 0.5, 3 0, 4 0, 5 0.5, 6 -0.5, 7 -0.5, 7 1, 0 1))>]
GeoDataFrame properties#
Geometry types#
The geometry types (Fig. 37) of the geometries in a GeoDataFrame or a GeoSeries can be obtained using the .geom_type property. The result is a Series of strings, with the geometry type names. For example:
s.geom_type
0 Polygon
1 Polygon
dtype: object
gdf.geom_type
0 Polygon
1 Polygon
dtype: object
The .geom_type property of geopandas thus serves the same purpose as the .geom_type property of shapely geometries (see Geometry type).
Coordinate Reference System (CRS)#
In addition to the geometries, a GeoSeries has a property named .crs, which specifies the Coordinate Regerence System (CRS) of the geometries. For example, the geometries in the gdf layer are in the geographic WGS84 CRS, as specified when creating the layer (see Creating a GeoDataFrame):
gdf.geometry.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
The .crs property can also be reached from the GeoDataFrame itself:
gdf.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
We are going to elaborate on CRSs and modifying them later on (see CRS and reprojection).
Bounds (geopandas)#
The .total_bounds, property of a GeoDataFrame contains the bounding box coordinates of all geometries combined, in the form of a numpy array:
gdf.total_bounds
array([ 0. , -1. , 7.5, 1. ])
For more detailed information, the .bounds property returns the bounds of each geometry, separately, in a DataFrame:
gdf.bounds
| minx | miny | maxx | maxy | |
|---|---|---|---|---|
| 0 | 0.0 | -1.0 | 7.5 | 0.0 |
| 1 | 0.0 | -0.5 | 7.0 | 1.0 |
In both cases, bounds are returned in the standard form \((x_{min}\), \(y_{min}\), \(x_{max}\), \(y_{max})\), in agreement with the .bounds property of shapely (see Bounds (shapely)).
Reading vector layers#
Usually, we read a vector layer from a file, rather than creating it from scratch (see Creating a GeoDataFrame). The gpd.read_file function can read vector layer files in any of the common formats, such as:
Shapefile (
.shp)GeoJSON (
.geojson)GeoPackage (
.gpkg)File Geodatabase (
.gdb)
For example, the following expression reads a Shapefile named 'muni_il.shp' into a GeoDataFrame named towns. This is a polygonal layer of towns in Israel. The additional encoding='utf-8' argument makes the Hebrew text correctly interpreted:
towns = gpd.read_file('data/muni_il.shp', encoding='utf-8')
towns
| Muni_Heb | Muni_Eng | Sug_Muni | CR_PNIM | CR_LAMAS | ... | FIRST_Nafa | LAST_Nafa2 | Shape_Leng | Shape_Area | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | שפיר | Shafir | מועצה אזורית | 5534 | None | ... | אשקלון | None | 48125.964322 | 2.721176e+07 | POLYGON Z ((177281.694 610056.5... |
| 1 | דימונה | Dimona | עירייה | 2200 | 2200 | ... | באר שבע | None | 1102.492262 | 7.708050e+04 | POLYGON Z ((205753.745 560693.6... |
| 2 | דימונה | Dimona | עירייה | 2200 | 2200 | ... | באר שבע | None | 7071.568894 | 1.043253e+06 | POLYGON Z ((207740.171 563476.5... |
| 3 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 1107.564213 | 4.769053e+04 | POLYGON Z ((209445.229 740507.2... |
| 4 | מגידו | Megido | מועצה אזורית | 5513 | None | ... | יזרעאל | None | 1076.261429 | 2.144091e+04 | POLYGON Z ((206769.626 727131.7... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 406 | ללא שיפוט - אזור מעיליא | No Jurisdiction - Mi'ilya Area | ללא שיפוט | 9901 | None | ... | None | None | 2617.936245 | 2.001376e+05 | POLYGON Z ((225369.655 770523.6... |
| 407 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 2209.128245 | 2.246316e+05 | POLYGON Z ((207596.214 746263.0... |
| 408 | זבולון | Zevulun | מועצה אזורית | 5512 | None | ... | חיפה | None | 5507.014575 | 1.141930e+06 | POLYGON Z ((208622.688 746540.1... |
| 409 | רמת השרון | Ramat Hasharon | עירייה | 2650 | 2650 | ... | תל אביב - יפו | None | 937.968193 | 4.076764e+04 | POLYGON Z ((183576.756 673627.0... |
| 410 | צור הדסה | Tsur Hadasa | מועצה מקומית | 1113 | 1113 | ... | תל אביב - יפו | None | 11292.949932 | 3.961142e+06 | POLYGON Z ((209871.005 625364.9... |
411 rows × 14 columns
Let’s read another layer, 'RAIL_STRATEGIC.shp', into a GeoDataFrame named rail. This is a line layer of railway lines in Israel:
rail = gpd.read_file('data/RAIL_STRATEGIC.shp')
rail
| UPGRADE | SHAPE_LENG | SHAPE_Le_1 | SEGMENT | ISACTIVE | UPGRAD | geometry | |
|---|---|---|---|---|---|---|---|
| 0 | שדרוג 2040 | 14489.400742 | 12379.715332 | כפר יהושע - נשר_1 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (205530.083 741562.9... |
| 1 | שדרוג 2030 | 2159.344166 | 2274.111800 | באר יעקב-ראשונים_2 | פעיל | שדרוג עד 2030 | LINESTRING (181507.598 650706.1... |
| 2 | שדרוג 2040 | 4942.306156 | 4942.306156 | נתבג - נען_3 | לא פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (189180.643 645433.4... |
| 3 | שדרוג 2040 | 2829.892202 | 2793.337699 | לב המפרץ מזרח - נשר_4 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (203482.789 744181.5... |
| 4 | שדרוג 2040 | 1976.490872 | 1960.170882 | קרית גת - להבים_5 | פעיל | שדרוג בין 2030 ל-2040 | LINESTRING (178574.101 609392.9... |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 212 | חדש 2030 | 2174.401983 | 2174.401983 | ראש העין צפון - כפר סבא צפון_ | לא פעיל | פתיחה עד 2030 | LINESTRING (195887.59 675861.96... |
| 213 | חדש 2030 | 0.000000 | 974.116931 | רמלה דרום-ראשונים_218 | לא פעיל | פתיחה עד 2030 | LINESTRING (183225.361 648648.2... |
| 214 | שדרוג 2030 | 3248.159569 | 1603.616014 | השמונה - בית המכס_220 | פעיל | שדרוג עד 2030 | LINESTRING (200874.999 746209.3... |
| 215 | שדרוג 2030 | 2611.879099 | 166.180958 | לב המפרץ - בית המכס_221 | פעיל | שדרוג עד 2030 | LINESTRING (203769.786 744358.6... |
| 216 | שדרוג 2030 | 1302.971893 | 1284.983680 | 224_לוד מרכז - נתבג | פעיל | שדרוג עד 2030 | LINESTRING (190553.481 654170.3... |
217 rows × 7 columns
Note
Note that a Shapefile consists of several files, such as '.shp', '.shx', '.dbf' and '.prj', even though we specify just one the '.shp' file path when reading a Shapefile with gpd.read_file. The same applies to writing a Shapefile (see Writing GeoDataFrame to file).
Plotting (geopandas)#
Basic plots#
GeoSeries and GeoDataFrame objects have a .plot method, which can be used to visually examine the data we are working with. For example:
gdf.plot();
We can use the color and edgecolor parameters to specify the fill color and border color, respectively:
gdf.plot(color='none', edgecolor='red');
Another useful property is linewidth, to specify line width:
gdf.plot(color='none', linewidth=10);
Let’s also plot the towns and rail layers to see what they look like:
towns.plot();
rail.plot();
Setting symbology#
By default, all plotted geometries share the same appearance. However, when plotting a GeoDataFrame, we can set symbology, i.e., geometry appearance corresponding to attribute values. The attribute to use for symbology is specified with the column parameter.
For example, the following expression results in a map where town fill colors correspond to the 'Machoz' attribute values:
towns.plot(column='Machoz');
The color palette for the symbology can be changed using the cmap (“color map’) argument. Valid palette names can be found in the Choosing Colormaps in Matplotlib page of the matplotlib documentation. For example, 'Set2' is one of the “Qualitative colormaps”, which is appropriate for categorical variables:
towns.plot(column='Machoz', cmap='Set2');
As another example, the following expression sets a symbology corresponding to the continuous 'Shape_Area' (polygon area, in \(m^2\)) columns. The argument legend=True adds a legend:
towns.plot(column='Shape_Area', cmap='Reds', legend=True);
Interactive maps#
geopandas also provides the .explore method, which facilitates interactive exploration of the vector layers we are working with. The .explore method takes many of the same arguments as .plot (such as column and cmap), but also some specific ones (such as style_kwds). See the documentation for details.
For example, the following expression creates an interactive map displaying the rail layer, colored according to status (active or not). Note that hovering over a feature with the mouse shows a tooltip with the attribute values (similarly to “identify” tools in GIS software).
rail.explore(
column='ISACTIVE',
cmap='RdYlBu',
style_kwds={'weight':8, 'opacity':0.8}
)
Note
There are several Python packages for creating interactive maps displaying spatial data, either inside a Jupyter notebook or exporting to an HTML document. Most notably, these include folium, ipyleaflet, leafmap, and bokeh. All but the last one are based on the Leaflet web-mapping JavaScript package. The .explore method of geopandas, in fact, uses the folium package and requires it to be installed.
Plotting more than one layer#
It is is often helpful to display several layers together in the same plot, one on top of the other. That way, we can see the orientation and alignment of several layers at once.
To display two or more layers in the same geopandas plot, we need to do things a little differently than shown above (see Basic plots):
Assign a “base” plot, to a variable, such as
baseDisplay any additional layer(s) on top of the base plot using its
.plotmethod combined withax=base
For example, here is how we can plot both towns and rail, one on top of the other. Note that we use color='none' for transparent fill of the polygons:
base = towns.plot(color='none', edgecolor='lightgrey')
rail.plot(ax=base, color='blue');
Note
More information on plotting geopandas layers can be found in the Mapping and plotting tools tutorial in the official package documentation.
Selecting GeoDataFrame columns#
We can subset columns in a GeoDataFrame the same way as in a DataFrame, using the [[ operator (see Selecting DataFrame columns). The only thing we need to keep in mind is that we must keep the geometry column, so that the result remains a vector layer.
For example, the following expression subsets 5 out of the 14 columns in towns:
towns = towns[['Muni_Heb', 'Muni_Eng', 'Machoz', 'Shape_Area', 'geometry']]
towns
| Muni_Heb | Muni_Eng | Machoz | Shape_Area | geometry | |
|---|---|---|---|---|---|
| 0 | שפיר | Shafir | דרום | 2.721176e+07 | POLYGON Z ((177281.694 610056.5... |
| 1 | דימונה | Dimona | דרום | 7.708050e+04 | POLYGON Z ((205753.745 560693.6... |
| 2 | דימונה | Dimona | דרום | 1.043253e+06 | POLYGON Z ((207740.171 563476.5... |
| 3 | זבולון | Zevulun | חיפה | 4.769053e+04 | POLYGON Z ((209445.229 740507.2... |
| 4 | מגידו | Megido | צפון | 2.144091e+04 | POLYGON Z ((206769.626 727131.7... |
| ... | ... | ... | ... | ... | ... |
| 406 | ללא שיפוט - אזור מעיליא | No Jurisdiction - Mi'ilya Area | צפון | 2.001376e+05 | POLYGON Z ((225369.655 770523.6... |
| 407 | זבולון | Zevulun | חיפה | 2.246316e+05 | POLYGON Z ((207596.214 746263.0... |
| 408 | זבולון | Zevulun | חיפה | 1.141930e+06 | POLYGON Z ((208622.688 746540.1... |
| 409 | רמת השרון | Ramat Hasharon | תל אביב | 4.076764e+04 | POLYGON Z ((183576.756 673627.0... |
| 410 | צור הדסה | Tsur Hadasa | ירושלים | 3.961142e+06 | POLYGON Z ((209871.005 625364.9... |
411 rows × 5 columns
And here we subset three of the columns in rail:
rail = rail[['SEGMENT', 'ISACTIVE', 'geometry']]
rail
| SEGMENT | ISACTIVE | geometry | |
|---|---|---|---|
| 0 | כפר יהושע - נשר_1 | פעיל | LINESTRING (205530.083 741562.9... |
| 1 | באר יעקב-ראשונים_2 | פעיל | LINESTRING (181507.598 650706.1... |
| 2 | נתבג - נען_3 | לא פעיל | LINESTRING (189180.643 645433.4... |
| 3 | לב המפרץ מזרח - נשר_4 | פעיל | LINESTRING (203482.789 744181.5... |
| 4 | קרית גת - להבים_5 | פעיל | LINESTRING (178574.101 609392.9... |
| ... | ... | ... | ... |
| 212 | ראש העין צפון - כפר סבא צפון_ | לא פעיל | LINESTRING (195887.59 675861.96... |
| 213 | רמלה דרום-ראשונים_218 | לא פעיל | LINESTRING (183225.361 648648.2... |
| 214 | השמונה - בית המכס_220 | פעיל | LINESTRING (200874.999 746209.3... |
| 215 | לב המפרץ - בית המכס_221 | פעיל | LINESTRING (203769.786 744358.6... |
| 216 | 224_לוד מרכז - נתבג | פעיל | LINESTRING (190553.481 654170.3... |
217 rows × 3 columns
Filtering by attributes#
We can subset the rows (i.e., features) of a GeoDataFrame using a corresponding boolean Series, the same way we subset rows of a DataFrame (see DataFrame filtering). Typically, the boolean Series is the result of applying a conditional operator on one or more columns of the same GeoDataFrame.
For example, suppose that we want to keep just the active railway segments from the rail layer, namely those segments where the value in the 'ISACTIVE' column is equal to 'פעיל' (“active”). First, we need to create a boolean series where True values reflect the rows we would like to keep. Here, we assign the boolean Series to a variable named sel:
sel = rail['ISACTIVE'] == 'פעיל'
sel
0 True
1 True
2 False
3 True
4 True
...
212 False
213 False
214 True
215 True
216 True
Name: ISACTIVE, Length: 217, dtype: bool
Second, we need to pass the series as an index to the GeoDataFrame. As a result, rail now contains only the active railway lines (161 features out of 217):
rail = rail[sel]
rail
| SEGMENT | ISACTIVE | geometry | |
|---|---|---|---|
| 0 | כפר יהושע - נשר_1 | פעיל | LINESTRING (205530.083 741562.9... |
| 1 | באר יעקב-ראשונים_2 | פעיל | LINESTRING (181507.598 650706.1... |
| 3 | לב המפרץ מזרח - נשר_4 | פעיל | LINESTRING (203482.789 744181.5... |
| 4 | קרית גת - להבים_5 | פעיל | LINESTRING (178574.101 609392.9... |
| 5 | רמלה - רמלה מזרח_6 | פעיל | LINESTRING (189266.58 647211.50... |
| ... | ... | ... | ... |
| 210 | ויתקין - חדרה_215 | פעיל | LINESTRING (190758.23 704950.04... |
| 211 | בית יהושע - נתניה ספיר_216 | פעיל | LINESTRING (187526.597 687360.3... |
| 214 | השמונה - בית המכס_220 | פעיל | LINESTRING (200874.999 746209.3... |
| 215 | לב המפרץ - בית המכס_221 | פעיל | LINESTRING (203769.786 744358.6... |
| 216 | 224_לוד מרכז - נתבג | פעיל | LINESTRING (190553.481 654170.3... |
161 rows × 3 columns
Once you are more comfortable with the subsetting syntax, you may prefer to do the operation in a single step, as follows:
rail = rail[rail['ISACTIVE'] == 'פעיל']
Let’s plot the filtered rail layer, to see the effect visually:
rail.plot();
Exercise 08-a
The layer
'statisticalareas_demography2019.gdb'contains demographic data for the statistical areas of Israel, such as population size per age group, in 2019. For example, the column'age_10_14'contains population counts in the 10-14 age group, while the'Pop_Total'column contains total population counts (in all age groups). Note that'statisticalareas_demography2019.gdb'is in a format known as a File Geodatabase.Read the
'statisticalareas_demography2019.gdb'layer into aGeoDataFrame.Subset the following columns (plus the geometry!):
'YISHUV_STAT11'—Statistical area ID'SHEM_YISHUV'—Town name in Hebrew'SHEM_YISHUV_ENGLISH'—Town name in English'Pop_Total'—Total population size
What is the Coordinate Reference System of the layer? (answer:
'Israel 1993 / Israeli TM Grid', i.e., ‘ITM’)How many features does the layer contain? Write an expression that returns the result as
int. (answer:3195)The values in the
'YISHUV_STAT11'column (statistical area ID) should all be unique. How can you make sure? Either use one of the methods we learned earlier, or search online (e.g., google ‘pandas check for duplicates’).Plot the layer using the
.plotmethod (Fig. 48).Subset just the statistical areas of Beer-Sheva, using either the
'SHEM_YISHUV'or the'SHEM_YISHUV_ENGLISH'column.Plot the resulting subset, using symbology according to total population size, i.e., the
'Pop_Total'column, and using a sequentual color map such as'Reds'(Fig. 49).How many statistical areas are there in Beer-Sheva? (answer:
62)What was the total population of Beer-Sheva in 2019? (answer:
209685.0)
Fig. 48 Solution of exercise-08-a1: The statistical areas layer#
Fig. 49 Solution of exercise-08-a2: Statistical areas of Beer-Sheva, with symbology according to total population size#
Table to point layer#
Another very common workflow of creating a vector layer is creating a point layer from a table that has X and Y coordinates in two separate columns. Technically, this means that the X and Y values in each row are transformed into a 'Point' geometry. The geometries are then placed in a geometry column, so that the table becomes a vector layer.
To demonstrate, let’s import the 'stops.txt' table, which contains information about public transport stops in Israel based on GTFS data (see What is GTFS?). Note the columns 'stop_lon' and 'stop_lat', which contain the X and Y coordinates, in the WGS84 coordinate system, i.e., longitude and latitude, respectively:
stops = pd.read_csv('data/gtfs/stops.txt')
stops = stops[['stop_id', 'stop_name', 'stop_lon', 'stop_lat']]
stops
| stop_id | stop_name | stop_lon | stop_lat | |
|---|---|---|---|---|
| 0 | 1 | בי''ס בר לב/בן יהודה | 34.917554 | 32.183985 |
| 1 | 2 | הרצל/צומת בילו | 34.819541 | 31.870034 |
| 2 | 3 | הנחשול/הדייגים | 34.782828 | 31.984553 |
| 3 | 4 | משה פריד/יצחק משקה | 34.790700 | 31.888325 |
| 4 | 6 | ת. מרכזית לוד/הורדה | 34.898098 | 31.956392 |
| ... | ... | ... | ... | ... |
| 34002 | 50182 | מסוף שכונת החותרים/הורדה | 34.966691 | 32.748564 |
| 34003 | 50183 | מסוף שכונת החותרים/איסוף | 34.966844 | 32.748776 |
| 34004 | 50184 | מדבריום | 34.744373 | 31.260857 |
| 34005 | 50185 | כביש 25/מדבריום | 34.747607 | 31.263478 |
| 34006 | 50186 | כביש 25/מדבריום | 34.746768 | 31.264262 |
34007 rows × 4 columns
To transform a DataFrame with X and Y columns into a GeoDataFrame point layer, we go through two steps:
Constructing a
GeoSeriesof'Point'geometries, from twoSeriesrepresenting X and Y, usinggpd.points_from_xy. Optionally, we can specify the CRS of theGeoSeries, using thecrsparameter.Converting the
GeoSeriesto aGeoDataFrame, usinggpd.GeoDataFrame.
Let us begin with the first step—creating the GeoSeries. The X and Y columns are readily available, stops['stop_lon'], stops['stop_lat'], however specifying the crs parameter is less trivial. The CRS definition is usually not stored in the same table with the X and Y columns, since it would have to be repeated on all rows. We need to figure out which CRS the X and Y coordinates are in according to the metadata or accompanying information (such as the GTFS specification). In this particular case, the names 'stop_lon' and 'stop_lat' (where “lon”=longitude and “lat”=latitude) give a pretty good hint that the coordinates are given in WGS84, therefore we use crs=4326:
geom = gpd.points_from_xy(stops['stop_lon'], stops['stop_lat'], crs=4326)
The result is a special type of array named GeometryArray:
type(geom)
geopandas.array.GeometryArray
Though it can be used as is, we are going to convert it to a GeoSeries, which we are already familiar with, for clarity:
geom = gpd.GeoSeries(geom)
geom
0 POINT (34.91755 32.18398)
1 POINT (34.81954 31.87003)
2 POINT (34.78283 31.98455)
3 POINT (34.7907 31.88832)
4 POINT (34.8981 31.95639)
...
34002 POINT (34.96669 32.74856)
34003 POINT (34.96684 32.74878)
34004 POINT (34.74437 31.26086)
34005 POINT (34.74761 31.26348)
34006 POINT (34.74677 31.26426)
Length: 34007, dtype: geometry
Now, we can use the gpd.GeoDataFrame function to combine the DataFrame with the GeoSeries, to get a GeoDataFrame. Instead of passing a dict of columns, like we did when creating a layer manually (see Creating a GeoDataFrame), we pass a DataFrame and a GeoSeries to the data and geometry parameters, respectively:
stops = gpd.GeoDataFrame(data=stops, geometry=geom)
stops
| stop_id | stop_name | stop_lon | stop_lat | geometry | |
|---|---|---|---|---|---|
| 0 | 1 | בי''ס בר לב/בן יהודה | 34.917554 | 32.183985 | POINT (34.91755 32.18398) |
| 1 | 2 | הרצל/צומת בילו | 34.819541 | 31.870034 | POINT (34.81954 31.87003) |
| 2 | 3 | הנחשול/הדייגים | 34.782828 | 31.984553 | POINT (34.78283 31.98455) |
| 3 | 4 | משה פריד/יצחק משקה | 34.790700 | 31.888325 | POINT (34.7907 31.88832) |
| 4 | 6 | ת. מרכזית לוד/הורדה | 34.898098 | 31.956392 | POINT (34.8981 31.95639) |
| ... | ... | ... | ... | ... | ... |
| 34002 | 50182 | מסוף שכונת החותרים/הורדה | 34.966691 | 32.748564 | POINT (34.96669 32.74856) |
| 34003 | 50183 | מסוף שכונת החותרים/איסוף | 34.966844 | 32.748776 | POINT (34.96684 32.74878) |
| 34004 | 50184 | מדבריום | 34.744373 | 31.260857 | POINT (34.74437 31.26086) |
| 34005 | 50185 | כביש 25/מדבריום | 34.747607 | 31.263478 | POINT (34.74761 31.26348) |
| 34006 | 50186 | כביש 25/מדבריום | 34.746768 | 31.264262 | POINT (34.74677 31.26426) |
34007 rows × 5 columns
The 'stop_lon' and 'stop_lat' columns are no longer necessary, because the information is already in the 'geometry' columns. Therefore, they can be dropped:
stops = stops.drop(['stop_lon', 'stop_lat'], axis=1)
stops
| stop_id | stop_name | geometry | |
|---|---|---|---|
| 0 | 1 | בי''ס בר לב/בן יהודה | POINT (34.91755 32.18398) |
| 1 | 2 | הרצל/צומת בילו | POINT (34.81954 31.87003) |
| 2 | 3 | הנחשול/הדייגים | POINT (34.78283 31.98455) |
| 3 | 4 | משה פריד/יצחק משקה | POINT (34.7907 31.88832) |
| 4 | 6 | ת. מרכזית לוד/הורדה | POINT (34.8981 31.95639) |
| ... | ... | ... | ... |
| 34002 | 50182 | מסוף שכונת החותרים/הורדה | POINT (34.96669 32.74856) |
| 34003 | 50183 | מסוף שכונת החותרים/איסוף | POINT (34.96684 32.74878) |
| 34004 | 50184 | מדבריום | POINT (34.74437 31.26086) |
| 34005 | 50185 | כביש 25/מדבריום | POINT (34.74761 31.26348) |
| 34006 | 50186 | כביש 25/מדבריום | POINT (34.74677 31.26426) |
34007 rows × 3 columns
Plotting the resulting layer of public transport stations reveals the outline of Israel, with more sparse coverage in the southern part, which corresponds to population density pattern in the country. The markersize argument is used to for smaller point size to reduce overlap:
stops.plot(markersize=1);
Points to line (geopandas)#
To lines: all points combined#
A common type of geometry transformation, which we already met when learning about shapely, is the transformation of points to lines (see Points to line (shapely)). Let us see how this type of casting can be acheived when our points are in a GeoDataFrame. We begin, in this section, with transforming an entire point layer to a single continuous line. In the next section, we will learn how to transform a point layer to a layer with multiple lines by group (see To lines: by group).
To demonstrate, let’s import the 'shapes.txt' table (see What is GTFS?) from the GTFS dataset. This table contains the shapes of public transport lines. This is quite a big table, with over 7 million rows. In the table, each shape is distinguished by a unique 'shape_id', comprised of lon-lat coordinates in the 'shape_pt_lat', 'shape_pt_lon' columns:
shapes = pd.read_csv('data/gtfs/shapes.txt')
shapes
| shape_id | shape_pt_lat | shape_pt_lon | shape_pt_sequence | |
|---|---|---|---|---|
| 0 | 69895 | 32.164723 | 34.848813 | 1 |
| 1 | 69895 | 32.164738 | 34.848972 | 2 |
| 2 | 69895 | 32.164771 | 34.849177 | 3 |
| 3 | 69895 | 32.164841 | 34.849429 | 4 |
| 4 | 69895 | 32.164889 | 34.849626 | 5 |
| ... | ... | ... | ... | ... |
| 7546509 | 146260 | 32.084831 | 34.796385 | 953 |
| 7546510 | 146260 | 32.084884 | 34.796040 | 954 |
| 7546511 | 146260 | 32.084908 | 34.795895 | 955 |
| 7546512 | 146260 | 32.084039 | 34.795698 | 956 |
| 7546513 | 146260 | 32.083737 | 34.795645 | 957 |
7546514 rows × 4 columns
For the first example, we take a subset of shapes containing the shape of just one transit line. We will select the 136236 value of 'shape_id', which corresponds to bus line 24 in Beer-Sheva going from Ramot to Beer-Sheva Central Bus Station (see Solution of exercise-07-i: bus trip points). This results in a much smaller table, with 861 rows:
pnt = shapes[shapes['shape_id'] == 136236]
pnt
| shape_id | shape_pt_lat | shape_pt_lon | shape_pt_sequence | |
|---|---|---|---|---|
| 2396423 | 136236 | 31.242009 | 34.797979 | 1 |
| 2396424 | 136236 | 31.242103 | 34.797988 | 2 |
| 2396425 | 136236 | 31.242829 | 34.797982 | 3 |
| 2396426 | 136236 | 31.242980 | 34.797986 | 4 |
| 2396427 | 136236 | 31.243121 | 34.798000 | 5 |
| ... | ... | ... | ... | ... |
| 2397279 | 136236 | 31.280183 | 34.822166 | 857 |
| 2397280 | 136236 | 31.280225 | 34.822116 | 858 |
| 2397281 | 136236 | 31.280246 | 34.822038 | 859 |
| 2397282 | 136236 | 31.280250 | 34.821949 | 860 |
| 2397283 | 136236 | 31.280233 | 34.821860 | 861 |
861 rows × 4 columns
What we have is a DataFrame with x-y coordinate columns, 'shape_pt_lon' and 'shape_pt_lat'. We already know how to transform it to a GeoDataFrame with point geometries (see Table to point layer):
geom = gpd.points_from_xy(pnt['shape_pt_lon'], pnt['shape_pt_lat'], crs=4326)
pnt = gpd.GeoDataFrame(data=pnt, geometry=geom)
pnt = pnt.drop(['shape_pt_lon', 'shape_pt_lat'], axis=1)
pnt
| shape_id | shape_pt_sequence | geometry | |
|---|---|---|---|
| 2396423 | 136236 | 1 | POINT (34.79798 31.24201) |
| 2396424 | 136236 | 2 | POINT (34.79799 31.2421) |
| 2396425 | 136236 | 3 | POINT (34.79798 31.24283) |
| 2396426 | 136236 | 4 | POINT (34.79799 31.24298) |
| 2396427 | 136236 | 5 | POINT (34.798 31.24312) |
| ... | ... | ... | ... |
| 2397279 | 136236 | 857 | POINT (34.82217 31.28018) |
| 2397280 | 136236 | 858 | POINT (34.82212 31.28022) |
| 2397281 | 136236 | 859 | POINT (34.82204 31.28025) |
| 2397282 | 136236 | 860 | POINT (34.82195 31.28025) |
| 2397283 | 136236 | 861 | POINT (34.82186 31.28023) |
861 rows × 3 columns
Now, we are going use shapely to transform the point geometries into a single 'LineString' geometry. First, we create a list of shapely geometries, which requires selecting the GeoSeries (geometry) column and converting it to a list using the .to_list method (see Individual geometries):
x = pnt.geometry.to_list()
x[:5]
[<POINT (34.798 31.242)>,
<POINT (34.798 31.242)>,
<POINT (34.798 31.243)>,
<POINT (34.798 31.243)>,
<POINT (34.798 31.243)>]
Then, we use the shapely.LineString function (see Points to line (shapely)) to convert the list of 'Point' geometries to a single 'LineString' geometry, as follows:
line = shapely.LineString(x)
line
In case we need the geometry to be contained in a GeoDataFrame, rather than a standalone shapely geometry, we can construct a GeoDataFrame using the methods we learned in the beginning of this chapter (see Creating a GeoDataFrame):
line1 = gpd.GeoDataFrame({'geometry': [line]}, crs=4326)
line1
| geometry | |
|---|---|
| 0 | LINESTRING (34.79798 31.24201, ... |
Here is an interactive map of the resulting layer line1:
line1.explore(color='red', style_kwds={'weight':8, 'opacity':0.8})
To lines: by group#
In the next example, we demonstrate how the conversion of points to lines can be generalized to produce multiple lines at once, in case when there is a grouping variable specifying the division of points to groups. For example, using the GTFS data, we may want to create a layer of type 'LineString', with all different bus routes together, rather than just one specific route (see To lines: all points combined). This example is relatively complex, with many steps to get the data into the right shape.
Our first step is to read four tables from the GTFS dataset, namely agency, routes, trips, and shapes, keeping just the columns that we need:
agency = pd.read_csv('data/gtfs/agency.txt', usecols=['agency_id', 'agency_name'])
routes = pd.read_csv('data/gtfs/routes.txt', usecols=['route_id', 'agency_id', 'route_short_name', 'route_long_name'])
trips = pd.read_csv('data/gtfs/trips.txt', usecols=['trip_id', 'route_id', 'shape_id'])
shapes = pd.read_csv('data/gtfs/shapes.txt', usecols=['shape_id', 'shape_pt_sequence', 'shape_pt_lon', 'shape_pt_lat'])
To have a more manageable layer size, we will subset just one bus operator, namely 'מטרופולין'. First, we need to figure out its agency_id:
agency = agency[agency['agency_name'] == 'מטרופולין']
agency
| agency_id | agency_name | |
|---|---|---|
| 9 | 15 | מטרופולין |
Now that we know the agency_id of 'מטרופולין', we can subset the routes table to retain only the routes by this particular operator. We also drop the 'agency_id' column which is no longer necessary. We can see there are 691 routes operated by 'מטרופולין':
routes = routes[routes['agency_id'] == agency['agency_id'].iloc[0]]
routes = routes.drop('agency_id', axis=1)
routes
| route_id | route_short_name | route_long_name | |
|---|---|---|---|
| 211 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... |
| 212 | 697 | 600 | ת.מרכזית תל אביב קומה 6/רציפים-... |
| 213 | 698 | 601 | ת. רכבת נתניה-נתניה<->ת.מרכזית ... |
| 214 | 700 | 601 | ת.מרכזית תל אביב קומה 6/רציפים-... |
| 215 | 707 | 604 | ת. מרכזית נתניה/רציפים-נתניה<->... |
| ... | ... | ... | ... |
| 7960 | 36003 | 457 | יצחק שדה/האצל-דימונה<->ירמיהו/א... |
| 8054 | 36477 | 178 | מסוף עתידים-תל אביב יפו<->המרכב... |
| 8055 | 36478 | 178 | המרכבה/הבנאי-חולון<->מסוף עתידי... |
| 8062 | 36503 | 13 | בית ברבור/דרך ההגנה-תל אביב יפו... |
| 8063 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... |
691 rows × 3 columns
Next, we need to join the routes table with the shapes table, where the route coordinates are. However the routes table corresponds to the shapes table indirectly, through the trips table (Fig. 28). Therefore we first need to join trips to routes, and then shapes to the result. To make things simpler, we will remove duplicates from the trips table, assuming that all trips of a particular route follow the same shape (which they typically do), thus retaining just one representative trip per route. Here is the current state of the trips table, with numerous trips per route:
trips
| route_id | trip_id | shape_id | |
|---|---|---|---|
| 0 | 60 | 3764_081023 | 126055.0 |
| 1 | 68 | 3143332_011023 | 128020.0 |
| 2 | 68 | 3143338_011023 | 128020.0 |
| 3 | 68 | 3244955_011023 | 128020.0 |
| 4 | 68 | 3244956_011023 | 128020.0 |
| ... | ... | ... | ... |
| 398086 | 43 | 2827_081023 | 127560.0 |
| 398087 | 43 | 17254928_011023 | 127560.0 |
| 398088 | 43 | 2796_011023 | 127560.0 |
| 398089 | 43 | 2806_011023 | 127560.0 |
| 398090 | 43 | 2807_011023 | 127560.0 |
398091 rows × 3 columns
To remove duplicates from a table, we can use the .drop_duplicates method. This method accepts an argument named subset, where we specify which column(s) are considered for duplicates. In our case, we remove rows with duplicated route_id in trips, thus retaining just one route_id per trip:
trips = trips.drop_duplicates(subset='route_id')
trips
| route_id | trip_id | shape_id | |
|---|---|---|---|
| 0 | 60 | 3764_081023 | 126055.0 |
| 1 | 68 | 3143332_011023 | 128020.0 |
| 10 | 69 | 42263035_081023 | 128021.0 |
| 20 | 70 | 4667527_081023 | 133676.0 |
| 28 | 81 | 17255283_081023 | 140323.0 |
| ... | ... | ... | ... |
| 394653 | 16001 | 57730596_091023 | 135279.0 |
| 394750 | 17800 | 584950698_011023 | 120468.0 |
| 395305 | 35432 | 585244118_071023 | 139876.0 |
| 395741 | 21093 | 28981390_071023 | 139766.0 |
| 397299 | 16331 | 28394213_011023 | 142903.0 |
8188 rows × 3 columns
We can also drop the trip_id column, which is no longer necessary:
trips = trips.drop('trip_id', axis=1)
trips
| route_id | shape_id | |
|---|---|---|
| 0 | 60 | 126055.0 |
| 1 | 68 | 128020.0 |
| 10 | 69 | 128021.0 |
| 20 | 70 | 133676.0 |
| 28 | 81 | 140323.0 |
| ... | ... | ... |
| 394653 | 16001 | 135279.0 |
| 394750 | 17800 | 120468.0 |
| 395305 | 35432 | 139876.0 |
| 395741 | 21093 | 139766.0 |
| 397299 | 16331 | 142903.0 |
8188 rows × 2 columns
Now we can run both joins:
first joining
tripstoroutes, to attach the shape IDs,then
shapesto the result, to attach the lon/lat point sequences
as follows:
routes = pd.merge(routes, trips, on='route_id', how='left')
routes
| route_id | route_short_name | route_long_name | shape_id | |
|---|---|---|---|---|
| 0 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 |
| 1 | 697 | 600 | ת.מרכזית תל אביב קומה 6/רציפים-... | 132223.0 |
| 2 | 698 | 601 | ת. רכבת נתניה-נתניה<->ת.מרכזית ... | 138816.0 |
| 3 | 700 | 601 | ת.מרכזית תל אביב קומה 6/רציפים-... | 145630.0 |
| 4 | 707 | 604 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138656.0 |
| ... | ... | ... | ... | ... |
| 686 | 36003 | 457 | יצחק שדה/האצל-דימונה<->ירמיהו/א... | 143525.0 |
| 687 | 36477 | 178 | מסוף עתידים-תל אביב יפו<->המרכב... | 145131.0 |
| 688 | 36478 | 178 | המרכבה/הבנאי-חולון<->מסוף עתידי... | 145132.0 |
| 689 | 36503 | 13 | בית ברבור/דרך ההגנה-תל אביב יפו... | 144796.0 |
| 690 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 |
691 rows × 4 columns
routes = pd.merge(routes, shapes, on='shape_id', how='left')
routes
| route_id | route_short_name | route_long_name | shape_id | shape_pt_lat | shape_pt_lon | shape_pt_sequence | |
|---|---|---|---|---|---|---|---|
| 0 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 32.326945 | 34.858709 | 1 |
| 1 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 32.326788 | 34.859270 | 2 |
| 2 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 32.326743 | 34.859413 | 3 |
| 3 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 32.326359 | 34.859245 | 4 |
| 4 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 32.326587 | 34.858540 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 670119 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 32.053639 | 34.802566 | 829 |
| 670120 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 32.053632 | 34.802569 | 830 |
| 670121 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 32.053539 | 34.802274 | 831 |
| 670122 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 32.053421 | 34.801801 | 832 |
| 670123 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 32.053396 | 34.801617 | 833 |
670124 rows × 7 columns
The resulting routes table contains, in the 'shape_pt_lon' and 'shape_pt_lat' columns, the coordinate sequence per 'route_id'. Now, we can move on to the spatial part of the conversion, where the table will be transformed to a 'LineString' layer of routes.
First, we convert the DataFrame to a GeoDataFrame of type 'Point', same as we did above for the individual line number 24 in Beer-Sheva (see To lines: all points combined):
geom = gpd.points_from_xy(routes['shape_pt_lon'], routes['shape_pt_lat'], crs=4326)
routes = gpd.GeoDataFrame(data=routes, geometry=geom)
routes = routes.drop(['shape_pt_lon', 'shape_pt_lat'], axis=1)
routes
| route_id | route_short_name | route_long_name | shape_id | shape_pt_sequence | geometry | |
|---|---|---|---|---|---|---|
| 0 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 1 | POINT (34.85871 32.32694) |
| 1 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 2 | POINT (34.85927 32.32679) |
| 2 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 3 | POINT (34.85941 32.32674) |
| 3 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 4 | POINT (34.85924 32.32636) |
| 4 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | 138815.0 | 5 | POINT (34.85854 32.32659) |
| ... | ... | ... | ... | ... | ... | ... |
| 670119 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 829 | POINT (34.80257 32.05364) |
| 670120 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 830 | POINT (34.80257 32.05363) |
| 670121 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 831 | POINT (34.80227 32.05354) |
| 670122 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 832 | POINT (34.8018 32.05342) |
| 670123 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | 144797.0 | 833 | POINT (34.80162 32.0534) |
670124 rows × 6 columns
If we needed to convert all sequences into a single 'LineString', we could use the method shown earlier (see To lines: all points combined). However, what we need is to create a separate 'LineString' per 'route_id'. Therefore, we:
Group the layer by
'shape_id', using the.aggmethod (see Different functions (agg))Use the
'first'“function” to keep the first'route_short_name'and'route_long_name'values per route (as they are identical in each'route_id'anyway)Apply a lambda function (see Custom functions (agg)) that “summarizes” the
'Point'geometries into a'LineString'geometry
Here is the code:
routes = routes.groupby('route_id').agg({
'route_short_name': 'first',
'route_long_name': 'first',
'geometry': lambda x: shapely.LineString(x.to_list()),
}).reset_index()
routes
| route_id | route_short_name | route_long_name | geometry | |
|---|---|---|---|---|
| 0 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | LINESTRING (34.85871 32.32694, ... |
| 1 | 697 | 600 | ת.מרכזית תל אביב קומה 6/רציפים-... | LINESTRING (34.7784 32.05604, 3... |
| 2 | 698 | 601 | ת. רכבת נתניה-נתניה<->ת.מרכזית ... | LINESTRING (34.86808 32.31997, ... |
| 3 | 700 | 601 | ת.מרכזית תל אביב קומה 6/רציפים-... | LINESTRING (34.77911 32.05585, ... |
| 4 | 707 | 604 | ת. מרכזית נתניה/רציפים-נתניה<->... | LINESTRING (34.85871 32.32694, ... |
| ... | ... | ... | ... | ... |
| 686 | 36003 | 457 | יצחק שדה/האצל-דימונה<->ירמיהו/א... | LINESTRING (35.02688 31.07337, ... |
| 687 | 36477 | 178 | מסוף עתידים-תל אביב יפו<->המרכב... | LINESTRING (34.84134 32.11417, ... |
| 688 | 36478 | 178 | המרכבה/הבנאי-חולון<->מסוף עתידי... | LINESTRING (34.8067 32.01673, 3... |
| 689 | 36503 | 13 | בית ברבור/דרך ההגנה-תל אביב יפו... | LINESTRING (34.80169 32.0534, 3... |
| 690 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | LINESTRING (34.76128 32.0366, 3... |
691 rows × 4 columns
The .groupby and .agg workflow unfortunately “loses” the spatial properties of the data structure, going back to a DataFrame:
type(routes)
pandas.core.frame.DataFrame
Therefore, we need to convert the result back to a GeoDataFrame and restore the CRS information which was lost, using gpd.GeoDataFrame:
routes = gpd.GeoDataFrame(routes, crs=4326)
routes
| route_id | route_short_name | route_long_name | geometry | |
|---|---|---|---|---|
| 0 | 696 | 600 | ת. מרכזית נתניה/רציפים-נתניה<->... | LINESTRING (34.85871 32.32694, ... |
| 1 | 697 | 600 | ת.מרכזית תל אביב קומה 6/רציפים-... | LINESTRING (34.7784 32.05604, 3... |
| 2 | 698 | 601 | ת. רכבת נתניה-נתניה<->ת.מרכזית ... | LINESTRING (34.86808 32.31997, ... |
| 3 | 700 | 601 | ת.מרכזית תל אביב קומה 6/רציפים-... | LINESTRING (34.77911 32.05585, ... |
| 4 | 707 | 604 | ת. מרכזית נתניה/רציפים-נתניה<->... | LINESTRING (34.85871 32.32694, ... |
| ... | ... | ... | ... | ... |
| 686 | 36003 | 457 | יצחק שדה/האצל-דימונה<->ירמיהו/א... | LINESTRING (35.02688 31.07337, ... |
| 687 | 36477 | 178 | מסוף עתידים-תל אביב יפו<->המרכב... | LINESTRING (34.84134 32.11417, ... |
| 688 | 36478 | 178 | המרכבה/הבנאי-חולון<->מסוף עתידי... | LINESTRING (34.8067 32.01673, 3... |
| 689 | 36503 | 13 | בית ברבור/דרך ההגנה-תל אביב יפו... | LINESTRING (34.80169 32.0534, 3... |
| 690 | 36504 | 13 | בי''ח וולפסון-חולון<->בית ברבור... | LINESTRING (34.76128 32.0366, 3... |
691 rows × 4 columns
type(routes)
geopandas.geodataframe.GeoDataFrame
Here is a map of the result. This is a line layer of all 691 routes operated by 'מטרופולין', reflecting the shape of the first representative trip per route:
routes.explore(color='red', style_kwds={'weight':8, 'opacity':0.1})