import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from shapely import box
Hilbert distance-based split into train and test
This notebook illustrates the core of the approach to split chips into train and test used in the CEUS paper. There it was a bit more complicated than that but the principle is illustrated here.
Get only a subset of the country for illustration - NW.
= (321566, 365379, 468106, 437198) start_x, start_y, end_x, end_y
Load the data and clip them to the box defined above. The data from https://figshare.com/ndownloader/files/38736501
= gpd.read_file(
signatures "/Users/martin/Downloads/spatial_signatures_GB_simplified.gpkg",
=(start_x, start_y, end_x, end_y)
bbox ).clip(box(start_x, start_y, end_x, end_y))
Get coordinates of chip centroids.
= np.arange(start_x, end_x, 250)
x_coords = np.arange(start_y, end_y, 250)
y_coords = np.meshgrid(x_coords, y_coords)
xv, yv = np.vstack([xv.ravel(), yv.ravel()]) combinations
Get chip geoemtry.
= gpd.GeoSeries.from_xy(x=combinations[0], y=combinations[1], crs=signatures.crs).buffer(125, cap_style=3) grid_cells
Filter only those fully within signatures.
= grid_cells.sindex.query(signatures.geometry, predicate="contains")
sig_idx, grid_idx = grid_cells.iloc[grid_idx].to_frame('geometry')
valid_grid_cells "sig_id"] = sig_idx valid_grid_cells[
Get unique signature IDs to pull from.
= valid_grid_cells.sig_id.unique()
unique 0] unique.shape[
745
valid_grid_cells.sig_id.value_counts().describe()
count 745.000000
mean 153.918121
std 1712.829432
min 1.000000
25% 1.000000
50% 5.000000
75% 15.000000
max 37716.000000
Name: count, dtype: float64
Illustrate the split using Hilbert distance. Chip groups with less than 20 chips are not split and should be allocated together either to train or test. The distance itself could be retrieved via GeoSeries.hilbert_distance()
if needed.
= valid_grid_cells[valid_grid_cells.sig_id == unique[0]]
g if g.shape[0] > 20:
= np.empty(g.shape[0], dtype=int)
split = int(np.floor(g.shape[0] * 0.8))
floor = 0
split[:floor] = 1
split[floor:] else:
= np.ones(g.shape[0])
split
= plt.subplots(1, 2, sharey=True)
f, ax "geometry").plot(split, ax=ax[0], edgecolor='w', linewidth=.1)
g.sort_values("geometry").plot(cmap="viridis", ax=ax[1], edgecolor='w', linewidth=.1)
g.sort_values(0].set_title("split")
ax[1].set_title("hilbert distance") ax[
Text(0.5, 1.0, 'hilbert distance')