What k value satisfies GDPR for geospatial datasets?

GDPR does not mandate a specific k value, but supervisory authority guidance and ENISA recommendations typically treat k ≥ 5 as a minimum and k ≥ 10 as a stronger baseline for location data. The appropriate threshold depends on the sensitivity of the spatial context and the realistic adversary model.

What is the difference between spatial uniqueness and re-identification risk?

Spatial uniqueness measures how many records share a location bin — a purely internal metric. Re-identification risk measures the probability that an adversary can correctly link a record to a real individual using auxiliary data. High uniqueness is a necessary but not sufficient condition for high re-identification risk.

How often should geospatial re-identification assessments be repeated?

At minimum, reassess quarterly and whenever: new auxiliary datasets are published, coordinate precision or sampling density changes, or data is shared with a new downstream consumer. Continuous monitoring hooks in the ETL pipeline catch regression risks automatically between scheduled assessments.

Re-identification Risk Assessment for Geospatial Datasets

Re-identification risk assessment for geospatial datasets is the structured process of measuring how likely spatial records are to be linked back to identifiable individuals — and implementing controls that reduce that probability to an acceptable threshold before data is released or shared.

The re-identification risk-assessment loop: raw coordinates enter spatial and temporal binning, equivalence classes are built, a linkage simulation tests adversarial joinability, and a threshold gate decides whether to release or iterate generalization.

When to Use This Framework vs. Alternatives permalink

Not every location dataset needs the same assessment depth. Choosing the right approach depends on dataset sensitivity, population density, and downstream sharing risk:

Full re-identification risk assessment (this page): Use when releasing or sharing datasets externally, when records include sensitive-category locations (health facilities, places of worship, domestic addresses), or when downstream re-use is unbounded. This workflow quantifies disclosure probability with audit-grade rigor.
Grid aggregation and spatial binning: Use as a lightweight first-pass transformation when the primary concern is coordinate precision rather than adversarial linkage. Suitable for internal dashboards where linkage attack surface is low.
K-anonymity grouping for location traces: Use when you need a formal equivalence-class guarantee over trajectory data. Pairs naturally with this assessment to validate that transformations achieve the required k.
Differential privacy for location data: Use when releasing statistical aggregates (counts, heatmaps, histograms) rather than record-level data. Provides a mathematical privacy guarantee; combine with risk assessment to select the right privacy budget (ε).

Algorithmic Specification permalink

Spatial Uniqueness permalink

For a dataset of $n$ records, partition the space into a grid of cells $C = \{c_1, c_2, \ldots, c_m\}$ . Let $n_i$ be the number of records in cell $c_i$ . The spatial uniqueness rate is:

U = \frac{|\{i : n_i < k\}|}{m}

where $k$ is the minimum equivalence-class size threshold. A dataset is spatially $k$ -anonymous when $U = 0$ — no cell contains fewer than $k$ records.

Re-identification Probability permalink

The prosecutor model estimates worst-case re-identification probability for a target record $r$ as:

\rho_r = \frac{1}{n_{c(r)}}

where $n_{c(r)}$ is the count of records sharing the same cell as $r$ . For $k$ -anonymity, $\rho_r \leq \frac{1}{k}$ . The journalist model averages over all unique records:

\bar{\rho} = \frac{1}{|U_{records}|} \sum_{r \in U_{records}} \frac{1}{n_{c(r)}}

Parameter Reference permalink

Parameter	Typical range	Meaning
$k$ (minimum equivalence size)	5–50	Records sharing a spatial bin; higher k → stronger privacy
Grid resolution (H3 level)	7–10	H3 level 8 ≈ 460 m² cells; level 7 ≈ 5 km² cells
Linkage success rate threshold	< 1–5 %	Fraction of records uniquely resolvable via auxiliary join
$\rho_{max}$ (max re-id probability)	0.01–0.20	Regulatory or organizational ceiling on correct identification
Temporal bin width	1 h–1 day	Time window for trajectory binning; wider = less unique

Prerequisites & Data Requirements permalink

Establish these baseline conditions before running the assessment:

Coordinate Reference System. All spatial attributes must share a single projected CRS. Geographic coordinates (EPSG:4326) use angular degrees and produce inaccurate distance calculations. Convert to a local metric projection — UTM zones (e.g., EPSG:32633 for central Europe) or EPSG:3857 for web-scale approximations — using pyproj.Transformer before any spatial join or radius calculation.

Data inventory and schema mapping. Catalog every spatial attribute (points, linestrings, polygons, trajectories) alongside quasi-identifiers: timestamps, device or user IDs, demographic tags, and administrative boundary codes. Document data lineage so assessments are reproducible against the same snapshot.

Adversary knowledge assumption. Define which auxiliary datasets an adversary could realistically access: open street networks, census block boundaries, commercial POI databases, published mobility traces. Scope must be documented to maintain audit defensibility. See spatial linkage attack vectors for a taxonomy of realistic auxiliary sources.

Risk threshold definition. Agree on organizational thresholds before measurement begins — k ≥ 5 for spatial k-anonymity and $\rho_{max} \leq 0.09$ are common baselines. Align thresholds with jurisdictional obligations; GDPR and CCPA compliance mapping translates legal requirements into concrete parameter choices.

Python dependencies. Install with pinned versions:

geopandas==0.14.4
shapely==2.0.4
pyproj==3.6.1
h3==3.7.7
scikit-learn==1.4.2
numpy==1.26.4
scipy==1.13.0

Step-by-Step Implementation permalink

Step 1 — Spatial Profiling and Uniqueness Baseline permalink

Project to a metric CRS and aggregate records into H3 hexagonal bins to establish a baseline uniqueness distribution. H3 level 8 (≈460 m²) is a reasonable starting point for urban mobility datasets; adjust downward (coarser) for sparse rural data.

import geopandas as gpd
import h3
import numpy as np
import pandas as pd
from pyproj import Transformer

def compute_spatial_uniqueness(
    gdf: gpd.GeoDataFrame,
    h3_resolution: int = 8,
    k_threshold: int = 5,
) -> dict:
    """
    Compute spatial uniqueness rate and per-record re-identification
    probability (prosecutor model) for a point dataset.

    Parameters
    ----------
    gdf : GeoDataFrame with geometry column in EPSG:4326 (WGS84).
    h3_resolution : H3 resolution level (7–10 typical for urban data).
    k_threshold : Minimum equivalence-class size; records below this are
                  considered at-risk disclosures.

    Returns
    -------
    dict with keys: uniqueness_rate, mean_reid_prob, at_risk_count, bin_counts.
    """
    # Ensure WGS84 for H3 indexing (H3 expects lat/lon in degrees)
    if gdf.crs and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    # Assign H3 cell index to each point
    gdf = gdf.copy()
    gdf["h3_cell"] = gdf.geometry.apply(
        lambda geom: h3.geo_to_h3(geom.y, geom.x, h3_resolution)
    )

    # Count records per bin
    bin_counts: pd.Series = gdf["h3_cell"].value_counts()
    gdf["bin_count"] = gdf["h3_cell"].map(bin_counts)

    # Uniqueness rate: fraction of occupied bins with fewer than k records
    at_risk_bins = (bin_counts < k_threshold).sum()
    uniqueness_rate = at_risk_bins / len(bin_counts)

    # Prosecutor model: re-id probability = 1 / bin_count
    gdf["reid_prob"] = 1.0 / gdf["bin_count"]
    at_risk_records = (gdf["bin_count"] < k_threshold).sum()

    return {
        "uniqueness_rate": float(uniqueness_rate),
        "mean_reid_prob": float(gdf["reid_prob"].mean()),
        "at_risk_count": int(at_risk_records),
        "bin_counts": bin_counts,
    }

The privacy implication of H3 resolution is non-linear: stepping from level 9 to level 8 increases cell area by roughly 7×, which can reduce uniqueness rate by 40–60 % on typical urban mobility datasets. Always compare at two adjacent resolutions to understand sensitivity before committing to a grid.

Step 2 — Adversarial Linkage Simulation permalink

Uniqueness alone does not equal identification; linkage with auxiliary data does. This step simulates how many records become uniquely resolvable when an adversary joins against a realistic auxiliary source. The combination of home cell, work cell, and visit frequency is the classic quasi-identifier triple for mobility data; refine to match your threat model from spatial linkage attack vectors and mitigation.

def simulate_linkage_attack(
    target_gdf: gpd.GeoDataFrame,
    auxiliary_gdf: gpd.GeoDataFrame,
    quasi_identifiers: list[str],
    h3_resolution: int = 8,
) -> dict:
    """
    Simulate a record linkage attack by joining target data against an
    auxiliary dataset on quasi-identifier combinations.

    Parameters
    ----------
    target_gdf : GeoDataFrame of records to assess (EPSG:4326).
    auxiliary_gdf : GeoDataFrame representing attacker's auxiliary data.
    quasi_identifiers : Column names used as linkage keys (must exist in both).
    h3_resolution : H3 resolution for spatial quasi-identifiers.

    Returns
    -------
    dict with linkage_success_rate and n_uniquely_linked.
    """
    for gdf in [target_gdf, auxiliary_gdf]:
        if gdf.crs and gdf.crs.to_epsg() != 4326:
            raise ValueError("Both GeoDataFrames must be in EPSG:4326")

    # Build spatial quasi-identifier: H3 cell as a proxy for home/work zone
    for col, gdf in [("target", target_gdf), ("aux", auxiliary_gdf)]:
        gdf[f"h3_{col}"] = gdf.geometry.apply(
            lambda g: h3.geo_to_h3(g.y, g.x, h3_resolution)
        )

    # Merge on quasi-identifier combination; count matches per target record
    merged = target_gdf[quasi_identifiers + [f"h3_target"]].merge(
        auxiliary_gdf[quasi_identifiers + [f"h3_aux"]],
        on=quasi_identifiers,
        how="left",
    )
    match_counts = merged.groupby(quasi_identifiers).size()

    # A record is "uniquely linked" if exactly one auxiliary record matches
    uniquely_linked = (match_counts == 1).sum()
    linkage_success_rate = uniquely_linked / len(target_gdf)

    return {
        "linkage_success_rate": float(linkage_success_rate),
        "n_uniquely_linked": int(uniquely_linked),
    }

Document which auxiliary sources were used, the date they were accessed, and the computational complexity required for each simulated attack. This forms a reproducible adversary knowledge record for regulators.

Step 3 — Risk Quantification and Threshold Validation permalink

Translate the raw metrics into the three standard statistical disclosure control (SDC) risk models:

from dataclasses import dataclass
from typing import Literal

@dataclass
class RiskReport:
    model: Literal["prosecutor", "journalist", "marketer"]
    mean_risk: float
    max_risk: float
    records_above_threshold: int
    threshold: float
    passes: bool

def compute_sdc_risk_models(
    bin_counts: pd.Series,
    rho_max: float = 0.09,
) -> list[RiskReport]:
    """
    Compute prosecutor, journalist, and marketer re-identification risk models
    per NIST SP 800-188 statistical disclosure control methodology.

    Parameters
    ----------
    bin_counts : Series mapping each record's bin to its population count.
    rho_max : Maximum acceptable re-identification probability.
    """
    reports = []

    # Prosecutor model: worst-case for a targeted individual
    prosecutor_risks = 1.0 / bin_counts
    reports.append(RiskReport(
        model="prosecutor",
        mean_risk=float(prosecutor_risks.mean()),
        max_risk=float(prosecutor_risks.max()),
        records_above_threshold=int((prosecutor_risks > rho_max).sum()),
        threshold=rho_max,
        passes=bool(prosecutor_risks.max() <= rho_max),
    ))

    # Journalist model: average risk across all uniquely identifiable records
    unique_bins = bin_counts[bin_counts == 1]
    journalist_risk = len(unique_bins) / len(bin_counts) if len(bin_counts) else 0.0
    reports.append(RiskReport(
        model="journalist",
        mean_risk=journalist_risk,
        max_risk=journalist_risk,
        records_above_threshold=int(journalist_risk > rho_max),
        threshold=rho_max,
        passes=journalist_risk <= rho_max,
    ))

    # Marketer model: fraction of records an adversary could correctly identify
    # in an untargeted sweep (expected correct identifications / n)
    correct_ids = (1.0 / bin_counts).sum()
    marketer_risk = correct_ids / len(bin_counts)
    reports.append(RiskReport(
        model="marketer",
        mean_risk=marketer_risk,
        max_risk=marketer_risk,
        records_above_threshold=int(marketer_risk > rho_max),
        threshold=rho_max,
        passes=marketer_risk <= rho_max,
    ))

    return reports

If any model reports passes=False, the dataset requires remediation before release. Flag the dataset for coarsening, suppression, or noise injection and re-run from Step 1.

Step 4 — Anonymization Control Implementation permalink

Once risk is measured, apply privacy-preserving spatial transformations. Validate post-transformation datasets by re-running the uniqueness and linkage simulations — controls are only considered effective when risk metrics consistently fall below thresholds without destroying analytical utility.

Spatial generalization (coarsening the H3 grid):

def coarsen_to_k_anonymous(
    gdf: gpd.GeoDataFrame,
    start_resolution: int,
    k_threshold: int = 5,
    min_resolution: int = 5,
) -> tuple[gpd.GeoDataFrame, int]:
    """
    Iteratively coarsen the H3 grid until all occupied bins contain at
    least k records. Returns the anonymized GeoDataFrame and the
    resolution used.

    Stops at min_resolution to prevent total geographic destruction.
    """
    gdf = gdf.copy()
    if gdf.crs and gdf.crs.to_epsg() != 4326:
        gdf = gdf.to_crs(epsg=4326)

    for resolution in range(start_resolution, min_resolution - 1, -1):
        gdf["h3_cell"] = gdf.geometry.apply(
            lambda g: h3.geo_to_h3(g.y, g.x, resolution)
        )
        bin_counts = gdf["h3_cell"].value_counts()
        if bin_counts.min() >= k_threshold:
            # Replace geometry with H3 cell centroid (snapping to grid)
            gdf["geometry"] = gdf["h3_cell"].apply(
                lambda cell: gpd.points_from_xy(
                    *reversed(h3.h3_to_geo(cell))
                )[0]
            )
            return gdf, resolution

    raise ValueError(
        f"Cannot achieve k={k_threshold} even at H3 resolution {min_resolution}. "
        "Consider suppressing sparse regions or applying differential privacy noise."
    )

Coordinate suppression for sparse regions removes records in cells that cannot be generalized to k — the approach endorsed by NIST SP 800-188 when utility collapse from coarsening is unacceptable. Track suppression counts as a data quality metric and report them alongside the risk assessment output.

Differential privacy noise injection adds calibrated Laplace or Gaussian noise to spatial counts or trajectory densities, providing a mathematical privacy bound. For trajectory-level datasets, consult how to calculate re-identification risk for GPS logs to apply temporal-spatial entropy metrics before choosing between generalization and noise-based controls.

Validation and Re-identification Testing permalink

After applying controls, run the full validation suite before sign-off:

import pytest

def test_k_anonymity_holds(gdf: gpd.GeoDataFrame, k: int = 5) -> None:
    """Assert that no H3 bin in the output contains fewer than k records."""
    bin_counts = gdf["h3_cell"].value_counts()
    violations = bin_counts[bin_counts < k]
    assert violations.empty, (
        f"k-anonymity violation: {len(violations)} bins have fewer than {k} records. "
        f"Smallest bin: {violations.min()} records."
    )

def test_crs_is_metric(gdf: gpd.GeoDataFrame, expected_epsg: int = 32633) -> None:
    """Assert the GeoDataFrame uses the expected metric CRS."""
    assert gdf.crs is not None, "GeoDataFrame has no CRS set"
    assert gdf.crs.to_epsg() == expected_epsg, (
        f"CRS mismatch: expected EPSG:{expected_epsg}, got {gdf.crs.to_epsg()}"
    )

def test_no_invalid_geometries(gdf: gpd.GeoDataFrame) -> None:
    """Assert all geometries are topologically valid."""
    invalid = gdf[~gdf.geometry.is_valid]
    assert invalid.empty, f"{len(invalid)} invalid geometries detected"

Embed these assertions as pytest fixtures in the ETL pipeline so that any schema or binning change that regresses privacy controls fails the deployment gate automatically. Track uniqueness histograms and linkage matrices alongside dataset versions in a metadata catalog — this creates an immutable audit trail required by most data protection authorities.

Common Failure Modes and Gotchas permalink

Projection drift. Running distance-based risk calculations in EPSG:4326 produces angular measurements, not metres. A 0.001° offset near the equator is ≈111 m, but at 60° latitude it is ≈55 m. Always project to UTM or another equal-area CRS before any neighbourhood or radius calculation, and assert the CRS in every pipeline stage.

Boundary-crossing artifacts. H3 and square-grid cells do not respect administrative boundaries. Records near a grid boundary may fall into bins that span two neighbourhoods with radically different population densities, causing k-anonymity to appear satisfied when the effective population in the relevant zone is much smaller. Validate bin counts against census boundaries as a secondary check in high-stakes assessments.

Sparse-data edge cases. Rural or industrial datasets often have large spatial extents with very few records. Coarsening to k-anonymity may require resolutions so coarse that the data loses analytical value entirely — utility collapse. In these cases, differential privacy noise on aggregated counts is a better control than generalization. Document the utility threshold (minimum acceptable spatial resolution) before beginning so that suppression and noise decisions are defensible.

Temporal re-aggregation. Removing high-precision timestamps from the schema is not sufficient if the dataset retains daily or weekly visit counts per location. Visit frequency combined with a coarsened spatial bin can still uniquely identify individuals in datasets where routine is the discriminator — a pattern that linkage simulation (Step 2) must explicitly test for.

Silent geometry errors. Invalid topologies (is_valid == False) produce silent failures during geopandas.sjoin() that corrupt risk counts without raising exceptions. Always run shapely.make_valid() and filter self-intersecting polygons before the assessment pipeline begins.

Overlooking trajectory quasi-identifiers. A privacy risk scoring framework for GIS that treats location points as independent records will underestimate risk for trajectory datasets, where the sequence of visits is itself a powerful quasi-identifier even when individual points are generalized. Apply temporal-sequence entropy checks alongside spatial uniqueness.

Compliance Alignment permalink

Control	Satisfied by	Documentation requirement
GDPR Art. 25 (Data Protection by Design)	Step 1–4 embedded in ETL pipeline	Privacy impact assessment record + threshold evidence
GDPR Art. 5(1)(e) (Storage Limitation)	Suppression of sparse-region records	Suppression log with counts and date
GDPR Recital 26 (Anonymisation standard)	Prosecutor model $\rho \leq 0.09$ + linkage test	SDC risk report signed by DPO or data steward
CCPA § 1798.140(o)(1) (De-identification)	All three SDC models passing thresholds	Technical safeguards documentation + contractual prohibitions on re-identification
NIST SP 800-188 (De-identification of CUI)	Prosecutor, journalist, and marketer risk models	Formal risk report with methodology, parameters, and threshold rationale
ISO 29101 (Privacy Architecture)	Versioned risk artifacts in metadata catalog	Audit trail with dataset version, assessment date, and assessor identity

Re-assessments are required at minimum quarterly and whenever new auxiliary datasets emerge that expand an adversary’s linkage surface. Integrate continuous monitoring by hooking the compute_spatial_uniqueness function into data access gateways — datasets that exceed thresholds are automatically routed to a privacy review queue rather than released.

← Back to Spatial Privacy Fundamentals & Threat Modeling

Re-identification Risk Assessment for Geospatial Datasets

When to Use This Framework vs. Alternatives # permalink

Algorithmic Specification # permalink

Spatial Uniqueness # permalink

Re-identification Probability # permalink

Parameter Reference # permalink

Prerequisites & Data Requirements # permalink

Step-by-Step Implementation # permalink

Step 1 — Spatial Profiling and Uniqueness Baseline # permalink

Step 2 — Adversarial Linkage Simulation # permalink

Step 3 — Risk Quantification and Threshold Validation # permalink

Step 4 — Anonymization Control Implementation # permalink

Validation and Re-identification Testing # permalink

Common Failure Modes and Gotchas # permalink

Compliance Alignment # permalink

Related # permalink

Explore this section

Related topics