What k value is sufficient to satisfy GDPR for transit tap data?

Regulators have not mandated a single k, but supervisory authorities consistently reference k ≥ 5 as a practical minimum. For off-peak windows or suburban routes where the population of riders is small, k ≥ 10 is a safer baseline. Document the rationale in your DPIA.

Home
Spatial Privacy Fundamentals & Threat Modeling
Spatial Linkage Attack Vectors & Mitigation
Preventing Spatial Linkage Attacks in Public Transit Data

Preventing Spatial Linkage Attacks in Public Transit Data

Q: Which H3 resolution is appropriate for urban transit stop data?

Resolution 8 (cell area ≈ 0.74 km²) covers most urban stop clusters. Resolution 9 (≈ 0.1 km²) is fine-grained enough to expose individual stops in low-density corridors — always check cell counts at the chosen resolution before release.

Q: Does temporal jitter alone prevent linkage attacks?

No. Temporal jitter breaks exact-match joins on timestamps but does not prevent an attacker who observes the approximate time range. Jitter must be combined with spatial cloaking and k-anonymity suppression; relying on any single control is insufficient.

Q: Should the RNG seed be fixed in production?

No. A fixed seed makes jitter offsets reproducible, which allows differential attacks across releases. Use a fixed seed only in test environments for audit reproducibility; omit it (or rotate it per-release) in production.

Preventing spatial linkage attacks in public transit data requires replacing raw WGS84 coordinates with population-density-aware H3 hexagonal cells, applying uniform temporal jitter of ±3 minutes, and enforcing a minimum cell count of $k \geq 5$ before any record leaves the secure environment.

Core Calculation and Parameter Table permalink

Transit records become linkable because a trip sequence — stop A at 07:42, stop B at 08:01, stop C at 08:19 — is nearly unique across a metropolitan area. The re-identification probability for a single record in a cell containing $n$ records is:

P_{\text{re-id}} = \frac{1}{n}

To hold $P_{\text{re-id}} \leq 0.2$ (a practical threshold for public-sector releases), you need $n \geq 5$ , which is the standard $k = 5$ baseline. For off-peak windows where ridership drops, requiring $k \geq 10$ bounds risk below $0.1$ .

The temporal jitter half-width $\Delta t$ determines how precisely an attacker can match records to scheduled service:

t_{\text{released}} = t_{\text{raw}} + U(-\Delta t,\; \Delta t)

where $U$ denotes a uniform distribution. Choosing $\Delta t = 180$ s (3 min) keeps released timestamps within one schedule headway interval for frequent services while preventing exact-second matching across auxiliary datasets.

Parameter	Symbol	Recommended value	Notes
H3 resolution (urban)	$r$	8	Cell area ≈ 0.74 km²; covers most stop clusters
H3 resolution (suburban/rural)	$r$	7	Cell area ≈ 5.16 km²; use when ridership density is low
K-anonymity threshold (baseline)	$k$	5	Matches supervisory authority guidance
K-anonymity threshold (off-peak)	$k$	10	Sparse corridors where cell counts fall quickly
Temporal jitter half-width	$\Delta t$	180 s	Stays within one service headway for most urban lines

Worked numeric example. A morning peak export contains 12,400 tap-in events. After assigning H3 resolution-8 cells (WGS84 input, no reprojection needed — H3 operates directly on geographic coordinates), 97 % of cells contain ≥ 5 records. The remaining 3 % (372 records across 28 cells) are suppressed. Temporal jitter is drawn per-record from $U(-180, 180)$ seconds. The resulting release covers 12,028 anonymized events, a 2.9 % suppression rate that is well within the ±5 % utility tolerance most capacity-planning models accept.

Three-Stage Anonymization Pipeline permalink

The diagram below shows the deterministic order in which spatial binning, temporal jitter, and k-anonymity suppression must run. Applying jitter before binning is correct; applying suppression before jitter would allow differential attacks on timing distributions within marginal cells.

Deterministic three-stage pipeline: spatial binning must precede jitter, and k-anonymity suppression is always the final gate before release.

Python Implementation permalink

The function below is a production-ready implementation of the three-stage pipeline. H3 operates directly on WGS84 decimal degrees — no prior reprojection is needed. The seed parameter must be omitted in production deployments to avoid reproducible jitter sequences that could enable differential attacks across releases.

import pandas as pd
import numpy as np
import h3

def anonymize_transit_records(
    df: pd.DataFrame,
    resolution: int = 8,
    k: int = 5,
    jitter_seconds: int = 180,
    seed: int | None = None,
) -> pd.DataFrame:
    """
    Anonymize transit tap records via H3 spatial binning, temporal jitter,
    and k-anonymity suppression.

    Args:
        df: DataFrame with columns:
              'lat'       float   WGS84 latitude
              'lon'       float   WGS84 longitude
              'timestamp' datetime UTC tap time
              'trip_id'   str     deduplication key (may be a hashed card ID)
        resolution: H3 resolution (0=coarsest … 15=finest).
                    Resolution 8 → cell area ≈ 0.74 km² (urban default).
                    Resolution 7 → cell area ≈ 5.16 km² (suburban/rural).
        k:          Minimum trip count per H3 cell before the cell is released.
        jitter_seconds: Half-width of the uniform temporal noise window (±s).
        seed:       Fix only in test/audit environments; omit in production.

    Returns:
        Anonymized DataFrame containing 'h3_cell' and 'timestamp_jittered';
        raw lat, lon, and original timestamp columns are removed.
    """
    df = df.copy()
    rng = np.random.default_rng(seed)

    # Stage 1 — H3 spatial binning (h3-py v4 API; check h3.__version__ before deploy)
    # Input CRS: WGS84 (EPSG:4326) — h3.latlng_to_cell expects decimal degrees
    df["h3_cell"] = [
        h3.latlng_to_cell(lat, lon, resolution)
        for lat, lon in zip(df["lat"], df["lon"])
    ]

    # Stage 2 — Temporal jitter: uniform noise over ±jitter_seconds
    # Applied per-record BEFORE suppression to avoid timing leakage in marginal cells
    df["timestamp"] = pd.to_datetime(df["timestamp"])
    offsets = rng.uniform(-jitter_seconds, jitter_seconds, size=len(df))
    df["timestamp_jittered"] = df["timestamp"] + pd.to_timedelta(offsets, unit="s")

    # Stage 3 — K-anonymity suppression: keep cells with >= k distinct trip records
    # Using trip_id count (not nunique) matches the published k-anonymity definition;
    # switch to nunique if your trip_id column contains duplicates within a cell.
    cell_counts = df.groupby("h3_cell")["trip_id"].transform("count")
    df_anon = df.loc[cell_counts >= k].copy()

    # Remove all raw quasi-identifiers before the DataFrame leaves this function
    df_anon = df_anon.drop(columns=["lat", "lon", "timestamp"])
    return df_anon

Verification Snippet permalink

After running the pipeline, validate that no raw quasi-identifiers remain and that every released cell genuinely meets the k threshold:

def verify_anonymization(df_raw: pd.DataFrame, df_anon: pd.DataFrame, k: int = 5) -> dict:
    """
    Confirm that the anonymized output meets k-anonymity and strips raw identifiers.

    Returns a dict with pass/fail flags and numeric diagnostics.
    """
    results: dict = {}

    # 1. Raw coordinate columns must be absent
    results["no_lat_lon"] = "lat" not in df_anon.columns and "lon" not in df_anon.columns

    # 2. Original (unjittered) timestamp must be absent
    results["no_raw_timestamp"] = "timestamp" not in df_anon.columns

    # 3. Every H3 cell in the release must have >= k records
    cell_counts = df_anon.groupby("h3_cell")["trip_id"].count()
    results["k_satisfied"] = bool((cell_counts >= k).all())
    results["min_cell_count"] = int(cell_counts.min()) if len(cell_counts) else 0

    # 4. Suppression rate must stay below 10 % (tune to your utility tolerance)
    suppressed = len(df_raw) - len(df_anon)
    results["suppression_rate"] = round(suppressed / len(df_raw), 4)
    results["suppression_ok"] = results["suppression_rate"] <= 0.10

    return results

# Example usage
report = verify_anonymization(raw_transit_df, df_anon, k=5)
assert report["no_lat_lon"],        "Raw coordinates still present"
assert report["no_raw_timestamp"],  "Original timestamps still present"
assert report["k_satisfied"],       f"K-anonymity violated: min cell={report['min_cell_count']}"
assert report["suppression_ok"],    f"Suppression rate {report['suppression_rate']:.1%} exceeds 10 %"

Edge Cases and Adjustments permalink

Sparse suburban and paratransit routes. Low-ridership corridors regularly produce cells with fewer than $k$ records even at resolution 7. The correct response is to coarsen to resolution 6 (cell area ≈ 36 km²) and re-evaluate, not to lower $k$ . Lowering $k$ below 5 on sparse routes is the most common compliance failure in transit data releases.
Non-uniform density zones. Urban cores and commuter-rail termini may need resolution 9 (≈ 0.1 km²) to preserve route-level utility, while outer suburbs require resolution 7. Use a resolution lookup table keyed to H3 resolution-5 “macro-cell” to apply adaptive binning within a single pipeline run.
Temporal windowing at service boundaries. The first and last trips of the day are low-count by definition. Apply a larger $\Delta t$ (360–600 s) to early-morning and late-night windows, or suppress them entirely if the k threshold cannot be met even at coarser resolutions. Off-peak suppression rates of 8–15 % are normal and should be documented in the DPIA.
H3 library version. The function above uses the h3-py v4 API (h3.latlng_to_cell). Version 3 used h3.geo_to_h3(lat, lon, resolution) — a drop-in rename, but silent failure under v3 will produce no error, just wrong cell IDs. Verify with import h3; print(h3.__version__) before deployment.

Frequently Asked Questions permalink

What $k$ value satisfies GDPR for transit tap data?

No regulation mandates a specific $k$ , but EU supervisory authority guidance consistently treats $k \geq 5$ as the minimum threshold for pseudonymous datasets in the transport sector. For data shared publicly via open-data portals, $k \geq 10$ is the safer default because auxiliary data (census demographics, public GTFS schedules, weather logs) available to attackers is far richer than for controlled research releases. Document the chosen value and the rationale in the data protection impact assessment.

Which H3 resolution is appropriate for urban transit stop data?

Resolution 8 (cell area ≈ 0.74 km²) is the standard urban starting point; it groups several adjacent stops into one cell without collapsing entire districts. Resolution 9 (≈ 0.1 km²) is too granular for most transit contexts — individual stops in low-traffic hours will fall below $k$ . Check the cell-count distribution at your chosen resolution before committing to a value; if more than 5 % of cells are below $k$ at resolution 8, fall back to resolution 7.

Does temporal jitter alone prevent linkage attacks?

No. Temporal jitter blocks exact-timestamp joins against auxiliary datasets, but an attacker who observes approximate departure times can still narrow the candidate set to a handful of riders on a given route. Jitter must be layered with spatial cloaking and k-anonymity suppression. Any single control in isolation is insufficient, as documented in re-identification risk assessment for geospatial datasets.

Should the RNG seed be fixed in production?

Fix the seed only in test and audit environments where you need reproducible outputs for peer review. In production, omit the seed entirely so each release draws fresh randomness. A fixed seed across sequential releases allows differential attacks: an adversary who receives two exports with identical jitter offsets can subtract them to recover the original timestamps.

Spatial Linkage Attack Vectors & Mitigation — parent page covering the full taxonomy of linkage attack vectors in geospatial datasets
K-anonymity grouping for location traces — in-depth coverage of k-anonymity thresholds, grouping algorithms, and suppression strategies
Grid aggregation and spatial binning strategies — covers H3, S2, and rectangular grid choices with utility trade-off analysis
Re-identification risk assessment for geospatial datasets — how to quantify residual risk after anonymization and document it for compliance audits
Coordinate jittering and noise injection methods — detailed parameter guidance for temporal and spatial noise injection

← Back to Spatial Linkage Attack Vectors & Mitigation

Preventing Spatial Linkage Attacks in Public Transit Data

Core Calculation and Parameter Table # permalink

Three-Stage Anonymization Pipeline # permalink

Python Implementation # permalink

Verification Snippet # permalink

Edge Cases and Adjustments # permalink

Frequently Asked Questions # permalink

Related # permalink

Related topics