What k value is sufficient to satisfy GDPR for location data?

GDPR does not mandate a numeric k threshold, but the Article 29 Working Party opinion on anonymisation considers k ≥ 5 a minimal baseline; values of k ≥ 10–15 are recommended for high-sensitivity mobility data such as home/work commute traces.

Can k-anonymity alone prevent re-identification of location traces?

No. K-anonymity suppresses exact coordinates but is vulnerable to background-knowledge attacks and temporal intersection attacks across multiple releases. Pair it with differential privacy noise, query-rate limits, or temporal suppression for defence-in-depth.

Should I use MBR or convex hull for cloak geometry?

MBR is faster and simpler to implement but overestimates area along diagonal corridors; convex hull is tighter but exposes trajectory shape. For most urban datasets, MBR is the safer default because it reveals less directional information.

What projection should I use when computing cloak areas and distances?

Always reproject from WGS84 (EPSG:4326) to an equal-area CRS such as World Mollweide (EPSG:54009) or a regional UTM zone before computing areas or applying distance thresholds. Distance calculations in degree-space introduce up to 40 % error at mid-latitudes.

K-Anonymity Grouping for Location Traces

K-anonymity grouping for location traces transforms raw GPS trajectory records into spatiotemporal envelopes where every published point is indistinguishable from at least k−1 other users, preventing individual re-identification from released mobility data.

When to Use This Technique permalink

Not every privacy requirement calls for spatial k-anonymity. The diagram below maps common mobility-data scenarios to the right protection strategy:

Situations where k-anonymity grouping is the right choice:

You must release trajectory-level rows (not aggregates) but cannot accept exact coordinates.
The downstream consumer needs a spatial boundary rather than a perturbed point — e.g. a routing or density model that works with polygon inputs.
A formal ε bound is not contractually required, but group-size indistinguishability is sufficient for your regulatory context.
The dataset is dense enough that grouping k users within an acceptable spatial radius remains analytically useful.

When individual coordinate precision is essential, coordinate jittering and noise injection retains point geometry while adding calibrated displacement. When the analytical output is a heatmap or density surface rather than individual rows, grid aggregation and spatial binning is more efficient. When a mathematically provable privacy guarantee is contractually required, Laplace and Gaussian noise for coordinate data provides ε-differential privacy.

Algorithmic Specification permalink

Formal definition permalink

For a dataset $$D$$ of location records, a k-anonymous release $$D^*$$ satisfies:

\forall r \in D^*,\ \bigl|\{r' \in D^* \mid \text{QI}(r') = \text{QI}(r)\}\bigr| \geq k

where QI® is the quasi-identifier tuple. For location traces the quasi-identifier is the spatiotemporal window: $$\text{QI}® = (W_t, G)$$, where $$W_t$$ is the temporal window and $$G$$ is the published spatial envelope (MBR or convex hull). Two records share the same quasi-identifier when both fall within the same window $$W_t$$ and the same envelope $$G$$.

The cloaking radius $$r$$ expands from an initial seed coordinate until the enclosed set contains at least k distinct user_id values:

r^* = \min\bigl\{r \geq 0 : \bigl|\text{distinct\_users}(B(p, r) \cap W_t)\bigr| \geq k\bigr\}

where $$B(p, r)$$ is a disk of radius r centred on seed point p, computed in an equal-area projection.

Parameter reference permalink

Parameter	Typical range	Privacy impact	Utility impact
`k`	5 – 25	Higher k → stronger anonymity	Higher k → larger cloaks, less spatial precision
`window_minutes` ($$\Delta t$$)	1 – 15 min	Shorter windows group fewer users (weaker)	Shorter windows → more temporal fidelity
`max_radius_km`	0.1 – 5 km	Larger ceiling → guaranteed k satisfaction	Larger radius → coarser output; may be analytically useless
`max_cloak_area_km²`	0.5 – 10 km²	Lower ceiling → privacy guarantee may fail in sparse zones	Lower ceiling → tighter, more precise envelopes
Geometry type	MBR / convex hull / alpha-shape	MBR reveals least shape information	Alpha-shape fits tightest but exposes trajectory shape

For detailed guidance on calibrating k against re-identification risk and dataset density, see calculating k-anonymity thresholds for mobile tracking.

Prerequisites & Data Requirements permalink

Component	Requirement
Python	3.9+ with virtual environment
Core libraries	`pandas`, `geopandas`, `shapely`, `numpy`, `scipy`
Spatial indexing	`scipy.spatial.cKDTree` (bundled with scipy)
Equal-area projection	`pyproj` for CRS reprojection during distance/area checks
Input CRS	WGS84 (EPSG:4326) — reproject to UTM or EPSG:6933 for metric operations
Input schema	`user_id`, `timestamp` (UTC datetime), `geometry` (shapely Point)
Temporal consistency	Uniform or near-uniform sampling interval; gaps filled or flagged before windowing
Minimum dataset size	At least 10k records per geographic area to achieve k ≥ 5 with reasonable cloak areas
Governance artefact	Documented k threshold and rationale, signed off by the privacy officer

Install the required stack:

pip install pandas geopandas shapely numpy scipy pyproj

Normalise all timestamps to UTC before processing. Traces spanning daylight-saving transitions will produce asymmetric window counts if normalisation is skipped. Input geometries must be cleaned of null coordinates and deduplicated on (user_id, timestamp) to prevent a single user from artificially inflating group counts.

Step-by-Step Implementation permalink

Step 1 — Temporal windowing and trace partitioning permalink

Segment the full trace dataset into fixed-duration slices. Each slice becomes an independent anonymisation unit; records outside a window boundary are excluded from that grouping cycle. Use pd.Grouper with a fixed frequency aligned to the epoch so windows are reproducible across pipeline runs.

import pandas as pd
import geopandas as gpd

def load_and_window(path: str, window_minutes: int = 5) -> dict[pd.Timestamp, pd.DataFrame]:
    """
    Load a GeoParquet or CSV trace file and partition into temporal windows.

    Args:
        path: Path to GeoParquet (preferred) or CSV with lat/lon columns.
        window_minutes: Window width in minutes.

    Returns:
        Dict mapping window-start Timestamps to DataFrames in WGS84.
    """
    gdf = gpd.read_parquet(path) if path.endswith(".parquet") else _csv_to_gdf(path)
    gdf["timestamp"] = pd.to_datetime(gdf["timestamp"], utc=True)
    gdf = gdf.dropna(subset=["geometry", "user_id"]).sort_values("timestamp")
    gdf = gdf.to_crs("EPSG:4326")

    windows = {}
    for key, group in gdf.groupby(pd.Grouper(key="timestamp", freq=f"{window_minutes}min")):
        if not group.empty:
            windows[key] = group.reset_index(drop=True)
    return windows


def _csv_to_gdf(path: str) -> gpd.GeoDataFrame:
    from shapely.geometry import Point
    df = pd.read_csv(path)
    geometry = [Point(lon, lat) for lon, lat in zip(df["lon"], df["lat"])]
    return gpd.GeoDataFrame(df, geometry=geometry, crs="EPSG:4326")

Privacy implication: Shorter windows expose finer temporal patterns. A 1-minute window reveals commute timing precisely; a 15-minute window sacrifices temporal utility but groups more users per slice, making k-satisfaction easier in sparse areas.

Step 2 — Spatial candidate generation with cKDTree permalink

Within each window, build a spatial index over the active coordinates, then expand outward from each seed point until k distinct user IDs are enclosed. The search operates in a projected (metric) CRS to ensure distance thresholds are in metres, not degrees.

import numpy as np
from scipy.spatial import cKDTree
from pyproj import Transformer

# Project WGS84 → World Mercator (EPSG:3395) for metre-accurate distances.
# For regional datasets prefer the local UTM zone.
_WGS84_TO_METRES = Transformer.from_crs("EPSG:4326", "EPSG:3395", always_xy=True)

def find_k_candidates(
    window_gdf: gpd.GeoDataFrame,
    k: int,
    max_radius_m: float = 2000.0,
) -> list[list[int]]:
    """
    Return index groups where each group contains the row indices that form
    a valid k-anonymous cloak (≥ k distinct user_ids, within max_radius_m).

    Args:
        window_gdf: GeoDataFrame for a single temporal window, EPSG:4326.
        k: Minimum distinct users per cloak.
        max_radius_m: Hard ceiling on cloaking radius (metres in EPSG:3395).

    Returns:
        List of index lists; each inner list is one candidate group.
    """
    xs = window_gdf.geometry.x.values
    ys = window_gdf.geometry.y.values
    mx, my = _WGS84_TO_METRES.transform(xs, ys)
    coords_m = np.column_stack([mx, my])
    tree = cKDTree(coords_m)

    processed: set[int] = set()
    groups: list[list[int]] = []

    for seed in range(len(window_gdf)):
        if seed in processed:
            continue

        # Query all neighbours sorted by distance up to the hard radius ceiling
        indices = tree.query_ball_point(coords_m[seed], r=max_radius_m)
        indices.sort(key=lambda i: np.linalg.norm(coords_m[i] - coords_m[seed]))

        seen_users: set = set()
        candidate_idx: list[int] = []

        for idx in indices:
            if idx in processed:
                continue
            uid = window_gdf.iloc[idx]["user_id"]
            if uid not in seen_users:
                seen_users.add(uid)
                candidate_idx.append(idx)
                if len(seen_users) >= k:
                    break

        if len(seen_users) >= k:
            groups.append(candidate_idx)
            processed.update(candidate_idx)

    return groups

Privacy implication: If max_radius_m is reached before k users are found, the window produces no cloak for that neighbourhood. These unresolved seeds must be logged and escalated — never silently dropped — because gaps in coverage can themselves reveal sensitive locations (e.g. a sparsely visited medical facility).

Step 3 — Cloak formation permalink

Convert each candidate group into a spatial envelope. The minimum bounding rectangle (MBR) is the safest default: it does not expose trajectory shape and is computationally trivial.

from shapely.geometry import box, MultiPoint

def form_cloak(
    window_gdf: gpd.GeoDataFrame,
    candidate_indices: list[int],
    geometry_type: str = "mbr",
) -> dict:
    """
    Build the published cloak geometry for a validated candidate group.

    Args:
        window_gdf: Source GeoDataFrame (WGS84).
        candidate_indices: Row indices forming the group (output of find_k_candidates).
        geometry_type: 'mbr' (minimum bounding rectangle) or 'hull' (convex hull).

    Returns:
        Dict with 'geometry' (shapely), 'user_count', and 'user_ids' (audit only).
    """
    subset = window_gdf.iloc[candidate_indices]
    user_ids = list(subset["user_id"].unique())

    if geometry_type == "hull":
        geom = MultiPoint(list(subset.geometry)).convex_hull
    else:
        minx, miny, maxx, maxy = subset.geometry.total_bounds
        geom = box(minx, miny, maxx, maxy)

    return {
        "geometry": geom,
        "user_count": len(user_ids),
        "user_ids": user_ids,  # Remove or hash before external release
    }

Step 4 — Full pipeline assembly permalink

def generate_k_anonymous_cloaks(
    path: str,
    k: int = 5,
    window_minutes: int = 5,
    max_radius_m: float = 2000.0,
    geometry_type: str = "mbr",
) -> gpd.GeoDataFrame:
    """
    End-to-end pipeline: load traces → window → group → cloak → return GeoDataFrame.

    Args:
        path: Path to GeoParquet or CSV trace file.
        k: Minimum distinct users per published cloak.
        window_minutes: Temporal window width.
        max_radius_m: Maximum cloaking radius in metres (EPSG:3395).
        geometry_type: 'mbr' or 'hull'.

    Returns:
        GeoDataFrame with cloak polygons (EPSG:4326) and metadata.
        Strip 'user_ids' before any external release.
    """
    windows = load_and_window(path, window_minutes)
    records = []

    for window_key, window_gdf in windows.items():
        groups = find_k_candidates(window_gdf, k, max_radius_m)
        for group_indices in groups:
            cloak = form_cloak(window_gdf, group_indices, geometry_type)
            cloak["window_start"] = window_key
            records.append(cloak)

    if not records:
        return gpd.GeoDataFrame(
            columns=["window_start", "user_count", "geometry", "user_ids"]
        )

    return gpd.GeoDataFrame(records, geometry="geometry", crs="EPSG:4326")

Validation & Re-identification Testing permalink

Publishing without validation is a compliance failure. Three verification layers are required before release.

1 — Neighbour-count audit permalink

Confirm that every row in the output carries user_count >= k:

def audit_k_guarantee(cloaks: gpd.GeoDataFrame, k: int) -> dict:
    """
    Verify the k-anonymity guarantee across all published cloaks.

    Returns a summary dict with pass/fail status and violating row count.
    """
    violations = cloaks[cloaks["user_count"] < k]
    return {
        "total_cloaks": len(cloaks),
        "violations": len(violations),
        "passed": len(violations) == 0,
        "min_user_count": int(cloaks["user_count"].min()) if len(cloaks) else None,
    }

2 — Intersection attack simulation permalink

A user appearing in overlapping windows may be isolatable by intersecting their consecutive cloak polygons. Compute the intersection area ratio across adjacent windows for any user_id present in more than one cloak:

def intersection_attack_score(cloaks: gpd.GeoDataFrame) -> float:
    """
    Estimate worst-case intersection shrinkage across consecutive windows.

    Returns the minimum ratio of intersection-area / smaller-cloak-area across
    all consecutive-window pairs that share at least one user_id.
    A ratio close to 1.0 means cloaks are stable; close to 0.0 means intersection
    is trivial and the user's path is nearly revealed.
    """
    cloaks = cloaks.sort_values("window_start").reset_index(drop=True)
    ratios = []

    for i in range(len(cloaks) - 1):
        a, b = cloaks.iloc[i], cloaks.iloc[i + 1]
        shared = set(a["user_ids"]) & set(b["user_ids"])
        if not shared:
            continue
        intersection = a["geometry"].intersection(b["geometry"])
        smaller_area = min(a["geometry"].area, b["geometry"].area)
        if smaller_area > 0:
            ratios.append(intersection.area / smaller_area)

    return min(ratios) if ratios else 1.0

Ratios below 0.3 indicate that consecutive cloaks narrow the user’s position to less than 30 % of the original envelope — a strong re-identification signal. Increase window_minutes or max_radius_m until the minimum ratio exceeds 0.5.

3 — Auxiliary-join simulation permalink

Obtain a public points-of-interest dataset (e.g. OpenStreetMap amenities) and count how many cloaks contain only a single sensitive POI (clinic, shelter, addiction service). Any such cloak leaks the user’s association with that facility even without a name:

def sensitive_poi_exposure(
    cloaks: gpd.GeoDataFrame,
    sensitive_pois: gpd.GeoDataFrame,
) -> gpd.GeoDataFrame:
    """
    Flag cloaks that contain exactly one sensitive POI — the privacy-equivalent
    of a direct disclosure.

    Args:
        cloaks: Output of generate_k_anonymous_cloaks (EPSG:4326).
        sensitive_pois: GeoDataFrame of sensitive-category POIs (EPSG:4326).

    Returns:
        Subset of cloaks with exactly one enclosed sensitive POI.
    """
    joined = gpd.sjoin(sensitive_pois, cloaks, how="left", predicate="within")
    poi_per_cloak = joined.groupby("index_right").size()
    exposed_indices = poi_per_cloak[poi_per_cloak == 1].index
    return cloaks.loc[exposed_indices]

This auxiliary-join pattern mirrors the spatial linkage attack vectors described in the threat-modeling reference — the same geographic containment logic that powers routing and proximity queries can be weaponised to re-identify.

Common Failure Modes & Gotchas permalink

Projection errors — distances in degrees instead of metres. cKDTree built on raw WGS84 coordinates treats degrees as Euclidean distance. One degree of longitude at 45 ° latitude is ~79 km, not ~111 km. Compute all distances in a projected CRS (EPSG:3395 globally, or the local UTM zone for regional datasets) and convert thresholds to degrees only if the tree must remain in WGS84.

Boundary-crossing artifacts. A user crossing an administrative boundary mid-trace (e.g. a city border or time-zone edge) may appear in two separate geographic cloaks during the same temporal window. Validate that cloak geometries do not straddle jurisdictional polygons where different data-sharing agreements apply.

Sparse-data utility collapse. In rural or low-density areas, achieving k = 10 may require a cloaking radius of 20 km — rendering the output useless for micro-mobility analysis. Implement a tiered k strategy: use k = 5 in sparse regions and k = 15 in dense urban cores, documenting the spatial boundary of each tier in your data-release notes.

Single-user temporal windows. A window containing only one user produces no valid cloak and should log a warning. If this happens for more than 5 % of windows, the temporal resolution is too fine for the dataset density — increase window_minutes or apply grid aggregation as a fallback.

Repeated releases — temporal intersection attack. Publishing k-anonymous snapshots every 5 minutes for the same area allows an adversary to intersect consecutive outputs and narrow a user’s position exponentially. Apply query-rate limits between consecutive releases of the same geographic zone, or add a differential privacy noise layer as a complementary defence.

Non-deterministic clustering breaking audit trails. Any algorithm that selects seed points randomly will produce different cloaks on each run, making auditability impossible. The cKDTree implementation above is deterministic when given sorted input — always sort by timestamp and reset the index before processing.

user_ids leaked in output. The pipeline retains user_ids in the user_ids column for audit purposes only. Before any external release, replace them with a keyed hash or drop the column entirely:

import hashlib

def hash_user_ids(cloaks: gpd.GeoDataFrame, salt: bytes) -> gpd.GeoDataFrame:
    """Replace raw user_id lists with salted SHA-256 hashes for auditability."""
    cloaks = cloaks.copy()
    cloaks["user_ids"] = cloaks["user_ids"].apply(
        lambda ids: [
            hashlib.sha256(salt + str(uid).encode()).hexdigest()[:16]
            for uid in ids
        ]
    )
    return cloaks

Compliance Alignment permalink

K-anonymity grouping for location traces directly addresses the following regulatory and technical controls:

Control	Clause / Standard	How this technique satisfies it
Data minimisation	GDPR Art. 5(1)©	Publishes spatial envelopes rather than raw coordinates; retains only the data necessary for the analytical purpose
Privacy by design	GDPR Art. 25	Anonymisation is built into the release pipeline rather than applied as an afterthought
De-identification	NIST SP 800-188 §4	Group-size indistinguishability satisfies the “k-anonymity” de-identification model documented in the standard
Re-identification risk assessment	ISO/IEC 29101 §6.4	The intersection-attack and auxiliary-join validation steps constitute a documented risk assessment
Purpose limitation	GDPR Art. 5(1)(b)	Cloak geometry is calibrated to the specific analytical purpose (e.g. routing vs. density); parameters are version-controlled
Audit readiness	GDPR Art. 30	Configuration files (k, Δt, max_radius_m, projection) stored with each release form a versioned processing record

For regulatory mapping across GDPR and CCPA, see the compliance mapping for GDPR and CCPA location data reference. The privacy risk scoring frameworks for GIS article provides a quantitative basis for selecting k relative to your organisation’s risk appetite.

Documentation requirements for audit readiness:

Version-controlled YAML or JSON config per release: k, window_minutes, max_radius_m, geometry_type, projection EPSG codes, and the git commit hash of the pipeline code.
Log of windows that failed k-satisfaction, with counts and geographic centroids (not user IDs).
Intersection-attack score for each release, stored alongside the output.
Sign-off from the data protection officer confirming the k threshold is consistent with the DPIA.

← Back to Geospatial Masking & Perturbation Techniques

K-Anonymity Grouping for Location Traces

When to Use This Technique # permalink

Algorithmic Specification # permalink

Formal definition # permalink

Parameter reference # permalink

Prerequisites & Data Requirements # permalink

Step-by-Step Implementation # permalink

Step 1 — Temporal windowing and trace partitioning # permalink

Step 2 — Spatial candidate generation with cKDTree # permalink

Step 3 — Cloak formation # permalink

Step 4 — Full pipeline assembly # permalink

Validation & Re-identification Testing # permalink

1 — Neighbour-count audit # permalink

2 — Intersection attack simulation # permalink

3 — Auxiliary-join simulation # permalink

Common Failure Modes & Gotchas # permalink

Compliance Alignment # permalink

Related # permalink

Explore this section

Related topics