Should I use WGS84 or a projected CRS for binning GPS logs?

Bin in a projected CRS (UTM or EPSG:3857) when your dataset spans a small geographic area and you need equal-area bins. For global datasets or when using H3 hexagonal indexing, WGS84 degree-based binning with a cosine correction for longitude is acceptable as a first pass, but UTM avoids longitude-shrinkage distortion and is preferable for any compliance calculation where bin-size accuracy is audited.

How to Calculate Re-identification Risk for GPS Logs

Q: What k value is required to satisfy GDPR for GPS trajectory data?

GDPR does not prescribe a numeric k threshold, but the Article 29 Working Party's guidance on anonymisation treats k = 3 with documented risk acceptance. Public or cross-organisation releases should target k >= 10 and supplement with suppression of sparse bins.

Q: How does spatial resolution affect re-identification risk in GPS logs?

Finer spatial bins ( = 500 m) increase class sizes but reduce spatial utility. The optimal resolution depends on the use case: 100–250 m is typical for urban mobility analytics where street-level precision matters; 500 m–1 km is used for regulatory compliance releases.

Q: When is Monte Carlo adversarial simulation necessary vs. baseline k-anonymity?

Baseline 1/k calculation assumes the adversary knows the full trajectory sequence. When attackers are likely to possess only partial knowledge (two to three anchor points from social media or transit records), Monte Carlo simulation with m-point sampling produces a more realistic risk estimate — typically 20–60% higher than the baseline 1/k value for urban datasets.

To quantify re-identification risk in GPS logs, discretize continuous spatiotemporal coordinates (WGS84) into equivalence classes, measure the size distribution of those classes, and express risk as $$\frac{1}{k_{\min}}$$ — the inverse of the smallest class. A uniqueness rate and a Monte Carlo adversarial simulation complement this baseline to produce the full risk picture required for re-identification risk assessment for geospatial datasets.

Core Calculation: Formulas, Parameters, and a Worked Example permalink

GPS logs are high-dimensional and temporally continuous. Exact coordinate matching is impractical for risk modeling, so the standard approach converts raw tuples into equivalence classes and measures their size distribution.

Formulas permalink

k-anonymity risk (worst-case):

R_k = \frac{1}{\min_{c \in \mathcal{C}} |c|}

Uniqueness rate (proportion of singleton trajectories):

U = \frac{|\{c \in \mathcal{C} : |c| = 1\}|}{|\mathcal{C}|}

Adversarial match probability (Monte Carlo, given $m$ observed points per trajectory):

P_{\text{match}}(m) = \frac{1}{N_{\text{iter}}} \sum_{i=1}^{N_{\text{iter}}} \mathbf{1}[\text{query}(S_i^{(m)}) \text{ returns exactly 1 user}]

where $\mathcal{C}$ is the set of equivalence classes, $|c|$ is the number of users in class $c$ , and $S_i^{(m)}$ is a random $m$ -point sample from trajectory $i$ .

Parameter Table permalink

Parameter	Symbol	Typical Range	Privacy Implication
Spatial bin width	$\Delta s$	100 m – 1 km	Smaller bins → more unique sequences → higher risk
Temporal bin width	$\Delta t$	15 min – 24 h	Finer windows increase trajectory distinctiveness
Minimum class size	$k_{\min}$	≥ 5 (internal), ≥ 10 (public)	Determines worst-case re-identification probability
Adversary sample points	$m$	2 – 4	More anchor points → higher simulated match rate
Monte Carlo iterations	$N_{\text{iter}}$	10,000 – 100,000	Higher count → tighter confidence interval on $P_{\text{match}}$

Worked Numeric Example permalink

A mobility dataset contains 10,000 user trajectories collected over 7 days in WGS84 (EPSG:4326). After spatial binning at $\Delta s = 500\,\text{m}$ and temporal binning at $\Delta t = 1\,\text{h}$ :

8,200 trajectories fall into equivalence classes of size ≥ 5.
1,600 fall into classes of size 2–4.
200 are singletons ( $|c| = 1$ ).

Results:

R_k = \frac{1}{1} = 1.0 \quad \text{(100\% risk for singletons)}

U = \frac{200}{10{,}000} = 0.02 \quad \text{(2\% uniqueness rate)}

Even with only 2% unique trajectories, the 200 singleton users are fully re-identifiable. Suppress them or coarsen $\Delta s$ to 1 km before release. After suppression:

R_k = \frac{1}{2} = 0.50 \quad \text{(still fails } k \geq 5 \text{ threshold)}

Coarsen to $\Delta s = 1\,\text{km}$ , $\Delta t = 2\,\text{h}$ ; verify again until $k_{\min} \geq 5$ .

Inline Diagram: Equivalence Class Formation permalink

The following diagram shows how raw GPS points are discretized into spatial-temporal bins and grouped into equivalence classes, revealing which trajectories are indistinguishable.

Python Implementation permalink

The function below calculates baseline re-identification risk using spatial-temporal binning and equivalence class analysis. It operates on WGS84 coordinates (EPSG:4326) and approximates bin sizes in degrees using a cosine correction for longitude — sufficient for regional datasets. For sub-100 m precision or global coverage, replace the manual binning with grid aggregation via H3 or PostGIS.

import pandas as pd
import numpy as np
from collections import Counter
from typing import TypedDict

class RiskReport(TypedDict):
    k_anonymity_risk: float
    uniqueness_rate: float
    min_class_size: int
    total_trajectories: int
    singleton_count: int
    class_size_distribution: dict[tuple, int]


def calculate_gps_reidentification_risk(
    df: pd.DataFrame,
    spatial_res_km: float = 0.5,
    temporal_res_h: float = 1.0,
) -> RiskReport:
    """
    Quantify re-identification risk for WGS84 GPS logs via equivalence classes.

    Coordinates are expected in EPSG:4326 (WGS84 decimal degrees).
    For datasets spanning > 500 km or requiring auditable bin-size accuracy,
    reproject to UTM first and pass spatial_res_km in metres.

    Args:
        df: DataFrame with columns 'user_id', 'lat', 'lon', 'timestamp'.
        spatial_res_km: Spatial bin width in kilometres (WGS84 approximation).
        temporal_res_h: Temporal bin width in hours.

    Returns:
        RiskReport with k_anonymity_risk, uniqueness_rate, min_class_size,
        total_trajectories, singleton_count, and class_size_distribution.

    Raises:
        ValueError: if required columns are missing or the DataFrame is empty.
    """
    required = {"user_id", "lat", "lon", "timestamp"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")
    if df.empty:
        raise ValueError("DataFrame is empty; cannot compute risk metrics.")

    df = df.copy()
    df["timestamp"] = pd.to_datetime(df["timestamp"])

    # --- Step 1: Spatial binning in WGS84 (EPSG:4326) ---
    # 1° latitude ≈ 111.32 km; longitude degrees shrink with cos(lat).
    # Use mean latitude as a dataset-level correction — acceptable for
    # regional datasets < 500 km extent.  For global data, project to UTM.
    lat_deg = spatial_res_km / 111.32
    lon_deg = spatial_res_km / (111.32 * np.cos(np.radians(df["lat"].mean())))

    df["lat_bin"] = np.floor(df["lat"] / lat_deg).astype(int)
    df["lon_bin"] = np.floor(df["lon"] / lon_deg).astype(int)

    # --- Step 2: Temporal binning ---
    freq = f"{max(1, int(temporal_res_h * 60))}min"
    df["time_bin"] = df["timestamp"].dt.floor(freq)

    # --- Step 3: Build per-user trajectory sequence (chronological) ---
    df = df.sort_values(["user_id", "timestamp"])
    trajectory_seq: pd.Series = (
        df.groupby("user_id")[["lat_bin", "lon_bin", "time_bin"]]
        .apply(lambda g: tuple(map(tuple, g.values.tolist())))
    )

    # --- Step 4: Equivalence class size distribution ---
    class_counts: Counter = Counter(trajectory_seq)
    sizes = np.array(list(class_counts.values()), dtype=int)

    if len(sizes) == 0:
        raise ValueError("No trajectories found after grouping.")

    min_k = int(sizes.min())
    singleton_count = int(np.sum(sizes == 1))

    # k-anonymity risk: worst-case per-user re-identification probability
    k_anonymity_risk = round(1.0 / min_k, 4)
    # Uniqueness rate: fraction of equivalence classes that are singletons
    uniqueness_rate = round(singleton_count / len(sizes), 4)

    return RiskReport(
        k_anonymity_risk=k_anonymity_risk,
        uniqueness_rate=uniqueness_rate,
        min_class_size=min_k,
        total_trajectories=int(len(sizes)),
        singleton_count=singleton_count,
        class_size_distribution=dict(class_counts),
    )

Implementation notes:

Strip or hash user_id before passing the DataFrame to this function; keep a separate mapping only in the secure audit log.
The cosine correction is a dataset-level mean. For precision work, reproject to EPSG:32632 (UTM zone 32N) or the locally relevant UTM zone before binning.
1/k represents worst-case risk. A dataset with 10,000 trajectories where $k_{\min} = 4$ implies a 25% re-identification probability for the most exposed users.
For k-anonymity grouping applied to mobile location traces, this function provides the pre-suppression diagnostic step.

Verification Snippet permalink

After running the main function, confirm the implementation meets the target threshold before any data release:

def verify_risk_threshold(
    report: RiskReport,
    target_k: int = 5,
    max_uniqueness: float = 0.01,
) -> bool:
    """
    Return True only if the dataset meets both the k-anonymity and
    uniqueness-rate targets.  Log failures with actionable remediation notes.
    """
    passes = True

    if report["min_class_size"] < target_k:
        print(
            f"FAIL: min_class_size={report['min_class_size']} < target k={target_k}. "
            f"Suppress {report['singleton_count']} singleton trajectories "
            f"or coarsen spatial_res_km / temporal_res_h."
        )
        passes = False

    if report["uniqueness_rate"] > max_uniqueness:
        print(
            f"FAIL: uniqueness_rate={report['uniqueness_rate']:.2%} "
            f"> max allowed {max_uniqueness:.2%}. "
            f"Found {report['singleton_count']} singleton classes."
        )
        passes = False

    if passes:
        print(
            f"PASS: min_class_size={report['min_class_size']}, "
            f"uniqueness_rate={report['uniqueness_rate']:.2%}"
        )
    return passes


# Example usage
report = calculate_gps_reidentification_risk(df, spatial_res_km=0.5, temporal_res_h=1.0)
if not verify_risk_threshold(report, target_k=5):
    # Suppress singletons and re-check
    safe_df = df[df["user_id"].isin(
        [uid for uid, seq in trajectory_seq.items()
         if class_counts[seq] >= 5]
    )]
    report = calculate_gps_reidentification_risk(safe_df, spatial_res_km=0.5, temporal_res_h=1.0)
    verify_risk_threshold(report, target_k=5)

Edge Cases and Adjustments permalink

Sparse rural data. Low population density creates naturally unique trajectories even at coarse resolutions. In sparse regions, coarsening $\Delta s$ from 500 m to 2 km may be necessary; monitor utility loss via a privacy-utility trade-off analysis before releasing.
Non-uniform urban density. Dense city centres generate highly overlapping trajectories (large $k$ ) while peripheral areas have few users and small equivalence classes. Report risk per-zone — a dataset-wide minimum $k$ may look acceptable while outlying zones are fully exposed. Segment the DataFrame by spatial zone and run the function per segment.
Temporal windowing. A trajectory spanning midnight is split across daily bins if temporal_res_h divides the night. This can artificially fragment a sequence and shrink equivalence class sizes. Use calendar-day flooring (dt.normalize()) for datasets where overnight continuity is not analytically required, or widen $\Delta t$ to 2–6 hours.
CRS precision loss. Binning in WGS84 decimal degrees with the cosine correction introduces up to 3% bin-width error at latitudes above 60°N. For Arctic or sub-Antarctic datasets, reproject to a local UTM CRS (e.g., EPSG:32633) before computing lat_bin / lon_bin to avoid inflating apparent equivalence class sizes.

Frequently Asked Questions permalink

What k value is required to satisfy GDPR for GPS trajectory data?

GDPR does not specify a numeric threshold, but the Article 29 Working Party guidance treats $k < 5$ as insufficient for public release of location records. Internal analytics without onward sharing may use $k \geq 3$ with documented risk acceptance, while public or cross-organisation releases should target $k \geq 10$ and apply suppression on any remaining sparse bins. The compliance mapping for GDPR and CCPA location data page details the relevant articles.

How does spatial resolution affect re-identification risk?

Finer spatial bins (below 100 m) produce more unique trajectory sequences and therefore smaller equivalence classes, pushing $R_k$ toward 1.0 for singleton trajectories. Coarser bins (500 m – 1 km) increase class sizes but reduce spatial utility. The optimal $\Delta s$ depends on use case: 100–250 m suits urban mobility analytics where street-level precision is needed; 500 m–1 km is common for regulatory compliance releases.

When is Monte Carlo adversarial simulation necessary vs. baseline k-anonymity?

Baseline $1/k$ assumes an adversary knows the full trajectory sequence. When attackers are likely to have only partial knowledge — two to three anchor points from social media check-ins or public transit records — Monte Carlo simulation with $m$ -point sampling produces a more realistic estimate, typically 20–60% higher than the baseline value for urban datasets. Implement the simulation by sampling $m$ random spatiotemporal bins from each trajectory, querying the binned dataset for matching records, and counting singleton matches across 10,000+ iterations. The spatial linkage attack vectors and mitigation page documents the auxiliary-data attack model in detail.

Should I use WGS84 or a projected CRS for binning?

Bin in a projected CRS (UTM or EPSG:3857) when your dataset spans a small geographic area and equal-area bins are required for audit accuracy. For global datasets or when using H3 hexagonal indexing, WGS84 degree-based binning with a cosine correction is acceptable as a first pass. UTM avoids longitude-shrinkage distortion and is preferable for any compliance calculation where bin-size accuracy is formally audited.

← Back to Re-identification Risk Assessment for Geospatial Datasets

How to Calculate Re-identification Risk for GPS Logs

Core Calculation: Formulas, Parameters, and a Worked Example # permalink

Formulas # permalink

Parameter Table # permalink

Worked Numeric Example # permalink

Inline Diagram: Equivalence Class Formation # permalink

Python Implementation # permalink

Verification Snippet # permalink

Edge Cases and Adjustments # permalink

Frequently Asked Questions # permalink

Related # permalink

Related topics