Where the Vector Breaks: Evaluating Our Duplicate Detection

The previous post described the 9-dimensional vector that powers Nipponhomes' duplicate detection and recommendations. A Postgres trigger computes nine hand-crafted numbers for every listing, and a threshold of score >= 98 decides whether two listings are "the same property."

That design choice was justified qualitatively: simple, interpretable, fast. What it wasn't was measured. This post is about the measurement — and what we found when we finally did it.

The short version:

A ±1 year_built drift alone pushes two otherwise-identical listings below the 98 threshold. That's a structural blind spot, not a bug.
A full scan of 462,400 neighbor pairs from the live database confirms this signature dominates the 92-98 score band.
Learned logistic weights recover a clean story: price and age carry the signal; the three binary dimensions are dead weight.
The current threshold catches only about 20% of realistic relists. Dropping it from 98 to 94 is a one-line change that recovers ~4× more duplicates at a modest precision cost.

Everything below is reproducible from the scripts in random-one-off-scripts/synthetic_dedup/.

1. The structural hole

Start with arithmetic. Dim 3 of the vector is (reference_year - year_built - 25) / 15. If two listings differ by exactly one calendar year in year_built, their squared difference on that dimension is (1/15)² ≈ 0.00444. The Euclidean distance ( $d_{L2}$ ) increases by $1/15 \approx 0.0667$ from that dimension alone, and the similarity score becomes:

\text{score} = 100 \cdot e^{-0.0667 / 2} \approx 96.72

That's below 98. So the current vector structurally cannot merge any pair of listings that differ only by one calendar year of year_built — a pattern Suumo and Athome routinely exhibit on the same building because they report different "construction completed" years.

With today's dim-3 scale, a single-year shift alone drops the score to 96.72. Every ±1 year relist is structurally invisible to the current 98 threshold — but sits comfortably above 94.

The fix isn't subtle: either widen dim 3's scale, or drop the threshold. Dropping the threshold is a one-line edit and doesn't require rebuilding 400k vectors.

2. Do real pairs look like the toy example?

Before modeling anything, we ran a direct question against the live database: what do near-duplicate pairs actually look like?

The query is a self-join through pgvector's HNSW index. For every listing, fetch the top-3 nearest neighbors with score ≥ 85, join back to the listings table, compute per-field diffs, and bucket by similarity. The full scan returned 462,400 pairs.

462,400 neighbor pairs from the live database

Pulled via pgvector's HNSW index: every listing's top-3 nearest neighbors with score ≥ 85, joined back to listings for per-field diffs.

bucket

pairs

|Δyear|

|Δsize|

|Δprice|

≥ 98

245.4k

0.07%

0.00%

95 – 98

114.3k

0.86%

1.82%

92 – 95

55.0k

1.73%

4.01%

90 – 92

19.9k

2.34%

5.72%

85 – 90

27.9k

2.92%

7.38%

Year-built difference within the ≥ 95 bucket

113,186 pairs with both years known. A single-year drift alone accounts for 40.3% of this bucket.

19.0%

59.7%

21.3%

−1 year19%

0 years59.7%

+1 year21.3%

±2 or more0.02%

Two findings stand out:

The ±1 year signature replicates across the entire database. In the ≥ 95 bucket, 40.3% of pairs with known year values differ by exactly one year, and essentially none differ by more than two. This isn't a quirk of our synthetic generator — it's what production near-duplicates actually look like.

The real measurement noise floor is small. In the 95-98 bucket, median |size_diff| is 0.86% and median |price_diff| is 1.82%. That's the amount of disagreement we see between pairs the system already considers "very similar." Any evaluation we run has to match this noise level to be honest.

(There's a third, quieter finding: 250,399 of the ≥ 95 pairs are more than 50 km apart. Those are what the character-bigram Jaccard location filter from part 1 catches in production — confirmation that the geography-free vector leans heavily on the filter as a safety net.)

3. Learning what the vector should have weighted

With noise calibrated to 1%, we built a labeled synthetic dataset: 12,159 pairs across five categories — synthetic relists, hard negatives (sibling units), cross-region, same-region-type, and cross-type. Then we trained three scorers on squared per-dimension diffs and compared them with 5-fold CV:

model	AUC	AP	P@Recall=0.95
baseline (current L2)	0.962 ± 0.003	0.934 ± 0.008	0.804 ± 0.012
logistic regression	0.971 ± 0.002	0.945 ± 0.007	0.872 ± 0.013
LightGBM	0.989 ± 0.001	0.978 ± 0.003	0.947 ± 0.005

Logistic regression delivers a 6.8 percentage point lift in P@Recall=0.95 over the hand-tuned baseline. While LightGBM's raw numbers look superior, our robustness tests revealed a classic "shortcut learning" trap.

The Shortcut Discovery

In early benchmarks, LightGBM achieved a perfect 1.000 AUC. When we interrogated the model's feature importance (gain), we found it had built a single split: if size_diff == 0 then duplicate.

In our initial synthetic generator, relists were produced with perfect size consistency. In the real world, Suumo and Athome routinely disagree on square meters by ~1% due to rounding or balcony rules. Once we injected the 1% measurement noise we observed in the live database, LightGBM's "perfect" splits collapsed. The logistic model—which weights the underlying concepts—proved far more stable across unseen perturbation types.

4. Why the coefficients matter

Translating the learned logistic coefficients back into our trigger scales gives us a clear ranking of which features actually separate "relists" from "sibling units" (different units in the same building).

Price and age carry the signal. The three binary dimensions — LDK, type, storage — have |coef| < 1 at every noise level tested, and could be dropped from the trigger without measurable loss on hard negatives.

The coefficients fell into three distinct groups:

The Dominants: Log Price (-15.5) and Age (-11.9) do the heavy lifting. They are the highest-resolution signals we have.
The Meaningful: Size (-8.5) and sqm/room (-9.0) are strong secondary signals. Size-per-room is particularly effective at encoding layout efficiency.
The Dead Weight: The binary dimensions (LDK flag, house-vs-condo, storage) all have coefficients $< 1$ . They help separate trivially different listings, but contribute almost nothing to the hard task of identifying relists.

5. The threshold is way too tight

The last experiment measured precision and recall for every threshold between 90 and 98. The results suggest we've been leaving a massive amount of recall on the table:

threshold	recall	precision	F1
98 (current)	20.5%	95.9%	0.337
96	58.3%	94.8%	0.722
94	82.3%	92.3%	0.870
92	90.4%	87.9%	0.892

These numbers don't yet include the Jaccard location filter, which runs after the vector match and rejects any pair with less than 0.15 location-string overlap. In production, precision at any threshold will be higher than the table shows, because the location filter catches the cross-prefecture false positives (the 250k pairs >50 km apart mentioned earlier). So the recommended operating point of 94 is conservative.

5. What ships

The beauty of this whole exercise is that the highest-leverage fix is also the cheapest. A one-line change to scraper/config/prod.py:

class SuumoConfig:
    VECTOR_SIMILARITY_THRESHOLD = 94.0  # was 98.0

No vector rebuild. No HNSW reindex. No migration. Instantly reversible if the real-world precision turns out worse than the synthetic benchmark predicts. The plan is to ship this first, monitor for a week against find_recent_suumo_duplicates.py in dry-run mode, and watch for Jaccard-filter escapes before committing fully.

The trigger changes — widening scales on the meaningful dims, dropping the binary dims — are strictly more expensive (every vector needs recomputing, then the HNSW index needs rebuilding on 400k rows) and deserve a hand-labeled oracle set before shipping. That's what the next iteration of this evaluation will be built on.

What I'd do differently

Two things.

Build the evaluation loop earlier. The original vector shipped with essentially no quantitative validation. It worked, and the threshold was "conservative enough," but "conservative enough" turned out to mean "catching 20% of what it could." An offline benchmark that reproduces the trigger byte-for-byte is not hard to write, and once it exists, every change to the trigger becomes a measurable experiment instead of a guess.

Calibrate synthetic data against real data before trusting it. Our first pass used 3% measurement noise, eyeballed. It was 3× too aggressive, and it produced misleading coefficients on two dimensions (size and sqm/room both looked collapsed when they weren't). Pulling the real-pair noise floor out of the live database took an afternoon and corrected the story.

Measurement is almost always the lever you wish you'd pulled first.

Scripts and raw results live in random-one-off-scripts/synthetic_dedup/FINDINGS.md in the repo. The next post in this series covers the recommendation side of the same vector.