Where the Vector Breaks: Evaluating Our Duplicate Detection
Pressure-testing the 9D similarity vector against 462k real pairs from the live database — and discovering the current threshold catches only 20% of realistic relists.
The previous post described the 9-dimensional vector that powers Nipponhomes' duplicate detection and recommendations. A Postgres trigger computes nine hand-crafted numbers for every listing, and a threshold of score >= 98 decides whether two listings are "the same property."
That design choice was justified qualitatively: simple, interpretable, fast. What it wasn't was measured. This post is about the measurement — and what we found when we finally did it.
The short version:
- A
±1 year_builtdrift alone pushes two otherwise-identical listings below the 98 threshold. That's a structural blind spot, not a bug. - A full scan of 462,400 neighbor pairs from the live database confirms this signature dominates the 92-98 score band.
- Learned logistic weights recover a clean story: price and age carry the signal; the three binary dimensions are dead weight.
- The current threshold catches only about 20% of realistic relists. Dropping it from 98 to 94 is a one-line change that recovers ~4× more duplicates at a modest precision cost.
Everything below is reproducible from the scripts in random-one-off-scripts/synthetic_dedup/.
1. The structural hole
Start with arithmetic. Dim 3 of the vector is (reference_year - year_built - 25) / 15. If two listings differ by exactly one calendar year in year_built, their squared difference on that dimension is (1/15)² ≈ 0.00444. The Euclidean distance () increases by from that dimension alone, and the similarity score becomes:
That's below 98. So the current vector structurally cannot merge any pair of listings that differ only by one calendar year of year_built — a pattern Suumo and Athome routinely exhibit on the same building because they report different "construction completed" years.
The fix isn't subtle: either widen dim 3's scale, or drop the threshold. Dropping the threshold is a one-line edit and doesn't require rebuilding 400k vectors.
2. Do real pairs look like the toy example?
Before modeling anything, we ran a direct question against the live database: what do near-duplicate pairs actually look like?
The query is a self-join through pgvector's HNSW index. For every listing, fetch the top-3 nearest neighbors with score ≥ 85, join back to the listings table, compute per-field diffs, and bucket by similarity. The full scan returned 462,400 pairs.
Two findings stand out:
The ±1 year signature replicates across the entire database. In the ≥ 95 bucket, 40.3% of pairs with known year values differ by exactly one year, and essentially none differ by more than two. This isn't a quirk of our synthetic generator — it's what production near-duplicates actually look like.
The real measurement noise floor is small. In the 95-98 bucket, median |size_diff| is 0.86% and median |price_diff| is 1.82%. That's the amount of disagreement we see between pairs the system already considers "very similar." Any evaluation we run has to match this noise level to be honest.
(There's a third, quieter finding: 250,399 of the ≥ 95 pairs are more than 50 km apart. Those are what the character-bigram Jaccard location filter from part 1 catches in production — confirmation that the geography-free vector leans heavily on the filter as a safety net.)
3. Learning what the vector should have weighted
With noise calibrated to 1%, we built a labeled synthetic dataset: 12,159 pairs across five categories — synthetic relists, hard negatives (sibling units), cross-region, same-region-type, and cross-type. Then we trained three scorers on squared per-dimension diffs and compared them with 5-fold CV:
| model | AUC | AP | P@Recall=0.95 |
|---|---|---|---|
| baseline (current L2) | 0.962 ± 0.003 | 0.934 ± 0.008 | 0.804 ± 0.012 |
| logistic regression | 0.971 ± 0.002 | 0.945 ± 0.007 | 0.872 ± 0.013 |
| LightGBM | 0.989 ± 0.001 | 0.978 ± 0.003 | 0.947 ± 0.005 |
Logistic regression delivers a 6.8 percentage point lift in P@Recall=0.95 over the hand-tuned baseline. While LightGBM's raw numbers look superior, our robustness tests revealed a classic "shortcut learning" trap.
The Shortcut Discovery
In early benchmarks, LightGBM achieved a perfect 1.000 AUC. When we interrogated the model's feature importance (gain), we found it had built a single split: if size_diff == 0 then duplicate.
In our initial synthetic generator, relists were produced with perfect size consistency. In the real world, Suumo and Athome routinely disagree on square meters by ~1% due to rounding or balcony rules. Once we injected the 1% measurement noise we observed in the live database, LightGBM's "perfect" splits collapsed. The logistic model—which weights the underlying concepts—proved far more stable across unseen perturbation types.
4. Why the coefficients matter
Translating the learned logistic coefficients back into our trigger scales gives us a clear ranking of which features actually separate "relists" from "sibling units" (different units in the same building).
The coefficients fell into three distinct groups:
- The Dominants: Log Price (-15.5) and Age (-11.9) do the heavy lifting. They are the highest-resolution signals we have.
- The Meaningful: Size (-8.5) and sqm/room (-9.0) are strong secondary signals. Size-per-room is particularly effective at encoding layout efficiency.
- The Dead Weight: The binary dimensions (LDK flag, house-vs-condo, storage) all have coefficients . They help separate trivially different listings, but contribute almost nothing to the hard task of identifying relists.
5. The threshold is way too tight
The last experiment measured precision and recall for every threshold between 90 and 98. The results suggest we've been leaving a massive amount of recall on the table:
| threshold | recall | precision | F1 |
|---|---|---|---|
| 98 (current) | 20.5% | 95.9% | 0.337 |
| 96 | 58.3% | 94.8% | 0.722 |
| 94 | 82.3% | 92.3% | 0.870 |
| 92 | 90.4% | 87.9% | 0.892 |
These numbers don't yet include the Jaccard location filter, which runs after the vector match and rejects any pair with less than 0.15 location-string overlap. In production, precision at any threshold will be higher than the table shows, because the location filter catches the cross-prefecture false positives (the 250k pairs >50 km apart mentioned earlier). So the recommended operating point of 94 is conservative.
5. What ships
The beauty of this whole exercise is that the highest-leverage fix is also the cheapest. A one-line change to scraper/config/prod.py:
class SuumoConfig:
VECTOR_SIMILARITY_THRESHOLD = 94.0 # was 98.0
No vector rebuild. No HNSW reindex. No migration. Instantly reversible if the real-world precision turns out worse than the synthetic benchmark predicts. The plan is to ship this first, monitor for a week against find_recent_suumo_duplicates.py in dry-run mode, and watch for Jaccard-filter escapes before committing fully.
The trigger changes — widening scales on the meaningful dims, dropping the binary dims — are strictly more expensive (every vector needs recomputing, then the HNSW index needs rebuilding on 400k rows) and deserve a hand-labeled oracle set before shipping. That's what the next iteration of this evaluation will be built on.
What I'd do differently
Two things.
Build the evaluation loop earlier. The original vector shipped with essentially no quantitative validation. It worked, and the threshold was "conservative enough," but "conservative enough" turned out to mean "catching 20% of what it could." An offline benchmark that reproduces the trigger byte-for-byte is not hard to write, and once it exists, every change to the trigger becomes a measurable experiment instead of a guess.
Calibrate synthetic data against real data before trusting it. Our first pass used 3% measurement noise, eyeballed. It was 3× too aggressive, and it produced misleading coefficients on two dimensions (size and sqm/room both looked collapsed when they weren't). Pulling the real-pair noise floor out of the live database took an afternoon and corrected the story.
Measurement is almost always the lever you wish you'd pulled first.
Scripts and raw results live in random-one-off-scripts/synthetic_dedup/FINDINGS.md in the repo. The next post in this series covers the recommendation side of the same vector.