Properties as Vectors
How Nippon Homes maps listings into a 9D vector space for duplicate detection and fast, interpretable recommendations with PostgreSQL + pgvector.
This is the first post in a short series about the ML infrastructure behind Nippon Homes. The platform currently hosts around 400k active Japanese real estate listings and serves roughly 20k monthly users.
Each listing is mapped into a 9-dimensional vector, and that single representation powers two production jobs:
- Duplicate detection: catch relists of the same property under new URLs.
- Recommendations: surface listings similar to what a user has browsed.
This post focuses on the vector itself: how we build it, why we scale it the way we do, and why this design works well in production. A follow-up post covers recommendation ranking.
Before we get into dimensions, here's what this post is not:
- No image embeddings
- No transformer encoders
- No learned latent representations
The vector is nine hand-crafted numbers from structured fields the scraper already extracts. That choice is deliberate. A 9D feature vector can be computed in a PostgreSQL trigger, indexed with pgvector HNSW, and inspected when something looks wrong. If a match is bad, we can see which dimensions drove it.
I have experimented with heavier models too. At Liquid AI's hackathon, I built an image + text embedding system on top of LFM2-VL and placed second. It is powerful, but much more expensive operationally: every new listing needs model inference, and model changes imply a full re-embedding pass. For our main jobs today (dedupe and nearest-neighbor retrieval in filtered candidate sets), the 9D vector is already strong enough.
The Core Idea
Every property is a point in 9-dimensional space:
Similar properties become neighbors. Small Tokyo condos cluster in one region; larger countryside homes cluster elsewhere.
Seeing It in Action
Below are 6 Tokyo condos and 6 Hokkaido houses from the database, all priced between ¥22M and ¥40M. Price overlaps, but the properties still separate cleanly in feature space.
Since 9D cannot be visualized directly, I project them into 3D with PCA:
Blue points (Tokyo condos) and orange points (Hokkaido houses) form different groups because the underlying feature patterns differ: size, room count, and price-per-sqm.
The 9 Dimensions We Track
This is the PostgreSQL trigger function update_listing_vector that computes the vector on insert/update:
DECLARE
vector_data vector(9);
normalized_rooms TEXT;
room_count INTEGER;
is_land BOOLEAN;
reference_year INTEGER := 2025;
BEGIN
-- Land listings have no building: no rooms, no age, no layout.
-- We zero out every dimension that only makes sense for a structure
-- and let size + price carry the similarity signal.
is_land := (NEW.listing_type = '土地');
IF is_land THEN
normalized_rooms := '';
room_count := 0;
ELSE
normalized_rooms := normalize_rooms(NEW.rooms);
room_count := COALESCE(
(regexp_match(normalized_rooms, '(\d+)'))[1]::integer,
3
);
END IF;
-- Scaling strategy: heuristic z-score normalization
-- Centers ≈ dataset mean (e.g., 100m² actual mean: 100.7m²)
-- Scales ≈ "practical range" (~0.5-1σ), NOT true standard deviations
-- This keeps most values in [-2, +2] and prevents high-variance
-- features from dominating distance calculations.
-- Binary features use ±1 for equal "vote weight" regardless of
-- class imbalance.
vector_data := ARRAY[
-- Dim 0: Size — center 100m² ≈ mean, scale 50 (~0.6σ, actual σ≈85)
(COALESCE(NEW.size_sqm, 100) - 100.0) / 50.0,
-- Dim 1: Price — log-scaled, center 17 ≈ ln(¥24M) ≈ actual mean 17.03
(LN(GREATEST(COALESCE(NEW.price, 25000000), 1000000)) - 17.0) / 1.0,
-- Dim 2: Price/m² — center 300k (actual mean ~454k, Tokyo skews high)
-- Scale 200k (~0.3σ) compresses outliers intentionally
CASE
WHEN COALESCE(NEW.size_sqm, 0) > 10
THEN (COALESCE(NEW.price, 0) / NEW.size_sqm - 300000.0) / 200000.0
ELSE 0
END,
-- Dim 3: Age — center 25 years, scale 15 (zeroed for land)
CASE WHEN is_land THEN 0.0
ELSE (reference_year - COALESCE(NEW.year_built, 1995) - 25.0) / 15.0 END,
-- Dim 4: Room count — center 3.5, scale 1.5 (zeroed for land)
CASE WHEN is_land THEN 0.0 ELSE (room_count - 3.5) / 1.5 END,
-- Dim 5-8: Binary features use ±1 (not normalized to class imbalance)
-- This gives equal weight in distance calculations. Land listings
-- sit at 0.0 so they don't vote "yes" or "no" on features they
-- don't have.
-- Dim 5: LDK flag (modern layout)
CASE WHEN is_land THEN 0.0
WHEN normalized_rooms LIKE '%LDK%' THEN 1.0 ELSE -1.0 END,
-- Dim 6: Listing type (house vs condo, land sits at 0)
CASE WHEN NEW.listing_type IN ('中古一戸建て', '新築一戸建て') THEN 1.0
WHEN is_land THEN 0.0 ELSE -1.0 END,
-- Dim 7: Size efficiency (sqm per room) — center 25, scale 10
CASE WHEN is_land THEN 0.0
ELSE (COALESCE(NEW.size_sqm, 100) / GREATEST(room_count, 1) - 25.0) / 10.0 END,
-- Dim 8: Has storage room (+S or SLDK pattern)
CASE WHEN is_land THEN 0.0
WHEN normalized_rooms ~ '\+S|S[LDK]' THEN 1.0 ELSE -1.0 END
]::vector(9);
INSERT INTO vecs.listing_vecs (id, vec, listing_id)
VALUES (NEW.id, vector_data, NEW.listing_id)
ON CONFLICT (id)
DO UPDATE SET
vec = EXCLUDED.vec,
listing_id = EXCLUDED.listing_id;
RETURN NEW;
END;
What We Intentionally Leave Out
There is no explicit location dimension.
A 65m² 3LDK in Shibuya and an otherwise identical one in Kushiro can map to the same vector point. This is intentional: geography is handled as a hard filter, not a fuzzy similarity signal.
- The recommender pre-filters with SQL bounding boxes on
lat/lng. - Duplicate detection adds a character-bigram Jaccard check on location strings.
Keeping geography outside the vector keeps the representation compact and interpretable.
Why Per-Dimension Scaling Matters
Raw dimensions live on very different scales:
- price: millions of yen
- age: years
- room count: small integers
Without scaling, price would dominate distance. We center each dimension around a typical value and divide by a practical spread, so most values sit around [-2, +2]. This makes dimensions comparable and prevents one feature from drowning out the rest.
The constants in the trigger (100m², 17.0, 300k, 25 years, etc.) were derived from full-dataset statistics during initial setup and then frozen for stability.
For binary dimensions, we use +1/-1 (and 0 for land where the feature is not applicable) so each yes/no feature contributes a balanced vote.
To illustrate the impact, here is each dimension's distance contribution between Property A and Property C before and after scaling:
Use Case 1: Duplicate Detection
Relists are common on Suumo and Athome: same property, new URL, slight price change, sometimes new photos. String matching alone misses many of these.
In vector space, relists usually move very little.
Consider three listings:
- A = reference listing
- B = relist of A one week later
- C = genuinely different nearby unit
| Property A reference | Property B relisted a week later | Property C different unit | |
|---|---|---|---|
| Size | 65m² | 65m² | 68m² |
| Price | ¥40.0M | ¥39.8M | ¥43.0M |
| Year built | 2018 | 2018 | 2016 |
| Rooms | 3LDK | 3LDK | 3LDK |
| Storage room | no | no | no |
| Type | condo | condo | condo |
Running update_listing_vector:
A and B differ mostly in tiny price movement. A and C differ across multiple dimensions.
Distance and Similarity Score
Euclidean distance:
Similarity score used by the scraper:
For the example:
- score
- score
Production Threshold
# scraper/config/prod.py
class SuumoConfig:
VECTOR_SIMILARITY_THRESHOLD = 98.0
score >= 98 implies an L2 radius of about 0.04, which is intentionally strict to keep false merges low.
Location Safety Net
Vector similarity is necessary, but not sufficient.
Before merging, we compare location strings extracted by the scraper:
location = unit.css('dt:contains("所在地") + dd::text').get()
We run character-bigram Jaccard similarity. If it is below 0.15, we reject the duplicate even if vector score is above threshold. This catches same-shaped listings in different areas.
Retrieval Query
SELECT v.listing_id,
v.vec <-> qv.query_vector AS l2_dist
FROM vecs.listing_vecs v, query_vec qv
ORDER BY v.vec <-> qv.query_vector
LIMIT 10;
<-> is pgvector's L2 operator. With HNSW indexing on vecs.listing_vecs.vec, nearest-neighbor lookup stays fast even across 400k listings.
Use Case 2: Recommendations
The same vector also powers recommendations, but with a different objective:
- Duplicate detection: threshold classification (
score >= 98) - Recommendations: top-k ranking by proximity
Same vector. Same distance metric. Different decision rule.
Part 2 covers user preference vectors, ranking, and diversity mixing to reduce filter bubbles.
Putting It Together
One representation, two radii, two behaviors:
- tiny radius for "same listing"
- wider ring for "similar enough to recommend"
Here is a 2D slice of that idea centered on Property A:
Property B sits inside the inner duplicate zone. Property C sits outside duplicate range but inside recommendation range. Everything farther out is usually too different to surface.
Built with PostgreSQL, pgvector, and a healthy appreciation for linear algebra.