Properties as Vectors

This is the first post in a short series about the ML infrastructure behind Nippon Homes. The platform currently hosts around 400k active Japanese real estate listings and serves roughly 20k monthly users.

Each listing is mapped into a 9-dimensional vector, and that single representation powers two production jobs:

Duplicate detection: catch relists of the same property under new URLs.
Recommendations: surface listings similar to what a user has browsed.

This post focuses on the vector itself: how we build it, why we scale it the way we do, and why this design works well in production. A follow-up post covers recommendation ranking.

Before we get into dimensions, here's what this post is not:

No image embeddings
No transformer encoders
No learned latent representations

The vector is nine hand-crafted numbers from structured fields the scraper already extracts. That choice is deliberate. A 9D feature vector can be computed in a PostgreSQL trigger, indexed with pgvector HNSW, and inspected when something looks wrong. If a match is bad, we can see which dimensions drove it.

I have experimented with heavier models too. At Liquid AI's hackathon, I built an image + text embedding system on top of LFM2-VL and placed second. It is powerful, but much more expensive operationally: every new listing needs model inference, and model changes imply a full re-embedding pass. For our main jobs today (dedupe and nearest-neighbor retrieval in filtered candidate sets), the 9D vector is already strong enough.

The Core Idea

Every property is a point in 9-dimensional space:

\vec{v}_{\text{listing}} = [d_0, d_1, d_2, \ldots, d_8] \in \mathbb{R}^9

Similar properties become neighbors. Small Tokyo condos cluster in one region; larger countryside homes cluster elsewhere.

Seeing It in Action

Below are 6 Tokyo condos and 6 Hokkaido houses from the database, all priced between ¥22M and ¥40M. Price overlaps, but the properties still separate cleanly in feature space.

Since 9D cannot be visualized directly, I project them into 3D with PCA:

Blue points (Tokyo condos) and orange points (Hokkaido houses) form different groups because the underlying feature patterns differ: size, room count, and price-per-sqm.

The 9 Dimensions We Track

This is the PostgreSQL trigger function update_listing_vector that computes the vector on insert/update:

DECLARE
    vector_data vector(9);
    normalized_rooms TEXT;
    room_count INTEGER;
    is_land BOOLEAN;
    reference_year INTEGER := 2025;
BEGIN
    -- Land listings have no building: no rooms, no age, no layout.
    -- We zero out every dimension that only makes sense for a structure
    -- and let size + price carry the similarity signal.
    is_land := (NEW.listing_type = '土地');

    IF is_land THEN
        normalized_rooms := '';
        room_count := 0;
    ELSE
        normalized_rooms := normalize_rooms(NEW.rooms);
        room_count := COALESCE(
            (regexp_match(normalized_rooms, '(\d+)'))[1]::integer,
            3
        );
    END IF;

    -- Scaling strategy: heuristic z-score normalization
    -- Centers ≈ dataset mean (e.g., 100m² actual mean: 100.7m²)
    -- Scales ≈ "practical range" (~0.5-1σ), NOT true standard deviations
    -- This keeps most values in [-2, +2] and prevents high-variance
    -- features from dominating distance calculations.
    -- Binary features use ±1 for equal "vote weight" regardless of
    -- class imbalance.

    vector_data := ARRAY[
        -- Dim 0: Size — center 100m² ≈ mean, scale 50 (~0.6σ, actual σ≈85)
        (COALESCE(NEW.size_sqm, 100) - 100.0) / 50.0,

        -- Dim 1: Price — log-scaled, center 17 ≈ ln(¥24M) ≈ actual mean 17.03
        (LN(GREATEST(COALESCE(NEW.price, 25000000), 1000000)) - 17.0) / 1.0,

        -- Dim 2: Price/m² — center 300k (actual mean ~454k, Tokyo skews high)
        -- Scale 200k (~0.3σ) compresses outliers intentionally
        CASE
            WHEN COALESCE(NEW.size_sqm, 0) > 10
            THEN (COALESCE(NEW.price, 0) / NEW.size_sqm - 300000.0) / 200000.0
            ELSE 0
        END,

        -- Dim 3: Age — center 25 years, scale 15 (zeroed for land)
        CASE WHEN is_land THEN 0.0
             ELSE (reference_year - COALESCE(NEW.year_built, 1995) - 25.0) / 15.0 END,

        -- Dim 4: Room count — center 3.5, scale 1.5 (zeroed for land)
        CASE WHEN is_land THEN 0.0 ELSE (room_count - 3.5) / 1.5 END,

        -- Dim 5-8: Binary features use ±1 (not normalized to class imbalance)
        -- This gives equal weight in distance calculations. Land listings
        -- sit at 0.0 so they don't vote "yes" or "no" on features they
        -- don't have.

        -- Dim 5: LDK flag (modern layout)
        CASE WHEN is_land THEN 0.0
             WHEN normalized_rooms LIKE '%LDK%' THEN 1.0 ELSE -1.0 END,

        -- Dim 6: Listing type (house vs condo, land sits at 0)
        CASE WHEN NEW.listing_type IN ('中古一戸建て', '新築一戸建て') THEN 1.0
             WHEN is_land THEN 0.0 ELSE -1.0 END,

        -- Dim 7: Size efficiency (sqm per room) — center 25, scale 10
        CASE WHEN is_land THEN 0.0
             ELSE (COALESCE(NEW.size_sqm, 100) / GREATEST(room_count, 1) - 25.0) / 10.0 END,

        -- Dim 8: Has storage room (+S or SLDK pattern)
        CASE WHEN is_land THEN 0.0
             WHEN normalized_rooms ~ '\+S|S[LDK]' THEN 1.0 ELSE -1.0 END

    ]::vector(9);

    INSERT INTO vecs.listing_vecs (id, vec, listing_id)
    VALUES (NEW.id, vector_data, NEW.listing_id)
    ON CONFLICT (id)
    DO UPDATE SET
        vec = EXCLUDED.vec,
        listing_id = EXCLUDED.listing_id;

    RETURN NEW;
END;

What We Intentionally Leave Out

There is no explicit location dimension.

A 65m² 3LDK in Shibuya and an otherwise identical one in Kushiro can map to the same vector point. This is intentional: geography is handled as a hard filter, not a fuzzy similarity signal.

The recommender pre-filters with SQL bounding boxes on lat/lng.
Duplicate detection adds a character-bigram Jaccard check on location strings.

Keeping geography outside the vector keeps the representation compact and interpretable.

Why Per-Dimension Scaling Matters

Raw dimensions live on very different scales:

price: millions of yen
age: years
room count: small integers

Without scaling, price would dominate distance. We center each dimension around a typical value and divide by a practical spread, so most values sit around [-2, +2]. This makes dimensions comparable and prevents one feature from drowning out the rest.

The constants in the trigger (100m², 17.0, 300k, 25 years, etc.) were derived from full-dataset statistics during initial setup and then frozen for stability.

For binary dimensions, we use +1/-1 (and 0 for land where the feature is not applicable) so each yes/no feature contributes a balanced vote.

To illustrate the impact, here is each dimension's distance contribution between Property A and Property C before and after scaling:

Property A vs Property C: where does each dimension's squared difference go?

Same two listings, same nine features. On the left, nothing is rescaled. On the right, each dimension is centered and compressed first.

Before scaling

L2(A, C) ≈ 3,000,048

units are basically "yen," since price dominates

Size (m²)1.0e-10%

Price (¥)100.00%

Price / m²3.2e-3%

Age (years)4.4e-11%

Room count0%

LDK flag0%

House vs condo0%

Size / room1.1e-11%

Storage room0%

Price alone contributes essentially 100% of the distance. Every other feature is a rounding error, so the two listings look "¥3M apart" and nothing else registers.

After scaling

L2(A, C) ≈ 0.209

unitless, every dimension has a fair share

Size (m²)8.21%

Price (¥)12.15%

Price / m²16.48%

Age (years)40.35%

Room count0%

LDK flag0%

House vs condo0%

Size / room22.81%

Storage room0%

Age is now the biggest single contributor, followed by size-per-room and price/m². Price still matters, but it no longer drowns the signal.

Same pair of homes, two completely different stories. The raw distance is dominated by whichever feature happens to have the biggest units. Scaling lets each dimension have a fair say, so "close" and "far" actually track what matters to a buyer.

Use Case 1: Duplicate Detection

Relists are common on Suumo and Athome: same property, new URL, slight price change, sometimes new photos. String matching alone misses many of these.

In vector space, relists usually move very little.

Consider three listings:

A = reference listing
B = relist of A one week later
C = genuinely different nearby unit

	Property A reference	Property B relisted a week later	Property C different unit
Size	65m²	65m²	68m²
Price	¥40.0M	¥39.8M	¥43.0M
Year built	2018	2018	2016
Rooms	3LDK	3LDK	3LDK
Storage room	no	no	no
Type	condo	condo	condo

Running update_listing_vector:

\vec{v}_A = [-0.70,\; 0.504,\; 1.577,\; -1.133,\; -0.333,\; 1,\; -1,\; -0.333,\; -1]

\vec{v}_B = [-0.70,\; 0.500,\; 1.562,\; -1.133,\; -0.333,\; 1,\; -1,\; -0.333,\; -1]

\vec{v}_C = [-0.64,\; 0.577,\; 1.662,\; -1.000,\; -0.333,\; 1,\; -1,\; -0.233,\; -1]

A and B differ mostly in tiny price movement. A and C differ across multiple dimensions.

Distance and Similarity Score

Euclidean distance:

d_{L2}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=0}^{8} (u_i - v_i)^2}

Similarity score used by the scraper:

\text{score}(\vec{u}, \vec{v}) = 100 \cdot e^{-d_{L2}(\vec{u}, \vec{v}) / 2}

For the example:

$d(A, B) \approx 0.016 \Rightarrow$ score $\approx 99.2$
$d(A, C) \approx 0.209 \Rightarrow$ score $\approx 90.1$

Production Threshold

# scraper/config/prod.py
class SuumoConfig:
    VECTOR_SIMILARITY_THRESHOLD = 98.0

score >= 98 implies an L2 radius of about 0.04, which is intentionally strict to keep false merges low.

Location Safety Net

Vector similarity is necessary, but not sufficient.

Before merging, we compare location strings extracted by the scraper:

location = unit.css('dt:contains("所在地") + dd::text').get()

We run character-bigram Jaccard similarity. If it is below 0.15, we reject the duplicate even if vector score is above threshold. This catches same-shaped listings in different areas.

Retrieval Query

SELECT v.listing_id,
       v.vec <-> qv.query_vector AS l2_dist
FROM vecs.listing_vecs v, query_vec qv
ORDER BY v.vec <-> qv.query_vector
LIMIT 10;

<-> is pgvector's L2 operator. With HNSW indexing on vecs.listing_vecs.vec, nearest-neighbor lookup stays fast even across 400k listings.

Use Case 2: Recommendations

The same vector also powers recommendations, but with a different objective:

Duplicate detection: threshold classification (score >= 98)
Recommendations: top-k ranking by proximity

Same vector. Same distance metric. Different decision rule.

Part 2 covers user preference vectors, ranking, and diversity mixing to reduce filter bubbles.

Putting It Together

One representation, two radii, two behaviors:

tiny radius for "same listing"
wider ring for "similar enough to recommend"

Here is a 2D slice of that idea centered on Property A:

Property B sits inside the inner duplicate zone. Property C sits outside duplicate range but inside recommendation range. Everything farther out is usually too different to surface.

Built with PostgreSQL, pgvector, and a healthy appreciation for linear algebra.