·14 min read

Properties as Vectors

software engineeringmachine learningpostgrespgvector

This is my the first post on Nippon Homes' ML infrastructure. The platform hosts around 400k active Japanese real-estate listings and serves ~20k monthly users. Each property listing gets mapped to a 9-dimensional vector, and that single representation powers two jobs at once:

  1. Duplicate detection: catching relists of the same property under a new URL.
  2. Recommendations: surfacing listings similar to what you've browsed.

Part 1 is about the vector itself: how we build it, why we scale it the way we do, and what it buys us. Part 2 digs into the recommender.

Recommended listings

Before we get into the dimensions, a note on what this post is not. There are no image embeddings here, no learned representations, no transformer encoders staring at listing photos. The entire vector is nine hand-picked numbers computed from structured fields the scraper already has. That's deliberate: we strive for simplicity, and on this dataset simplicity wins. A 9d feature vector fits in a Postgres trigger, updates in microseconds on insert, indexes cleanly with pgvector's HNSW, and produces duplicate detection that's accurate enough to run in production with a conservative threshold and a Jaccard location filter as the safety net. Every dimension is something you could explain to a realtor over coffee. When a result is wrong, you can look at the nine numbers and see why.

I've played with the alternative. At Liquid AI's hackathon I built an image+text embedding model on top of LFM2-VL that took 2nd place, and it's a genuinely interesting direction: feed the listing photos through a vision encoder, fuse them with a text encoder, and get an embedding that captures things a feature vector never will (does the kitchen look renovated? is the living room bright?). But it's orders of magnitude more computationally expensive than a Postgres trigger that does nine arithmetic ops — every new listing would need a GPU forward pass over its photos, and every model update would mean re-embedding all 400k listings. For the jobs we actually have to do (catch relists, rank candidates inside an already-filtered set), the 9d hand-crafted vector is already good enough.


The Core Idea: Properties as Vectors

Every property is a point in 9-dimensional space. Similar properties cluster together. Small Tokyo condos sit in one region, spacious countryside houses in another.

Each listing becomes a 9d feature vector:

vlisting=[d0,d1,d2,,d8]R9\vec{v}_{\text{listing}} = [d_0, d_1, d_2, \ldots, d_8] \in \mathbb{R}^9

Each dimension captures something about the property.

Seeing It In Action

Here are 6 Tokyo condos and 6 Hokkaido houses from the database, all priced between ¥22M and ¥40M. Even though the prices overlap, they sit in very different regions of our 9-dimensional space. Since we can't visualize 9 dimensions, I projected them into 3D with Principal Component Analysis (PCA):

Tokyo CondosHokkaido HousesDrag to rotate

The blue points (Tokyo condos) cluster in one region and the orange points (Hokkaido houses) form their own group. Tokyo condos are smaller (48 to 64m²) with a higher price per m², while Hokkaido houses are larger (74 to 251m²) with more rooms.

Similar properties end up as neighbors in this space, and that neighborhood structure is what makes the two use cases below possible.


The 9 Dimensions We Track

Here's the PostgreSQL trigger function update_listing_vector that computes the feature vector whenever a listing is inserted or updated:

DECLARE
    vector_data vector(9);
    normalized_rooms TEXT;
    room_count INTEGER;
    is_land BOOLEAN;
    reference_year INTEGER := 2025;
BEGIN
    -- Land listings have no building: no rooms, no age, no layout.
    -- We zero out every dimension that only makes sense for a structure
    -- and let size + price carry the similarity signal.
    is_land := (NEW.listing_type = '土地');

    IF is_land THEN
        normalized_rooms := '';
        room_count := 0;
    ELSE
        normalized_rooms := normalize_rooms(NEW.rooms);
        room_count := COALESCE(
            (regexp_match(normalized_rooms, '(\d+)'))[1]::integer,
            3
        );
    END IF;

    -- Scaling strategy: heuristic z-score normalization
    -- Centers ≈ dataset mean (e.g., 100m² actual mean: 100.7m²)
    -- Scales ≈ "practical range" (~0.5-1σ), NOT true standard deviations
    -- This keeps most values in [-2, +2] and prevents high-variance
    -- features from dominating distance calculations.
    -- Binary features use ±1 for equal "vote weight" regardless of
    -- class imbalance.

    vector_data := ARRAY[
        -- Dim 0: Size — center 100m² ≈ mean, scale 50 (~0.6σ, actual σ≈85)
        (COALESCE(NEW.size_sqm, 100) - 100.0) / 50.0,

        -- Dim 1: Price — log-scaled, center 17 ≈ ln(¥24M) ≈ actual mean 17.03
        (LN(GREATEST(COALESCE(NEW.price, 25000000), 1000000)) - 17.0) / 1.0,

        -- Dim 2: Price/m² — center 300k (actual mean ~454k, Tokyo skews high)
        -- Scale 200k (~0.3σ) compresses outliers intentionally
        CASE
            WHEN COALESCE(NEW.size_sqm, 0) > 10
            THEN (COALESCE(NEW.price, 0) / NEW.size_sqm - 300000.0) / 200000.0
            ELSE 0
        END,

        -- Dim 3: Age — center 25 years, scale 15 (zeroed for land)
        CASE WHEN is_land THEN 0.0
             ELSE (reference_year - COALESCE(NEW.year_built, 1995) - 25.0) / 15.0 END,

        -- Dim 4: Room count — center 3.5, scale 1.5 (zeroed for land)
        CASE WHEN is_land THEN 0.0 ELSE (room_count - 3.5) / 1.5 END,

        -- Dim 5-8: Binary features use ±1 (not normalized to class imbalance)
        -- This gives equal weight in distance calculations. Land listings
        -- sit at 0.0 so they don't vote "yes" or "no" on features they
        -- don't have.

        -- Dim 5: LDK flag (modern layout)
        CASE WHEN is_land THEN 0.0
             WHEN normalized_rooms LIKE '%LDK%' THEN 1.0 ELSE -1.0 END,

        -- Dim 6: Listing type (house vs condo, land sits at 0)
        CASE WHEN NEW.listing_type IN ('中古一戸建て', '新築一戸建て') THEN 1.0
             WHEN is_land THEN 0.0 ELSE -1.0 END,

        -- Dim 7: Size efficiency (sqm per room) — center 25, scale 10
        CASE WHEN is_land THEN 0.0
             ELSE (COALESCE(NEW.size_sqm, 100) / GREATEST(room_count, 1) - 25.0) / 10.0 END,

        -- Dim 8: Has storage room (+S or SLDK pattern)
        CASE WHEN is_land THEN 0.0
             WHEN normalized_rooms ~ '\+S|S[LDK]' THEN 1.0 ELSE -1.0 END

    ]::vector(9);

    INSERT INTO vecs.listing_vecs (id, vec, listing_id)
    VALUES (NEW.id, vector_data, NEW.listing_id)
    ON CONFLICT (id)
    DO UPDATE SET
        vec = EXCLUDED.vec,
        listing_id = EXCLUDED.listing_id;

    RETURN NEW;
END;

What's Not In The Vector

You'll notice something missing: there's no dimension for where the property is. A 65m² 3LDK in Shibuya and an identical one in Kushiro land on the exact same point in 9d space. That's intentional, not an oversight. Geography is a hard constraint, not a fuzzy signal, and mixing it into L2 distance would always be a lossy approximation. Instead, the recommender pre-filters candidates with a SQL bounding box on lat/lng before the vector ever gets consulted, and the duplicate detector runs a character-bigram Jaccard check on the location strings as a safety net (more on that later). Geography lives at a different layer, and keeping it out of the vector is what lets the vector stay nine clean numbers.


The Need to Scale Each Dimension Independently

Each dimension measures something on a different scale: price is in millions of yen, age in years, room count a single digit. Thrown raw into a distance calculation, price would swamp everything — a ¥5M gap would dwarf a 20-year age gap, even though the age matters far more to a buyer.

So we shrink every dimension to a common range: recenter so a typical value is 0, then divide by roughly how much that feature varies. Almost every listing ends up with numbers between about -2 and +2, and no single feature can dominate the distance.

The centers and scales hardcoded in the trigger (100m², 17.0, 300k, 25 years) were computed once from the initial scrape — means and standard deviations over every listing, then rounded for readability. Freezing them as constants keeps the trigger a pure function of the row, at the cost of needing a manual recalculation if the market ever meaningfully shifts. So far it hasn't.

The yes/no features get +1 / -1 instead of being weighted by class frequency. If only 30% of listings have a storage room, weighting would make "has storage" count louder just because it's rarer — but we want both answers to pull with equal strength when we're asking whether two homes are the same kind of place.

To see why the scaling matters, here's what each dimension contributes to the distance between Property A and Property C (a similar but genuinely different unit, introduced in the next section), before and after scaling:

Property A vs Property C: where does each dimension's squared difference go?
Same two listings, same nine features. On the left, nothing is rescaled. On the right, each dimension is centered and compressed first.
Before scaling
L2(A, C) ≈ 3,000,048
units are basically "yen," since price dominates
Size (m²)1.0e-10%
Price (¥)100.00%
Price / m²3.2e-3%
Age (years)4.4e-11%
Room count0%
LDK flag0%
House vs condo0%
Size / room1.1e-11%
Storage room0%
Price alone contributes essentially 100% of the distance. Every other feature is a rounding error, so the two listings look "¥3M apart" and nothing else registers.
After scaling
L2(A, C) ≈ 0.209
unitless, every dimension has a fair share
Size (m²)8.21%
Price (¥)12.15%
Price / m²16.48%
Age (years)40.35%
Room count0%
LDK flag0%
House vs condo0%
Size / room22.81%
Storage room0%
Age is now the biggest single contributor, followed by size-per-room and price/m². Price still matters, but it no longer drowns the signal.
Same pair of homes, two completely different stories. The raw distance is dominated by whichever feature happens to have the biggest units. Scaling lets each dimension have a fair say, so "close" and "far" actually track what matters to a buyer.

Use Case 1: Finding Duplicates

Suumo and Athome relist properties constantly. Same house, a slightly different price after two weeks on the market, sometimes a different agent, sometimes freshly uploaded photos. The URL changes, the title is tweaked, and naïve string-matching misses the overlap. In 9-dimensional feature space, though, a relisted property barely moves.

Consider three listings: one reference and two candidates.

Property A
reference
Property B
relisted a week later
Property C
different unit
Size65m²65m²68m²
Price¥40.0M¥39.8M¥43.0M
Year built201820182016
Rooms3LDK3LDK3LDK
Storage roomnonono
Typecondocondocondo

Property B is Property A relisted after the seller dropped the price by ¥200k (0.5%). Property C is a genuinely different unit in a similar building: a bit larger, a bit more expensive, two years older.

Running each through update_listing_vector:

vA=[0.70,  0.504,  1.577,  1.133,  0.333,  1,  1,  0.333,  1]\vec{v}_A = [-0.70,\; 0.504,\; 1.577,\; -1.133,\; -0.333,\; 1,\; -1,\; -0.333,\; -1] vB=[0.70,  0.500,  1.562,  1.133,  0.333,  1,  1,  0.333,  1]\vec{v}_B = [-0.70,\; 0.500,\; 1.562,\; -1.133,\; -0.333,\; 1,\; -1,\; -0.333,\; -1] vC=[0.64,  0.577,  1.662,  1.000,  0.333,  1,  1,  0.233,  1]\vec{v}_C = [-0.64,\; 0.577,\; 1.662,\; -1.000,\; -0.333,\; 1,\; -1,\; -0.233,\; -1]

A and B differ only in the two price-related dimensions, and the log-scaled price barely moves. A and C differ across five dimensions at once: size, price, price/m², age, and size efficiency.

Turning Distance Into A Score

We compute Euclidean (L2) distance between two vectors:

dL2(u,v)=i=08(uivi)2d_{L2}(\vec{u}, \vec{v}) = \sqrt{\sum_{i=0}^{8} (u_i - v_i)^2}

then convert to a 0 to 100 similarity score using the exact formula the scraper uses:

score(u,v)=100edL2(u,v)/2\text{score}(\vec{u}, \vec{v}) = 100 \cdot e^{-d_{L2}(\vec{u}, \vec{v}) / 2}

Running it on our three listings:

  • d(A,B)0.016d(A, B) \approx 0.016, which gives a score of 99.2
  • d(A,C)0.209d(A, C) \approx 0.209, which gives a score of 90.1

The Threshold

# scraper/config/prod.py
class SuumoConfig:
    VECTOR_SIMILARITY_THRESHOLD = 98.0

Anything at or above 98 gets flagged as a duplicate. Working backwards from 100ed/2=98100 \cdot e^{-d/2} = 98 gives d0.04d \approx 0.04, a very tight radius in 9-dimensional space. At that distance, essentially every one of the 9 features has to agree to within a fraction of its scaled range. It's a deliberately conservative threshold, tuned to keep false-positive merges near zero.

Property B (score 99.2) clears it easily. Property C (score 90.1) is clearly similar to A (same layout, same city, same era) but falls well short of the threshold, which is correct: it really is a different unit.

The Location Gotcha

Vector similarity alone isn't enough. Two listings with identical features in completely different cities are still different properties, and there's no dimension in our vector for geography. That's by design, since location clustering would fight the "similar kind of home, wherever it is" goal of the recommender.

So before we merge, we run a sanity check on the location strings the scraper extracted back in listings_suumo.py:

location = unit.css('dt:contains("所在地") + dd::text').get()

The two location strings are compared using character-bigram Jaccard similarity. If the score comes back below 0.15, we reject the duplicate flag even when the vector score is above 98. Bigrams are forgiving enough to survive small formatting differences ("東京都新宿区" vs "東京都 新宿区") but still reject two genuinely different neighborhoods. It's the kind of cheap post-filter that exists precisely because vectors are necessary but not sufficient. The moment a feature you didn't encode starts to matter, you need a safety net.

The Actual Query

The scraper asks Postgres for candidates using pgvector's L2 operator:

SELECT v.listing_id,
       v.vec <-> qv.query_vector AS l2_dist
FROM vecs.listing_vecs v, query_vec qv
ORDER BY v.vec <-> qv.query_vector
LIMIT 10;

<-> is pgvector's L2 distance operator. An HNSW index on vecs.listing_vecs.vec turns this from an O(n) scan into an approximately logarithmic nearest-neighbor lookup, so it finishes in milliseconds even across all 400k listings. The spider then checks each candidate's score against the 98 threshold and runs the Jaccard location filter before deciding whether to merge.


Use Case 2: Recommending Similar Listings

The same 9-dimensional vectors also power property recommendations, but with a much looser bar. "Close enough to be the same listing" is a threshold problem (score ≥ 98). "Close enough to be interesting to the same user" is a ranking problem: pick the top-k neighbors of a user's browsing history and show them in order, regardless of absolute distance.

That's Part 2 of this series. It covers how we turn a user's browsing history into a preference vector, how we use the same HNSW index for fast nearest-neighbor search, and how we mix in random candidates to avoid trapping users in a filter bubble.


Putting It All Together

One representation, two thresholds, two very different use cases. It helps to see them together. Here's a 2D slice through the 9-dimensional space around Property A, with every distance drawn to scale:

recommendation zone (the ring)L2 ≈ 0.04 → 0.30duplicate zoneL2 ≤ 0.04ABCA — referenceB — duplicate (score 99.2)C — similar, not duplicateother listings2D slice through 9-dimensional feature space · hover points for details

Two circles centered on Property A. The tight inner blue circle is the duplicate zone: anything that lands inside has an L2 distance below 0.04 and a similarity score above 98, meaning the vector says it's effectively the same listing. Property B — the relist one week later with a ¥200k price drop — sits right on top of A, well inside. The scraper merges it without a second thought.

The outer purple circle marks the outer edge of the recommendation zone, and the zone itself isn't the whole disk: it's the ring between the two circles, roughly 0.04 < L2 < 0.30. That ring is where "close enough to be interesting" lives — close enough that a user who liked A would probably also like what's in there, but far enough that it's obviously a different listing. Property C falls inside the ring, along with a handful of other listings from the catalogue. Everything outside the purple circle is the rest of the 400k catalogue: too different to surface.

Duplicate detection asks "is this inside the blue circle?" Recommendations ask "what's in the ring?" Same vector, same pgvector index, same L2 distance — just different radii.


Built with PostgreSQL, pgvector, and a healthy appreciation for linear algebra.