HomeResourcesCase study
Case study

eBay Built Search for What You Can't Describe

By The SDL team·4 min read·Updated Jun 10, 2026

Visual search isn’t a special case. It’s the same embed-and-find-neighbors skeleton as semantic search — with a CNN swapped in for the text encoder.

You’re window-shopping. You see a chair you love. Now try to type what makes it that chair — the exact shade, the leg taper, the era. You can’t. eBay built search for precisely the thing words fail at.

Text search assumes you can describe what you want. For visual, taste-driven discovery — the digital equivalent of wandering a store and pointing — that assumption collapses. A look, a style, an aesthetic doesn't reduce cleanly to keywords. eBay's answer was to let the image itself be the query.

Plain English

If semantic text search matches meaning instead of words, visual search goes a step further and removes words entirely. You give it a picture, and it finds visually similar items. No description, no keywords — just “more things that look like this.”

The mechanism is the same trick as semantic search, applied to pixels. Run each image through a model that turns it into an embedding — a vector capturing its visual characteristics — positioned so that images that look alike sit close together. Searching becomes finding the listing-image vectors nearest your photo's vector.

The same skeleton, a different sense

eBay integrated image-embedding vector search into their ranking stack. A convolutional neural network encodes each listing photo into a vector offline; a query image is encoded the same way at search time; an approximate-nearest-neighbor lookup returns the visually closest listings, which then feed the ranking pipeline. If you read the Spotify semantic-search teardown in this series, this will feel familiar — because it's the identical pattern (embed, index, ANN) with images swapped in for text.

Some things you can’t put into words — so don’t search with words Text search struggles: “that vintage mid-century armchair with the tapered wooden legs and mustard…” — you can’t describe a look photo of the chair 📷 CNN encoder image → vector ANN over listing images Visually nearest listings ranked by image-embedding similarity, not by text overlap Same embedding + ANN pattern as text search — just images.
The image is the query. Encode listing photos into vectors offline, encode the query photo the same way, and return the visually nearest listings via ANN — the same embed-index-lookup skeleton as text search.
Now the engineering

That repetition is the real lesson. Once you understand embeddings + approximate nearest-neighbor search, you hold a master key that opens text search, image search, recommendations, anomaly detection, and the retrieval layer of modern AI. The encoder changes per modality — a sentence model for text, a CNN (or vision transformer) for images — but the architecture downstream is the same: turn things into vectors, index them for fast ANN, look up neighbors. eBay's visual search and Spotify's semantic search are two instances of one idea.

Worth knowing

This is why “vector search” became infrastructure rather than a feature. The moment you can encode anything — text, images, audio, user behavior — into a shared vector space, “find similar things” becomes a single, reusable capability across wildly different products. Recognizing that one pattern underlies many features is exactly the kind of abstraction that compounds an engineer's leverage.

The gap it reveals

The surface lesson is “eBay does visual search.” The deeper one — the one worth carrying — is that visual search, semantic search, and recommendations are the same architecture with different encoders. Engineers who see each as a separate special-case feature will rebuild the wheel repeatedly; those who see the shared embedding-plus-ANN skeleton design one capability and reuse it everywhere.

In the interview room

If a prompt involves images, similarity, or “find things like this,” resist treating it as exotic. “I'd encode images into embeddings with a CNN and serve nearest-neighbor lookups via ANN — same shape as semantic text search” shows you recognize the general pattern. Interviewers value candidates who compress many problems into one framework over those who memorize a separate recipe for each.

The reframe

The best engineering insight isn't a new trick — it's noticing that two things you thought were different are secretly the same. Visual search looks like a fundamentally separate problem from text search, until you see that both are “embed it, then find the nearest neighbors.” Master the pattern once and a dozen “different” features collapse into variations on a theme.

It's not image search and text search. It's nearest-neighbor search, twice.

Primary source →
eBay Tech Blog — How eBay’s New Search Feature Was Inspired by Window Shopping

Want feedback on your design?

Related articles