You’re window-shopping. You see a chair you love. Now try to type what makes it that chair — the exact shade, the leg taper, the era. You can’t. eBay built search for precisely the thing words fail at.
Text search assumes you can describe what you want. For visual, taste-driven discovery — the digital equivalent of wandering a store and pointing — that assumption collapses. A look, a style, an aesthetic doesn't reduce cleanly to keywords. eBay's answer was to let the image itself be the query.
If semantic text search matches meaning instead of words, visual search goes a step further and removes words entirely. You give it a picture, and it finds visually similar items. No description, no keywords — just “more things that look like this.”
The mechanism is the same trick as semantic search, applied to pixels. Run each image through a model that turns it into an embedding — a vector capturing its visual characteristics — positioned so that images that look alike sit close together. Searching becomes finding the listing-image vectors nearest your photo's vector.
The same skeleton, a different sense
eBay integrated image-embedding vector search into their ranking stack. A convolutional neural network encodes each listing photo into a vector offline; a query image is encoded the same way at search time; an approximate-nearest-neighbor lookup returns the visually closest listings, which then feed the ranking pipeline. If you read the Spotify semantic-search teardown in this series, this will feel familiar — because it's the identical pattern (embed, index, ANN) with images swapped in for text.
That repetition is the real lesson. Once you understand embeddings + approximate nearest-neighbor search, you hold a master key that opens text search, image search, recommendations, anomaly detection, and the retrieval layer of modern AI. The encoder changes per modality — a sentence model for text, a CNN (or vision transformer) for images — but the architecture downstream is the same: turn things into vectors, index them for fast ANN, look up neighbors. eBay's visual search and Spotify's semantic search are two instances of one idea.
Worth knowing
This is why “vector search” became infrastructure rather than a feature. The moment you can encode anything — text, images, audio, user behavior — into a shared vector space, “find similar things” becomes a single, reusable capability across wildly different products. Recognizing that one pattern underlies many features is exactly the kind of abstraction that compounds an engineer's leverage.
The gap it reveals
The surface lesson is “eBay does visual search.” The deeper one — the one worth carrying — is that visual search, semantic search, and recommendations are the same architecture with different encoders. Engineers who see each as a separate special-case feature will rebuild the wheel repeatedly; those who see the shared embedding-plus-ANN skeleton design one capability and reuse it everywhere.
In the interview room
If a prompt involves images, similarity, or “find things like this,” resist treating it as exotic. “I'd encode images into embeddings with a CNN and serve nearest-neighbor lookups via ANN — same shape as semantic text search” shows you recognize the general pattern. Interviewers value candidates who compress many problems into one framework over those who memorize a separate recipe for each.
The reframe
The best engineering insight isn't a new trick — it's noticing that two things you thought were different are secretly the same. Visual search looks like a fundamentally separate problem from text search, until you see that both are “embed it, then find the nearest neighbors.” Master the pattern once and a dozen “different” features collapse into variations on a theme.
It's not image search and text search. It's nearest-neighbor search, twice.
Primary source →
eBay Tech Blog — How eBay’s New Search Feature Was Inspired by Window Shopping