From Keywords to Meaning

Search has trained us to think in keywords.

You type a few words into a box. The system looks for pages that contain those words. The ones with the most overlap float to the top.

This works surprisingly well, until it doesn’t.

The moment intent becomes fuzzy, or phrasing changes, or the “right” result does not use the same words you typed, keyword search starts to fall apart. Humans are good at recognizing meaning even when the words differ. Traditional search systems are not.

This project explored a different approach: searching by meaning instead of matching words.

The Core Idea

Rather than asking:

Which pages contain the same words as this query?

We ask:

Which pages share a similar meaning to the words in this query?

That shift sounds subtle. But it's important.

It requires a completely different way of representing language.

Step 1: Turning Language into Vectors

To do this, both the search query and the content itself need to be represented in the same format.

That format is a vector embedding.

A vector embedding is a numerical representation of meaning. Instead of storing text as characters or tokens, I convert it into a list of numbers that capture semantic relationships learned from large amounts of language.

In simple terms:

Similar meanings produce vectors that point in similar directions
Different meanings produce vectors that diverge

For this project:

Every content URL was processed and converted into a vector embedding
The embeddings were stored alongside the URL in a database
The same embedding model was used for both content and queries

This matters because vectors only make sense when they live in the same space.

Step 2: Embedding the Search Query

When you submit a search query, the query itself is also converted into a vector using the same model.

At this point, the system is no longer comparing words.

It is comparing positions in a semantic space.

A query about “wine enthusiasts” does not need to match those exact words in the content. It only needs to land near content that talks about wine, drinks and dining, etc.

Step 3: Measuring Similarity with Cosine Similarity

Once both the query and the content are represented as vectors, the next question is simple:

How close are they?

For this project, I used cosine similarity.

Cosine similarity measures how aligned two vectors are, regardless of their length. It produces a score between -1 and 1, where:

1 means the vectors point in the same direction, or the content and query are highly similar (in this case, enthusiastic about wine)
0 means they are unrelated
negative values mean they oppose each other (e.g. "I despise wine")

Said simply, higher cosine similarity means closer semantic meaning.

For each query:

I computed cosine similarity between the query vector and every content vector
I sorted the results from highest to lowest similarity
I applied a similarity threshold of 0.8
Only URLs above that threshold were returned

The output was a ranked list of URLs ordered by semantic relevance.

What the Results Looked Like

The results were very strong.

Instead of returning pages that happened to share words with the query, the system returned pages that shared intent.

Content that used different phrasing but covered the same ideas ranked highly. Content that mentioned the same terms in unrelated contexts fell below the threshold. Content containing keywords, but expressing the opposite intent were excluded.

The ranking also felt more intuitive. The top results were not just technically relevant. They were meaningfully relevant.

This is exactly what you want from a relevance search.

Search results for 'wine enthusiasts' with various links and descriptions. — Using Vector Search to Return Semantically Relevant Results

Why This Matters

People don't naturally write in a way that optimizes for search queries. Good writing uses variation, tone, and narrative. That makes it more engaging for humans, but harder for keyword-based systems.

Vector search flips that tradeoff.

Because meaning is captured in the embedding:

Writers do not need to optimize phrasing for machines (keyword stuffing)
Related content clusters naturally
Long-form articles are represented holistically, not as bags of words

For content creators, this opens up new possibilities:

Smarter internal search
Content discovery based on theme, not tags or audience identities
Better matching between content and downstream systems like advertising

How This Connects to Contextual Advertising

This project pairs naturally with contextual signal extraction.

Once content is vectorized:

It can be searched by meaning
It can be clustered by theme
It can be matched to advertiser intent vectors
It can support contextual targeting without relying on consumer data

Instead of asking “does this page contain keyword X,” the system can ask “how close is this content to what the advertiser is trying to reach.”

That is a fundamentally stronger foundation for contextual advertising.

A Grounded Takeaway

Keyword search asks whether words match. Vector search asks whether meaning aligns.

Vector search fundamentally shifts the focus from mere word matching to understanding contextual relevance.

While keyword search remains valuable for specific tasks, such as retrieving a customer record by last name, vector search excels in scenarios where meaning is paramount. By embedding both queries and content into the same semantic space and employing cosine similarity to rank results, this project illustrates a powerful concept: Search for similarity is more effective when it prioritizes meaning over text.