Why Hybrid Search Was Returning Garbage

Search “Sam Altman on AGI” on Audioscrape. You’d expect clips of Sam Altman discussing artificial general intelligence. Instead, the top results were:

“Here’s Sam Altman.” (18 characters)
“Including Sam Altman.” (21 characters)
“Here’s Sam Altman also.” (23 characters)

Filler. Introductions. The spoken equivalent of nothing. These short segments were outranking 300-character clips where Altman actually discusses AGI timelines.

I spent a day tracing the problem through three layers of the search stack. This is what I found.

The Setup

Audioscrape uses OpenSearch to search across 13+ million podcast transcript segments in production. I debugged this on our staging index (5.2 million segments), which mirrors the same data distribution. Each segment is a timestamped chunk of speech — typically 5 to 30 seconds of audio.

Search is hybrid: it combines BM25 (keyword matching) with kNN (semantic similarity via embeddings). OpenSearch normalizes both scores to [0, 1] and blends them 80/20 text/semantic. In theory, you get keyword precision plus conceptual understanding. In practice, both sides were independently broken for this query.

Problem 1: BM25 Loves Short Documents

BM25 normalizes by document length. A word in a short document is scored as more significant than the same word in a long one. For web pages, this is reasonable. For transcript segments, it’s pathological.

“Here’s Sam Altman.” has maximum term density for “Sam” and “Altman” — there’s nothing else to dilute the score. A 300-character segment where Altman discusses AGI has those terms spread among other words, so it scores lower per-term.

Problem 2: kNN Loves Short Documents Too

I tested pure semantic search. Every single result was a short segment:

Rank	Score	Text
1	0.917	“Here’s Sam Altman.”
2	0.889	“Including Sam Altman.”
3	0.880	“And now, dear friends, here’s Sam Altman.”

The embedding for “Here’s Sam Altman.” is almost entirely the concept “Sam Altman” — no surrounding context to add noise. When the query embedding is also dominated by “Sam Altman,” cosine similarity is near-perfect.

Both BM25 and kNN were biased toward short segments. Hybrid didn’t average the problem out — it amplified it.

Fix 1: Kill the Filler

I added a text_length integer field to the OpenSearch segment mapping, backfilled it across all 5.2 million staging documents, and added a post-filter:

{"bool": {"should": [
    {"range": {"text_length": {"gte": 30}}},
    {"bool": {"must_not": [{"exists": {"field": "text_length"}}]}}
], "minimum_should_match": 1}}

The backward-compatible should clause lets pre-migration documents without the field still appear. 12% of all segments — 630,000 documents — turned out to be under 30 characters. “Right.” “Thanks.” “Here’s Sam Altman.” None of these are useful search results.

Fix 2: Boost Substance

Instead of just filtering, I wrapped the text query in function_score with a field_value_factor on text_length:

{"function_score": {
    "query": "<text_query>",
    "functions": [{"field_value_factor": {
        "field": "text_length",
        "modifier": "log1p",
        "factor": 0.1,
        "missing": 100
    }}],
    "boost_mode": "multiply"
}}

log1p gives diminishing returns: big boost from 30 to 200 chars, small boost from 200 to 2000. This naturally prefers substantive segments without penalizing medium-length ones. missing: 100 provides a neutral default for documents not yet backfilled.

Result: all filler gone from top results. But only 2-3 out of 10 results mentioned both “Sam Altman” and “AGI.”

The Wrong Turn

Our query intent classifier detected “Sam Altman on AGI” as a keyword query. I tried auto-switching from hybrid to text-only for keyword queries, reasoning that kNN’s short-segment bias was the problem.

This made things worse. Text-only returned only 2/10 results with both terms and 1/4 in the top 4. Removing semantic search didn’t help — it just eliminated the few good results kNN was pulling in. I reverted this.

The Real Problem

The text query used multi_match with the default OR operator. For “Sam Altman on AGI,” this means: match any segment containing “Sam” OR “Altman” OR “on” OR “AGI.” Since “Sam Altman” appears in thousands of segments across the corpus — about politics, eye scanning, corporate drama — those segments easily outscored the handful mentioning AGI.

Proof: searching "Sam Altman" AGI with explicit phrase syntax returned 10/10 results with both terms. The content existed. The query just wasn’t requiring it.

The Embedding Model Can’t Help

I tested whether the embedding model was the bottleneck. We run paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions). I computed cosine similarities between the query and various targets:

"Sam Altman on AGI" vs "Sam Altman"                            → 0.87
"Sam Altman on AGI" vs "AGI"                                   → 0.63
"Sam Altman on AGI" vs "Sam Altman discusses AGI timelines"    → 0.64

The query vector is 87% “Sam Altman.” The compositional meaning — “this person’s views on this topic” — barely registers. I tested two other 384-dim models (all-MiniLM-L6-v2 and BAAI/bge-small-en-v1.5). Same pattern. At this dimensionality, the entity name consumes most of the vector’s capacity.

Google solves this with much larger retrieval models, cross-encoders, and learned ranking from click data. I don’t have any of those. The embedding model is good for conceptual queries like “how to sleep better” where exact terms don’t matter. For entity+topic queries, BM25 has to do the work.

The Actual Fix

I replaced the OR text query with minimum_should_match: "75%":

{"multi_match": {
    "query": "Sam Altman on AGI",
    "fields": ["text^2"],
    "fuzziness": "AUTO",
    "minimum_should_match": "75%"
}}

For a 4-word query, 75% means 3 of 4 terms must match. Since “on” is a stop word, this effectively requires “Sam” + “Altman” + “AGI.” For a 2-word query, both must match. For a single word, 1 of 1. It scales naturally with query length — no hardcoded boosts to tune per query shape.

Results

I tested the full change set (length filter + length boost + minimum_should_match) against the old hybrid search:

	Before	After
Both “Altman” + “AGI” in top 10	2/10	3/10
Short filler in top 10	5/10	0/10
Shortest result	18 chars	285 chars

I tested across seven query types — entity+topic (“Sam Altman on AGI”), natural language (“how to sleep better”), single-word (“consciousness”), multi-entity (“Elon Musk Mars”), and questions (“How does AI affect society?”). No regressions on any query type.

The remaining Altman-only results in positions 5-10 come from the kNN side. The embedding model sees “Sam Altman on AGI” as “Sam Altman” and retrieves accordingly. That’s a known constraint — not fixable without a model change.

What I Shipped

Four changes total:

text_length field in the segment mapping, computed at index time
Hard floor at 30 characters via post-filter (removes degenerate filler)
Soft length boost via function_score with log1p (ranks substantive content higher)
minimum_should_match: "75%" on the text query (requires most terms to be present)

What’s Next

The next lever is the embedding model. A larger retrieval-tuned model (768+ dimensions, trained for query-document matching rather than paraphrase detection) would improve semantic precision for compositional queries like “Sam Altman on AGI.” That’s a bigger infrastructure change — reindexing 13+ million production documents, more GPU memory, different Cog container — but it’s where the remaining quality gap lives.

For now, the combination of smarter BM25 matching, length-based scoring, and the existing semantic signal is a substantial improvement. The filler is gone, and the content that matters rises to the top.

Audioscrape — search engine for podcasts.