Why Hybrid Search Was Returning Garbage
Search “Sam Altman on AGI” on Audioscrape. You’d expect clips of Sam Altman discussing artificial general intelligence. Instead, the top results were:
- “Here’s Sam Altman.” (18 characters)
- “Including Sam Altman.” (21 characters)
- “Here’s Sam Altman also.” (23 characters)
Filler. Introductions. The spoken equivalent of nothing. These short segments were outranking 300-character clips where Altman actually discusses AGI timelines.
I spent a day tracing the problem through three layers of the search stack. This is what I found.
The Setup
Audioscrape uses OpenSearch to search across 13+ million podcast transcript segments in production. I debugged this on our staging index (5.2 million segments), which mirrors the same data distribution. Each segment is a timestamped chunk of speech — typically 5 to 30 seconds of audio.
Search is hybrid: it combines BM25 (keyword matching) with kNN (semantic similarity via embeddings). OpenSearch normalizes both scores to [0, 1] and blends them 80/20 text/semantic. In theory, you get keyword precision plus conceptual understanding. In practice, both sides were independently broken for this query.
Problem 1: BM25 Loves Short Documents
BM25 normalizes by document length. A word in a short document is scored as more significant than the same word in a long one. For web pages, this is reasonable. For transcript segments, it’s pathological.
“Here’s Sam Altman.” has maximum term density for “Sam” and “Altman” — there’s nothing else to dilute the score. A 300-character segment where Altman discusses AGI has those terms spread among other words, so it scores lower per-term.
Problem 2: kNN Loves Short Documents Too
I tested pure semantic search. Every single result was a short segment:
| Rank | Score | Text |
|---|---|---|
| 1 | 0.917 | “Here’s Sam Altman.” |
| 2 | 0.889 | “Including Sam Altman.” |
| 3 | 0.880 | “And now, dear friends, here’s Sam Altman.” |
The embedding for “Here’s Sam Altman.” is almost entirely the concept “Sam Altman” — no surrounding context to add noise. When the query embedding is also dominated by “Sam Altman,” cosine similarity is near-perfect.
Both BM25 and kNN were biased toward short segments. Hybrid didn’t average the problem out — it amplified it.
Fix 1: Kill the Filler
I added a text_length integer field to the OpenSearch segment mapping, backfilled it across all 5.2 million staging documents, and added a post-filter:
{"bool": {"should": [
{"range": {"text_length": {"gte": 30}}},
{"bool": {"must_not": [{"exists": {"field": "text_length"}}]}}
], "minimum_should_match": 1}}
The backward-compatible should clause lets pre-migration documents without the field still appear. 12% of all segments — 630,000 documents — turned out to be under 30 characters. “Right.” “Thanks.” “Here’s Sam Altman.” None of these are useful search results.
Fix 2: Boost Substance
Instead of just filtering, I wrapped the text query in function_score with a field_value_factor on text_length:
{"function_score": {
"query": "<text_query>",
"functions": [{"field_value_factor": {
"field": "text_length",
"modifier": "log1p",
"factor": 0.1,
"missing": 100
}}],
"boost_mode": "multiply"
}}
log1p gives diminishing returns: big boost from 30 to 200 chars, small boost from 200 to 2000. This naturally prefers substantive segments without penalizing medium-length ones. missing: 100 provides a neutral default for documents not yet backfilled.
Result: all filler gone from top results. But only 2-3 out of 10 results mentioned both “Sam Altman” and “AGI.”
The Wrong Turn
Our query intent classifier detected “Sam Altman on AGI” as a keyword query. I tried auto-switching from hybrid to text-only for keyword queries, reasoning that kNN’s short-segment bias was the problem.
This made things worse. Text-only returned only 2/10 results with both terms and 1/4 in the top 4. Removing semantic search didn’t help — it just eliminated the few good results kNN was pulling in. I reverted this.
The Real Problem
The text query used multi_match with the default OR operator. For “Sam Altman on AGI,” this means: match any segment containing “Sam” OR “Altman” OR “on” OR “AGI.” Since “Sam Altman” appears in thousands of segments across the corpus — about politics, eye scanning, corporate drama — those segments easily outscored the handful mentioning AGI.
Proof: searching "Sam Altman" AGI with explicit phrase syntax returned 10/10 results with both terms. The content existed. The query just wasn’t requiring it.
The Embedding Model Can’t Help
I tested whether the embedding model was the bottleneck. We run paraphrase-multilingual-MiniLM-L12-v2 (384 dimensions). I computed cosine similarities between the query and various targets:
"Sam Altman on AGI" vs "Sam Altman" → 0.87
"Sam Altman on AGI" vs "AGI" → 0.63
"Sam Altman on AGI" vs "Sam Altman discusses AGI timelines" → 0.64
The query vector is 87% “Sam Altman.” The compositional meaning — “this person’s views on this topic” — barely registers. I tested two other 384-dim models (all-MiniLM-L6-v2 and BAAI/bge-small-en-v1.5). Same pattern. At this dimensionality, the entity name consumes most of the vector’s capacity.
Google solves this with much larger retrieval models, cross-encoders, and learned ranking from click data. I don’t have any of those. The embedding model is good for conceptual queries like “how to sleep better” where exact terms don’t matter. For entity+topic queries, BM25 has to do the work.
The Actual Fix
I replaced the OR text query with minimum_should_match: "75%":
{"multi_match": {
"query": "Sam Altman on AGI",
"fields": ["text^2"],
"fuzziness": "AUTO",
"minimum_should_match": "75%"
}}
For a 4-word query, 75% means 3 of 4 terms must match. Since “on” is a stop word, this effectively requires “Sam” + “Altman” + “AGI.” For a 2-word query, both must match. For a single word, 1 of 1. It scales naturally with query length — no hardcoded boosts to tune per query shape.
Results
I tested the full change set (length filter + length boost + minimum_should_match) against the old hybrid search:
| Before | After | |
|---|---|---|
| Both “Altman” + “AGI” in top 10 | 2/10 | 3/10 |
| Short filler in top 10 | 5/10 | 0/10 |
| Shortest result | 18 chars | 285 chars |
I tested across seven query types — entity+topic (“Sam Altman on AGI”), natural language (“how to sleep better”), single-word (“consciousness”), multi-entity (“Elon Musk Mars”), and questions (“How does AI affect society?”). No regressions on any query type.
The remaining Altman-only results in positions 5-10 come from the kNN side. The embedding model sees “Sam Altman on AGI” as “Sam Altman” and retrieves accordingly. That’s a known constraint — not fixable without a model change.
What I Shipped
Four changes total:
text_lengthfield in the segment mapping, computed at index time- Hard floor at 30 characters via post-filter (removes degenerate filler)
- Soft length boost via
function_scorewithlog1p(ranks substantive content higher) minimum_should_match: "75%"on the text query (requires most terms to be present)
What’s Next
The next lever is the embedding model. A larger retrieval-tuned model (768+ dimensions, trained for query-document matching rather than paraphrase detection) would improve semantic precision for compositional queries like “Sam Altman on AGI.” That’s a bigger infrastructure change — reindexing 13+ million production documents, more GPU memory, different Cog container — but it’s where the remaining quality gap lives.
For now, the combination of smarter BM25 matching, length-based scoring, and the existing semantic signal is a substantial improvement. The filler is gone, and the content that matters rises to the top.
Audioscrape — search engine for podcasts.