nikGo

Engineering, AI, & Cognition

Domain-Specific RAG with Gemini 3 Flash Beats PRO with Web Search Grounding

Why a smaller model + an academic corpus can outperform a larger model grounded with search

Executive summary

My previous white paper showed that Gemini 3 Flash + domain RAG (academic psych/cog-sci corpus) significantly outperforms (1) Gemini 3 Flash without retrieval and (2) Gemini 3 Pro without retrieval on 4/5 judged dimensions, with coherence the only non-significant metric. (nikgo)

This new experiment tightens the baseline: Flash + domain RAG is compared against Gemini 3 Pro + grounded with search (a “web RAG” style system). The headline result holds: domain RAG still wins, with statistically significant improvements in overall_score and 4/5 dimensions (factual correctness, completeness, hallucination risk, academic response), while coherence remains not significant—but the effect sizes and win rate are weaker than when Pro had no retrieval, which is exactly what you’d predict when you add grounding to the larger model.


Abstract

Retrieval-Augmented Generation (RAG) is often treated as interchangeable with “grounding via web search,” but retrieval quality and corpus alignment should matter—especially for research-grade synthesis. We extend prior controlled experiments by comparing a smaller model (Gemini 3 Flash) using a domain-specific academic RAG corpus against a larger model (Gemini 3 Pro) grounded with search. Using blinded pairwise LLM-as-judge evaluation (ChatGPT 5.2 Thinking), randomized A/B ordering, and Wilcoxon signed-rank tests across five dimensions plus an overall score, we find statistically significant gains for domain RAG on overall_score and 4/5 dimensions (excluding coherence). The win rate drops from the earlier “Pro without retrieval” baseline to 86.7% wins / 6.7% ties / 6.7% harms (n=15), indicating that search grounding narrows—but does not close—the gap. These findings support a stronger systems claim: RAG is not one thing; for research tasks, domain-aligned corpora can beat general web grounding even on smaller models.


1. What’s new compared to the previous white paper

The previous paper established two results:

  1. Same model, retrieval matters: Flash + RAG > Flash without RAG. (nikgo)
  2. RAG can substitute for scale: Flash + RAG > Pro without RAG. (nikgo)

This third experiment asks the harder question:

If you give the larger model its own retrieval (search grounding), does the specialized academic corpus still win?

This is the important “real world” comparison, because many teams assume “Pro + web search” is effectively “best possible RAG.”


Recent work consistently finds that retrieval grounding can improve factual accuracy and reduce hallucinations, particularly for domain-specific or time-sensitive queries. (arXiv)

But the literature is equally clear on a second point: retrieval is not automatically beneficial. Whether retrieval helps depends on when you retrieve, what you retrieve, and how you integrate it. (aclanthology.org) This is where “search grounding” and “domain RAG” can diverge:

Methods like evidence refinement (condensing retrieval into key supporting evidence) improve consistency and answer quality, underscoring that retrieval quality/integration are first-class system components—not add-ons. (aclanthology.org) And large best-practice benchmarks show RAG outcomes are sensitive to configuration choices (prompting, chunk size, KB size, retrieval strategies). (aclanthology.org)


3. Experimental design

3.1 Tasks and corpus

3.2 Model conditions

3.3 Evaluation protocol


4. Results

4.1 Headline: domain RAG still wins vs Pro + search grounding

Across all 15 evaluations:

This is weaker than the earlier comparison against Pro without retrieval (expected), but still strongly favorable overall.

4.2 Overall metrics and significance (Wilcoxon)

Mean paired differences (Answer X − Answer Y), n=15:

Gemini 3 Flash RAG vs. Gemini 3 Pro + Search Figure 1: Overall results bar chart — paired score deltas for overall_score + five dimensions, with significance markers

Interpretation: compared to “Pro without retrieval,” grounding Pro with search improves the baseline, shrinking the gap—but domain RAG still wins on the dimensions that matter most for research outputs: factuality, completeness, academic quality, and hallucination risk.

4.3 Coherence again: still not significant

Coherence remains non-significant, matching the prior paper’s result. (nikgo) This aligns with broader findings that retrieval can help or hurt depending on conditions and that integration is a system-design problem (selection, reranking, synthesis constraints), not a guaranteed byproduct of “having sources.” (aclanthology.org)


5. NLP analysis: domain RAG outputs are longer and more academic (and harder to read)

Compared to Pro + search grounding, domain RAG outputs are:

Gemini 3 Flash RAG vs. Gemini 3 Pro + Search - Readability Figure 2: Readability + length comparison panel (Answer X vs Answer Y)

This trade-off is consistent with what you’d expect: when you retrieve from an academic corpus, the model tends to produce more qualified, terminology-rich synthesis. Similar effects appear in applied retrieval-grounded systems that prioritize evidence-driven reliability over readability. (jmir.org)


6. Why domain RAG can beat “search grounding”

This experiment is basically a referendum on corpus alignment.

Search grounding is retrieval, but it’s retrieval from an open, heterogeneous corpus with:

A domain academic corpus is:

The research community’s “retrieval helps or hurts” results predict exactly this: if retrieval quality is noisy or mismatched, you don’t automatically get better synthesis—sometimes you just get more clutter. (aclanthology.org) And the best-practices / refinement line of work essentially says: you win when your retrieval system produces the right evidence and you integrate it cleanly. (aclanthology.org)


7. Practical implications for AI teams

7.1 Don’t treat “RAG” as a checkbox

This study supports a stronger framing: RAG is an architecture class, and corpus choice is a primary driver of outcomes. (aclanthology.org)

7.2 When Pro + search is “good enough”

If your task is:

7.3 When domain RAG wins (even on smaller models)

If your task is:

This mirrors the direction of recent refinement/best-practices research. (aclanthology.org)


8. Limitations


9. Conclusion

Adding search grounding to Gemini 3 Pro improves the baseline and reduces the gap versus domain RAG—but a smaller model with a domain-aligned academic corpus still wins, with statistically significant improvements in overall_score and 4/5 judged dimensions. The result generalizes the core thesis from the prior paper: retrieval can substitute for scale—and now adds a sharper point: retrieval quality and corpus alignment can beat generic web grounding for research tasks. (nikgo)


References