ResearchBuddyAI
Vol. 1 · Tech Overview2026
A Technical Account

How we verify research,
not just generate it.

An open description of the multi-model verification pipeline behind the Deep Audit, Peer Review, and Deep Research tools — what each agent does, why we made the choices we did, and how our results compare to other agentic reviewers in the literature.
researchbuddyai.com  ·  Last revised: May 2026
Abstract

Most AI research tools generate plausible-sounding output and stop there. ResearchBuddyAI is built on the opposite premise: every claim, citation, and novelty assessment must be verified against an external source of truth before the user sees it. We achieve this through three coupled mechanisms — live API verification of every cited reference (CrossRef + Semantic Scholar), a triple-model ensemble (o4-mini, Claude Sonnet 4.6, Gemini 2.5 Flash) with conservative score aggregation, and prior-work grounding for originality assessment. This document describes the full pipeline and our design choices, with citations to the underlying research where relevant.

§ 1

The hallucination problem in academic writing

Large language models routinely invent citations. Authors who appear plausible, journal names that sound right, DOIs that resolve to nothing. In casual use this is a curiosity; in a doctoral thesis or a peer-reviewed manuscript it is career-ending. The empirical literature is unambiguous on this: Liang et al.[1] found that while GPT-4 feedback overlaps substantially with human feedback, single-model reviewers exhibit systematic blind spots — particularly around novelty — that Shin et al.[2] later quantified across a focus-level evaluation framework.

The implication is structural. A research assistant cannot be one model with a good prompt. It must be (a) multiple independent models, (b) grounded in a real and current corpus, and (c) wired to fail loudly rather than silently when its components disagree.

Everything that follows is an attempt to satisfy those three requirements as honestly as we know how.

§ 2 · The Pipeline

Six stages, three independent models

Each stage has a single responsibility. The pipeline is designed so that any stage can fail without corrupting the others — and so that the output of each stage is independently verifiable.

01

Ingest & Parse

Role
Document Extraction
Time
<2s
Cost
~0 credits

Paper uploaded as PDF, DOCX, or TXT (up to 15k words). Processed in memory only — never written to disk. Title and author metadata extracted; document is verified to be an academic paper as a sanity check before any model is invoked.

Why this matters

Garbage in, garbage out. If we can’t parse the structure cleanly, the downstream analysis is unreliable. This step also protects against non-academic uploads consuming expensive model calls.

Outputs
Extracted full text
Section boundaries (regex-based)
Metadata (title, authors, word count)
Components
pdfminer.sixpython-docxin-memory pipeline
02

Reference VerificationCritical Path

Role
API Layer
Time
3–8s
Cost
0 credits (free APIs)

Every cited reference is parsed from the manuscript and resolved live against CrossRef and Semantic Scholar. Failed lookups are flagged as potential hallucinations or formatting errors. This is the single most important step — and the one most AI tools skip entirely.

Why this matters

LLMs hallucinate citations with perfect confidence. The only way to know if a reference is real is to check it against an authoritative database. We do this for every single reference, every time.

Outputs
Verification rate (% confirmed real)
DOI resolution for each ref
Citation count + journal metadata
Unverified refs flagged with reason
Components
CrossRef REST APISemantic Scholar Graph APIDOI resolver
03

Prior-Work Search

Role
Grounding
Time
4–10s
Cost
~5 credits

The paper’s abstract and contribution claims are converted into 4–6 search queries at varying specificity (broad-field, sub-area, technique, benchmark). Semantic Scholar’s 200M+ paper graph returns candidate related work; metadata is downloaded for the top results.

Why this matters

You cannot assess originality without knowing what already exists. Single-model reviewers consistently over-estimate novelty because they lack access to the real literature. This step eliminates that failure mode.

Outputs
6 related papers with abstracts
Citation counts + venue prestige
Relevance assessment per paper
Gap analysis for originality scoring
Components
Semantic Scholar Graphquery decompositionrelevance ranking
04

Multi-Pass AnalysisCritical Path

Role
Triple-Model Ensemble
Time
45–90s
Cost
~30 credits

Two independent passes across three models (six analyses total). Pass 1 focuses on methodology, internal consistency, and argument structure. Pass 2 evaluates writing, citation quality, compliance with target venue, and originality — grounded in the prior work retrieved in step 03.

Why this matters

Each model has different blind spots. o4-mini excels at logical reasoning, Claude at nuanced writing assessment, Gemini at broad knowledge. Running all three on focused sub-tasks captures signals that any single model would miss.

Outputs
3 independent methodology assessments
3 independent evaluation reports
Per-dimension scores from each model
Up to 8 evidence-grounded findings per section
Components
o4-mini (OpenAI)Claude Sonnet 4.6 (Anthropic)Gemini 2.5 Flash (Google)
05

Conservative Merge

Role
Score Aggregation
Time
<1s
Cost
0 credits

Across the three independent analyses, the lowest score on each of the 9 dimensions wins. Disagreements are surfaced rather than averaged away. This is the same principle used in safety-critical systems: when models disagree on a paper’s quality, that disagreement is itself a signal worth showing the author.

Why this matters

Averaging hides disagreement. If two models score methodology at 75 but one flags a critical flaw and scores 40, the author needs to know about the 40 — not see a reassuring 63. The minimum is the honest number.

Outputs
9 merged dimension scores
Disagreement flags (>15pt spread)
Deduplicated findings list
Confidence indicators per dimension
Components
dimension-wise mindisagreement flaggingconfidence weighting
06

Reviewer 2 Synthesis

Role
Final Output
Time
15–30s
Cost
~15 credits

Claude composes the final reviewer report in the voice of an experienced peer reviewer — section grades (A–F), strengths, weaknesses, internal consistency contradictions with side-by-side quotes, and a numbered revision roadmap. Calibrated to the target venue selected at upload.

Why this matters

Scores alone don’t tell you what to fix. A good peer review is specific, actionable, and prioritized. The synthesis step transforms raw analysis into a report that reads like it was written by a demanding but constructive committee member.

Outputs
Overall verdict + maturity level
Section-by-section grades (A–F)
Prioritized action items with effort estimates
Revision roadmap with specific rewrites
Risk radar (what reviewers will attack)
Components
Claude Sonnet 4.6venue-conditioned promptstructured output
§ 3 · Architecture

Data flow at a glance

ManuscriptPDF · DOCX · TXTParserextract refs + claimsReference CheckCrossRef · Semantic ScholarPrior-Work SearchS2 graph · 200M+ papersTriple-Model Ensembleo4-miniClaude Sonnet 4.6Gemini 2.5 Flash2 passes · 6 analysesconservative mergeReviewer 2report + scoresverification pathanalysis path
§ 4 · Scoring

Nine dimensions, scored 0–100

Following the design rationale of Jiang & Ng[7] — who use seven dimensions regressed onto a final score — we expand the rubric to nine, with explicit emphasis on internal Consistency (cross-section contradictions) and Risk (the question every author should ask: what will the harshest reviewer attack first?). Each model scores each dimension independently. The merge takes the minimum.

D01
Structure
Section ordering, IMRAD compliance, flow
D02
Methodology
Soundness of design, validity threats, controls
D03
Citations
Verified existence, accuracy, currency, breadth
D04
Arguments
Logical progression, claim–evidence alignment
D05
Consistency
Internal contradictions across sections
D06
Compliance
Target-venue formatting and norms
D07
Originality
Novelty assessed against retrieved prior work
D08
Writing
Clarity, concision, academic register
D09
Risk
What a real reviewer will attack first
§ 5 · Comparison

Where we differ from the closest comparable system

Stanford's paperreview.ai is the most directly comparable public agentic reviewer. Both systems share the core grounding intuition — that a reviewer needs access to current literature — but diverge on verification depth, model count, and how disagreement is handled.

Feature
ResearchBuddyAI
paperreview.ai
Citation existence verification
CrossRef + Semantic Scholar (live)
Not performed
Prior-work grounding
Semantic Scholar graph (200M+ papers)
arXiv via Tavily
Models per analysis
3 (o4-mini, Sonnet 4.6, Gemini 2.5)
1
Score aggregation
Conservative min (per dimension)
Linear regression on 7 dims
Disagreement handling
Surfaced to author
Averaged
Target-venue calibration
Yes (set at upload)
ICLR-only score display
Reported correlation
ρ = 0.61 preliminary (N=10, ICLR 2025)
ρ = 0.42 (N=147, ICLR 2025)

We owe an intellectual debt to the Stanford team. Their public technical writeup materially shaped how we present our own. Our preliminary validation (N=10, ICLR 2025) shows ρ = 0.61 — encouraging, but on a small sample. We are running a phased expansion: 30 papers next, then 75, then 150 to match Stanford's sample size. This page will be updated as the validation matures.

§ 5.1 · Early Validation

Preliminary results — ICLR 2025 smoke test

We ran 10 ICLR 2025 papers (5 accepted, 5 rejected) through the production Deep Audit pipeline and compared the 9-dimension scores against mean human reviewer ratings from OpenReview. The results below are directional — the sample is too small for statistical significance — but the signal is strong enough to warrant continued validation.

0.61
ρ AI vs Human
0.24
ρ Human vs Human
0.42
Stanford Benchmark
Caveat: N=10. The confidence interval on ρ = 0.61 is approximately ±0.20. These results will be updated as the validation expands to 30, 75, and 150 papers over the coming weeks. We publish early because we believe in showing our work — not because we consider these figures final.
Per-dimension correlation with human scores
#DimensionSpearman ρSignal
1
Writing
0.77
2
Methodology
0.68
3
Arguments
0.65
4
Citations
0.62
5
Originality
0.58
6
Consistency
0.55
7
Risk
0.52
8
Compliance
0.41
9
Structure
0.30
What this suggests

The system's strongest agreement with human reviewers is on Writing quality (ρ = 0.77) — likely because the Reviewer 2 synthesis step (Claude Sonnet 4.6) excels at prose assessment. Methodology and Arguments also show strong correlation, suggesting the triple-model ensemble is capturing substantive quality signals, not just surface features.

What needs investigation

Structure (ρ = 0.30) is the weakest dimension. ICLR papers frequently deviate from strict IMRAD format — the compliance rubric may be too rigid for ML conference submissions. We are investigating whether venue-specific structure templates improve this dimension before the full validation run.

Validation roadmap
Week 1
0.61
N=10 · complete
Week 2
N=30 · scheduled
Week 3
N=75 · planned
Week 4
N=150 · planned
§ 6 · The Intent Layer

Reading the brief before running a single model

A research request can hide ten different jobs. “Help me with offshore wind cost trends” could mean a literature review, a methodology design, a dataset hunt, or simply an open conversation. Sending the wrong job to the engine wastes the user's time and the model's reasoning. Before any heavy model is invoked, every chat message passes through a lightweight intent layer.

A small LLM reads the message together with the last few turns of the conversation and classifies it against the platform's full menu of tasks — literature review, deep research, research questions, methodology, analysis, references, dataset finder, and others. It returns a structured judgment: which task this looks like, the core topic in four to eight words, a confidence score, and a flag indicating whether one missing detail would materially sharpen the output.

The principle is the same one that underlies the rest of the pipeline: never act on an ambiguous brief, and never let the user discover the wrong direction at the end of a thirty-second generation. The intent layer is cheap. The work it gates is not.

Three roads after classification
Vague or conversational
Straight to chat
If the message is exploratory, broad, or simply a question the user is thinking out loud, no confirmation is shown. The conversation flows like an ordinary chat — no friction, no menu, no overhead.
Clear, but one detail missing
One focused question
If the request is well-formed but a single missing detail would meaningfully sharpen the output, the system asks one clarifying question — and only one. A second clarification request is never allowed.
Clear and high-confidence
Natural-language confirmation
If the intent is unambiguous, a short confirmation is shown in plain language before any heavier work is run. The user can redirect with a single sentence if the system misread the brief.
§ 7 · Limitations

What this system cannot do

01
Reviews are AI-generated, and may contain errors.
We have layered verification, but no system is perfect. Authors should treat the report as a strong second opinion, not as a substitute for supervisor or peer review.
02
Coverage is field-dependent.
Originality assessments are stronger in fields with extensive open-access publishing (CS, ML, biomedical preprints) and weaker in fields where the canonical literature sits behind paywalls or in non-indexed venues.
03
We support English-language papers only at present.
Multilingual support is on the roadmap but not yet shipped. Submitting non-English papers will produce degraded results.
04
Models drift; so do our results.
When OpenAI, Anthropic, or Google ship a new version of a model we use, the outputs change. We pin specific versions where possible and re-validate when we update.
05
We do not compete with peer review.
If you are a reviewer for a conference or journal, do not submit assigned papers to ResearchBuddyAI in any way that violates that venue’s confidentiality policy. This tool exists for authors auditing their own work.
§ 8 · References

The literature behind these choices

[1]
Liang, W. et al. Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis.” NEJM AI 1(8), AIoa2400196, 2024.doi:10.1056/AIoa2400196
Foundational study showing GPT-4 feedback overlaps substantially with human reviewer feedback — but with a documented blind spot on novelty assessment. Motivates our prior-work grounding step.
[2]
Shin, H. et al. Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews.” arXiv:2502.17086, 2025.doi:10.48550/arXiv.2502.17086
Demonstrates that single-model reviewers over-weight technical validity and under-weight novelty. Directly motivates the multi-model ensemble.
[3]
D’Arcy, M., Hope, T., Birnbaum, L. & Downey, D. MARG: Multi-Agent Review Generation for Scientific Papers.” arXiv:2401.04259, 2024.doi:10.48550/arXiv.2401.04259
Shows that multi-agent discussion produces more specific, helpful feedback than single-agent generation. Informs our two-pass design.
[4]
Sahu, G., Larochelle, H., Charlin, L. & Pal, C. ReviewerToo: Should AI Join The Program Committee?.” arXiv:2510.08867, 2025.doi:10.48550/arXiv.2510.08867
Diverse reviewer personas match human accuracy but introduce sycophancy risks during rebuttals. Cited as caution; we deliberately do not allow rebuttal-style follow-up that could amplify this.
[5]
Thakkar, N. et al. Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025.” arXiv:2504.09737, 2025.doi:10.48550/arXiv.2504.09737
Pilot showing LLM feedback nudges human reviewers toward more specific, actionable comments. Validates the value proposition for pre-submission self-review.
[6]
Jin, Y. et al. AgentReview: Exploring Peer Review Dynamics with LLM Agents.” arXiv:2406.12708, 2024.doi:10.48550/arXiv.2406.12708
Studies how reviewer-author agent dynamics affect outcomes. Informs how we present disagreement between our three models to authors.
[7]
Jiang, Y. & Ng, A. Tech Overview — Stanford Agentic Reviewer (paperreview.ai).” Stanford ML Group, 2025.https://paperreview.ai/tech-overview
Closest public comparator. Reports Spearman ρ = 0.42 between agent and a single human reviewer on ICLR 2025 (vs. ρ = 0.41 human–human). Single-model agent grounded in arXiv via Tavily.
End of documentv1.0 · May 2026

Read the system. Then try it on your draft.

Forty free credits, no card required. Run a Deep Audit on a paper you're about to submit and see the report this pipeline produces.

researchbuddyai.com →