ResearchBuddyAI

Vol. 1 · Tech Overview2026

A Technical Account

How we verify research,
not just generate it.

An open description of the multi-model verification pipeline behind the Deep Audit, Peer Review, and Deep Research tools — what each agent does, why we made the choices we did, and how our results compare to other agentic reviewers in the literature.

researchbuddyai.com  ·  Last revised: May 2026

Abstract

Most AI research tools generate plausible-sounding output and stop there. ResearchBuddyAI is built on the opposite premise: every claim, citation, and novelty assessment must be verified against an external source of truth before the user sees it. We achieve this through three coupled mechanisms — live API verification of every cited reference (CrossRef + Semantic Scholar), a triple-model ensemble (o4-mini, Claude Sonnet 4.6, Gemini 2.5 Flash) with conservative score aggregation, and prior-work grounding for originality assessment. This document describes the full pipeline and our design choices, with citations to the underlying research where relevant.

§ 1

The hallucination problem in academic writing

Large language models routinely invent citations. Authors who appear plausible, journal names that sound right, DOIs that resolve to nothing. In casual use this is a curiosity; in a doctoral thesis or a peer-reviewed manuscript it is career-ending. The empirical literature is unambiguous on this: Liang et al.^[1] found that while GPT-4 feedback overlaps substantially with human feedback, single-model reviewers exhibit systematic blind spots — particularly around novelty — that Shin et al.^[2] later quantified across a focus-level evaluation framework.

The implication is structural. A research assistant cannot be one model with a good prompt. It must be (a) multiple independent models, (b) grounded in a real and current corpus, and (c) wired to fail loudly rather than silently when its components disagree.

Everything that follows is an attempt to satisfy those three requirements as honestly as we know how.

§ 2 · The Pipeline

Six stages, three independent models

Each stage has a single responsibility. The pipeline is designed so that any stage can fail without corrupting the others — and so that the output of each stage is independently verifiable.

Ingest & Parse

Role

Document Extraction

Time

<2s

Cost

~0 credits

Paper uploaded as PDF, DOCX, or TXT (up to 15k words). Processed in memory only — never written to disk. Title and author metadata extracted; document is verified to be an academic paper as a sanity check before any model is invoked.

Why this matters

Garbage in, garbage out. If we can’t parse the structure cleanly, the downstream analysis is unreliable. This step also protects against non-academic uploads consuming expensive model calls.

Outputs

→Extracted full text

→Section boundaries (regex-based)

→Metadata (title, authors, word count)

Components

pdfminer.sixpython-docxin-memory pipeline

Reference VerificationCritical Path

Role

API Layer

Time

3–8s

Cost

0 credits (free APIs)

Every cited reference is parsed from the manuscript and resolved live against CrossRef and Semantic Scholar. Failed lookups are flagged as potential hallucinations or formatting errors. This is the single most important step — and the one most AI tools skip entirely.

Why this matters

LLMs hallucinate citations with perfect confidence. The only way to know if a reference is real is to check it against an authoritative database. We do this for every single reference, every time.

Outputs

→Verification rate (% confirmed real)

→DOI resolution for each ref

→Citation count + journal metadata

→Unverified refs flagged with reason

Components

CrossRef REST APISemantic Scholar Graph APIDOI resolver

Prior-Work Search

Role

Grounding

Time

4–10s

Cost

~5 credits

The paper’s abstract and contribution claims are converted into 4–6 search queries at varying specificity (broad-field, sub-area, technique, benchmark). Semantic Scholar’s 200M+ paper graph returns candidate related work; metadata is downloaded for the top results.

Why this matters

You cannot assess originality without knowing what already exists. Single-model reviewers consistently over-estimate novelty because they lack access to the real literature. This step eliminates that failure mode.

Outputs

→6 related papers with abstracts

→Citation counts + venue prestige

→Relevance assessment per paper

→Gap analysis for originality scoring

Components

Semantic Scholar Graphquery decompositionrelevance ranking

Multi-Pass AnalysisCritical Path

Role

Triple-Model Ensemble

Time

45–90s

Cost

~30 credits

Two independent passes across three models (six analyses total). Pass 1 focuses on methodology, internal consistency, and argument structure. Pass 2 evaluates writing, citation quality, compliance with target venue, and originality — grounded in the prior work retrieved in step 03.

Why this matters

Each model has different blind spots. o4-mini excels at logical reasoning, Claude at nuanced writing assessment, Gemini at broad knowledge. Running all three on focused sub-tasks captures signals that any single model would miss.

Outputs

→3 independent methodology assessments

→3 independent evaluation reports

→Per-dimension scores from each model

→Up to 8 evidence-grounded findings per section

Components

o4-mini (OpenAI)Claude Sonnet 4.6 (Anthropic)Gemini 2.5 Flash (Google)

Conservative Merge

Role

Score Aggregation

Time

<1s

Cost

0 credits

Across the three independent analyses, the lowest score on each of the 9 dimensions wins. Disagreements are surfaced rather than averaged away. This is the same principle used in safety-critical systems: when models disagree on a paper’s quality, that disagreement is itself a signal worth showing the author.

Why this matters

Averaging hides disagreement. If two models score methodology at 75 but one flags a critical flaw and scores 40, the author needs to know about the 40 — not see a reassuring 63. The minimum is the honest number.

Outputs

→9 merged dimension scores

→Disagreement flags (>15pt spread)

→Deduplicated findings list

→Confidence indicators per dimension

Components

dimension-wise mindisagreement flaggingconfidence weighting

Reviewer 2 Synthesis

Role

Final Output

Time

15–30s

Cost

~15 credits

Claude composes the final reviewer report in the voice of an experienced peer reviewer — section grades (A–F), strengths, weaknesses, internal consistency contradictions with side-by-side quotes, and a numbered revision roadmap. Calibrated to the target venue selected at upload.

Why this matters

Scores alone don’t tell you what to fix. A good peer review is specific, actionable, and prioritized. The synthesis step transforms raw analysis into a report that reads like it was written by a demanding but constructive committee member.

Outputs

→Overall verdict + maturity level

→Section-by-section grades (A–F)

→Prioritized action items with effort estimates

→Revision roadmap with specific rewrites

→Risk radar (what reviewers will attack)

Components

Claude Sonnet 4.6venue-conditioned promptstructured output

§ 3 · Architecture

Data flow at a glance

§ 4 · Scoring

Nine dimensions, scored 0–100

Following the design rationale of Jiang & Ng^[7] — who use seven dimensions regressed onto a final score — we expand the rubric to nine, with explicit emphasis on internal Consistency (cross-section contradictions) and Risk (the question every author should ask: what will the harshest reviewer attack first?). Each model scores each dimension independently. The merge takes the minimum.

D01

Structure

Section ordering, IMRAD compliance, flow

D02

Methodology

Soundness of design, validity threats, controls

D03

Citations

Verified existence, accuracy, currency, breadth

D04

Arguments

Logical progression, claim–evidence alignment

D05

Consistency

Internal contradictions across sections

D06

Compliance

Target-venue formatting and norms

D07

Originality

Novelty assessed against retrieved prior work

D08

Writing

Clarity, concision, academic register

D09

Risk

What a real reviewer will attack first

§ 5 · Comparison

Where we differ from the closest comparable system

Stanford's paperreview.ai is the most directly comparable public agentic reviewer. Both systems share the core grounding intuition — that a reviewer needs access to current literature — but diverge on verification depth, model count, and how disagreement is handled.

Feature

ResearchBuddyAI

paperreview.ai

Citation existence verification

CrossRef + Semantic Scholar (live)

Not performed

Prior-work grounding

Semantic Scholar graph (200M+ papers)

arXiv via Tavily

Models per analysis

3 (o4-mini, Sonnet 4.6, Gemini 2.5)

1

Score aggregation

Conservative min (per dimension)

Linear regression on 7 dims

Disagreement handling

Surfaced to author

Averaged

Target-venue calibration

Yes (set at upload)

ICLR-only score display

Reported correlation

ρ = 0.61 preliminary (N=10, ICLR 2025)

ρ = 0.42 (N=147, ICLR 2025)

We owe an intellectual debt to the Stanford team. Their public technical writeup materially shaped how we present our own. Our preliminary validation (N=10, ICLR 2025) shows ρ = 0.61 — encouraging, but on a small sample. We are running a phased expansion: 30 papers next, then 75, then 150 to match Stanford's sample size. This page will be updated as the validation matures.

§ 5.1 · Early Validation

Preliminary results — ICLR 2025 smoke test

We ran 10 ICLR 2025 papers (5 accepted, 5 rejected) through the production Deep Audit pipeline and compared the 9-dimension scores against mean human reviewer ratings from OpenReview. The results below are directional — the sample is too small for statistical significance — but the signal is strong enough to warrant continued validation.

0.61

ρ AI vs Human

0.24

ρ Human vs Human

0.42

Stanford Benchmark

Caveat: N=10. The confidence interval on ρ = 0.61 is approximately ±0.20. These results will be updated as the validation expands to 30, 75, and 150 papers over the coming weeks. We publish early because we believe in showing our work — not because we consider these figures final.

Per-dimension correlation with human scores

#DimensionSpearman ρSignal

1

Writing

0.77

2

Methodology

0.68

3

Arguments

0.65

4

Citations

0.62

5

Originality

0.58

6

Consistency

0.55

7

Risk

0.52

8

Compliance

0.41

9

Structure

0.30

What this suggests

The system's strongest agreement with human reviewers is on Writing quality (ρ = 0.77) — likely because the Reviewer 2 synthesis step (Claude Sonnet 4.6) excels at prose assessment. Methodology and Arguments also show strong correlation, suggesting the triple-model ensemble is capturing substantive quality signals, not just surface features.

What needs investigation

Structure (ρ = 0.30) is the weakest dimension. ICLR papers frequently deviate from strict IMRAD format — the compliance rubric may be too rigid for ML conference submissions. We are investigating whether venue-specific structure templates improve this dimension before the full validation run.

Validation roadmap

Week 1

0.61

N=10 · complete

Week 2

—

N=30 · scheduled

Week 3

—

N=75 · planned

Week 4

—

N=150 · planned

§ 6 · The Intent Layer

Reading the brief before running a single model

A research request can hide ten different jobs. “Help me with offshore wind cost trends” could mean a literature review, a methodology design, a dataset hunt, or simply an open conversation. Sending the wrong job to the engine wastes the user's time and the model's reasoning. Before any heavy model is invoked, every chat message passes through a lightweight intent layer.

A small LLM reads the message together with the last few turns of the conversation and classifies it against the platform's full menu of tasks — literature review, deep research, research questions, methodology, analysis, references, dataset finder, and others. It returns a structured judgment: which task this looks like, the core topic in four to eight words, a confidence score, and a flag indicating whether one missing detail would materially sharpen the output.

The principle is the same one that underlies the rest of the pipeline: never act on an ambiguous brief, and never let the user discover the wrong direction at the end of a thirty-second generation. The intent layer is cheap. The work it gates is not.

Three roads after classification

Vague or conversational

Straight to chat

If the message is exploratory, broad, or simply a question the user is thinking out loud, no confirmation is shown. The conversation flows like an ordinary chat — no friction, no menu, no overhead.

Clear, but one detail missing

One focused question

If the request is well-formed but a single missing detail would meaningfully sharpen the output, the system asks one clarifying question — and only one. A second clarification request is never allowed.

Clear and high-confidence

Natural-language confirmation

If the intent is unambiguous, a short confirmation is shown in plain language before any heavier work is run. The user can redirect with a single sentence if the system misread the brief.

§ 7 · Limitations

What this system cannot do

Reviews are AI-generated, and may contain errors.

We have layered verification, but no system is perfect. Authors should treat the report as a strong second opinion, not as a substitute for supervisor or peer review.

Coverage is field-dependent.

Originality assessments are stronger in fields with extensive open-access publishing (CS, ML, biomedical preprints) and weaker in fields where the canonical literature sits behind paywalls or in non-indexed venues.

We support English-language papers only at present.

Multilingual support is on the roadmap but not yet shipped. Submitting non-English papers will produce degraded results.

Models drift; so do our results.

When OpenAI, Anthropic, or Google ship a new version of a model we use, the outputs change. We pin specific versions where possible and re-validate when we update.

We do not compete with peer review.

If you are a reviewer for a conference or journal, do not submit assigned papers to ResearchBuddyAI in any way that violates that venue’s confidentiality policy. This tool exists for authors auditing their own work.

§ 8 · References

The literature behind these choices

[1]

Liang, W. et al. “Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis.” NEJM AI 1(8), AIoa2400196, 2024.doi:10.1056/AIoa2400196

Foundational study showing GPT-4 feedback overlaps substantially with human reviewer feedback — but with a documented blind spot on novelty assessment. Motivates our prior-work grounding step.

[2]

Shin, H. et al. “Mind the Blind Spots: A Focus-Level Evaluation Framework for LLM Reviews.” arXiv:2502.17086, 2025.doi:10.48550/arXiv.2502.17086

Demonstrates that single-model reviewers over-weight technical validity and under-weight novelty. Directly motivates the multi-model ensemble.

[3]

D’Arcy, M., Hope, T., Birnbaum, L. & Downey, D. “MARG: Multi-Agent Review Generation for Scientific Papers.” arXiv:2401.04259, 2024.doi:10.48550/arXiv.2401.04259

Shows that multi-agent discussion produces more specific, helpful feedback than single-agent generation. Informs our two-pass design.

[4]

Sahu, G., Larochelle, H., Charlin, L. & Pal, C. “ReviewerToo: Should AI Join The Program Committee?.” arXiv:2510.08867, 2025.doi:10.48550/arXiv.2510.08867

Diverse reviewer personas match human accuracy but introduce sycophancy risks during rebuttals. Cited as caution; we deliberately do not allow rebuttal-style follow-up that could amplify this.

[5]

Thakkar, N. et al. “Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025.” arXiv:2504.09737, 2025.doi:10.48550/arXiv.2504.09737

Pilot showing LLM feedback nudges human reviewers toward more specific, actionable comments. Validates the value proposition for pre-submission self-review.

[6]

Jin, Y. et al. “AgentReview: Exploring Peer Review Dynamics with LLM Agents.” arXiv:2406.12708, 2024.doi:10.48550/arXiv.2406.12708

Studies how reviewer-author agent dynamics affect outcomes. Informs how we present disagreement between our three models to authors.

[7]

Jiang, Y. & Ng, A. “Tech Overview — Stanford Agentic Reviewer (paperreview.ai).” Stanford ML Group, 2025.https://paperreview.ai/tech-overview

Closest public comparator. Reports Spearman ρ = 0.42 between agent and a single human reviewer on ICLR 2025 (vs. ρ = 0.41 human–human). Single-model agent grounded in arXiv via Tavily.

End of documentv1.0 · May 2026

Read the system. Then try it on your draft.

Forty free credits, no card required. Run a Deep Audit on a paper you're about to submit and see the report this pipeline produces.

researchbuddyai.com →

How we verify research,not just generate it.

The hallucination problem in academic writing

Six stages, three independent models

Ingest & Parse

Reference VerificationCritical Path

Prior-Work Search

Multi-Pass AnalysisCritical Path

Conservative Merge

Reviewer 2 Synthesis

Data flow at a glance

Nine dimensions, scored 0–100

Where we differ from the closest comparable system

Preliminary results — ICLR 2025 smoke test

Reading the brief before running a single model

What this system cannot do

The literature behind these choices

Read the system. Then try it on your draft.

How we verify research,
not just generate it.