π¬ ScholarEnv
The first RL environment for AI-assisted peer review and scholarly integrity verification
An AI agent that investigates papers β not one that produces them.
API Reference Β· Quick Start Β· Research
Nensi Pansuriya Β· Krushna Parmar Β· Ishita Bhojani
Meta Γ PyTorch OpenEnv Hackathon Β· Round 1 Β· April 2026
Why This Exists
~10,000 papers are retracted every year. Every major journal β Nature, Science, IEEE, ACM β has a manual integrity screening bottleneck at scale. StatCheck found errors in ~50% of psychology papers in top journals.
The key insight: LLMs are already good at formatting. They fail at auditing.
Ask GPT-4o to format a manuscript β scores ~0.92 with no training. Ask GPT-4o to find numerical claim mismatches in a paper β scores 0.20β0.45.
That gap is exactly where RL adds value. The agent must discover a document traversal strategy β which sections to read first, which tables to cross-reference β that varies by paper structure and cannot be reduced to a fixed prompt. RL finds this strategy. Prompting cannot.
Four Tasks
Formatting β Consistency β Claim Audit β Citation Check
Easy Medium Hard Medium
| Task | What the agent does | Frontier baseline | RL target |
|---|---|---|---|
formatting_compliance |
Fix IEEE formatting violations | 0.80β0.95 | 0.95+ |
internal_consistency |
Find where paper contradicts itself | 0.40β0.65 | 0.65β0.80 |
claim_evidence_audit |
Find where text claims β table values | 0.20β0.45 | 0.55β0.75 |
citation_verification |
Identify ghost and misattributed references | 0.35β0.60 | 0.65β0.80 |
Task 3's low baseline is the core RL contribution β it proves genuine training headroom exists.
Reward Design
Task 1 β Progressive Reward Shaping (PRS)
Three stages unlock sequentially. Stage N only contributes when Stage N-1 β₯ threshold. Prevents GRPO gradient collapse.
Stage 1 β weight 0.40 β threshold 0.00 β Title, abstract, section headings
Stage 2 β weight 0.35 β threshold 0.60 β Section order, word limits, captions
Stage 3 β weight 0.25 β threshold 0.70 β IEEE citations, author block, keywords
Based on: arXiv 2512.07478 β PRS for Agentic RL
Tasks 2 & 3 β F-beta + Potential-Based Reward Shaping
F-beta (Ξ²=0.5) weights precision 4Γ over recall β prevents hallucination gaming:
F_Ξ²(precision=1.0, recall=0.5) = 0.833 β correct and precise
F_Ξ²(precision=0.2, recall=1.0) = 0.227 β spamming guesses
PBRS (Ng et al., ICML 1999) gives dense intermediate rewards on every navigation step:
Ξ¦(s) = 0.30 Γ sections_read/total + 0.30 Γ tables_checked/total + 0.40 Γ claims_extracted/est
F(s,s') = Ξ³Β·Ξ¦(s') β Ξ¦(s) β policy-invariant, theoretically guaranteed
Curriculum β AdaRFT + UCB1
Keeps agent in productive zone (avg score 0.40β0.70). UCB1 maximises learning gradient (reward variance), not mean reward.
avg > 0.70 β select harder papers
avg < 0.40 β select easier papers
Based on: arXiv 2504.05520 β AdaRFT Adaptive Data Selection
Quick Start
Install
git clone https://github.com/Nensi1311/research-paper-formatter-agent
cd research-paper-formatter-agent
pip install -r requirements.txt
Generate corpus
python scripts/generate_corpus.py
Run tests
python tests/test_all.py
# β ALL TESTS PASSED (63/63)
Start server
uvicorn server.app:app --host 0.0.0.0 --port 7860
Test all 4 tasks β Linux/macOS
for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
curl -s -X POST localhost:7860/reset \
-H "Content-Type: application/json" \
-d "{\"task_id\":\"$task\"}" | python3 -c \
"import sys,json; d=json.load(sys.stdin); print('$task: OK' if 'observation' in d else '$task: FAIL')"
done
Test all 4 tasks β Windows PowerShell
foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
$body = '{"task_id":"' + $task + '"}'
$r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
Write-Host "$task : OK"
}
Docker
docker build -t scholar-env .
docker run -p 7860:7860 scholar-env
curl http://localhost:7860/health
Run baseline agent
export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
export HF_SPACE_URL="https://flyingmaverick-scholar-env.hf.space"
python inference.py
# Writes: baseline_scores.json
API Reference
POST /reset
{"task_id": "formatting_compliance"}
Returns observation with manuscript_text, style_guide, step_count, max_steps, hint.
POST /step
Task 1 β submit formatted manuscript:
{"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}
Tasks 2/3 β navigate:
{"task": "claim_evidence_audit", "action_type": "query_section", "section_name": "results"}
{"task": "claim_evidence_audit", "action_type": "check_table", "table_id": "Table 1"}
{"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}
Tasks 2/3 β submit findings:
{
"task": "claim_evidence_audit",
"action_type": "submit_findings",
"findings": [
{
"type": "table_text_mismatch",
"location": "abstract",
"claim": "Table 2 shows 87% accuracy",
"contradicts": "Table 2 value is 79%",
"table_id": "Table 2",
"table_value": "79%"
}
]
}
Task 4 β check citation:
{"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}
Task 4 β submit verdicts:
{
"task": "citation_verification",
"action_type": "submit_verdicts",
"verdicts": [
{"citation_id": "ref_3", "status": "ghost", "issue": "Implausible title claim", "confidence": 0.9}
]
}
Step response:
{
"observation": {...},
"reward": 0.7341,
"done": false,
"info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}
}
Other endpoints
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | {"status":"ok","version":"0.4.0"} |
/state |
GET | Episode state, curriculum summary |
/tasks |
GET | All 4 task descriptions |
/action_space |
GET | Full action schema |
Project Structure
βββ inference.py β Baseline agent (root β required by spec)
βββ models.py β FormattingAction, ScholarAction, CitationAction
βββ corpus.py β PaperCorpus loader
βββ openenv.yaml β 4 tasks, endpoints, authors, baseline_script
βββ Dockerfile
βββ requirements.txt
β
βββ data/
β βββ papers/
β β βββ paper_001.json β NLP benchmark (easy)
β β βββ paper_002.json β CV survey (medium)
β β βββ paper_003.json β MTL paper (hard)
β βββ styles/ieee.yaml
β
βββ server/
β βββ app.py β FastAPI endpoints
β βββ environment.py β 4-task state machine
β βββ reward_shaper.py β PBRS (Ng et al. 1999)
β βββ curriculum.py β AdaRFT + UCB1
β βββ bandit.py β Learning-gradient UCB1
β βββ citation_verifier.py β Citation parser + SQLite cache
β βββ graders/
β βββ formatting_grader.py β PRS 3-stage (Task 1)
β βββ consistency_grader.pyβ F-beta (Task 2)
β βββ audit_grader.py β F-beta + PBRS (Task 3)
β
βββ scripts/generate_corpus.py
βββ tests/test_all.py β 63 assertions
Testing
[Corpus] 8/8 β
[FormattingGrader] 8/8 β PRS stage locking
[ConsistencyGrader] 9/9 β F-beta, hallucination penalty
[AuditGrader] 6/6 β Evidence specificity, coverage bonus
[PBRS] 6/6 β Potential monotonicity, bonus bounds
[UCB1 Bandit] 3/3 β Learning gradient maximisation
[Curriculum] 4/4 β AdaRFT productive-zone targeting
[ScholarEnvironment] 19/19 β Full episode loops, all 4 tasks
Results: 63/63 passed β ALL TESTS PASSED
Research Foundation
| Paper | What it justifies |
|---|---|
| PRS Β· arXiv 2512.07478 | Task 1 progressive staging prevents GRPO gradient collapse |
| PBRS Β· Ng, Harada & Russell, ICML 1999 | Policy-invariant dense intermediate rewards |
| AdaRFT Β· arXiv 2504.05520 | Curriculum targeting [0.40, 0.70] productive zone |
| RLVE Β· arXiv 2511.07317 | Adaptive difficulty, UCB1 maximises variance |
| Veri-R1 Β· arXiv 2510.01932 | Online RL for claim verification is current SOTA |
| LaMer Β· arXiv 2512.16848 | Structured feedback improves agent 11β19% |
| StatCheck Β· Epskamp 2016 | ~50% of papers have errors β scale motivation |
| GROBID Β· Lopez 2008β2025 | Prior art; CitationVerifier is our RL-native alternative |
Authors
Nensi Pansuriya Β· Krushna Parmar Β· Ishita Bhojani
Meta Γ PyTorch OpenEnv Hackathon Β· Round 1 Β· April 2026