πŸ”¬ ScholarEnv

The first RL environment for AI-assisted peer review and scholarly integrity verification

OpenEnv Python License Tasks Tests

An AI agent that investigates papers β€” not one that produces them.

API Reference Β· Quick Start Β· Research


Nensi Pansuriya Β· Krushna Parmar Β· Ishita Bhojani

Meta Γ— PyTorch OpenEnv Hackathon Β· Round 1 Β· April 2026


Why This Exists

~10,000 papers are retracted every year. Every major journal β€” Nature, Science, IEEE, ACM β€” has a manual integrity screening bottleneck at scale. StatCheck found errors in ~50% of psychology papers in top journals.

The key insight: LLMs are already good at formatting. They fail at auditing.

Ask GPT-4o to format a manuscript β†’ scores ~0.92 with no training. Ask GPT-4o to find numerical claim mismatches in a paper β†’ scores 0.20–0.45.

That gap is exactly where RL adds value. The agent must discover a document traversal strategy β€” which sections to read first, which tables to cross-reference β€” that varies by paper structure and cannot be reduced to a fixed prompt. RL finds this strategy. Prompting cannot.


Four Tasks

Formatting β†’ Consistency β†’ Claim Audit β†’ Citation Check
   Easy          Medium         Hard          Medium
Task What the agent does Frontier baseline RL target
formatting_compliance Fix IEEE formatting violations 0.80–0.95 0.95+
internal_consistency Find where paper contradicts itself 0.40–0.65 0.65–0.80
claim_evidence_audit Find where text claims β‰  table values 0.20–0.45 0.55–0.75
citation_verification Identify ghost and misattributed references 0.35–0.60 0.65–0.80

Task 3's low baseline is the core RL contribution β€” it proves genuine training headroom exists.


Reward Design

Task 1 β€” Progressive Reward Shaping (PRS)

Three stages unlock sequentially. Stage N only contributes when Stage N-1 β‰₯ threshold. Prevents GRPO gradient collapse.

Stage 1 β”‚ weight 0.40 β”‚ threshold 0.00 β”‚ Title, abstract, section headings
Stage 2 β”‚ weight 0.35 β”‚ threshold 0.60 β”‚ Section order, word limits, captions
Stage 3 β”‚ weight 0.25 β”‚ threshold 0.70 β”‚ IEEE citations, author block, keywords

Based on: arXiv 2512.07478 β€” PRS for Agentic RL

Tasks 2 & 3 β€” F-beta + Potential-Based Reward Shaping

F-beta (Ξ²=0.5) weights precision 4Γ— over recall β€” prevents hallucination gaming:

F_Ξ²(precision=1.0, recall=0.5) = 0.833   βœ“ correct and precise
F_Ξ²(precision=0.2, recall=1.0) = 0.227   βœ— spamming guesses

PBRS (Ng et al., ICML 1999) gives dense intermediate rewards on every navigation step:

Ξ¦(s) = 0.30 Γ— sections_read/total + 0.30 Γ— tables_checked/total + 0.40 Γ— claims_extracted/est
F(s,s') = Ξ³Β·Ξ¦(s') βˆ’ Ξ¦(s)     ← policy-invariant, theoretically guaranteed

Curriculum β€” AdaRFT + UCB1

Keeps agent in productive zone (avg score 0.40–0.70). UCB1 maximises learning gradient (reward variance), not mean reward.

avg > 0.70  β†’  select harder papers
avg < 0.40  β†’  select easier papers

Based on: arXiv 2504.05520 β€” AdaRFT Adaptive Data Selection


Quick Start

Install

git clone https://github.com/Nensi1311/research-paper-formatter-agent
cd research-paper-formatter-agent
pip install -r requirements.txt

Generate corpus

python scripts/generate_corpus.py

Run tests

python tests/test_all.py
# β†’ ALL TESTS PASSED (63/63)

Start server

uvicorn server.app:app --host 0.0.0.0 --port 7860

Test all 4 tasks β€” Linux/macOS

for task in formatting_compliance internal_consistency claim_evidence_audit citation_verification; do
  curl -s -X POST localhost:7860/reset \
    -H "Content-Type: application/json" \
    -d "{\"task_id\":\"$task\"}" | python3 -c \
    "import sys,json; d=json.load(sys.stdin); print('$task: OK' if 'observation' in d else '$task: FAIL')"
done

Test all 4 tasks β€” Windows PowerShell

foreach ($task in @("formatting_compliance","internal_consistency","claim_evidence_audit","citation_verification")) {
    $body = '{"task_id":"' + $task + '"}'
    $r = Invoke-RestMethod -Uri "http://localhost:7860/reset" -Method POST -ContentType "application/json" -Body $body
    Write-Host "$task : OK"
}

Docker

docker build -t scholar-env .
docker run -p 7860:7860 scholar-env
curl http://localhost:7860/health

Run baseline agent

export API_BASE_URL="https://api-inference.huggingface.co/v1"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export HF_TOKEN="hf_your_token"
export HF_SPACE_URL="https://flyingmaverick-scholar-env.hf.space"

python inference.py
# Writes: baseline_scores.json

API Reference

POST /reset

{"task_id": "formatting_compliance"}

Returns observation with manuscript_text, style_guide, step_count, max_steps, hint.

POST /step

Task 1 β€” submit formatted manuscript:

{"task": "formatting_compliance", "formatted_text": "...full reformatted manuscript..."}

Tasks 2/3 β€” navigate:

{"task": "claim_evidence_audit", "action_type": "query_section", "section_name": "results"}
{"task": "claim_evidence_audit", "action_type": "check_table", "table_id": "Table 1"}
{"task": "claim_evidence_audit", "action_type": "extract_claims", "section_name": "results"}

Tasks 2/3 β€” submit findings:

{
  "task": "claim_evidence_audit",
  "action_type": "submit_findings",
  "findings": [
    {
      "type": "table_text_mismatch",
      "location": "abstract",
      "claim": "Table 2 shows 87% accuracy",
      "contradicts": "Table 2 value is 79%",
      "table_id": "Table 2",
      "table_value": "79%"
    }
  ]
}

Task 4 β€” check citation:

{"task": "citation_verification", "action_type": "check_citation", "citation_id": "ref_3"}

Task 4 β€” submit verdicts:

{
  "task": "citation_verification",
  "action_type": "submit_verdicts",
  "verdicts": [
    {"citation_id": "ref_3", "status": "ghost", "issue": "Implausible title claim", "confidence": 0.9}
  ]
}

Step response:

{
  "observation": {...},
  "reward": 0.7341,
  "done": false,
  "info": {"f_beta": 0.73, "precision": 0.8, "recall": 0.67}
}

Other endpoints

Endpoint Method Description
/health GET {"status":"ok","version":"0.4.0"}
/state GET Episode state, curriculum summary
/tasks GET All 4 task descriptions
/action_space GET Full action schema

Project Structure

β”œβ”€β”€ inference.py                 ← Baseline agent (root β€” required by spec)
β”œβ”€β”€ models.py                    ← FormattingAction, ScholarAction, CitationAction
β”œβ”€β”€ corpus.py                    ← PaperCorpus loader
β”œβ”€β”€ openenv.yaml                 ← 4 tasks, endpoints, authors, baseline_script
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ papers/
β”‚   β”‚   β”œβ”€β”€ paper_001.json       ← NLP benchmark (easy)
β”‚   β”‚   β”œβ”€β”€ paper_002.json       ← CV survey (medium)
β”‚   β”‚   └── paper_003.json       ← MTL paper (hard)
β”‚   └── styles/ieee.yaml
β”‚
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                   ← FastAPI endpoints
β”‚   β”œβ”€β”€ environment.py           ← 4-task state machine
β”‚   β”œβ”€β”€ reward_shaper.py         ← PBRS (Ng et al. 1999)
β”‚   β”œβ”€β”€ curriculum.py            ← AdaRFT + UCB1
β”‚   β”œβ”€β”€ bandit.py                ← Learning-gradient UCB1
β”‚   β”œβ”€β”€ citation_verifier.py     ← Citation parser + SQLite cache
β”‚   └── graders/
β”‚       β”œβ”€β”€ formatting_grader.py ← PRS 3-stage (Task 1)
β”‚       β”œβ”€β”€ consistency_grader.py← F-beta (Task 2)
β”‚       └── audit_grader.py      ← F-beta + PBRS (Task 3)
β”‚
β”œβ”€β”€ scripts/generate_corpus.py
└── tests/test_all.py            ← 63 assertions

Testing

[Corpus]              8/8  βœ“
[FormattingGrader]    8/8  βœ“  PRS stage locking
[ConsistencyGrader]   9/9  βœ“  F-beta, hallucination penalty
[AuditGrader]         6/6  βœ“  Evidence specificity, coverage bonus
[PBRS]                6/6  βœ“  Potential monotonicity, bonus bounds
[UCB1 Bandit]         3/3  βœ“  Learning gradient maximisation
[Curriculum]          4/4  βœ“  AdaRFT productive-zone targeting
[ScholarEnvironment] 19/19 βœ“  Full episode loops, all 4 tasks

Results: 63/63 passed β€” ALL TESTS PASSED

Research Foundation

Paper What it justifies
PRS Β· arXiv 2512.07478 Task 1 progressive staging prevents GRPO gradient collapse
PBRS Β· Ng, Harada & Russell, ICML 1999 Policy-invariant dense intermediate rewards
AdaRFT Β· arXiv 2504.05520 Curriculum targeting [0.40, 0.70] productive zone
RLVE Β· arXiv 2511.07317 Adaptive difficulty, UCB1 maximises variance
Veri-R1 Β· arXiv 2510.01932 Online RL for claim verification is current SOTA
LaMer Β· arXiv 2512.16848 Structured feedback improves agent 11–19%
StatCheck Β· Epskamp 2016 ~50% of papers have errors β€” scale motivation
GROBID Β· Lopez 2008–2025 Prior art; CitationVerifier is our RL-native alternative

Authors

Nensi Pansuriya Β· Krushna Parmar Β· Ishita Bhojani

Meta Γ— PyTorch OpenEnv Hackathon Β· Round 1 Β· April 2026


License

Apache 2.0


The future of AI isn't just models that generate β€” it's models that verify.

GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for flyingmaverick/ScholarEnvMeta