Training data
updated
AutoMathText: Autonomous Data Selection with Language Models for
Mathematical Texts
Paper
•
2402.07625
•
Published
•
16
Rethinking Data Selection for Supervised Fine-Tuning
Paper
•
2402.06094
•
Published
•
1
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models
Paper
•
2402.13064
•
Published
•
50
TnT-LLM: Text Mining at Scale with Large Language Models
Paper
•
2403.12173
•
Published
•
20
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
•
2402.10379
•
Published
•
31
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
•
2402.10176
•
Published
•
38
CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language
Model Pre-training
Paper
•
2406.10670
•
Published
•
4
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
Assessment and Selection for Instruction Tuning of Language Models
Paper
•
2408.02085
•
Published
•
19
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive
Alignment
Paper
•
2410.13785
•
Published
•
19
Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Paper
•
2501.06708
•
Published
•
5