DATALORA: Dynamic Adaptive Token-Level Optimization with Rank-Adaptive LoRA
DATALORA is a novel training optimization method that combines three key innovations into a unified framework for efficient language model fine-tuning:
- Unified Saliency Network (USN) β Joint token importance scoring and expert routing
- Mixture of LoRA Experts β 8 specialized LoRA adapters with dynamic routing
- Dynamic Token Pruning β Curriculum-based token retention scheduling
Model Details
| Property | Value |
|---|---|
| Base Model | mistralai/Mistral-7B-v0.3 |
| Method | DATALORA (USN + MoLoRA + Token Pruning) |
| Training Data | Open-Orca/OpenOrca (5K samples) |
| LoRA Experts | 8 |
| LoRA Rank | 16 |
| Target Retention | 50% |
| Quantization | 4-bit (NF4) |
| Training Epochs | 3 |
| Curriculum | Warmup β Sparsification β Hardening |
Architecture
Input Tokens
β
βΌ
βββββββββββββββββββββββββββββββ
β Unified Saliency Network β
β (Shared backbone) β
ββββββββ¬βββββββββββ¬ββββββββββββ€
βToken β Router β Rank β
βScoresβ Logits β Scale β
ββββ¬ββββ΄βββββ¬ββββββ΄ββββββ¬ββββββ
β β β
βΌ βΌ βΌ
Token Expert Dynamic
Pruning Selection Rank Scaling
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββ
β Mixture of LoRA Experts β
β (8 specialized adapters) β
βββββββββββββββββββββββββββββββ
Key Insight
Token importance and task complexity are correlated signals. Complex inputs need more tokens AND stronger adaptation. The USN learns this correlation jointly, enabling:
- Efficient computation via token pruning (skip unimportant tokens)
- Specialized adaptation via expert routing (different experts for different input types)
- Adaptive capacity via rank scaling (more capacity for harder inputs)
Training Details
Curriculum Learning (3 Phases)
- Warmup (Epoch 1): Dense training, high temperature (soft decisions), all tokens retained
- Sparsification (Epoch 2): Gradually increase token pruning, temperature annealing
- Hardening (Epoch 3): Low temperature (discrete decisions), target 50% retention
Loss Function
L_total = L_LM + Ξ»_prune * L_retention + Ξ±_balance * L_balance
L_LM: Standard language modeling lossL_retention: Encourages target token retention ratioL_balance: Encourages uniform expert utilization
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load base model
tokenizer = AutoTokenizer.from_pretrained("tugrulkaya/datalora-mistral-7b")
model = AutoModelForCausalLM.from_pretrained(
"tugrulkaya/datalora-mistral-7b",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Generate
prompt = "What is machine learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
DATALORA Framework
For training your own DATALORA models, see the full framework:
Components
- UnifiedSaliencyNetwork: Joint token scoring + expert routing + rank scaling
- SimpleMixtureOfExperts: 8 LoRA experts with weighted combination
- CurriculumScheduler: 3-phase training with temperature annealing
- DATALORATrainer: Extended HuggingFace Trainer with curriculum integration
Training Configuration
from models import DATALORAConfig
config = DATALORAConfig(
base_model="mistralai/Mistral-7B-v0.3",
num_lora_experts=8, # Number of LoRA experts
lora_rank=16, # Base LoRA rank
lora_alpha=32, # LoRA scaling factor
target_retention=0.5, # Target 50% token retention
num_active_experts=2, # Top-K expert selection
temperature_init=5.0, # Initial Gumbel-softmax temperature
temperature_final=0.5, # Final temperature
)
Citation
@misc{kaya2025datalora,
title={DATALORA: Dynamic Adaptive Token-Level Optimization with Rank-Adaptive LoRA},
author={Kaya, TuΔrul},
year={2025},
url={https://huggingface.co/tugrulkaya/datalora-mistral-7b}
}
License
Apache 2.0 (inherited from Mistral-7B-v0.3)
Model tree for tugrulkaya/datalora-mistral-7b
Base model
mistralai/Mistral-7B-v0.3