Long-Context Fake News Classifier (Longformer, ISOT)
A binary text-classification model that fine-tunes allenai/longformer-base-4096 to classify long-form news articles as REAL or FAKE, trained on a subsampled ISOT Fake News Dataset.
Model
- Base model:
allenai/longformer-base-4096 - Task: Binary text classification
- Labels:
0= REAL,1= FAKE - Max sequence length used: 1024 tokens
- Parameters: same as
longformer-base-4096with a newly initialized 2-class classifier head - Framework: Hugging Face
transformers(Trainer API)
Data
- Dataset: ISOT Fake News Dataset
- Files:
True.csv(REAL),Fake.csv(FAKE) - Language: English
- Preprocessing:
- Added
labelcolumn: 0 for REAL (True.csv), 1 for FAKE (Fake.csv) - Concatenated
titleandtextintofull_text - Shuffled combined data with
random_state=42 - Subsampled to 10,024 examples (
df_small) - Train/test split: 80% / 20% (8,019 train, 2,005 test), stratified by
label - Label distribution in subsample:
- Overall: 5,241 FAKE, 4,783 REAL
- Train: 4,193 FAKE, 3,826 REAL
- Test: 1,048 FAKE, 957 REAL
- Added
Tokenization
- Tokenizer:
AutoTokenizer.from_pretrained("allenai/longformer-base-4096") - Settings:
padding="max_length"truncation=Truemax_length=1024
- Global attention (training code):
- Created
global_attention_maskas a Python list of lengthlen(inputs["input_ids"])with the first element set to 1 and the rest 0, then attached asinputs["global_attention_mask"] - Note: this differs from the standard
(batch_size, seq_len)tensor mask used at inference time
- Created
Training setup
Model init
model = AutoModelForSequenceClassification.from_pretrained(
"allenai/longformer-base-4096",
num_labels=2,
)
TrainingArguments
evaluation_strategy="epoch"save_strategy="epoch"learning_rate=2e-5per_device_train_batch_size=1per_device_eval_batch_size=1gradient_accumulation_steps=4num_train_epochs=1weight_decay=0.01fp16=Truegradient_checkpointing=Trueload_best_model_at_end=Truepush_to_hub=Falsereport_to="none"
Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
)
Training and evaluation
- Epochs: 1
- Global steps: 2004
- Training runtime: 2065.12 seconds
- Train samples per second: 3.883
- Train steps per second: 0.97
- Total FLOPs (reported): 5,265,322,518,970,368.0
- Losses:
- Epoch 0 training loss: 0.005100
- Epoch 0 validation loss: 0.00013
- Final
TrainOutput.training_loss: 0.017658273408750813
No accuracy, precision, recall, or F1 metrics were computed in the training script; evaluation is currently reported only via loss on the held-out test split.
Inference
Minimal example for using the model from the Hub:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "[PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new/)"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
def classify(text: str):
inputs = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=1024,
return_tensors="pt",
)
global_attention_mask = torch.zeros(
inputs["input_ids"].shape,
dtype=torch.long,
)
global_attention_mask[:, 0] = 1
inputs["global_attention_mask"] = global_attention_mask
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
label_id = int(torch.argmax(probs))
labels = {0: "REAL", 1: "FAKE"}
return labels[label_id], float(probs[label_id])
Limitations and bias
- Trained on a single English fake-news dataset (ISOT), with domain focus on politics and world news; performance outside this distribution is uncertain.
- Labels are based on data source heuristics (e.g., Reuters vs. unreliable sites), not article-level fact checking, and may encode source or political bias.
- The model should not be used as an automated fact-checker or for high-stakes decisions.
Author
- Author: Pushkar Kumar
- Downloads last month
- 44
Model tree for PushkarKumar/veritas_ai_new
Base model
allenai/longformer-base-4096