Long-Context Fake News Classifier (Longformer, ISOT)

A binary text-classification model that fine-tunes allenai/longformer-base-4096 to classify long-form news articles as REAL or FAKE, trained on a subsampled ISOT Fake News Dataset.

Model

Base model: allenai/longformer-base-4096
Task: Binary text classification
Labels: 0 = REAL, 1 = FAKE
Max sequence length used: 1024 tokens
Parameters: same as longformer-base-4096 with a newly initialized 2-class classifier head
Framework: Hugging Face transformers (Trainer API)

Data

Dataset: ISOT Fake News Dataset
Files: True.csv (REAL), Fake.csv (FAKE)
Language: English
Preprocessing:
- Added label column: 0 for REAL (True.csv), 1 for FAKE (Fake.csv)
- Concatenated title and text into full_text
- Shuffled combined data with random_state=42
- Subsampled to 10,024 examples (df_small)
- Train/test split: 80% / 20% (8,019 train, 2,005 test), stratified by label
- Label distribution in subsample:
  - Overall: 5,241 FAKE, 4,783 REAL
  - Train: 4,193 FAKE, 3,826 REAL
  - Test: 1,048 FAKE, 957 REAL

Tokenization

Tokenizer: AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
Settings:
- padding="max_length"
- truncation=True
- max_length=1024
Global attention (training code):
- Created global_attention_mask as a Python list of length len(inputs["input_ids"]) with the first element set to 1 and the rest 0, then attached as inputs["global_attention_mask"]
- Note: this differs from the standard (batch_size, seq_len) tensor mask used at inference time

Training setup

Model init

model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2,
)

TrainingArguments

evaluation_strategy = "epoch"
save_strategy = "epoch"
learning_rate = 2e-5
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
gradient_accumulation_steps = 4
num_train_epochs = 1
weight_decay = 0.01
fp16 = True
gradient_checkpointing = True
load_best_model_at_end = True
push_to_hub = False
report_to = "none"

Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

Training and evaluation

Epochs: 1
Global steps: 2004
Training runtime: 2065.12 seconds
Train samples per second: 3.883
Train steps per second: 0.97
Total FLOPs (reported): 5,265,322,518,970,368.0
Losses:
- Epoch 0 training loss: 0.005100
- Epoch 0 validation loss: 0.00013
- Final TrainOutput.training_loss: 0.017658273408750813

No accuracy, precision, recall, or F1 metrics were computed in the training script; evaluation is currently reported only via loss on the held-out test split.

Inference

Minimal example for using the model from the Hub:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "[PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new/)"  

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

def classify(text: str):
    inputs = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    )
    global_attention_mask = torch.zeros(
        inputs["input_ids"].shape,
        dtype=torch.long,
    )
    global_attention_mask[:, 0] = 1
    inputs["global_attention_mask"] = global_attention_mask

    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    label_id = int(torch.argmax(probs))
    labels = {0: "REAL", 1: "FAKE"}
    return labels[label_id], float(probs[label_id])

Limitations and bias

Trained on a single English fake-news dataset (ISOT), with domain focus on politics and world news; performance outside this distribution is uncertain.
Labels are based on data source heuristics (e.g., Reuters vs. unreliable sites), not article-level fact checking, and may encode source or political bias.
The model should not be used as an automated fact-checker or for high-stakes decisions.

Author

Author: Pushkar Kumar

Downloads last month: 44

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for PushkarKumar/veritas_ai_new

Base model

allenai/longformer-base-4096

Finetuned

(134)

this model

PushkarKumar
/

veritas_ai_new