Long-Context Fake News Classifier (Longformer, ISOT)

A binary text-classification model that fine-tunes allenai/longformer-base-4096 to classify long-form news articles as REAL or FAKE, trained on a subsampled ISOT Fake News Dataset.

Model

  • Base model: allenai/longformer-base-4096
  • Task: Binary text classification
  • Labels: 0 = REAL, 1 = FAKE
  • Max sequence length used: 1024 tokens
  • Parameters: same as longformer-base-4096 with a newly initialized 2-class classifier head
  • Framework: Hugging Face transformers (Trainer API)

Data

  • Dataset: ISOT Fake News Dataset
  • Files: True.csv (REAL), Fake.csv (FAKE)
  • Language: English
  • Preprocessing:
    • Added label column: 0 for REAL (True.csv), 1 for FAKE (Fake.csv)
    • Concatenated title and text into full_text
    • Shuffled combined data with random_state=42
    • Subsampled to 10,024 examples (df_small)
    • Train/test split: 80% / 20% (8,019 train, 2,005 test), stratified by label
    • Label distribution in subsample:
      • Overall: 5,241 FAKE, 4,783 REAL
      • Train: 4,193 FAKE, 3,826 REAL
      • Test: 1,048 FAKE, 957 REAL

Tokenization

  • Tokenizer: AutoTokenizer.from_pretrained("allenai/longformer-base-4096")
  • Settings:
    • padding="max_length"
    • truncation=True
    • max_length=1024
  • Global attention (training code):
    • Created global_attention_mask as a Python list of length len(inputs["input_ids"]) with the first element set to 1 and the rest 0, then attached as inputs["global_attention_mask"]
    • Note: this differs from the standard (batch_size, seq_len) tensor mask used at inference time

Training setup

Model init

model = AutoModelForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=2,
)

TrainingArguments

  • evaluation_strategy = "epoch"
  • save_strategy = "epoch"
  • learning_rate = 2e-5
  • per_device_train_batch_size = 1
  • per_device_eval_batch_size = 1
  • gradient_accumulation_steps = 4
  • num_train_epochs = 1
  • weight_decay = 0.01
  • fp16 = True
  • gradient_checkpointing = True
  • load_best_model_at_end = True
  • push_to_hub = False
  • report_to = "none"

Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

Training and evaluation

  • Epochs: 1
  • Global steps: 2004
  • Training runtime: 2065.12 seconds
  • Train samples per second: 3.883
  • Train steps per second: 0.97
  • Total FLOPs (reported): 5,265,322,518,970,368.0
  • Losses:
    • Epoch 0 training loss: 0.005100
    • Epoch 0 validation loss: 0.00013
    • Final TrainOutput.training_loss: 0.017658273408750813

No accuracy, precision, recall, or F1 metrics were computed in the training script; evaluation is currently reported only via loss on the held-out test split.

Inference

Minimal example for using the model from the Hub:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "[PushkarKumar/veritas_ai_new](https://huggingface.co/PushkarKumar/veritas_ai_new/)"  

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

def classify(text: str):
    inputs = tokenizer(
        text,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
    )
    global_attention_mask = torch.zeros(
        inputs["input_ids"].shape,
        dtype=torch.long,
    )
    global_attention_mask[:, 0] = 1
    inputs["global_attention_mask"] = global_attention_mask

    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    label_id = int(torch.argmax(probs))
    labels = {0: "REAL", 1: "FAKE"}
    return labels[label_id], float(probs[label_id])

Limitations and bias

  • Trained on a single English fake-news dataset (ISOT), with domain focus on politics and world news; performance outside this distribution is uncertain.
  • Labels are based on data source heuristics (e.g., Reuters vs. unreliable sites), not article-level fact checking, and may encode source or political bias.
  • The model should not be used as an automated fact-checker or for high-stakes decisions.

Author

  • Author: Pushkar Kumar
Downloads last month
44
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for PushkarKumar/veritas_ai_new

Finetuned
(134)
this model

Space using PushkarKumar/veritas_ai_new 1