WhatsApp Chat Training Data

Private dataset of personal WhatsApp conversations converted to LLM fine-tuning formats.

Dataset Details

Dataset Description

This dataset contains personal WhatsApp chat exports that have been processed into structured training data suitable for fine-tuning large language models. The data covers casual conversations between the dataset owner ("Alexander") and various friends, family, and colleagues.

Conversations are filtered to remove:

  • URLs and links
  • Media omissions
  • System messages (encryption notices, group creation, etc.)
  • Messages below a minimum length threshold

Supported Tasks

  • Text generation / chat completion
  • Instruction tuning (Alpaca format)
  • Conversational fine-tuning (ChatML / ShareGPT formats)

Languages

English (informal/conversational)

Dataset Structure

Data Instances

Each instance in the JSONL file is a conversation in OpenAI ChatML format:

{"messages": [
  {"role": "system", "content": "You are Alexander. Respond naturally to messages from friends and contacts."},
  {"role": "user", "content": "[Friend Name]: Hey, how's it going?"},
  {"role": "assistant", "content": "Good man! Just busy with work. You?"},
  {"role": "user", "content": "[Friend Name]: Same old. Want to grab a drink?"},
  {"role": "assistant", "content": "Always. When and where?"}
]}

Data Fields

  • messages: A list of message objects with role and content fields
    • role: One of "system", "user", or "assistant"
    • content: The message text. User messages are prefixed with the sender's name in brackets, e.g. "[Friend Name]: message"

Data Splits

Single file, no predefined splits. Recommended to split using your framework's train/validation/test utilities.

Dataset Size

  • Conversations: ~7,100
  • File: chatml_conversations.jsonl (JSONL format)
  • Encoding: UTF-8

Usage

With HuggingFace datasets

from datasets import load_dataset

dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")

# Access conversations
for example in dataset:
    messages = example["messages"]

With Unsloth

from unsloth import FastLanguageModel

dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

With OpenAI SDK format

The JSONL file is directly compatible with OpenAI's fine-tuning format:

openai api fine_tunes.create \
  -t chatml_conversations.jsonl \
  -m gpt-3.5-turbo

Dataset Creation

Source Data

Personal WhatsApp chat exports from the "Export Chat" feature. The data represents informal, casual English-language conversations.

Processing Steps

  1. Parse WhatsApp .txt exports (supports both 12-hour and 24-hour timestamp formats)
  2. Filter out URLs, media omissions, system messages, and short messages
  3. Group messages into conversations based on time gaps (default: ~6 hours)
  4. Format as ChatML JSONL with system prompt, sender-prefixed user messages, and assistant responses
  5. Filter to conversations with at least 2 assistant turns

Personal and Sensitive Information

This dataset contains personal conversations. It is hosted as a private dataset and should not be redistributed. Names of conversation partners are included as-is and have not been anonymized.

Considerations for Using the Data

Social Impact of Dataset

This dataset is personal in nature and intended for personal fine-tuning to create a chatbot that mimics the dataset owner's conversational style. It should not be used to impersonate or deceive others.

Discussion of Biases

  • The dataset reflects the communication style, vocabulary, and interests of the dataset owner
  • Conversations span multiple years and may reflect changing perspectives over time
  • Topics are limited to the dataset owner's social circles and interests
  • Primarily British English colloquialisms and informal language

Other Known Limitations

  • No topic categorization or labeling
  • No quality filtering beyond basic length and pattern rules
  • Multi-line messages are preserved but may affect context length
  • Timestamps are stripped from the output data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support