WhatsApp Chat Training Data

Private dataset of personal WhatsApp conversations converted to LLM fine-tuning formats.

Dataset Details

Dataset Description

This dataset contains personal WhatsApp chat exports that have been processed into structured training data suitable for fine-tuning large language models. The data covers casual conversations between the dataset owner ("Alexander") and various friends, family, and colleagues.

Conversations are filtered to remove:

URLs and links
Media omissions
System messages (encryption notices, group creation, etc.)
Messages below a minimum length threshold

Supported Tasks

Text generation / chat completion
Instruction tuning (Alpaca format)
Conversational fine-tuning (ChatML / ShareGPT formats)

Languages

English (informal/conversational)

Dataset Structure

Data Instances

Each instance in the JSONL file is a conversation in OpenAI ChatML format:

{"messages": [
  {"role": "system", "content": "You are Alexander. Respond naturally to messages from friends and contacts."},
  {"role": "user", "content": "[Friend Name]: Hey, how's it going?"},
  {"role": "assistant", "content": "Good man! Just busy with work. You?"},
  {"role": "user", "content": "[Friend Name]: Same old. Want to grab a drink?"},
  {"role": "assistant", "content": "Always. When and where?"}
]}

Data Fields

messages: A list of message objects with role and content fields
- role: One of "system", "user", or "assistant"
- content: The message text. User messages are prefixed with the sender's name in brackets, e.g. "[Friend Name]: message"

Data Splits

Single file, no predefined splits. Recommended to split using your framework's train/validation/test utilities.

Dataset Size

Conversations: ~7,100
File: chatml_conversations.jsonl (JSONL format)
Encoding: UTF-8

Usage

With HuggingFace datasets

from datasets import load_dataset

dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")

# Access conversations
for example in dataset:
    messages = example["messages"]

With Unsloth

from unsloth import FastLanguageModel

dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

With OpenAI SDK format

The JSONL file is directly compatible with OpenAI's fine-tuning format:

openai api fine_tunes.create \
  -t chatml_conversations.jsonl \
  -m gpt-3.5-turbo

Dataset Creation

Source Data

Personal WhatsApp chat exports from the "Export Chat" feature. The data represents informal, casual English-language conversations.

Processing Steps

Parse WhatsApp .txt exports (supports both 12-hour and 24-hour timestamp formats)
Filter out URLs, media omissions, system messages, and short messages
Group messages into conversations based on time gaps (default: ~6 hours)
Format as ChatML JSONL with system prompt, sender-prefixed user messages, and assistant responses
Filter to conversations with at least 2 assistant turns

Personal and Sensitive Information

This dataset contains personal conversations. It is hosted as a private dataset and should not be redistributed. Names of conversation partners are included as-is and have not been anonymized.

Considerations for Using the Data

Social Impact of Dataset

This dataset is personal in nature and intended for personal fine-tuning to create a chatbot that mimics the dataset owner's conversational style. It should not be used to impersonate or deceive others.

Discussion of Biases

The dataset reflects the communication style, vocabulary, and interests of the dataset owner
Conversations span multiple years and may reflect changing perspectives over time
Topics are limited to the dataset owner's social circles and interests
Primarily British English colloquialisms and informal language

Other Known Limitations

No topic categorization or labeling
No quality filtering beyond basic length and pattern rules
Multi-line messages are preserved but may affect context length
Timestamps are stripped from the output data

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support