WhatsApp Chat Training Data
Private dataset of personal WhatsApp conversations converted to LLM fine-tuning formats.
Dataset Details
Dataset Description
This dataset contains personal WhatsApp chat exports that have been processed into structured training data suitable for fine-tuning large language models. The data covers casual conversations between the dataset owner ("Alexander") and various friends, family, and colleagues.
Conversations are filtered to remove:
- URLs and links
- Media omissions
- System messages (encryption notices, group creation, etc.)
- Messages below a minimum length threshold
Supported Tasks
- Text generation / chat completion
- Instruction tuning (Alpaca format)
- Conversational fine-tuning (ChatML / ShareGPT formats)
Languages
English (informal/conversational)
Dataset Structure
Data Instances
Each instance in the JSONL file is a conversation in OpenAI ChatML format:
{"messages": [
{"role": "system", "content": "You are Alexander. Respond naturally to messages from friends and contacts."},
{"role": "user", "content": "[Friend Name]: Hey, how's it going?"},
{"role": "assistant", "content": "Good man! Just busy with work. You?"},
{"role": "user", "content": "[Friend Name]: Same old. Want to grab a drink?"},
{"role": "assistant", "content": "Always. When and where?"}
]}
Data Fields
messages: A list of message objects withroleandcontentfieldsrole: One of"system","user", or"assistant"content: The message text. User messages are prefixed with the sender's name in brackets, e.g."[Friend Name]: message"
Data Splits
Single file, no predefined splits. Recommended to split using your framework's train/validation/test utilities.
Dataset Size
- Conversations: ~7,100
- File:
chatml_conversations.jsonl(JSONL format) - Encoding: UTF-8
Usage
With HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")
# Access conversations
for example in dataset:
messages = example["messages"]
With Unsloth
from unsloth import FastLanguageModel
dataset = load_dataset("n00b001/whatsapp-chat-training-data", split="train")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3-8b-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
With OpenAI SDK format
The JSONL file is directly compatible with OpenAI's fine-tuning format:
openai api fine_tunes.create \
-t chatml_conversations.jsonl \
-m gpt-3.5-turbo
Dataset Creation
Source Data
Personal WhatsApp chat exports from the "Export Chat" feature. The data represents informal, casual English-language conversations.
Processing Steps
- Parse WhatsApp
.txtexports (supports both 12-hour and 24-hour timestamp formats) - Filter out URLs, media omissions, system messages, and short messages
- Group messages into conversations based on time gaps (default: ~6 hours)
- Format as ChatML JSONL with system prompt, sender-prefixed user messages, and assistant responses
- Filter to conversations with at least 2 assistant turns
Personal and Sensitive Information
This dataset contains personal conversations. It is hosted as a private dataset and should not be redistributed. Names of conversation partners are included as-is and have not been anonymized.
Considerations for Using the Data
Social Impact of Dataset
This dataset is personal in nature and intended for personal fine-tuning to create a chatbot that mimics the dataset owner's conversational style. It should not be used to impersonate or deceive others.
Discussion of Biases
- The dataset reflects the communication style, vocabulary, and interests of the dataset owner
- Conversations span multiple years and may reflect changing perspectives over time
- Topics are limited to the dataset owner's social circles and interests
- Primarily British English colloquialisms and informal language
Other Known Limitations
- No topic categorization or labeling
- No quality filtering beyond basic length and pattern rules
- Multi-line messages are preserved but may affect context length
- Timestamps are stripped from the output data