test_model / howto /train_redpajama.md

init

12001a9 over 2 years ago

4.59 kB

	# Pre-train LLaMA on RedPajama

	This howto will walk you through setting up the RedPajama dataset and launching the pre-training script.

	## What's RedPajama

	[RedPajama](https://github.com/togethercomputer/RedPajama-Data) is an open-source reproduction of the original LLaMA training dataset.

	It contains a total of 1.2 trillion tokens, divided into

	```text
	Commoncrawl 878B
	C4 175B
	GitHub 59B
	Books 26B
	ArXiv 28B
	Wikipedia 24B
	StackExchange 20B
	```

	The [RedPajama repo](https://github.com/togethercomputer/RedPajama-Data) contains the source code for collecting and preparing
	the dataset, and it is Apache 2.0 licensed.

	The data itself is licensed according to the original licenses with which its invidivdual parts were released.
	The GitHub datasets are limited to MIT, BSD, or Apache 2.0 repositories.

	Along with the full [RedPajama-1T dataset](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T),
	the [RedPajama-1T-Sample](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample) 1B sample dataset
	is also available for development.

	You can download the data using git lfs:

	```bash
	# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
	git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T data/RedPajama-Data-1T
	```

	```bash
	# Make sure you have git-lfs installed (https://git-lfs.com): git lfs install
	git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample
	```

	## Prepare RedPajama for training

	The dataset consists of 2084 `jsonl` files (the sample dataset contains 11). In order to start pre-training lit-llama
	on it, you need to read, tokenize, and write the data in binary chunks. This will leverage the `PackedDataset`
	streaming dataset that comes with lit-llama.

	Do to so, run

	```bash
	python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama
	```

	or

	```bash
	python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T-Sample --tokenizer_path checkpoints/lit-llama/tokenizer.model --destination_path data/lit-redpajama-sample --sample True
	```

	for the sample dataset.

	In the above we are assuming that you will be using the same tokenizer as used in LLaMA, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.

	The script will take a while to run, so time for :tea:

	## Pre-training

	Running the pre-training script requires at least 4 GPUs with 40GB+ each (A100).

	```bash
	python train_redpajama.py --devices 4 --train_data_dir data/lit-redpajama
	```

	For running on the sample dataset:

	```bash
	python train_redpajama.py --devices 4 --train_data_dir data/lit-redpajama-sample
	```

	The script will save checkpoints periodically to the folder `out/`.

	The `train_redpajama.py` script will pre-train the LLaMA 7B model with FSDP in
	`bfloat16` precision and gradient accumulation.

	You can easily change the size of the model by passing a different string to

	```python
	config = LLaMAConfig.from_name("7B")
	```

	in the `main` function.

	Keep in mind that the original LLaMA training for the 7B model required 83k A100 80GB
	hours, so you'll need access to a cluster.

	Once you're in a cluster, you can follow [these instructions](https://lightning.ai/docs/fabric/stable/guide/multi_node/other.html)
	to launch the script across machines:

	- [SLURM cluster](https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html)
	- [Barebones cluster](https://lightning.ai/docs/fabric/stable/guide/multi_node/barebones.html)
	- [MPI](https://lightning.ai/docs/fabric/stable/guide/multi_node/other.html)

	The script contains several configurations and hyperparameters you can tweak:

	```python
	out_dir = "out/training"
	save_interval = 1000
	eval_interval = 1000
	eval_iters = 100
	log_interval = 1

	# Hyperparameters
	learning_rate = 6e-4
	batch_size = 125
	micro_batch_size = 5
	max_iters = 600000 # num_epochs * epoch_size // devices
	weight_decay = 1e-1
	beta1 = 0.9
	beta2 = 0.95
	grad_clip = 1.0
	decay_lr = True
	warmup_iters = 2000
	lr_decay_iters = max_iters
	min_lr = 6e-5
	```

	In particular, `micro_batch_size` should be adjusted so the process will use the available
	GPU memory.

	Last, logging is kept minimal in the script. In order to use a particular logger
	please refer to <https://lightning.ai/docs/fabric/stable/api/loggers.html> or
	call a logging client library like `wandb` directly.