Instructions to use efederici/ipt-350m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use efederici/ipt-350m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="efederici/ipt-350m", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("efederici/ipt-350m", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("efederici/ipt-350m", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use efederici/ipt-350m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "efederici/ipt-350m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "efederici/ipt-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/efederici/ipt-350m
- SGLang
How to use efederici/ipt-350m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "efederici/ipt-350m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "efederici/ipt-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "efederici/ipt-350m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "efederici/ipt-350m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use efederici/ipt-350m with Docker Model Runner:
docker model run hf.co/efederici/ipt-350m
ipt-350m
ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).
It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).
ipt-350m is:
- Licensed for the possibility of commercial use
- Prepared to handle extremely long inputs thanks to ALiBi.
- Capable of fast training and inference (via FlashAttention and FasterTransformer)
- Equipped with highly efficient open-source training code via the llm-foundry repository
If you find this project useful, consider supporting its development:
How to Use
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'efederici/ipt-350m',
trust_remote_code=True
)
Note: This model requires that trust_remote_code=True be passed to the from_pretrained method.
To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0) with attn_impl='triton' and with bfloat16 precision:
import torch
import transformers
name = 'efederici/ipt-350m'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.
import transformers
name = 'efederici/ipt-350m'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True
)
Model Description
The architecture is a modification of a standard decoder-only transformer.
The model has been modified from a standard transformer in the following ways:
- It uses FlashAttention
- It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
- It does not use biases
| Hyperparameter | Value |
|---|---|
| n_parameters | 350M |
| n_layers | 24 |
| n_heads | 16 |
| d_model | 1024 |
| vocab size | 50432 |
| sequence length | 2048 |
Dataset
The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.
- Downloads last month
- 803