Instructions to use efederici/ipt-350m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use efederici/ipt-350m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="efederici/ipt-350m", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("efederici/ipt-350m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("efederici/ipt-350m", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use efederici/ipt-350m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "efederici/ipt-350m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efederici/ipt-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/efederici/ipt-350m

SGLang

How to use efederici/ipt-350m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "efederici/ipt-350m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efederici/ipt-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "efederici/ipt-350m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efederici/ipt-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use efederici/ipt-350m with Docker Model Runner:
```
docker model run hf.co/efederici/ipt-350m
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ipt-350m

ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).

ipt-350m is:

Licensed for the possibility of commercial use
Prepared to handle extremely long inputs thanks to ALiBi.
Capable of fast training and inference (via FlashAttention and FasterTransformer)
Equipped with highly efficient open-source training code via the llm-foundry repository

If you find this project useful, consider supporting its development:

How to Use

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'efederici/ipt-350m',
  trust_remote_code=True
)

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method.

To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0) with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True
)

Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

It uses FlashAttention
It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
It does not use biases

Hyperparameter	Value
n_parameters	350M
n_layers	24
n_heads	16
d_model	1024
vocab size	50432
sequence length	2048

Dataset

The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.

Downloads last month: 129

Model tree for efederici/ipt-350m

Quantizations

1 model

Dataset used to train efederici/ipt-350m

Papers for efederici/ipt-350m

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper • 2205.14135 • Published May 27, 2022 • 15

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Paper • 2108.12409 • Published Aug 27, 2021 • 5

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Paper • 1909.08053 • Published Sep 17, 2019 • 5