G2P Multilingual ByT5 Tiny (8 layers) - IPA CHILDES
This model is a sequence-to-sequence model based on Google's ByT5, fine-tuned on the IPA CHILDES (split) dataset to convert grapheme to phonemes over a context of 512 tokens for 31 languages.
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.
ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.
Language tags
The following language tags can be used for prefixing the model input:
| Tag | Language |
|---|---|
| ca | Catalan |
| cy | Welsh |
| da | Danish |
| de | German |
| en-na | English (North America) |
| en-uk | English (United Kingdom) |
| es | Spanish |
| et | Estonian |
| eu | Basque |
| fa | Persian |
| fr | French |
| ga | Irish |
| hr | Croatian |
| hu | Hungarian |
| id | Indonesian |
| is | Icelandic |
| it | Italian |
| ja | Japanese |
| ko | Korean |
| nl | Dutch |
| no | Norwegian |
| pl | Polish |
| pt | Portuguese |
| pt-br | Portuguese (Brazil) |
| qu | Quechua |
| ro | Romanian |
| sr | Serbian |
| sv | Swedish |
| tr | Turkish |
| zh | Chinese |
| zh-yue | Cantonese |
The tag must be prepended to the prompt as a prefix using the format <{tag}>: (e.g., <pt-br>: ).
Note: a space between the prefix colon (:) and the beginning of the text is mandatory.
Example 1: inference with tokenizer
For batched inference & training it is recommended using a tokenizer class for handling padding, truncation and additional tokens:
from transformers import T5ForConditionalGeneration, AutoTokenizer
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
tokenizer = AutoTokenizer.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
model_inputs = tokenizer(["<en-na>: Life is like a box of chocolates."], max_length=512, padding=True, truncation=True, add_special_tokens=False, return_tensors="pt")
preds = model.generate(**model_inputs, num_beams=1, max_length=512) # We do not find beam search helpful. Greedy decoding is enough.
phones = tokenizer.batch_decode(preds.tolist(), skip_special_tokens=True)
print(phones)
# ['laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts']
Example 2: inference without tokenizer
For standalone inference, the decoding without the tokenizer reads as
import torch
import json
from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained('fdemelo/g2p-multilingual-byt5-tiny-8l-ipa-childes')
input_ids = torch.tensor([list("<en-na>: Life is like a box of chocolates.".encode("utf-8"))]) + 3 # add shift to account for special tokens <pad>, </s>, <unk>
preds = model.generate(input_ids=input_ids, num_beams=1, max_length=512)
# Simplified version of the decoding process (discarding special/added tokens)
with open("tokenizer_config.json", "r") as f:
added_tokens = json.load(f).get("added_tokens_decoder", {})
phone_bytes = [
bytes([token - 3]) for token in preds[0].tolist() if str(token) not in added_tokens
]
phones = b''.join(phone_bytes).decode("utf-8", errors="ignore")
print(phones)
# 'laɪf ɪz laɪk ʌ bɑks ʌv t̠ʃɑkləts'
- Downloads last month
- 116