Instructions to use mrmuminov/quranic-phoneme-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mrmuminov/quranic-phoneme-tokenizer with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("mrmuminov/quranic-phoneme-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Quran Phoneme-Level Tokenizer
This is a custom tokenizer for Quranic Arabic phoneme-level transcription, including diacritics (harakat).
It is designed for Whisper phoneme-level fine-tuning or other speech-to-text models.
Features
- Handles Quran transliteration in Buckwalter format (e.g.,
bi,{ll~ahi,r~aHiym). - Preserves diacritics (fatha, kasra, damma, shadda, sukun).
- Outputs phoneme-level tokens suitable for speech recognition fine-tuning.
- Includes special tokens:
<pad>,<s>,</s>,<unk>.
How to use
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("bahriddin/quran-phoneme-tokenizer")
# Encode phoneme text
phoneme_text = "b_i s_sukun m_i"
inputs = tokenizer(phoneme_text)
# Decode
decoded = tokenizer.decode(inputs["input_ids"])
print(decoded)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support