Size of dataset
Hi, I would like to fine tune bge-m3 for the Spanish language in a legal context.
What size of dataset do you recommend me? Also, do you have the notebook you use to fine-tune this model?
Thank you very much!
Hi! I’m also very interested in this topic
following on that comment, could you please share a bit more about your setup?
In addition to size of the dataset could you provide any of your data sources, did you train with pairs (query–positive) or triplets (query–positive–negative)? and if so how did you generate or mine hard negatives?
Thanks !
https://huggingface.co/datasets/manu/embedding_data_v2_100k
I think this was my final dataset, 100k samples. More is better but quality is key. I probs finetuned it with sentence transformers at the time.
All of what you mebtion will help performance but I recommend:
- setting up a good eval setup you trust, some baselines
- start naive
- iterate with the tricks
Good luck