Size of dataset

by dariolopez - opened Jun 27, 2024

Discussion

dariolopez

Jun 27, 2024

Hi, I would like to fine tune bge-m3 for the Spanish language in a legal context.

What size of dataset do you recommend me? Also, do you have the notebook you use to fine-tune this model?

Thank you very much!

zanga78

12 days ago

Hi! I’m also very interested in this topic

following on that comment, could you please share a bit more about your setup?

In addition to size of the dataset could you provide any of your data sources, did you train with pairs (query–positive) or triplets (query–positive–negative)? and if so how did you generate or mine hard negatives?

Thanks !

manu

Owner 12 days ago

https://huggingface.co/datasets/manu/embedding_data_v2_100k

I think this was my final dataset, 100k samples. More is better but quality is key. I probs finetuned it with sentence transformers at the time.
All of what you mebtion will help performance but I recommend:

setting up a good eval setup you trust, some baselines
start naive
iterate with the tricks

Good luck

manu changed discussion status to closed 12 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment