From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective
Paper
•
2205.04733
•
Published
•
2
This is a SPLADE Sparse Encoder model finetuned from naver/splade-v3 using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
This model was made as part of an Advanced Information Retrieval Course in the scope of our project which is meant to recommend games based on their description. See also our github: https://github.com/Spivonxe/There-are-no-games
SparseEncoder(
(0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
(1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("mazombieme/There-Are-No-Games")
# Run inference
queries = [
"artifacts powerful",
]
documents = [
'Dreamscaper By night, delve deep into your subconscious and discover powerful artifacts to conquer your nightmares. By day, explore the city of Redhaven and build relationships to unlock the power of your dreams. DREAM. DIE. WAKE. REPEAT.',
"The Mystery of Devils House The Mystery of Devil's House is a 2D Horror Action Platformer based on the story of the Haunted house called William's House located in Massachusetts investigated by The Ghostbuster Crew: International.",
'KoroNeko Roll your way through a cozy, kawaii world full of charming characters and challenging puzzles to save your siblings from Strawberry the Witch!',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.4569, 0.0118, 0.0000]])
query and document| query | document | |
|---|---|---|
| type | string | string |
| details |
|
|
| query | document |
|---|---|
control difficulties |
Manabi SandBox This is an action game in which you control a knight and explore unknown ruins. The goal is to collect all the fruits on the map while exterminating monsters. Overcome various difficulties with simple actions and ingenuity. The Knights never die, so don't give up and solve all the riddles. |
atmospheric every mystery narrativedriven mansion tension |
Crimson Mansion Uncover the secrets of a forgotten mansion where every room raises new questions. A narrative-driven exploration game full of mystery and atmospheric tension. |
visually games edutainment pointandclick puzzle adventure |
Endacopia Endacopia is a Point-and-Click Puzzle Adventure that is visually reminiscent of early computer edutainment games... with underlying horrors. |
SpladeLoss with these parameters:{
"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
"document_regularizer_weight": 0.003,
"query_regularizer_weight": 0
}
query and document| query | document | |
|---|---|---|
| type | string | string |
| details |
|
|
| query | document |
|---|---|
cones playground park explore |
Pipo Park Pipo Park is a messy and happy digital playground. Explore a small park, play football, throw traffic cones, balls, dogs, and more... Have a nice time! |
sentient united employee dark |
Quantum Eye Set in a dark, futuristic united states, all citizens are watched under a highly-advanced supercomputer known as the "Quantum Eye" Machine. You are an employee at the Quantum Eye facility, but quantum eye goes sentient. Your goal is simple, SHUT IT DOWN... |
five solve in twisted 2 powerful |
Hanaja's Body 2 in One Indulge in this twisted tale of transformations as five characters struggle for control of one body: yours! Mix and match between five different, uniquely powerful forms to defeat dangerous foes and solve puzzles to escape the tome you've been trapped in before it's too late. |
SpladeLoss with these parameters:{
"loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
"document_regularizer_weight": 0.003,
"query_regularizer_weight": 0
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16learning_rate: 2e-05warmup_ratio: 0.1fp16: Truebatch_sampler: no_duplicatesrouter_mapping: {'query': 'query', 'answer': 'document'}overwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthproject: huggingfacetrackio_space_id: trackioddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: noneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Trueprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {'query': 'query', 'answer': 'document'}learning_rate_mapping: {}| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 0.0812 | 200 | 0.0307 | - |
| 0.1625 | 400 | 0.0116 | - |
| 0.2437 | 600 | 0.0162 | - |
| 0.3249 | 800 | 0.0086 | - |
| 0.4062 | 1000 | 0.0095 | 0.0134 |
| 0.4874 | 1200 | 0.0114 | - |
| 0.5686 | 1400 | 0.0113 | - |
| 0.6499 | 1600 | 0.0124 | - |
| 0.7311 | 1800 | 0.0145 | - |
| 0.8123 | 2000 | 0.0151 | 0.0167 |
| 0.8936 | 2200 | 0.0162 | - |
| 0.9748 | 2400 | 0.0163 | - |
| 1.0561 | 2600 | 0.0169 | - |
| 1.1373 | 2800 | 0.017 | - |
| 1.2185 | 3000 | 0.0177 | 0.0208 |
| 1.2998 | 3200 | 0.0176 | - |
| 1.3810 | 3400 | 0.0255 | - |
| 1.4622 | 3600 | 0.0143 | - |
| 1.5435 | 3800 | 0.0168 | - |
| 1.6247 | 4000 | 0.018 | 0.0191 |
| 1.7059 | 4200 | 0.0177 | - |
| 1.7872 | 4400 | 0.0148 | - |
| 1.8684 | 4600 | 0.0139 | - |
| 1.9496 | 4800 | 0.0156 | - |
| 2.0309 | 5000 | 0.0125 | 0.0173 |
| 2.1121 | 5200 | 0.0139 | - |
| 2.1933 | 5400 | 0.0113 | - |
| 2.2746 | 5600 | 0.0124 | - |
| 2.3558 | 5800 | 0.0118 | - |
| 2.4370 | 6000 | 0.0115 | 0.0177 |
| 2.5183 | 6200 | 0.011 | - |
| 2.5995 | 6400 | 0.0117 | - |
| 2.6807 | 6600 | 0.0108 | - |
| 2.7620 | 6800 | 0.0127 | - |
| 2.8432 | 7000 | 0.0128 | 0.0165 |
| 2.9245 | 7200 | 0.0107 | - |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{formal2022distillationhardnegativesampling,
title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
year={2022},
eprint={2205.04733},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2205.04733},
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
naver/splade-v3