SPLADE Sparse Encoder

This is a SPLADE Sparse Encoder model finetuned from naver/splade-v3 using the sentence-transformers library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

This model was made as part of an Advanced Information Retrieval Course in the scope of our project which is meant to recommend games based on their description. See also our github: https://github.com/Spivonxe/There-are-no-games

Model Description

  • Model Type: SPLADE Sparse Encoder
  • Base model: naver/splade-v3
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 30522 dimensions
  • Similarity Function: Dot Product

Model Sources

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the 🤗 Hub
model = SparseEncoder("mazombieme/There-Are-No-Games")
# Run inference
queries = [
    "artifacts powerful",
]
documents = [
    'Dreamscaper By night, delve deep into your subconscious and discover powerful artifacts to conquer your nightmares. By day, explore the city of Redhaven and build relationships to unlock the power of your dreams. DREAM. DIE. WAKE. REPEAT.',
    "The Mystery of Devils House The Mystery of Devil's House is a 2D Horror Action Platformer based on the story of the Haunted house called William's House located in Massachusetts investigated by The Ghostbuster Crew: International.",
    'KoroNeko Roll your way through a cozy, kawaii world full of charming characters and challenging puzzles to save your siblings from Strawberry the Witch!',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 30522] [3, 30522]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[12.4569,  0.0118,  0.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 39,381 training samples
  • Columns: query and document
  • Approximate statistics based on the first 1000 samples:
    query document
    type string string
    details
    • min: 4 tokens
    • mean: 7.17 tokens
    • max: 22 tokens
    • min: 10 tokens
    • mean: 50.35 tokens
    • max: 103 tokens
  • Samples:
    query document
    control difficulties Manabi SandBox This is an action game in which you control a knight and explore unknown ruins. The goal is to collect all the fruits on the map while exterminating monsters. Overcome various difficulties with simple actions and ingenuity. The Knights never die, so don't give up and solve all the riddles.
    atmospheric every mystery narrativedriven mansion tension Crimson Mansion Uncover the secrets of a forgotten mansion where every room raises new questions. A narrative-driven exploration game full of mystery and atmospheric tension.
    visually games edutainment pointandclick puzzle adventure Endacopia Endacopia is a Point-and-Click Puzzle Adventure that is visually reminiscent of early computer edutainment games... with underlying horrors.
  • Loss: SpladeLoss with these parameters:
    {
        "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
        "document_regularizer_weight": 0.003,
        "query_regularizer_weight": 0
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 4,913 evaluation samples
  • Columns: query and document
  • Approximate statistics based on the first 1000 samples:
    query document
    type string string
    details
    • min: 4 tokens
    • mean: 6.97 tokens
    • max: 16 tokens
    • min: 6 tokens
    • mean: 49.39 tokens
    • max: 118 tokens
  • Samples:
    query document
    cones playground park explore Pipo Park Pipo Park is a messy and happy digital playground. Explore a small park, play football, throw traffic cones, balls, dogs, and more... Have a nice time!
    sentient united employee dark Quantum Eye Set in a dark, futuristic united states, all citizens are watched under a highly-advanced supercomputer known as the "Quantum Eye" Machine. You are an employee at the Quantum Eye facility, but quantum eye goes sentient. Your goal is simple, SHUT IT DOWN...
    five solve in twisted 2 powerful Hanaja's Body 2 in One Indulge in this twisted tale of transformations as five characters struggle for control of one body: yours! Mix and match between five different, uniquely powerful forms to defeat dangerous foes and solve puzzles to escape the tome you've been trapped in before it's too late.
  • Loss: SpladeLoss with these parameters:
    {
        "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score', gather_across_devices=False)",
        "document_regularizer_weight": 0.003,
        "query_regularizer_weight": 0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates
  • router_mapping: {'query': 'query', 'answer': 'document'}

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {'query': 'query', 'answer': 'document'}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss
0.0812 200 0.0307 -
0.1625 400 0.0116 -
0.2437 600 0.0162 -
0.3249 800 0.0086 -
0.4062 1000 0.0095 0.0134
0.4874 1200 0.0114 -
0.5686 1400 0.0113 -
0.6499 1600 0.0124 -
0.7311 1800 0.0145 -
0.8123 2000 0.0151 0.0167
0.8936 2200 0.0162 -
0.9748 2400 0.0163 -
1.0561 2600 0.0169 -
1.1373 2800 0.017 -
1.2185 3000 0.0177 0.0208
1.2998 3200 0.0176 -
1.3810 3400 0.0255 -
1.4622 3600 0.0143 -
1.5435 3800 0.0168 -
1.6247 4000 0.018 0.0191
1.7059 4200 0.0177 -
1.7872 4400 0.0148 -
1.8684 4600 0.0139 -
1.9496 4800 0.0156 -
2.0309 5000 0.0125 0.0173
2.1121 5200 0.0139 -
2.1933 5400 0.0113 -
2.2746 5600 0.0124 -
2.3558 5800 0.0118 -
2.4370 6000 0.0115 0.0177
2.5183 6200 0.011 -
2.5995 6400 0.0117 -
2.6807 6600 0.0108 -
2.7620 6800 0.0127 -
2.8432 7000 0.0128 0.0165
2.9245 7200 0.0107 -

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 5.2.0
  • Transformers: 4.57.1
  • PyTorch: 2.9.0+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.4.2
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mazombieme/There-Are-No-Games

Base model

naver/splade-v3
Finetuned
(1)
this model

Papers for mazombieme/There-Are-No-Games