FAISS Index for Patent Retrieval

This repository contains FAISS index files created with the following parameters:

Model: SPECTER2 (allenai/specter2_base)
Index type: IVF100,PQ16.index
Distance metric: L2
Embedding dimension: 768
Corpus: USPTO Patents
PQ Quantization: PQ64 (improved precision over default PQ16)

Files

specter2_IVF100,PQ16.index: FAISS index file
specter2_IVF100,PQ64.index: FAISS index file
emb_specter2.memmap: Embedding memmap file
patents_all.parquet: Corpus parquet file

Usage

To use these files, download them and load with FAISS:

import faiss
import numpy as np
from huggingface_hub import hf_hub_download

# Download and load index
index_path = hf_hub_download(repo_id="ErzhuoShao/USPTO-Specter2-faiss", filename="specter2_IVF100,PQ16.index")
index = faiss.read_index(index_path)

# Optionally download and load embeddings if needed
emb_path = hf_hub_download(repo_id="ErzhuoShao/USPTO-Specter2-faiss", filename="emb_specter2.memmap")
embeddings = np.memmap(
    emb_path,
    mode="r",
    dtype=np.float32
).reshape(-1, 768)  # Adjust shape as needed

# Load corpus
import pandas as pd
corpus = pd.read_parquet("path/to/downloaded/corpus.parquet")

# Example query
from transformers import AutoTokenizer, AutoModel
import torch

# Load the same model used to build the index
tokenizer = AutoTokenizer.from_pretrained("allenai/specter2_base")
model = AutoModel.from_pretrained("allenai/specter2_base")

# Encode a query
query = "Machine learning techniques for computer vision"
inputs = tokenizer(query, return_tensors="pt", max_length=512, padding=True, truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
query_vector = outputs.last_hidden_state[:, 0].numpy().astype('float32')

# Search the index
distances, indices = index.search(query_vector, k=5)

For more details, refer to the original repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support