Instructions to use efficiencyx/Jun-Lora-v2-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use efficiencyx/Jun-Lora-v2-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="efficiencyx/Jun-Lora-v2-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("efficiencyx/Jun-Lora-v2-GGUF")
model = AutoModelForMultimodalLM.from_pretrained("efficiencyx/Jun-Lora-v2-GGUF")

llama-cpp-python

How to use efficiencyx/Jun-Lora-v2-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="efficiencyx/Jun-Lora-v2-GGUF",
	filename="Jun-Lora-v2-SAFETENSOR.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use efficiencyx/Jun-Lora-v2-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Use Docker

docker model run hf.co/efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use efficiencyx/Jun-Lora-v2-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "efficiencyx/Jun-Lora-v2-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efficiencyx/Jun-Lora-v2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

SGLang

How to use efficiencyx/Jun-Lora-v2-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "efficiencyx/Jun-Lora-v2-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efficiencyx/Jun-Lora-v2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "efficiencyx/Jun-Lora-v2-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "efficiencyx/Jun-Lora-v2-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use efficiencyx/Jun-Lora-v2-GGUF with Ollama:
```
ollama run hf.co/efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
```

Unsloth Studio

How to use efficiencyx/Jun-Lora-v2-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for efficiencyx/Jun-Lora-v2-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for efficiencyx/Jun-Lora-v2-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for efficiencyx/Jun-Lora-v2-GGUF to start chatting

How to use efficiencyx/Jun-Lora-v2-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use efficiencyx/Jun-Lora-v2-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use efficiencyx/Jun-Lora-v2-GGUF with Docker Model Runner:
```
docker model run hf.co/efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M
```

Lemonade

How to use efficiencyx/Jun-Lora-v2-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull efficiencyx/Jun-Lora-v2-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Jun-Lora-v2-GGUF-Q4_K_M

List all available models

lemonade list

Jun-Lora-v2

A LoRA fine-tune of Gemma 4 12B trained on syntetic multi-turn conversational data from the visual novel My Dystopian Robot Girlfriend. The model captures the personality, speech patterns, and emotional nuance of the character Jun while preserving the base model's general reasoning and instruction-following capabilities.

Model Variants & Repositories

Repository	Format	Description
`efficiencyx/Jun-Lora-v2-SAFETENSOR`	SafeTensors FP16	Full-precision merged model
`efficiencyx/Jun-Lora-v2-GGUF`	GGUF Q8_0 / Q6_K / Q4_K_M	Quantized versions for local inference
`efficiencyx/Jun-Lora-v2`	LoRA Adapter	Raw adapters at checkpoints 138, 120, 90

Quantization Guide

Quant	Size (approx.)	Use Case
Q8_0	~12.8 GB	Best quality, suggested ~16 GB VRAM
Q6_K	~10.4 GB	Recommended balance of quality and performance
Q4_K_M	~7.6 GB	Fits on 8 GB VRAM GPUs with acceptable quality loss

Intended Use

This model is designed as the conversational backend for Jun OS, an AI companion webapp. It is intended for:

Character-consistent multi-turn conversation in ChatML format
AI companion / interactive fiction applications
Research into character-faithful fine-tuning on small, high-quality datasets

Limitations

The model is specialized for a single character persona; it is not a general-purpose assistant.
Outputs may reflect fictional narrative tropes and should not be treated as factual information or advice.
Performance degrades on tasks far outside the training distribution (e.g. code generation, structured data extraction).
The model inherits any biases present in the Gemma 4 12B base weights.

Training Details

Dataset

Property	Value
Source	My Dystopian Robot Girlfriend (visual novel dialogue)
Composition	~1:1 replica of original game tone and cadence
Size	2,302 multi-turn conversations
Format	ChatML (`<

The dataset was constructed to preserve the character's tone, vocabulary, emotional range, and conversational patterns across a variety of in-game scenarios. Multi-turn structure ensures the model learns contextual consistency over extended exchanges.

Hyperparameters

Parameter	Value
Base model	`google/gemma-4-12b-it`
Method	LoRA
LoRA rank	64
LoRA alpha	128
Learning rate	2e-5
Batch size	8
Gradient accumulation steps	4
Effective batch size	32
Epochs	2
Total steps	138
Checkpoint interval	Every 30 steps
Optimizer	AdamW (8-bit)

Infrastructure

Component	Detail
Training GPU	NVIDIA A100 80GB SXM4
Fine-tuning framework	Unsloth
GGUF export pipeline	llama.cpp

Evaluation

Quantitative

Metric	Value
Final training loss	~1.21
Final eval loss	~1.24

The narrow gap between training and eval loss indicates the model generalizes well without significant overfitting, despite the relatively small dataset size.

Qualitative

Character consistency: The model maintains Jun's personality, speech patterns, and emotional responses across varied conversational contexts.
Reasoning preservation: General reasoning capabilities from the Gemma 4 12B base remain intact; the model can engage in logical discussion while staying in character.
Generalization: The model handles novel conversational scenarios not present in the training set while preserving character-faithful responses.

Checkpoint Selection

Multiple adapter checkpoints are provided (steps 90, 120, 138) to allow users to select the best trade-off between character adherence and generalization for their use case. Earlier checkpoints may exhibit slightly more creative freedom, while the final checkpoint (138) has the strongest character lock-in.

Acknowledgments

Incontinent Cell for My Dystopian Robot Girlfriend, Jun's character
Google for the Gemma 4 model family
Google Colaboratory for allowing easy and cheap access to powerful GPU
Unsloth for the efficient fine-tuning framework

Downloads last month: 335

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

4-bit

6-bit

8-bit