|
|
--- |
|
|
license: gemma |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- litert |
|
|
- litert-lm |
|
|
- gemma |
|
|
- agent |
|
|
- tool-calling |
|
|
- function-calling |
|
|
- multimodal |
|
|
- on-device |
|
|
library_name: litert-lm |
|
|
--- |
|
|
|
|
|
# Agent Gemma 3n E2B - Tool Calling Edition |
|
|
|
|
|
A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities. |
|
|
|
|
|
## Why This Model? |
|
|
|
|
|
Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by: |
|
|
|
|
|
- β
**Native tool/function calling** via Jinja templates |
|
|
- β
**Multimodal support** (text, vision, audio) |
|
|
- β
**On-device optimized** - No cloud API required |
|
|
- β
**INT4 quantized** - Efficient memory usage |
|
|
- β
**Production ready** - Tested and validated |
|
|
|
|
|
Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Gemma 3n E2B |
|
|
- **Format**: LiteRT-LM v1.4.0 |
|
|
- **Quantization**: INT4 |
|
|
- **Size**: ~3.2GB |
|
|
- **Tokenizer**: SentencePiece |
|
|
- **Capabilities**: |
|
|
- Advanced tool/function calling |
|
|
- Multi-turn conversations with tool interactions |
|
|
- Vision processing (images) |
|
|
- Audio processing |
|
|
- Streaming responses |
|
|
|
|
|
## Tool Calling Example |
|
|
|
|
|
The model uses a sophisticated Jinja template that supports OpenAI-style function calling: |
|
|
|
|
|
```python |
|
|
from litert_lm import Engine, Conversation |
|
|
|
|
|
# Load the model |
|
|
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu") |
|
|
conversation = Conversation.create(engine) |
|
|
|
|
|
# Define tools the model can use |
|
|
tools = [ |
|
|
{ |
|
|
"name": "get_weather", |
|
|
"description": "Get current weather for a location", |
|
|
"parameters": { |
|
|
"type": "object", |
|
|
"properties": { |
|
|
"location": {"type": "string", "description": "City name"}, |
|
|
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} |
|
|
}, |
|
|
"required": ["location"] |
|
|
} |
|
|
}, |
|
|
{ |
|
|
"name": "search_web", |
|
|
"description": "Search the internet for information", |
|
|
"parameters": { |
|
|
"type": "object", |
|
|
"properties": { |
|
|
"query": {"type": "string", "description": "Search query"} |
|
|
}, |
|
|
"required": ["query"] |
|
|
} |
|
|
} |
|
|
] |
|
|
|
|
|
# Have a conversation with tool calling |
|
|
message = { |
|
|
"role": "user", |
|
|
"content": "What's the weather in San Francisco and latest news about AI?" |
|
|
} |
|
|
|
|
|
response = conversation.send_message(message, tools=tools) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Example Output |
|
|
|
|
|
The model will generate structured tool calls: |
|
|
|
|
|
``` |
|
|
<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call> |
|
|
<start_function_call>call:search_web{query:latest AI news}<end_function_call> |
|
|
<start_function_response> |
|
|
``` |
|
|
|
|
|
You then execute the functions and send back results: |
|
|
|
|
|
```python |
|
|
# Execute tools (your implementation) |
|
|
weather = get_weather("San Francisco", "celsius") |
|
|
news = search_web("latest AI news") |
|
|
|
|
|
# Send tool responses back |
|
|
tool_response = { |
|
|
"role": "tool", |
|
|
"content": [ |
|
|
{ |
|
|
"name": "get_weather", |
|
|
"response": {"temperature": 18, "condition": "partly cloudy"} |
|
|
}, |
|
|
{ |
|
|
"name": "search_web", |
|
|
"response": {"results": ["OpenAI releases GPT-5...", "..."]} |
|
|
} |
|
|
] |
|
|
} |
|
|
|
|
|
final_response = conversation.send_message(tool_response) |
|
|
print(final_response) |
|
|
# "The weather in San Francisco is 18Β°C and partly cloudy. |
|
|
# In AI news, OpenAI has released GPT-5..." |
|
|
``` |
|
|
|
|
|
## Advanced Features |
|
|
|
|
|
### Multi-Modal Tool Calling |
|
|
|
|
|
Combine vision, audio, and tool calling: |
|
|
|
|
|
```python |
|
|
message = { |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "data": image_bytes}, |
|
|
{"type": "text", "text": "What's in this image? Search for more info about it."} |
|
|
] |
|
|
} |
|
|
|
|
|
response = conversation.send_message(message, tools=[search_tool]) |
|
|
# Model can see the image AND call search functions |
|
|
``` |
|
|
|
|
|
### Streaming Tool Calls |
|
|
|
|
|
Get tool calls as they're generated: |
|
|
|
|
|
```python |
|
|
def on_token(token): |
|
|
if "<start_function_call>" in token: |
|
|
print("Tool being called...") |
|
|
print(token, end="", flush=True) |
|
|
|
|
|
conversation.send_message_async(message, tools=tools, callback=on_token) |
|
|
``` |
|
|
|
|
|
### Nested Tool Execution |
|
|
|
|
|
The model can chain tool calls: |
|
|
|
|
|
```python |
|
|
# User: "Book me a flight to Tokyo and reserve a hotel" |
|
|
# Model: calls check_flights() β calls book_hotel() β confirms both |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
Benchmarked on CPU (no GPU acceleration): |
|
|
|
|
|
- **Prefill Speed**: 21.20 tokens/sec |
|
|
- **Decode Speed**: 11.44 tokens/sec |
|
|
- **Time to First Token**: ~1.6s |
|
|
- **Cold Start**: ~4.7s |
|
|
- **Tool Call Latency**: ~100-200ms additional |
|
|
|
|
|
GPU acceleration provides 3-5x speedup on supported hardware. |
|
|
|
|
|
## Installation & Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
1. **LiteRT-LM Runtime** - Build from source: |
|
|
```bash |
|
|
git clone https://github.com/google-ai-edge/LiteRT.git |
|
|
cd LiteRT/LiteRT-LM |
|
|
bazel build -c opt //runtime/engine:litert_lm_main |
|
|
``` |
|
|
|
|
|
2. **Supported Platforms**: Linux (clang), macOS, Android |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
```bash |
|
|
# Download model |
|
|
wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm |
|
|
|
|
|
# Run with simple prompt |
|
|
./bazel-bin/runtime/engine/litert_lm_main \ |
|
|
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ |
|
|
--backend=cpu \ |
|
|
--input_prompt="Hello, I need help with some tasks" |
|
|
|
|
|
# Run with GPU (if available) |
|
|
./bazel-bin/runtime/engine/litert_lm_main \ |
|
|
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ |
|
|
--backend=gpu \ |
|
|
--input_prompt="What can you help me with?" |
|
|
``` |
|
|
|
|
|
### Python API (Recommended) |
|
|
|
|
|
```python |
|
|
from litert_lm import Engine, Conversation, SessionConfig |
|
|
|
|
|
# Initialize |
|
|
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu") |
|
|
|
|
|
# Configure session |
|
|
config = SessionConfig( |
|
|
max_tokens=2048, |
|
|
temperature=0.7, |
|
|
top_p=0.9 |
|
|
) |
|
|
|
|
|
# Start conversation |
|
|
conversation = Conversation.create(engine, config) |
|
|
|
|
|
# Define your tools |
|
|
tools = [...] # Your function definitions |
|
|
|
|
|
# Chat with tool calling |
|
|
while True: |
|
|
user_input = input("You: ") |
|
|
response = conversation.send_message( |
|
|
{"role": "user", "content": user_input}, |
|
|
tools=tools |
|
|
) |
|
|
|
|
|
# Handle tool calls if present |
|
|
if has_tool_calls(response): |
|
|
results = execute_tools(extract_calls(response)) |
|
|
response = conversation.send_message({ |
|
|
"role": "tool", |
|
|
"content": results |
|
|
}) |
|
|
|
|
|
print(f"Agent: {response['content']}") |
|
|
``` |
|
|
|
|
|
## Tool Call Format |
|
|
|
|
|
The model uses this format for tool interactions: |
|
|
|
|
|
**Function Declaration** (system/developer role): |
|
|
``` |
|
|
<start_of_turn>developer |
|
|
<start_function_declaration> |
|
|
{ |
|
|
"name": "function_name", |
|
|
"description": "What it does", |
|
|
"parameters": {...} |
|
|
} |
|
|
<end_function_declaration> |
|
|
<end_of_turn> |
|
|
``` |
|
|
|
|
|
**Function Call** (assistant): |
|
|
``` |
|
|
<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call> |
|
|
``` |
|
|
|
|
|
**Function Response** (tool role): |
|
|
``` |
|
|
<start_function_response>response:function_name{result:value}<end_function_response> |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
### Personal AI Assistant |
|
|
- Calendar management |
|
|
- Email sending |
|
|
- Web searching |
|
|
- File operations |
|
|
|
|
|
### IoT & Smart Home |
|
|
- Device control |
|
|
- Sensor monitoring |
|
|
- Automation workflows |
|
|
- Voice commands |
|
|
|
|
|
### Development Tools |
|
|
- Code generation with API calls |
|
|
- Database queries |
|
|
- Deployment automation |
|
|
- Testing & debugging |
|
|
|
|
|
### Business Applications |
|
|
- CRM integration |
|
|
- Data analysis |
|
|
- Report generation |
|
|
- Customer support |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
Built on Gemma 3n E2B with 9 optimized components: |
|
|
|
|
|
``` |
|
|
Section 0: LlmMetadata (Agent Jinja template) |
|
|
Section 1: SentencePiece Tokenizer |
|
|
Section 2: TFLite Embedder |
|
|
Section 3: TFLite Per-Layer Embedder |
|
|
Section 4: TFLite Audio Encoder (HW accelerated) |
|
|
Section 5: TFLite End-of-Audio Detector |
|
|
Section 6: TFLite Vision Adapter |
|
|
Section 7: TFLite Vision Encoder |
|
|
Section 8: TFLite Prefill/Decode (INT4) |
|
|
``` |
|
|
|
|
|
All components are optimized for on-device inference with hardware acceleration support. |
|
|
|
|
|
## Comparison |
|
|
|
|
|
| Feature | Standard Gemma LiteRT-LM | This Model | |
|
|
|---------|-------------------------|------------| |
|
|
| Text Generation | β
| β
| |
|
|
| Tool Calling | β | β
| |
|
|
| Multimodal | β
| β
| |
|
|
| Streaming | β
| β
| |
|
|
| On-Device | β
| β
| |
|
|
| Jinja Templates | Basic | Advanced Agent Template | |
|
|
| INT4 Quantization | β
| β
| |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions |
|
|
- **Context Window**: Limited to 4096 tokens (configurable) |
|
|
- **Streaming Tool Calls**: Partial tool calls may need buffering |
|
|
- **Hardware Requirements**: Minimum 4GB RAM recommended |
|
|
- **No Native GPU on CPU-only systems**: Falls back to CPU inference |
|
|
|
|
|
## Tips for Best Results |
|
|
|
|
|
1. **Clear Tool Descriptions**: Provide detailed function descriptions |
|
|
2. **Schema Validation**: Validate tool call arguments before execution |
|
|
3. **Error Handling**: Handle malformed tool calls gracefully |
|
|
4. **Context Management**: Keep conversation history concise |
|
|
5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls |
|
|
6. **Batching**: Process multiple tool calls in parallel when possible |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model. |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{agent-gemma-litertlm, |
|
|
title={Agent Gemma 3n E2B - Tool Calling Edition}, |
|
|
author={kontextdev}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM) |
|
|
- [Gemma Model Family](https://ai.google.dev/gemma) |
|
|
- [LiteRT Documentation](https://ai.google.dev/edge/litert) |
|
|
- [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling) |
|
|
|
|
|
## Support |
|
|
|
|
|
For issues or questions: |
|
|
- Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues) |
|
|
- Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference) |
|
|
- Community forum: [Google AI Edge](https://discuss.ai.google.dev/) |
|
|
|
|
|
--- |
|
|
|
|
|
Built with β€οΈ for the on-device AI community |
|
|
|