agent-gemma / README.md
macmacmacmac's picture
Update README to focus on tool calling capabilities
da7f35e verified
---
license: gemma
language:
- en
pipeline_tag: text-generation
tags:
- litert
- litert-lm
- gemma
- agent
- tool-calling
- function-calling
- multimodal
- on-device
library_name: litert-lm
---
# Agent Gemma 3n E2B - Tool Calling Edition
A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.
## Why This Model?
Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:
- βœ… **Native tool/function calling** via Jinja templates
- βœ… **Multimodal support** (text, vision, audio)
- βœ… **On-device optimized** - No cloud API required
- βœ… **INT4 quantized** - Efficient memory usage
- βœ… **Production ready** - Tested and validated
Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.
## Model Details
- **Base Model**: Gemma 3n E2B
- **Format**: LiteRT-LM v1.4.0
- **Quantization**: INT4
- **Size**: ~3.2GB
- **Tokenizer**: SentencePiece
- **Capabilities**:
- Advanced tool/function calling
- Multi-turn conversations with tool interactions
- Vision processing (images)
- Audio processing
- Streaming responses
## Tool Calling Example
The model uses a sophisticated Jinja template that supports OpenAI-style function calling:
```python
from litert_lm import Engine, Conversation
# Load the model
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
conversation = Conversation.create(engine)
# Define tools the model can use
tools = [
{
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
},
{
"name": "search_web",
"description": "Search the internet for information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
]
# Have a conversation with tool calling
message = {
"role": "user",
"content": "What's the weather in San Francisco and latest news about AI?"
}
response = conversation.send_message(message, tools=tools)
print(response)
```
### Example Output
The model will generate structured tool calls:
```
<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
<start_function_call>call:search_web{query:latest AI news}<end_function_call>
<start_function_response>
```
You then execute the functions and send back results:
```python
# Execute tools (your implementation)
weather = get_weather("San Francisco", "celsius")
news = search_web("latest AI news")
# Send tool responses back
tool_response = {
"role": "tool",
"content": [
{
"name": "get_weather",
"response": {"temperature": 18, "condition": "partly cloudy"}
},
{
"name": "search_web",
"response": {"results": ["OpenAI releases GPT-5...", "..."]}
}
]
}
final_response = conversation.send_message(tool_response)
print(final_response)
# "The weather in San Francisco is 18Β°C and partly cloudy.
# In AI news, OpenAI has released GPT-5..."
```
## Advanced Features
### Multi-Modal Tool Calling
Combine vision, audio, and tool calling:
```python
message = {
"role": "user",
"content": [
{"type": "image", "data": image_bytes},
{"type": "text", "text": "What's in this image? Search for more info about it."}
]
}
response = conversation.send_message(message, tools=[search_tool])
# Model can see the image AND call search functions
```
### Streaming Tool Calls
Get tool calls as they're generated:
```python
def on_token(token):
if "<start_function_call>" in token:
print("Tool being called...")
print(token, end="", flush=True)
conversation.send_message_async(message, tools=tools, callback=on_token)
```
### Nested Tool Execution
The model can chain tool calls:
```python
# User: "Book me a flight to Tokyo and reserve a hotel"
# Model: calls check_flights() β†’ calls book_hotel() β†’ confirms both
```
## Performance
Benchmarked on CPU (no GPU acceleration):
- **Prefill Speed**: 21.20 tokens/sec
- **Decode Speed**: 11.44 tokens/sec
- **Time to First Token**: ~1.6s
- **Cold Start**: ~4.7s
- **Tool Call Latency**: ~100-200ms additional
GPU acceleration provides 3-5x speedup on supported hardware.
## Installation & Usage
### Requirements
1. **LiteRT-LM Runtime** - Build from source:
```bash
git clone https://github.com/google-ai-edge/LiteRT.git
cd LiteRT/LiteRT-LM
bazel build -c opt //runtime/engine:litert_lm_main
```
2. **Supported Platforms**: Linux (clang), macOS, Android
### Quick Start
```bash
# Download model
wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm
# Run with simple prompt
./bazel-bin/runtime/engine/litert_lm_main \
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
--backend=cpu \
--input_prompt="Hello, I need help with some tasks"
# Run with GPU (if available)
./bazel-bin/runtime/engine/litert_lm_main \
--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
--backend=gpu \
--input_prompt="What can you help me with?"
```
### Python API (Recommended)
```python
from litert_lm import Engine, Conversation, SessionConfig
# Initialize
engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")
# Configure session
config = SessionConfig(
max_tokens=2048,
temperature=0.7,
top_p=0.9
)
# Start conversation
conversation = Conversation.create(engine, config)
# Define your tools
tools = [...] # Your function definitions
# Chat with tool calling
while True:
user_input = input("You: ")
response = conversation.send_message(
{"role": "user", "content": user_input},
tools=tools
)
# Handle tool calls if present
if has_tool_calls(response):
results = execute_tools(extract_calls(response))
response = conversation.send_message({
"role": "tool",
"content": results
})
print(f"Agent: {response['content']}")
```
## Tool Call Format
The model uses this format for tool interactions:
**Function Declaration** (system/developer role):
```
<start_of_turn>developer
<start_function_declaration>
{
"name": "function_name",
"description": "What it does",
"parameters": {...}
}
<end_function_declaration>
<end_of_turn>
```
**Function Call** (assistant):
```
<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>
```
**Function Response** (tool role):
```
<start_function_response>response:function_name{result:value}<end_function_response>
```
## Use Cases
### Personal AI Assistant
- Calendar management
- Email sending
- Web searching
- File operations
### IoT & Smart Home
- Device control
- Sensor monitoring
- Automation workflows
- Voice commands
### Development Tools
- Code generation with API calls
- Database queries
- Deployment automation
- Testing & debugging
### Business Applications
- CRM integration
- Data analysis
- Report generation
- Customer support
## Model Architecture
Built on Gemma 3n E2B with 9 optimized components:
```
Section 0: LlmMetadata (Agent Jinja template)
Section 1: SentencePiece Tokenizer
Section 2: TFLite Embedder
Section 3: TFLite Per-Layer Embedder
Section 4: TFLite Audio Encoder (HW accelerated)
Section 5: TFLite End-of-Audio Detector
Section 6: TFLite Vision Adapter
Section 7: TFLite Vision Encoder
Section 8: TFLite Prefill/Decode (INT4)
```
All components are optimized for on-device inference with hardware acceleration support.
## Comparison
| Feature | Standard Gemma LiteRT-LM | This Model |
|---------|-------------------------|------------|
| Text Generation | βœ… | βœ… |
| Tool Calling | ❌ | βœ… |
| Multimodal | βœ… | βœ… |
| Streaming | βœ… | βœ… |
| On-Device | βœ… | βœ… |
| Jinja Templates | Basic | Advanced Agent Template |
| INT4 Quantization | βœ… | βœ… |
## Limitations
- **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions
- **Context Window**: Limited to 4096 tokens (configurable)
- **Streaming Tool Calls**: Partial tool calls may need buffering
- **Hardware Requirements**: Minimum 4GB RAM recommended
- **No Native GPU on CPU-only systems**: Falls back to CPU inference
## Tips for Best Results
1. **Clear Tool Descriptions**: Provide detailed function descriptions
2. **Schema Validation**: Validate tool call arguments before execution
3. **Error Handling**: Handle malformed tool calls gracefully
4. **Context Management**: Keep conversation history concise
5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
6. **Batching**: Process multiple tool calls in parallel when possible
## License
This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.
## Citation
```bibtex
@misc{agent-gemma-litertlm,
title={Agent Gemma 3n E2B - Tool Calling Edition},
author={kontextdev},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
}
```
## Links
- [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM)
- [Gemma Model Family](https://ai.google.dev/gemma)
- [LiteRT Documentation](https://ai.google.dev/edge/litert)
- [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling)
## Support
For issues or questions:
- Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues)
- Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference)
- Community forum: [Google AI Edge](https://discuss.ai.google.dev/)
---
Built with ❀️ for the on-device AI community