agent-gemma / README.md

Update README to focus on tool calling capabilities

da7f35e verified about 1 month ago

10.4 kB

	---
	license: gemma
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- litert
	- litert-lm
	- gemma
	- agent
	- tool-calling
	- function-calling
	- multimodal
	- on-device
	library_name: litert-lm
	---

	# Agent Gemma 3n E2B - Tool Calling Edition

	A specialized version of Gemma 3n E2B optimized for on-device tool/function calling with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities.

	## Why This Model?

	Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by:

	- ✅ Native tool/function calling via Jinja templates
	- ✅ Multimodal support (text, vision, audio)
	- ✅ On-device optimized - No cloud API required
	- ✅ INT4 quantized - Efficient memory usage
	- ✅ Production ready - Tested and validated

	Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device.

	## Model Details

	- Base Model: Gemma 3n E2B
	- Format: LiteRT-LM v1.4.0
	- Quantization: INT4
	- Size: ~3.2GB
	- Tokenizer: SentencePiece
	- Capabilities:
	- Advanced tool/function calling
	- Multi-turn conversations with tool interactions
	- Vision processing (images)
	- Audio processing
	- Streaming responses

	## Tool Calling Example

	The model uses a sophisticated Jinja template that supports OpenAI-style function calling:

	```python
	from litert_lm import Engine, Conversation

	# Load the model
	engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu")
	conversation = Conversation.create(engine)

	# Define tools the model can use
	tools = [
	{
	"name": "get_weather",
	"description": "Get current weather for a location",
	"parameters": {
	"type": "object",
	"properties": {
	"location": {"type": "string", "description": "City name"},
	"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
	},
	"required": ["location"]
	}
	},
	{
	"name": "search_web",
	"description": "Search the internet for information",
	"parameters": {
	"type": "object",
	"properties": {
	"query": {"type": "string", "description": "Search query"}
	},
	"required": ["query"]
	}
	}
	]

	# Have a conversation with tool calling
	message = {
	"role": "user",
	"content": "What's the weather in San Francisco and latest news about AI?"
	}

	response = conversation.send_message(message, tools=tools)
	print(response)
	```

	### Example Output

	The model will generate structured tool calls:

	```
	<start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call>
	<start_function_call>call:search_web{query:latest AI news}<end_function_call>
	<start_function_response>
	```

	You then execute the functions and send back results:

	```python
	# Execute tools (your implementation)
	weather = get_weather("San Francisco", "celsius")
	news = search_web("latest AI news")

	# Send tool responses back
	tool_response = {
	"role": "tool",
	"content": [
	{
	"name": "get_weather",
	"response": {"temperature": 18, "condition": "partly cloudy"}
	},
	{
	"name": "search_web",
	"response": {"results": ["OpenAI releases GPT-5...", "..."]}
	}
	]
	}

	final_response = conversation.send_message(tool_response)
	print(final_response)
	# "The weather in San Francisco is 18°C and partly cloudy.
	# In AI news, OpenAI has released GPT-5..."
	```

	## Advanced Features

	### Multi-Modal Tool Calling

	Combine vision, audio, and tool calling:

	```python
	message = {
	"role": "user",
	"content": [
	{"type": "image", "data": image_bytes},
	{"type": "text", "text": "What's in this image? Search for more info about it."}
	]
	}

	response = conversation.send_message(message, tools=[search_tool])
	# Model can see the image AND call search functions
	```

	### Streaming Tool Calls

	Get tool calls as they're generated:

	```python
	def on_token(token):
	if "<start_function_call>" in token:
	print("Tool being called...")
	print(token, end="", flush=True)

	conversation.send_message_async(message, tools=tools, callback=on_token)
	```

	### Nested Tool Execution

	The model can chain tool calls:

	```python
	# User: "Book me a flight to Tokyo and reserve a hotel"
	# Model: calls check_flights() → calls book_hotel() → confirms both
	```

	## Performance

	Benchmarked on CPU (no GPU acceleration):

	- Prefill Speed: 21.20 tokens/sec
	- Decode Speed: 11.44 tokens/sec
	- Time to First Token: ~1.6s
	- Cold Start: ~4.7s
	- Tool Call Latency: ~100-200ms additional

	GPU acceleration provides 3-5x speedup on supported hardware.

	## Installation & Usage

	### Requirements

	1. LiteRT-LM Runtime - Build from source:
	```bash
	git clone https://github.com/google-ai-edge/LiteRT.git
	cd LiteRT/LiteRT-LM
	bazel build -c opt //runtime/engine:litert_lm_main
	```

	2. Supported Platforms: Linux (clang), macOS, Android

	### Quick Start

	```bash
	# Download model
	wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm

	# Run with simple prompt
	./bazel-bin/runtime/engine/litert_lm_main \
	--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
	--backend=cpu \
	--input_prompt="Hello, I need help with some tasks"

	# Run with GPU (if available)
	./bazel-bin/runtime/engine/litert_lm_main \
	--model_path=gemma-3n-E2B-it-agent-fixed.litertlm \
	--backend=gpu \
	--input_prompt="What can you help me with?"
	```

	### Python API (Recommended)

	```python
	from litert_lm import Engine, Conversation, SessionConfig

	# Initialize
	engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu")

	# Configure session
	config = SessionConfig(
	max_tokens=2048,
	temperature=0.7,
	top_p=0.9
	)

	# Start conversation
	conversation = Conversation.create(engine, config)

	# Define your tools
	tools = [...] # Your function definitions

	# Chat with tool calling
	while True:
	user_input = input("You: ")
	response = conversation.send_message(
	{"role": "user", "content": user_input},
	tools=tools
	)

	# Handle tool calls if present
	if has_tool_calls(response):
	results = execute_tools(extract_calls(response))
	response = conversation.send_message({
	"role": "tool",
	"content": results
	})

	print(f"Agent: {response['content']}")
	```

	## Tool Call Format

	The model uses this format for tool interactions:

	Function Declaration (system/developer role):
	```
	<start_of_turn>developer
	<start_function_declaration>
	{
	"name": "function_name",
	"description": "What it does",
	"parameters": {...}
	}
	<end_function_declaration>
	<end_of_turn>
	```

	Function Call (assistant):
	```
	<start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call>
	```

	Function Response (tool role):
	```
	<start_function_response>response:function_name{result:value}<end_function_response>
	```

	## Use Cases

	### Personal AI Assistant
	- Calendar management
	- Email sending
	- Web searching
	- File operations

	### IoT & Smart Home
	- Device control
	- Sensor monitoring
	- Automation workflows
	- Voice commands

	### Development Tools
	- Code generation with API calls
	- Database queries
	- Deployment automation
	- Testing & debugging

	### Business Applications
	- CRM integration
	- Data analysis
	- Report generation
	- Customer support

	## Model Architecture

	Built on Gemma 3n E2B with 9 optimized components:

	```
	Section 0: LlmMetadata (Agent Jinja template)
	Section 1: SentencePiece Tokenizer
	Section 2: TFLite Embedder
	Section 3: TFLite Per-Layer Embedder
	Section 4: TFLite Audio Encoder (HW accelerated)
	Section 5: TFLite End-of-Audio Detector
	Section 6: TFLite Vision Adapter
	Section 7: TFLite Vision Encoder
	Section 8: TFLite Prefill/Decode (INT4)
	```

	All components are optimized for on-device inference with hardware acceleration support.

	## Comparison

	\| Feature \| Standard Gemma LiteRT-LM \| This Model \|
	\|---------\|-------------------------\|------------\|
	\| Text Generation \| ✅ \| ✅ \|
	\| Tool Calling \| ❌ \| ✅ \|
	\| Multimodal \| ✅ \| ✅ \|
	\| Streaming \| ✅ \| ✅ \|
	\| On-Device \| ✅ \| ✅ \|
	\| Jinja Templates \| Basic \| Advanced Agent Template \|
	\| INT4 Quantization \| ✅ \| ✅ \|

	## Limitations

	- Tool Execution: The model generates tool calls but doesn't execute them - you need to implement the actual functions
	- Context Window: Limited to 4096 tokens (configurable)
	- Streaming Tool Calls: Partial tool calls may need buffering
	- Hardware Requirements: Minimum 4GB RAM recommended
	- No Native GPU on CPU-only systems: Falls back to CPU inference

	## Tips for Best Results

	1. Clear Tool Descriptions: Provide detailed function descriptions
	2. Schema Validation: Validate tool call arguments before execution
	3. Error Handling: Handle malformed tool calls gracefully
	4. Context Management: Keep conversation history concise
	5. Temperature: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls
	6. Batching: Process multiple tool calls in parallel when possible

	## License

	This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.

	## Citation

	```bibtex
	@misc{agent-gemma-litertlm,
	title={Agent Gemma 3n E2B - Tool Calling Edition},
	author={kontextdev},
	year={2025},
	publisher={HuggingFace},
	howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}}
	}
	```

	## Links

	- [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM)
	- [Gemma Model Family](https://ai.google.dev/gemma)
	- [LiteRT Documentation](https://ai.google.dev/edge/litert)
	- [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling)

	## Support

	For issues or questions:
	- Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues)
	- Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference)
	- Community forum: [Google AI Edge](https://discuss.ai.google.dev/)

	---

	Built with ❤️ for the on-device AI community