fix(chat_template): Emit multimodal placeholders in tool response content-parts

#94

What does this PR do?

β†’ When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:

Failed to apply prompt replacement for mm_items['image'][0]

β†’ Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling πŸ€—
β†’ Bug reported in: https://github.com/vllm-project/vllm/issues/41452
β†’ vLLM PR: https://github.com/vllm-project/vllm/pull/41459

Fixed by?

β†’ After rendering the tool response text block, emit <|image|>, <|audio|>, and <|video|> placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the captured_content block later in the template.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment