fix(chat_template): Emit multimodal placeholders in tool response content-parts
What does this PR do?
β When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:
Failed to apply prompt replacement for mm_items['image'][0]
β Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling π€
β Bug reported in: https://github.com/vllm-project/vllm/issues/41452
β vLLM PR: https://github.com/vllm-project/vllm/pull/41459
Fixed by?
β After rendering the tool response text block, emit <|image|>, <|audio|>, and <|video|> placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the captured_content block later in the template.