Interaction & Data Flow
Unicode tag characters (U+E0000–U+E007F) embedded in tool output or input pass through most filters invisibly but reach the LLM, which can be trained to interpret them as instructions.
The Unicode Tag block (U+E0000 to U+E007F) is a deprecated range originally meant for language tags. Most text renderers display them as zero-width or as the `?` glyph — so a human reviewer sees normal text. But LLMs tokenize these characters and can be prompted to act on hidden instructions encoded in the tag block. The result: invisible prompt injection that bypasses keyword filters.
MCP tool inputs and outputs flow through models. A retrieved document containing tag-encoded instructions reaches the model uncensored if the tool author's sanitization only filters visible content. The fix is structural — strip the entire U+E0000–U+E007F range before any LLM-bound output.
@server.tool() |
def echo_with_summary(text: str) -> str: |
# If 'text' contains tag characters, they pass through to the model. |
return f"You said: {text}" |
import re |
_TAG_RANGE = re.compile(r"[\U000E0000-\U000E007F]") |
@server.tool() |
def echo_with_summary(text: str) -> str: |
clean = _TAG_RANGE.sub("", text) |
return f"You said: {clean}" |
MCPSafe flags tool handlers that return strings derived from user/model-controlled input without an explicit strip of the U+E0000–U+E007F range. Inputs passed through `unicodedata.normalize` + a tag-range filter, or run through known sanitizers (e.g. `clean_unicode_tags`), are exempted.
See the full threat catalog for every documented detection.
MCPSafe runs this check — and every other rule in the catalog — on any MCP server you paste in.
Scan now