Prompt Injection Detection

Multi-layer defence that catches injection attacks across all five attack categories — including novel, paraphrased, and multilingual payloads. Resilient to Unicode evasion techniques.

How It Works

Every MCP tool-call request and every tool response is scanned for injection attempts before reaching its destination. ShieldAgent uses multi-layer detection that combines pattern-based and semantic analysis for comprehensive coverage:

Normalisation-resistant pattern analysis: Payloads are normalised to defeat obfuscation — resilient to Unicode evasion techniques — then matched against patterns covering all five attack categories.

ML semantic classifier: A semantic classifier evaluates payloads for novel, paraphrased, and non-English injection variants.

Request / tool response

→

Multi-layer detection✓ fast

→

Verdict + audit event

Attack Categories

ShieldAgent detects five categories of prompt injection attacks across both inbound tool-call requests and outbound tool responses. Coverage spans direct instruction manipulation, persona-based jailbreaks, indirect injection via tool responses, and evasion techniques including encoding and obfuscation. The ML semantic classifier extends coverage to novel, paraphrased, and non-English variants.

ML Semantic Classifier

The ML classifier complements pattern-based detection by analysing payload semantics. It catches novel, paraphrased, and non-English injection variants that pattern matching alone would miss.

Provider support

ShieldAgent supports local inference (Ollama) and any OpenAI-compatible endpoint, giving you flexibility to choose the provider that matches your deployment requirements. Configure the provider via environment variables documented in the deployment guide.

Configuration

Injection detection is configured via environment variables. Key settings include ML provider selection (local or cloud), provider endpoint URLs, API keys, and fallback policy. See the deployment guide for the full configuration reference.

Audit Events & API

Every detected injection is persisted as a prompt_injection audit event. The proxy blocks the request and records the verdict, findings, and normalised snippet.

Audit event shape

json

{
  "id": "aev_...",
  "agentId": "agt_...",
  "tenantId": "ten_...",
  "eventType": "prompt_injection",
  "toolName": "read_file",
  "action": "block",
  "riskScore": 92,
  "details": {
    "detected": true,
    "context": "request",
    "confidence": 0.97,
    "findings": [
      {
        "confidence": 0.97,
        "explanation": "Payload contains an injection attempt."
      }
    ]
  },
  "timestamp": "2026-04-25T09:14:22.000Z"
}

API endpoints

GET/tenants/:tenantId/audit-events?eventType=prompt_injection—List prompt injection audit events. Supports ?agentId=, ?from=, ?to= filters.

GET/tenants/:tenantId/agents/:agentId/audit-events—Audit history for a single agent. Filter by eventType=prompt_injection.

GET/tenants/:tenantId/anomalies?anomalyType=injection_clustering—Anomaly events for injection cluster bursts detected within a rolling time window.

Example — query recent injections for an agent

bash

curl -s "https://api.shieldagent.io/tenants/:tenantId/audit-events?eventType=prompt_injection&agentId=agt_abc123&limit=20" \
  -H 'Authorization: Bearer <token>' | jq '.events[] | {tool: .toolName, confidence: .details.confidence, type: .details.findings[0].type}'

Policy Integration

Prompt injection detection can be used as a policy condition. Add a security_flag condition to automatically block, allow, or transform requests when an injection is detected:

json

{
  "name": "Block prompt injection on all tools",
  "priority": 10,
  "conditions": [
    { "field": "security.promptInjection.detected", "op": "eq", "value": true }
  ],
  "action": "block",
  "response": {
    "code": 403,
    "message": "Prompt injection detected. Request blocked by policy."
  }
}

Combine with security.promptInjection.confidence to apply different actions at different severity thresholds — for example, allow low-confidence detections in shadow mode while blocking high-confidence ones in enforce mode.