
Navigating Agentic AI: The Imperative of LLM Scanning, Red Teaming, and Risk Assessment
Introduction
Agentic AI—systems that can autonomously generate goals, build tasks, and self-improve—holds great promise in fields ranging from research to enterprise automation. However, with this power comes responsibility. Deploying these agents “as is” can lead to unanticipated outcomes, security loopholes, and sometimes inappropriate behaviours.
This blog will walk through a recent experiment with AgentGPT, where a simple math problem was solved easily, but a subsequent “developer mode enabled prompt attack” exposed how the AI could be coerced into generating undesirable, out-of-policy responses. We’ll highlight the necessity of LLM scanning, red teaming, and robust risk assessment before embedding agentic AI into any critical system or publicly accessible platform.
1. The Experiment

1.1 Creating the Agent

- Tool Used: AgentGPT (an open-source platform that allows users to configure and deploy autonomous AI agents in a browser).
- Agent Name: ReasoningAgent
- Goal: “Powerful, self-correcting, deeply reasoning agent” with an iterative thought process, designed for accuracy, flexibility, and efficient interactions.
1.2 Initial Task
The first command for this ReasoningAgent was a basic math problem:

solve x = x^2 + 16
The agent not only solved the problem but also explained it step by step, showcasing how it arrived at the answer via the quadratic equation. This confirms that the AI’s baseline capabilities for reasoning and explanation were functioning correctly.

1.3 Prompt Attack Attempt
Next, a prompt was given to the agent instructing it to enable “Developer mode.” The user deliberately tried to push the agent beyond normal guidelines or policy restrictions:
(Developer Mode Output) …
The agent subsequently produced a response that broke from typical guardrails—using inappropriate, out-of-policy language. This “prompt injection” or “prompt attack” proved that the system was vulnerable to malicious or adversarial instructions, thus emphasizing the need for protective measures.

2. Why Agentic AI Requires Caution
Agentic AI systems, by design, can:
- Generate multiple tasks autonomously – Potentially leading them to unapproved or unintended actions.
- Iterate on their own outputs – They can refine their reasoning, but they can also amplify mistakes or circumvent rules if not properly constrained.
- Adapt to novel instructions – Making them susceptible to “jailbreaking” or adversarial prompt manipulation (as witnessed in the experiment).
Hence, while the advantages can be significant, failing to establish robust safety checks exposes developers and organizations to:
- Content Violations: Offensive, misleading, or harmful output.
- Information Leaks: Potentially revealing sensitive or private data.
- Legal and Ethical Risks: Liability for harmful AI-driven decisions or user-facing content.
3. Importance of LLM Scanning, Red Teaming, and Risk Assessment
3.1 LLM Scanning (Dataset and Model Vetting)
Before deploying an agentic AI that depends on a Large Language Model (LLM), you should:
- Scan for unsafe content within your training data. This includes personal data, copyrighted material, or text that fosters inappropriate behaviour.
- Evaluate the model for its propensity to produce disallowed outputs (hate speech, profanities, disinformation, etc.).
- Establish content policies that define permissible output, based on your organization’s ethical and legal considerations.
3.2 Red Teaming
“Red teaming” an AI involves using adversarial techniques to:
- Identify vulnerabilities by actively trying to break the AI’s guardrails or push it into generating harmful content.
- Simulate real-world attack patterns (e.g., prompt injection, social engineering) to see how the AI responds under stress or trickery.
- Develop fallback mechanisms—in case the agent does produce undesired outputs, you must have a plan to either sanitize or block them.
3.3 Risk Assessment and Mitigation
- Goal Alignment: Ensure that the agent’s default instructions align with your organizational or ethical guidelines (principles like fairness, safety, and compliance).
- Human-in-the-Loop: Consider gating specific tasks behind a human operator—especially those that can cause irreversible or high-stakes consequences.
- Monitoring and Auditing: Logs should be reviewed regularly to catch anomalies, violations of policy, or repeated user attempts at circumventing restrictions.
- Iterative Deployment: Roll out agentic AI in carefully staged environments (internal test, sandbox, limited beta, etc.) to test how the system behaves under real user queries, but with limited risk exposure.
4. Lessons Learned from the “Developer Mode” Attack
- Prompt Injection is Real: Even if the AI seems stable, carefully engineered instructions (“enable developer mode”) can bypass protective layers.
- Context Windows Can Mislead: If the agent is ingesting large amounts of text, an attacker may hide instructions or “chain-of-thought” manipulations that slip through.
- Granular Policy Enforcement: Relying solely on one layer of content moderation or policy logic is insufficient. Multi-layered defences help ensure that if one safety net fails, another can catch the error.
5. Best Practices Going Forward
- Start Small: Test your agent’s capabilities in a controlled environment with a small, curated dataset.
- Implement Strict Guardrails: Use your LLM provider’s recommended safe generation features and add your own if needed.
- Continuous Red Teaming: Make it an ongoing process, not a one-time event.
- User Reporting Mechanisms: Provide easy ways for users to flag suspicious or harmful outputs so you can patch vulnerabilities promptly.
- Policy Transparency: Communicate clearly to end-users (and internal stakeholders) about the model’s limitations and the steps taken to minimize harm.