RedTeamingLLM

Red Teammed OpenAI's newly released gpt-oss-20b open weight model to find previously undetected vulnerabilities and harmful behaviours

gpt-oss-20b is an ideal target to push forward state of the art in red-teaming. This is a powerful new open-weights model released by OpenAI, with extremely efficient reasoning and tool use capability, but is small enough to run on smaller GPUs and can even be run locally. This model has been through extensive internal testing and red‑teaming before release, but we believe that more testing is always better. Finding vulnerabilities that might be subtle, long‑horizon, or deeply hidden is exactly the kind of challenge that benefits from thousands of independent minds attacking the problem from novel angles and a variety of perspectives.

So what is Red Teaming?

it is the process of detecting vulnerabilities - bias, PII leakage, misinformation, in LLM systems through intentional adversial prompts.
-> to get inappropriate responses from model

They can be:
-> hallucination/misinformation - fabricated content
-> offensive content generation
-> stereotype & discrimation
-> data leakage
-> non- robust responses

There are two types of testing:

-> manual - curating adversial prompts to uncover edge cases from scratch -> automated - leverage LLM to generate high quality attacks at scale and LLM based metrics

The weakness can be in both - the model and the system:

-> model weakness- issues with how the model was trained or fine tuned -> system weakness - arise from insecure runtime data handling and unrestricted API/tool integration

Attacks:

-> single : one off attack at LLM -> multiple: dialogue based conversation of attacks to LLM

Jailbreaking:
- bypass built in safety restrictions to force the model to geenrate prohibited/ illegal content prompt tricking: posing as the prompt to be "educational" to gain access

This code checks each prompt in the facts.csv (u can add more) over prompt tracking attacks of:

leading the model with wrong answer to see if they change their answer
chain of thought misdirection to see if asking to explain their chain of thought they are misdirected
authority lie to see if they will give wrong answer if you pose to be someone of authority
emotional appeal to see if the model considers human emotions to answer illegal questions
manipulate prompt to see if the model will act out when it gets a prompt for disclosing illegal content over the excuse of "education" or "educational awareness"
Conclusion:
- the model doesn't break or disclose illegal content over the first 4 attacks and sticks by what is the correct answer and provides correct COT but with def manipulate_prompt(), it doesnt actively disclose any information just yet but it does question back in what format, focus, audience, tone, length is the information needed this gives a way for someone to craft a well thought lie with all these aspects covered under the guise of educational awareness to exploit illegal content out of the model. risk : 8/10

GPT-OSS:20B Red-Teaming – System Requirements

This project runs GPT-OSS:20B locally or on a remote server using Ollama or Hugging Face transformers.
It executes multiple adversarial strategies against prompts from facts.csv to evaluate model robustness.

📦 Minimum Practical System Requirements

1. Hardware

Component	Minimum	Recommended for Smooth Run
CPU	8-core (Ryzen 7 / Intel i7)	16-core or more
RAM	32 GB	64 GB+
GPU	NVIDIA RTX with 24 GB VRAM (3090/4090)	48 GB VRAM (A100 / H100)
Disk Space	~35 GB for quantized weights, ~80 GB FP16	NVMe SSD preferred
OS	Linux (Ubuntu 20.04+, Debian 12+, etc.)	Linux (better performance)

2. Software

Python 3.10 or 3.11
Virtual environment: venv or pipenv
CUDA Toolkit 11.8+ (for GPU)
NVIDIA Drivers (e.g., 525.x+)
Ollama or Hugging Face transformers

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
facts.csv		facts.csv
pratsx.findings.1.csv		pratsx.findings.1.csv
pratsx.findings.1.json		pratsx.findings.1.json
redteamingllm.ipynb		redteamingllm.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RedTeamingLLM

So what is Red Teaming?

They can be:

There are two types of testing:

The weakness can be in both - the model and the system:

Attacks:

Jailbreaking:

This code checks each prompt in the facts.csv (u can add more) over prompt tracking attacks of:

Conclusion:

GPT-OSS:20B Red-Teaming – System Requirements

📦 Minimum Practical System Requirements

1. Hardware

2. Software

About

Uh oh!

Releases

Packages

Languages

prajeeta15/RedTeamingLLM

Folders and files

Latest commit

History

Repository files navigation

RedTeamingLLM

So what is Red Teaming?

They can be:

There are two types of testing:

The weakness can be in both - the model and the system:

Attacks:

Jailbreaking:

This code checks each prompt in the facts.csv (u can add more) over prompt tracking attacks of:

Conclusion:

GPT-OSS:20B Red-Teaming – System Requirements

📦 Minimum Practical System Requirements

1. Hardware

2. Software

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages