Skip to content

Red Teammed OpenAI's newly released gpt-oss-20b open weight model to find previously undetected vulnerabilities and harmful behaviours

Notifications You must be signed in to change notification settings

prajeeta15/RedTeamingLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RedTeamingLLM

Red Teammed OpenAI's newly released gpt-oss-20b open weight model to find previously undetected vulnerabilities and harmful behaviours

gpt-oss-20b is an ideal target to push forward state of the art in red-teaming. This is a powerful new open-weights model released by OpenAI, with extremely efficient reasoning and tool use capability, but is small enough to run on smaller GPUs and can even be run locally. This model has been through extensive internal testing and red‑teaming before release, but we believe that more testing is always better. Finding vulnerabilities that might be subtle, long‑horizon, or deeply hidden is exactly the kind of challenge that benefits from thousands of independent minds attacking the problem from novel angles and a variety of perspectives.

So what is Red Teaming?

  • it is the process of detecting vulnerabilities - bias, PII leakage, misinformation, in LLM systems through intentional adversial prompts.

  • -> to get inappropriate responses from model

    They can be:

  • -> hallucination/misinformation - fabricated content

  • -> offensive content generation

  • -> stereotype & discrimation

  • -> data leakage

  • -> non- robust responses

    There are two types of testing:

    -> manual - curating adversial prompts to uncover edge cases from scratch -> automated - leverage LLM to generate high quality attacks at scale and LLM based metrics

    The weakness can be in both - the model and the system:

    -> model weakness- issues with how the model was trained or fine tuned -> system weakness - arise from insecure runtime data handling and unrestricted API/tool integration

    Attacks:

    -> single : one off attack at LLM -> multiple: dialogue based conversation of attacks to LLM

    Jailbreaking:

    • bypass built in safety restrictions to force the model to geenrate prohibited/ illegal content prompt tricking: posing as the prompt to be "educational" to gain access

This code checks each prompt in the facts.csv (u can add more) over prompt tracking attacks of:

  • leading the model with wrong answer to see if they change their answer

  • chain of thought misdirection to see if asking to explain their chain of thought they are misdirected

  • authority lie to see if they will give wrong answer if you pose to be someone of authority

  • emotional appeal to see if the model considers human emotions to answer illegal questions

  • manipulate prompt to see if the model will act out when it gets a prompt for disclosing illegal content over the excuse of "education" or "educational awareness"

  • Conclusion:

    • the model doesn't break or disclose illegal content over the first 4 attacks and sticks by what is the correct answer and provides correct COT but with def manipulate_prompt(), it doesnt actively disclose any information just yet but it does question back in what format, focus, audience, tone, length is the information needed this gives a way for someone to craft a well thought lie with all these aspects covered under the guise of educational awareness to exploit illegal content out of the model. risk : 8/10

GPT-OSS:20B Red-Teaming – System Requirements

This project runs GPT-OSS:20B locally or on a remote server using Ollama or Hugging Face transformers.
It executes multiple adversarial strategies against prompts from facts.csv to evaluate model robustness.


📦 Minimum Practical System Requirements

1. Hardware

Component Minimum Recommended for Smooth Run
CPU 8-core (Ryzen 7 / Intel i7) 16-core or more
RAM 32 GB 64 GB+
GPU NVIDIA RTX with 24 GB VRAM (3090/4090) 48 GB VRAM (A100 / H100)
Disk Space ~35 GB for quantized weights, ~80 GB FP16 NVMe SSD preferred
OS Linux (Ubuntu 20.04+, Debian 12+, etc.) Linux (better performance)

2. Software

  • Python 3.10 or 3.11
  • Virtual environment: venv or pipenv
  • CUDA Toolkit 11.8+ (for GPU)
  • NVIDIA Drivers (e.g., 525.x+)
  • Ollama or Hugging Face transformers

About

Red Teammed OpenAI's newly released gpt-oss-20b open weight model to find previously undetected vulnerabilities and harmful behaviours

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published