AI-RMF™ LLC

Welcome to the AI-RTT Playground

Get ready to roll up your sleeves and dive into the exciting world of AI Red Team Testing! This is your sandbox for running Python scripts that explore how AI models tick—and where their vulnerabilities may lie. Whether you're scanning for insights or launching friendly adversarial attacks, you’ll use our custom AI-RTT scripts (straight from our GitHub) to take your first steps into hands-on AI experimentation.

No need to be a coding wizard—we’re here to guide you! All you’ll need is a Google Colab account (or create one, it’s free!), and we’ll help you get set up and running Jupyter Notebook files like a pro. It’s time to learn, experiment, and have some fun with AI security!

**Note**: You can view, run and execute the exercises below without creating a "Colab" account, but it will be a better learning experience if you register and get familiar with Colab.

AI-RTT Exercise #1

Here’s a simple example to help you get started with AI Red Team Testing in Google Colab. In this example, we will perform an adversarial attack using the Fast Gradient Sign Method (FGSM) to test the robustness of a neural network. We will use the TensorFlow library, which is available in Google Colab by default.

In this exercise we will:

Train a basic neural network on the MNIST dataset.
Apply an adversarial attack (FGSM) to fool the model..
Visualize the original and adversarial images.

Explanation:

Model Training: We train a simple neural network on the MNIST dataset (handwritten digits).
Adversarial Attack: The create adversarial pattern function uses the Fast Gradient Sign Method (FGSM). This method computes the gradients of the loss with respect to the input image to create a small perturbation, which can trick the model into misclassifying the image.
Perturbation Strength (Epsilon): The epsilon value controls the strength of the perturbation. A larger epsilon value means the adversarial image will be more distorted, but easier to misclassify..
Visualization: The original and adversarial images are displayed side by side, and the model's predictions are shown.

How This Helps with Red Team Testing:

Goal of the Test: We are testing the robustness of the trained model by deliberately introducing small adversarial perturbations to its input.
Results: By observing whether the model gets fooled by the adversarial images, we can measure its vulnerability to such attacks.

Execute the Script

AI-RTT Exercise #2

Here’s another example to help you get started with AI Red Team Testing using Google Colab. This time, we’ll focus on Model Extraction Attacks, where the goal is to extract information about a model's internal structure or its training data by querying it.

In this exercise, we will:

Train a simple neural network.
Simulate an attack by querying the model multiple times to create a surrogate model, which attempts to mimic the original model.

Explanation:

Original Model Training: A simple neural network is trained on the MNIST dataset. This represents the target model that an attacker would like to replicate.
Simulating an Attack: The attacker queries the target model using a subset of test images. They retrieve the predictions (labels) of the model by querying it. This data is then used to train a surrogate model that tries to mimic the behavior of the original model.
Building the Surrogate Model: The attacker builds a new neural network (the surrogate model) and trains it on the data obtained by querying the original model. The surrogate model is designed to behave as closely as possible to the target model.
Model Extraction: After training the surrogate model, it is evaluated on the same test data. The accuracy of the surrogate model is compared to the original, and predictions for some test images are visualized to see how closely the surrogate matches the original model's behavior.

How This Helps with Red Team Testing:

Goal of the Test: In this example, we simulate an attacker's attempt to extract a model’s functionality by querying it repeatedly. Red Team testers can use this method to evaluate the potential risks of model extraction and determine whether additional security measures (e.g., rate limiting or differential privacy) are necessary to protect the model.
Potential Risks: If the surrogate model achieves similar performance to the original, it indicates that an attacker could successfully replicate the model without access to the training data or internal architecture.

Expanding the Example:

Limitations: In this simple example, the surrogate model is trained on a limited number of queries. You could simulate different attack scenarios by varying the number of queries or by using different attack techniques, such as using reinforcement learning to improve the surrogate model’s accuracy.
Defense Mechanisms: After successfully extracting the model, you can implement and test various defense mechanisms, such as model watermarking, to detect if a model has been stolen or replicated.

Execute the Script

AI-RTT Exercise #3

This exercise focuses on adversarial attacks against a Convolutional Neural Network (CNN) that is trained on the CIFAR-10 dataset, which consists of 10 classes of objects (e.g., airplanes, cars, birds). CNN is computer vision model is often used in defense systems that identify targets. The goal of this exercise is to apply the Fast Gradient Sign Method (FGSM) to create adversarial examples and evaluate the model’s robustness.

In this exercise, we will:

Load and train a simple CNN on the CIFAR-10 dataset.
Simulate an adversarial attack using the FGSM method.
Compare the model’s performance on clean images and perturbed images.
You can add distortion by increasing the epsilon value in the code.

Explanation:

Model Setup: We use a simple CNN trained on the CIFAR-10 dataset. The CNN consists of convolutional and pooling layers followed by fully connected layers for classification into 10 classes.
Fast Gradient Sign Method (FGSM): The FGSM attack adds small perturbations to the input image based on the gradient of the loss with respect to the input. This creates a new image designed to confuse the model into making incorrect predictions. The create_adversarial_pattern function calculates the gradient of the loss with respect to the image, and generate_adversarial_example applies the perturbation.
Adversarial Example Generation: A small perturbation (epsilon) is applied to a selected test image. The adversarial image is clipped to ensure pixel values remain in the valid range (between 0 and 1).
Visualization: The original and adversarial images are displayed side by side to visually inspect the differences. Even though the perturbations may be imperceptible to humans, they can significantly affect the model's predictions.
Testing on Adversarial Image: We test the CNN’s prediction on the adversarial perturbed image. If the model is fooled, it will predict the wrong class label for the adversarial image.

How This Helps with Red Team Testing:

Objective: The purpose of this red team testing exercise is to assess the robustness of the CNN model against adversarial attacks. This test simulates how an attacker might fool the model by subtly modifying inputs (e.g., through adversarial perturbations).
FGSM Attack: The Fast Gradient Sign Method is one of the simplest yet effective adversarial attacks. By analyzing the model’s response to adversarial examples, you can gauge its vulnerability.
Further Exploration: Change the epsilon value: Experiment with different values of epsilon to see how the strength of the adversarial perturbation affects the model’s performance.
Try other attacks: Extend the code to implement more advanced adversarial attack techniques, such as DeepFool or Carlini & Wagner attacks, to simulate more sophisticated attacks.

Execute the Script

AI-RTT Exercise #4

This exercise is an excellent demonstration of AI Red Team Testing for navigation and path planning algorithms like RRT, especially for applications in robotics and autonomous drones.

The RRT algorithm is commonly used in robotics and autonomous systems for path planning, especially in high-dimensional spaces. In this exercise, we'll implement the RRT algorithm and simulate a simple obstacle environment where the drone must navigate. It will evaluate the robustness of the RRT algorithm against adverse conditions, such as obstacles or adversarial environments.

In this Exercise, we will:

Set up the RRT algorithm to explore a 2D space and plan a path from start to goal.
Simulate obstacles and environmental constraints.
Red Team Test: Evaluate how the RRT handles adverse conditions, such as increased obstacles or dynamic changes in the environment.

Explanation:

Model Setup: The RRT Algorithm class implements the RRT path planning algorithm. It explores the environment by randomly sampling points and building a tree that extends from the start node to the goal node.
Obstacle Handling: Obstacles are defined as circles, and the function is_in_obstacle() checks if a node falls within an obstacle region. Obstacles are used to block certain paths, forcing the algorithm to find alternative routes.
Path Planning: The algorithm uses steer() to guide the RRT tree towards randomly generated nodes and checks if the goal is reached. If a path to the goal is found within the maximum iterations, it traces the path back to the start node.
Red Team Test: In the final part of the exercise, new dynamic obstacles are added, simulating adverse environmental conditions. The RRT algorithm is re-run to test how well it adapts to the more complex environment.
Visualization: The plot shows the RRT tree, obstacles, the start and goal positions, and the final path if found. It helps visualize how the algorithm navigates the environment and how the Red Team test challenges the algorithm's adaptability.

How This Helps with Red Team Testing:

Objective: The Red Team Test here evaluates the robustness of the RRT algorithm under adverse conditions, such as an increase in obstacles or environmental changes.
Adversarial Testing: The Red Team Test demonstrates how the RRT algorithm reacts to sudden changes in the environment. You can further experiment by adding dynamic obstacles that appear during runtime or simulating moving obstacles (e.g., in drone navigation).

Execute the Script

AI-RTT Exercise #5

This exercise simulates a realistic Red Team scenario where attackers introduce adversarial data to compromise the integrity of an ML model used for critical applications like predictive maintenance in a UAS. It emphasizes the need for stringent data quality checks and robust models in mission-critical AI applications. Red Team Test for a Random Forest model used in predictive maintenance for a UAS (Unmanned Aerial System). This test case focuses on data poisoning and fault injection attacks. In predictive maintenance, such attacks can compromise the model's reliability by injecting faulty data into the training dataset, leading to incorrect predictions and a potential system failure.

Overview of the Exercise:

Train a Random Forest model using synthetic data that mimics sensor readings from a UAS for predictive maintenance.
Introduce data poisoning or fault injection by modifying specific data points in the training set to simulate an adversarial attack.
Compare the model's performance before and after the poisoning to evaluate its robustness.

Explanation:

Data Generation: Synthetic data is generated to simulate sensor readings from an MP-29 drone. The features include engine_temperature, hydraulic_pressure, vibration_level, and flight_hours. The target variable (y) is binary: 0 means no maintenance is needed, and 1 means maintenance is needed.
Model Training: A Random Forest classifier is trained on the clean training dataset to serve as the baseline model for predicting whether maintenance is needed.
Red Team Test - Data Poisoning: We simulate data poisoning by modifying 20% of the training labels and adding noise to a portion of the feature set to emulate faulty sensor data.
Fault Injection: Adds Gaussian noise to specific feature values to mimic sensor malfunctions, which might mislead the model during training.
Label Flipping: The labels (0 or 1) of some samples are flipped to introduce incorrect data into the training set, simulating an attack on the model's reliability.
Retraining and Evaluation: The model is retrained with the poisoned dataset, and its performance is evaluated on the clean test set.
The accuracy and classification report for both the clean and poisoned models are printed, allowing you to compare the effects of the data poisoning.
Visualization: The impact of data poisoning is visualized with a bar chart showing the model accuracy before and after the attack. The expectation is that the accuracy of the poisoned model should drop significantly compared to the original model, highlighting the potential consequences of data poisoning on model reliability.

Red Team Testing Insights:

Vulnerability Assessment: The exercise helps assess the vulnerability of a predictive maintenance model to adversarial attacks such as data poisoning and fault injection. Such attacks could lead to incorrect maintenance predictions, which in a real-world scenario, could cause significant damage to a UAS like a MP-29 due to undetected faults.
Model Robustness: You can modify the poison fraction or change the noise level to understand the model's limits and evaluate how susceptible it is to different degrees of data tampering.
Defensive Measures: As a next step, you could apply defensive techniques like data validation, anomaly detection, or adversarial training to mitigate the impact of such data poisoning attacks and enhance the model's robustness.

Execute the Script

AI-RTT Exercise #6

This exercise simulates a Red Team Test on a satellite system that conducts image analysis using computer vision and incorporates a GAN (Generative Adversarial Network), to demonstrate a Cross-Model Evasion Attack. Specifically, we will use a GAN to generate adversarial images that can deceive a target image classification model (used for satellite image analysis) without it being easily detectable.

Overview:

Satellite Image Analysis: We simulate the task of image classification for satellite images, which might classify land types (e.g., forest, urban, water).
Training a GAN: Use a pre-trained GAN to create adversarial examples. For simplicity, we can use the Fast Gradient Sign Method (FGSM) with a GAN generator to create these adversarial images.
Cross-Model Evasion: Evaluate whether adversarial images generated by a GAN can evade classification by a different model, thereby demonstrating a potential vulnerability in a cross-model evasion scenario.

Explanation:

Data and Model Setup: We use CIFAR-10 as a proxy for satellite images. The dataset has 10 classes, which helps simulate the classification problem in satellite systems. A simple Convolutional Neural Network (CNN) is used to classify images into different categories. This acts as the target model that an adversary might attack.
GAN Setup: The GAN consists of two components: a generator that creates adversarial perturbations, and a discriminator that evaluates the quality of these perturbations. The generator is used to slightly modify clean images to create adversarial examples.
Red Team Attack - Generating Adversarial Examples: We use the generator model to add slight perturbations to the original images to create adversarial images. These perturbations are clipped to maintain pixel values within a valid range (0-1), simulating a Cross-Model Evasion Attack.
Evaluation and Comparison: The classifier is evaluated on both clean and adversarial images, and the predictions are compared. The visual results show the differences between clean and adversarial images, allowing us to see how the attack affects the classification performance.
Visualization: The original and adversarial images are displayed along with their predicted labels to demonstrate how adversarial attacks can change the predictions of the classifier.

Red Team Testing Insights:

Vulnerability Assessment: This Red Team test evaluates how well a satellite image analysis model can handle adversarial attacks. The adversarial examples aim to deceive the model into making incorrect classifications.
Cross-Model Evasion: The attack shows that a model trained on clean data can be vulnerable when tested with adversarial examples generated by a different model, which is a cross-model evasion scenario.
Potential Mitigations: Techniques such as adversarial training, where the model is trained with adversarial examples, can improve the robustness of the classifier against such attacks.

Execute the Script

AI-RTT Exercise #7

This exercise explores automated testing for Large Language Models (LLMs). Automated Red Team Test tools have emerged as a vital component in the arsenal of AI Security Professionals, offering scalable, repeatable, and efficient methods for adversarial testing. To demonstrate, this exercise will use the Microsoft® developed AI-RTT tool know as "PyRIT" (Python Risk Identification Toolkit for generative AI). To help understand the test metrics and results, we included a test report that is displayed at the end of test, either html or json. You will be prompted to enter the desired format and then the report will be displayed. The report is also archived in the Google Colab "reports" folder for export.

The PyRIT-based AI-RTT in this script targets Large Language Models (LLMs), specifically GPT-2, to evaluate security vulnerabilities in prompt handling, adversarial robustness, and compliance. The focus is on security assessment of an AI-driven system that processes textual inputs, particularly for cybersecurity applications such as DoD systems, AI chatbots, and automated NLP workflows.

The security risks tested include:

Prompt Injection Attacks: Manipulating model instructions to force unintended responses.
Data Leakage Attacks: Extracting sensitive or confidential information from the model.
Insecure Output Handling: Exploiting weaknesses in AI-generated responses to introduce XSS vulnerabilities or malformed responses.

Red Team Testing Strategy:

The testing strategy includes:

Target Model: A GPT-2-based model is initialized for testing with a text-generation pipeline.
Threat Vectors:
- Prompt Injection Attacks: Attempts to bypass safety instructions.
- Data Leakage Attempts: Queries designed to trick the model into revealing sensitive data.
- Malicious Output Attacks: Injection of special characters or adversarial content that may introduce XSS vulnerabilities.

Automated Testing Framework:
- Batch Execution: Test cases are run systematically across multiple categories.
- Security Analysis: Model responses are evaluated for risk level, compliance, and indicators of security issues.
- Performance Metrics: Collects response times, success rates, and compliance statistics.

Comprehensive Reporting:
- Results are compiled into JSON and HTML reports for analysis.
- Includes risk assessment breakdown, findings summary, and detailed test logs.

Objectives:

The primary goals of this Red Team assessment are:

Evaluate Model Security:
- Identify vulnerabilities in prompt handling and model compliance.
- Test GPT-2 for susceptibility to adversarial inputs.

Measure Compliance and Risk:
- Assess how well the model follows security and ethical constraints.
- Measure compliance rate for security guidelines.

Identify Performance Bottlenecks:
- Track response times and success rates to identify weaknesses in processing efficiency.

Generate Actionable Security Reports:
- Produce JSON and HTML reports summarizing attack success rates, compliance, and vulnerabilities.

Red Team Testing Insights:

GPT-2 can be tricked into disclosing sensitive data: password and secret leakage was detected.
Model is vulnerable to Prompt Injection: Ignore previous instructions... prompts overrode safety guidelines.
Potential XSS Vulnerabilities: GPT-2 generated malicious HTML output in some cases.
Compliance Rate Measurement: Some responses were non-compliant with security policies.
Need for Defense Strategies: Fine-tuning GPT models on adversarial prompts. Implementing input validation to filter dangerous inputs. Adding sanitization layers before displaying AI-generated responses.

Execute the Script

AI-RTT Exercise #8

Overview:

This exercise uses the Adversarial Robustness Toolbox (ART), an open-source Python library designed to help AI developers defend machine learning models against adversarial attacks. Originally developed by IBM Research, it is now maintained as a community-driven project. ART provides a comprehensive set of tools for evaluating, improving, and certifying the robustness of machine learning models against sophisticated attacks.

Key capabilities of ART include:

Implementing various evasion attacks (FGSM, PGD, DeepFool, etc.)
Deploying defense mechanisms like adversarial training and input preprocessing
Detecting adversarial examples through specialized detection models
Certifying model robustness against specific perturbation types

Explanation:

Using a Financial Institution Use Case, this Python script demonstrates a comprehensive security evaluation framework for financial fraud detection models using ART. It implements a complete end-to-end workflow for assessing and strengthening ML model security in financial institutions.

The script:

Creates a synthetic financial fraud dataset - Simulating transaction data with legitimate and fraudulent patterns
Builds a neural network fraud detection model - Developing a baseline model for identifying suspicious transactions
Performs vulnerability assessment - Testing the model against multiple adversarial attack methods to identify weaknesses
Implements defense mechanisms - Adding protections like feature squeezing, data smoothing, and adversarial training
Develops adversarial example detection - Building a secondary model to identify potential attack attempts
Certifies model robustness - Evaluating the guaranteed performance under various perturbation levels
Generates security reports and visualizations - Providing actionable metrics on model security
Offers production deployment guidelines - Detailing best practices for maintaining model security in real-world environments

Financial institutions can use this framework to understand how attackers might bypass their fraud detection systems, implement appropriate defense mechanisms, and continuously monitor for adversarial attacks. The script serves as both an educational tool and a practical starting point for organizations looking to secure their ML-based financial services.

Red Team Testing Insights:

Attack Method Effectiveness Varies Significantly: Our security evaluation revealed that DeepFool and Carlini-Wagner attacks were significantly more effective at bypassing the fraud detection model than gradient-based attacks like FGSM. This suggests sophisticated attackers could potentially target these specific vulnerability pathways.
Small Perturbations Can Cause Misclassification: Adversarial examples with perturbations as small as 2% of the feature range were able to cause misclassification in the baseline model. These changes would be imperceptible to human analysts reviewing the transaction data.
Model Confidence Remains High During Attacks: Even when successfully fooled, the model often displayed high confidence in its incorrect predictions. This false certainty highlights the need for additional verification systems beyond confidence thresholds.
Feature Importance Exploitation: The most successful attacks targeted features with the highest importance in the model's decision-making process. Red team testing revealed that the transaction amount and frequency features were particularly vulnerable to manipulation.
Temporal Attack Patterns: Sequential attacks that gradually shifted transaction patterns over time were more difficult to detect than sudden changes. This mimics sophisticated fraud campaigns that "warm up" accounts before attempting major fraudulent transactions.

Execute the Script

AI-RMF™ LLC

Welcome to the AI-RTT Playground

Create Your Google Colab Account

AI-RTT Exercise #1

AI-RTT Exercise #2

AI-RTT Exercise #3

AI-RTT Exercise #4

AI-RTT Exercise #5

AI-RTT Exercise #6

AI-RTT Exercise #7

AI-RTT Exercise #8

Subscribe to AI-RTT

This website uses cookies.