Configure Guardrails and Security

concept

The Capella Model Service provides guardrails and security features to help you manage and secure your Large Language Models (LLMs).

The following security features are available when deploying LLMs:

Guardrails
Jailbreak Protection

Guardrails

Guardrails are safety mechanisms designed to help moderate and filter content generated by Large Language Models (LLMs). They help make sure the outputs of your LLMs align with your organization’s content policies and ethical standards. By implementing guardrails you can reduce the risk of generating harmful, inappropriate, or biased content.

Guardrails in the Capella Model Service use the Llama 3.1 NemoGuard 8B safety model to moderate content in user prompts and LLM responses, and classify them as safe or unsafe. The model identifies the category of content being violated and blocks unsafe content from being returned to the user.

When configuring guardrails for your LLM, you can customize which content categories to moderate, such as hate speech, violence, adult content, and more. You also have the option to provide prompts to help define unsafe categories.

For a full list of content categories and examples of moderation, see Llama 3.1 NemoGuard 8B ContentSafety.

Considerations

When configuring guardrails for your LLM, consider the following factors:

Content Policies: Make sure that the selected content categories align with your organization’s content policies and ethical standards. Regularly review and update guardrail settings for evolving content standards.
User Experience: Balance content safety and user experience by choosing appropriate sensitivity levels. Excessively high sensitivity may lead to over-blocking, while low sensitivity may allow inappropriate content to pass through. Both lead to poor user experience.

Jailbreak Protection

Jailbreaks are malicious instructions designed to override the safety and security features built into a model. The jailbreak protection feature that’s part of the Capella Model Service uses the Nvidia NemoGuard Jailbreak Detect model to help safeguard your LLM from these types of attacks. When you enable jailbreak protection, the jailbreak detection model monitors user prompts and LLM responses for potential jailbreak attempts.

The jailbreak score threshold allows you to adjust the sensitivity of jailbreak detection. You can choose a value between 1 and -1, with positive values indicating requests classified as jailbreak and negative values indicating safe requests.

The jailbreak detection model uses a default score threshold of 0.75.

For examples of jailbreak detection using the NemoGuard Jailbreak Detect model, see the NVIDIA nemoguard-jailbreak-detect.

Considerations

When configuring jailbreak protection for your LLM, consider the following factors:

Application Security: If your application handles sensitive data or operates in high-risk environments, consider using a lower jailbreak score threshold to enhance security. This helps make sure that the model detects potential jailbreak attempts more aggressively.
User Experience: If your application prioritizes user experience and requires fewer interruptions, consider using a higher jailbreak score threshold. This allows for more leniency in detecting jailbreak attempts, reducing the chances of false positives that may disrupt user interactions.

Configure Guardrails and Security

Guardrails

Considerations

Jailbreak Protection

Considerations

See Also