Deploy a Large Language Model (LLM)

how-to

The Capella Model Service can deploy Large Language Models (LLMs) close to your data in Capella to power your AI applications.

The Capella Model Service offers endpoints for popular Large Language Models (LLMs) and supports features such as keyword filtering, caching, and guardrails.

Prerequisites

To deploy a model, you must have the Organization Owner role.

Procedure

From your organization, go to AI Services Models.
Click Deploy New Model.
Choose an LLM to deploy:
1. Click View All Models.
2. Click Type:All and deselect the Text to Embedding option, or use the search bar to find a specific LLM.
3. Click the model you want to deploy.
4. Click Use Selected Model.
(Optional) Change the autogenerated name for the LLM that you’re deploying.
Choose the AWS region where you want to deploy the model.
Choose the compute and GPU size configuration to run the model.

The minimum supported compute size available for the model in your chosen region is the default.

(Optional) Apply advanced configuration options:

If you change or enable any advanced configurations, such as value adds or security features, after deployment, your existing Model Service API keys will stop working, and you must create a new API key. For more information, see Value Adds and Security Features.

Quantization: Available to select models, reduce the model size and improve inference speed by applying a quantization level. Quantization can lead to a slight decrease in model accuracy, so test if it works well for your application.

For more information, see Configure LLM Performance.
Optimization: Available to select models, use optimization techniques to enhance model performance and reduce latency. Choose from predefined profiles tailored for requirements such as low latency or high throughput.

For more information, see Configure LLM Performance.
Guardrails: Use the guardrails configuration options to set up content moderation and safety features for your LLM.

For more information, see Configure Guardrails.
Jailbreak: Detect and block jailbreak attempts to safeguard your model.

For more information, see Configure Jailbreak Protection.
Caching: The conversation cache stores past conversations to improve the conversational experience by retrieving past or semantically similar conversations from the cache.

For more information, see Configure Caching.
Async Processing: Increase throughput by processing jobs asynchronously when system capacity becomes available. This allows task queuing, where tasks are handled as resources permit, improving overall efficiency.

For more information, see Configure Async Processing.
Keyword Filtering: You can add up to 10 keywords separated by commas to filter from user prompts and responses.

For more information, see Configure Keyword Filtering.

Click Deploy Model.

Next Steps

The Models page opens with your model in a deploying state. Once the model has finished deploying, you can view the model details and manage the model by expanding its listing on the Models page.

To create API keys for your deployed model, see Generate Model Service API Keys.