diff --git a/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png new file mode 100644 index 0000000000..285ebb899b Binary files /dev/null and b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png differ diff --git a/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_pred_env - Copy.png:Zone.Identifier b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_pred_env - Copy.png:Zone.Identifier new file mode 100644 index 0000000000..d6c1ec6829 Binary files /dev/null and b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_pred_env - Copy.png:Zone.Identifier differ diff --git a/docs/concepts/mlops/serving.md b/docs/concepts/mlops/serving.md index 37c3d4c358..b7c9596ef3 100644 --- a/docs/concepts/mlops/serving.md +++ b/docs/concepts/mlops/serving.md @@ -1,19 +1,18 @@ -In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community). -KServe is the defacto open-source framework for model serving on Kubernetes. -You can deploy models in either programs, using the HSML library, or in the UI. +In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes. +You can deploy models programmatically using the HSML library or via the UI. A KServe model deployment can include the following components: -**`Transformer`** +**`Predictor (KServe component)`** -: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. +: A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions. -**`Predictor`** +**`Transformer (KServe component)`** -: A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output. +: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. Not available for vLLM deployments. **`Inference Logger`** -: Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. +: Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. Not available for vLLM deployments. **`Inference Batcher`** @@ -21,12 +20,18 @@ A KServe model deployment can include the following components: **`Istio Model Endpoint`** -: You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key. +: You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key, accessible via path-based routing through Istio. API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks. + !!! warning "Host-based routing" + The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. Path-based routing is recommended for new deployments. + Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store. !!! info "Model Serving Guide" More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md). + +!!! tip "Python deployments" + For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment.md) page. diff --git a/docs/user_guides/fs/feature_view/feature-vectors.md b/docs/user_guides/fs/feature_view/feature-vectors.md index c16778a532..8f47b012e2 100644 --- a/docs/user_guides/fs/feature_view/feature-vectors.md +++ b/docs/user_guides/fs/feature_view/feature-vectors.md @@ -239,7 +239,6 @@ However, you can retrieve the untransformed feature vectors without applying mod entry=[{"id": 1}, {"id": 2}], transform=False ) - ``` ## Retrieving feature vector without on-demand features @@ -258,7 +257,6 @@ To achieve this, set the parameters `transform` and `on_demand_features` to `Fa entry=[{"id": 1}, {"id": 2}], transform=False, on_demand_features=False ) - ``` ## Passing Context Variables to Transformation Functions @@ -274,7 +272,6 @@ After [defining a transformation function using a context variable](../transform entry=[{"pk1": 1}], transformation_context={"context_parameter": 10} ) - ``` ## Choose the right Client diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md index cbb22f305e..cb394e2cf9 100644 --- a/docs/user_guides/fs/feature_view/helper-columns.md +++ b/docs/user_guides/fs/feature_view/helper-columns.md @@ -41,7 +41,6 @@ for computing the [on-demand feature](../../../concepts/fs/feature_group/on_dema inference_helper_columns=["expiry_date"], ) - ``` ### Inference Data Retrieval @@ -88,7 +87,6 @@ However, they can be optionally fetched with inference or training data. ] ] - ``` #### Online inference @@ -129,7 +127,6 @@ However, they can be optionally fetched with inference or training data. passed_features={"days_valid": days_valid}, ) - ``` ## Training Helper columns @@ -156,7 +153,6 @@ For example one might want to use feature like `category` of the purchased produ training_helper_columns=["category"], ) - ``` ### Training Data Retrieval @@ -190,7 +186,6 @@ However, they can be optionally fetched. training_dataset_version=1, training_helper_columns=True ) - ``` !!! note diff --git a/docs/user_guides/fs/feature_view/model-dependent-transformations.md b/docs/user_guides/fs/feature_view/model-dependent-transformations.md index 66ebfe518a..cf1cbd142c 100644 --- a/docs/user_guides/fs/feature_view/model-dependent-transformations.md +++ b/docs/user_guides/fs/feature_view/model-dependent-transformations.md @@ -55,7 +55,6 @@ Additionally, Hopsworks also allows users to specify custom names for transforme transformation_functions=[add_two, add_one_multiple], ) - ``` ### Specifying input features @@ -77,7 +76,6 @@ The features to be used by a model-dependent transformation function can be spec ], ) - ``` ### Using built-in transformations @@ -106,7 +104,6 @@ The only difference is that they can either be retrieved from the Hopsworks or i ], ) - ``` To attach built-in transformation functions from the `hopsworks` module they can be directly imported into the code from `hopsworks.builtin_transformations`. @@ -134,7 +131,6 @@ To attach built-in transformation functions from the `hopsworks` module they can ], ) - ``` ## Using Model Dependent Transformations @@ -160,7 +156,6 @@ Model-dependent transformation functions can also be manually applied to a featu # Apply Model Dependent transformations encoded_feature_vector = fv.transform(feature_vector) - ``` ### Retrieving untransformed feature vector and batch inference data @@ -185,5 +180,4 @@ To achieve this, set the `transform` parameter to False. # Fetching untransformed batch data. untransformed_batch_data = feature_view.get_batch_data(transform=False) - ``` diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md index d57d452603..de3c3c68fb 100644 --- a/docs/user_guides/fs/feature_view/training-data.md +++ b/docs/user_guides/fs/feature_view/training-data.md @@ -154,7 +154,6 @@ Once you have [defined a transformation function using a context variable](../tr transformation_context={"context_parameter": 10}, ) - ``` ## Read training data with primary key(s) and event time diff --git a/docs/user_guides/mlops/serving/api-protocol.md b/docs/user_guides/mlops/serving/api-protocol.md index b5f11e8989..01e69ab146 100644 --- a/docs/user_guides/mlops/serving/api-protocol.md +++ b/docs/user_guides/mlops/serving/api-protocol.md @@ -3,11 +3,12 @@ ## Introduction Hopsworks supports both REST and gRPC as API protocols for sending inference requests to model deployments. -While REST API protocol is supported in all types of model deployments, support for gRPC is only available for models served with [KServe](predictor.md#serving-tool). +While REST API protocol is supported in all types of model deployments, gRPC is only supported for **Python model server** deployments with a model artifact. -!!! warning - At the moment, the gRPC API protocol is only supported for **Python model deployments** (e.g., scikit-learn, xgboost). - Support for Tensorflow model deployments is coming soon. +!!! warning "gRPC constraints" + - gRPC is only supported for Python model server deployments + - A model artifact is required — gRPC is not available for Python deployments + - gRPC uses port 8081 with `h2c` protocol ## GUI @@ -40,17 +41,7 @@ To navigate to the advanced creation form, click on `Advanced options`. ### Step 3: Select the API protocol -Enabling gRPC as the API protocol for a model deployment requires KServe as the serving platform for the deployment. -Make sure that KServe is enabled by activating the corresponding checkbox. - -

-

- KServe enabled in advanced deployment form -
Enable KServe in the advanced deployment form
-
-

- -Then, you can select the API protocol to be enabled in your model deployment. +You can select the API protocol to be enabled in your model deployment in the advanced deployment form.

@@ -102,7 +93,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference diff --git a/docs/user_guides/mlops/serving/autoscaling.md b/docs/user_guides/mlops/serving/autoscaling.md new file mode 100644 index 0000000000..e61870a9ed --- /dev/null +++ b/docs/user_guides/mlops/serving/autoscaling.md @@ -0,0 +1,55 @@ +--- +description: Documentation on how to configure scaling for a deployment +--- + +# How To Configure Scaling For A Deployment + +## Introduction + +Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic. + +??? info "Show scale metrics" + + | Scale Metric | Default Target | Description | + | ------------ | -------------- | ------------------------------------ | + | RPS | 200 | Requests per second per replica | + | CONCURRENCY | 100 | Concurrent requests per replica | + +**Scaling parameters:** + +- `minInstances` — Minimum replicas (0 enables scale-to-zero) +- `maxInstances` — Maximum replicas (must be ≥1, cannot be less than min) +- `panicWindowPercentage` — Panic window as percentage of stable window (default: 10.0, range: 1-100) +- `stableWindowSeconds` — Stable window duration in seconds (default: 60, range: 6-3600) +- `panicThresholdPercentage` — Traffic threshold to trigger panic mode (default: 200.0, must be >0) +- `scaleToZeroRetentionSeconds` — Time to retain pods before scaling to zero (default: 0, must be ≥0) + +!!! note "Cluster-level constraints" + Administrators can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments. + +## Code + +=== "Python" + + ```python + from hsml.resources import PredictorResources, Resources + + minimum_res = Resources(cores=1, memory=256, gpus=1) + maximum_res = Resources(cores=2, memory=512, gpus=1) + + predictor_res = PredictorResources( + num_instances=1, + requests=minimum_res, + limits=maximum_res + ) + + my_predictor = ms.create_predictor( + my_model, + resources=predictor_res, + # autoscaling + min_instances=1, + max_instances=5, + scale_metric="RPS", + scale_target=100 + ) + ``` diff --git a/docs/user_guides/mlops/serving/deployment-state.md b/docs/user_guides/mlops/serving/deployment-state.md index 7bc60e5815..f65fe16269 100644 --- a/docs/user_guides/mlops/serving/deployment-state.md +++ b/docs/user_guides/mlops/serving/deployment-state.md @@ -86,7 +86,6 @@ Additionally, you can find the nº of instances currently running by scrolling d ```python deployment = ms.get_deployment("mydeployment") - ``` ### Step 3: Inspect deployment state @@ -98,7 +97,6 @@ Additionally, you can find the nº of instances currently running by scrolling d state.describe() - ``` ### Step 4: Check nº of running instances @@ -112,7 +110,6 @@ Additionally, you can find the nº of instances currently running by scrolling d # nº of transformer instances deployment.transformer.resources.describe() - ``` ### API Reference @@ -127,16 +124,23 @@ The status of a deployment is a high-level description of its current state. ??? info "Show deployment status" - | Status | Description | - | -------- | ------------------------------------------------------------------------------------------------------------------------ | - | CREATED | Deployment has never been started | - | STARTING | Deployment is starting | - | RUNNING | Deployment is ready and running. Predictions are served without additional latencies. | - | IDLE | Deployment is ready, but idle. Higher latencies (i.e., cold-start) are expected in the first incoming inference requests | - | FAILED | Deployment is in a failed state, which can be due to multiple reasons. More details can be found in the condition | - | UPDATING | Deployment is applying updates to the running instances | - | STOPPING | Deployment is stopping | - | STOPPED | Deployment has been stopped | + | Status | Description | + | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | + | CREATING | Deployment artifacts are being prepared | + | CREATED | Deployment has never been started | + | STARTING | Deployment is starting | + | RUNNING | Deployment is ready and running. Predictions are served without additional latencies. | + | IDLE | Deployment is ready but scaled to zero or has no active replicas. Higher latencies (cold-start) are expected on the first inference request. | + | FAILED | Terminal state. The deployment has encountered an unrecoverable error. More details can be found in the status condition. | + | UPDATING | Deployment is applying updates to the running instances | + | STOPPING | Deployment is stopping | + | STOPPED | Deployment has been stopped | + +## How States Are Determined + +Deployment state is determined from multiple sources: the database state (whether the deployment has been deployed and its revision), KServe InferenceService conditions, pod presence (available replicas for predictor and transformer), and the artifact filesystem (whether the deployment artifact files are ready). + +A revision ID and deployment version are used to distinguish between STARTING (first generation) and UPDATING (subsequent changes to a running deployment). ## Deployment conditions diff --git a/docs/user_guides/mlops/serving/deployment.md b/docs/user_guides/mlops/serving/deployment.md index c514a0fd8f..aa075b9c56 100644 --- a/docs/user_guides/mlops/serving/deployment.md +++ b/docs/user_guides/mlops/serving/deployment.md @@ -8,12 +8,13 @@ description: Documentation on how to deployment Machine Learning (ML) models and In this guide, you will learn how to create a new deployment for a trained model. -!!! warning - This guide assumes that a model has already been trained and saved into the Model Registry. - To learn how to create a model in the Model Registry, see [Model Registry Guide](../registry/index.md#exporting-a-model) +!!! note + This guide covers model deployments, which require a model saved in the Model Registry. + To learn how to create a model in the Model Registry, see [Model Registry Guide](../registry/index.md#exporting-a-model). + For Python deployments (running a Python script without a model artifact), see [Python Deployments](../../projects/python-deployment.md). -Deployments are used to unify the different components involved in making one or more trained models online and accessible to compute predictions on demand. -For each deployment, there are four concepts to consider: +Model deployments are used to unify the different components involved in making one or more trained models online and accessible to compute predictions on demand. +For each model deployment, there are four concepts to understand: !!! info "" 1. [Model files](#model-files) @@ -42,29 +43,17 @@ Both options will open the deployment creation form. A simplified creation form will appear including the most common deployment fields from all available configurations. We provide default values for the rest of the fields, adjusted to the type of deployment you want to create. -In the simplified form, select the model framework used to train your model. -Then, select the model you want to deploy from the list of available models under `pick a model`. - -After selecting the model, the rest of fields are filled automatically. -We pick the last model version and model artifact version available in the Model Registry. -Moreover, we infer the deployment name from the model name. - -!!! notice "Deployment name validation rules" - A valid deployment name can only contain characters a-z, A-Z and 0-9. - -!!! info "Predictor script for Python models" - For Python models, you must select a custom [predictor script](#predictor) that loads and runs the trained model by clicking on `From project` or `Upload new file`, to choose an existing script in the project file system or upload a new script, respectively. - -If you prefer, change the name of the deployment, model version or [artifact version](#artifact-files). -Then, click on `Create new deployment` to create the deployment for your model. +In the simplified form, choose the model server that will be used to serve your model.

- Select the model framework -
Select the model framework
+ Select the model server +
Select the model server

+Then, select the model you want to deploy from the list of available models under `pick a model`. +

Select the model @@ -72,6 +61,19 @@ Then, click on `Create new deployment` to create the deployment for your model.

+After selecting the model, select a model version and give your model deployment a name. + +!!! notice "Deployment name validation rules" + A valid deployment name can only contain characters a-z, A-Z and 0-9. + +!!! info "Predictor script for Python models" + For Python models, you must select a custom [predictor script](#predictor) that loads and runs the trained model by clicking on `From project` or `Upload new file`, to choose an existing script in the project file system or upload a new script, respectively. + +!!! info "Server configuration file for vLLM" + For vLLM deployments, a server configuration file is required. See the [Predictor Guide](predictor.md#server-configuration-file) for more details. + +Lastly, click on `Create new deployment` to create the deployment for your model. + ### Step 3 (Optional): Advanced configuration Optionally, you can access and adjust other parameters of the deployment configuration by clicking on `Advanced options`. @@ -82,28 +84,14 @@ Optionally, you can access and adjust other parameters of the deployment configu
Advanced options. Go to advanced deployment creation form

- -You will be redirected to a full-page deployment creation form where you can see all the default configuration values we selected for your deployment and adjust them according to your use case. -Apart from the aforementioned simplified configuration, in this form you can setup the following components: - -!!! info "Deployment advanced options" - 1. [Predictor](#predictor) - 2. [Transformer](#transformer) - 3. [Inference logger](predictor.md#inference-logger) - 4. [Inference batcher](predictor.md#inference-batcher) - 5. [Resources](predictor.md#resources) - 6. [API protocol](predictor.md#api-protocol) +You will be redirected to a full-page deployment creation form, where you can review all default configuration values and customize them to fit your requirements. In addition to the basic settings, this form allows you to further configure the [Predictor](#predictor) and [Transformer](#transformer) KServe components of your model deployment. Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. -### Step 4: (Kueue enabled) Select a Queue - -If the cluster is installed with Kueue enabled, you will need to select a queue in which the deployment should run. -This can be done from `Advance configuration -> Scheduler section`. +!!! info "Predictor script" + Depending on the model server, a predictor script may be required to load and serve your model. See the [Predictor Guide](predictor.md) for more details. -![Default queue for job](../../../assets/images/guides/project/scheduler/job_queue.png) - -### Step 5: Deployment creation +### Step 4: Deployment creation Wait for the deployment creation process to finish. @@ -114,7 +102,7 @@ Wait for the deployment creation process to finish.

-### Step 6: Deployment overview +### Step 5: Deployment overview Once the deployment is created, you will be redirected to the list of all your existing deployments in the project. You can use the filters on the top of the page to easily locate your new deployment. @@ -150,44 +138,28 @@ After that, click on the new deployment to access the overview page. mr = project.get_model_registry() ``` -### Step 2: Create deployment +### Step 2: Retrieve your trained model -Retrieve the trained model you want to deploy. +Retrieve the trained model you want to deploy using the Model Registry handle. === "Python" ```python my_model = mr.get_model("my_model", version=1) - ``` -#### Option A: Using the model object - -=== "Python" - - ```python - my_deployment = my_model.deploy() - +### Step 3: Deploy your trained model - ``` - -#### Option B: Using the Model Serving handle +Create a deployment for your model by calling `.deploy()` on the model metadata object. This will create a deployment for your model with default values. === "Python" ```python - # get Hopsworks Model Serving handle - ms = project.get_model_serving() - - my_predictor = ms.create_predictor(my_model) - my_deployment = my_predictor.deploy() - - # or - my_deployment = ms.create_deployment(my_predictor) - my_deployment.save() - + my_deployment = my_model.deploy() + # optionally, start your model deployment + my_deployment.start() ``` ### API Reference @@ -203,15 +175,12 @@ Inside a model deployment, the local path to the model files is stored in the `M Moreover, you can explore the model files under the `/Models///Files` directory using the File Browser. !!! warning - All files under `/Models` are managed by Hopsworks. - Changes to model files cannot be reverted and can have an impact on existing model deployments. + All files under `/Models` and `/Deployments` are managed by Hopsworks. + Manual changes to these files cannot be reverted and can have an impact on existing model deployments. ## Artifact Files -Artifact files are files involved in the correct startup and running of the model deployment. -The most important files are the **predictor** and **transformer scripts**. -The former is used to load and run the model for making predictions. -The latter is typically used to apply transformations on the model inputs at inference time before making predictions. +Artifact files are essential for the proper initialization and operation of a model deployment. The most critical artifact files are the **predictor** and **transformer scripts**. The predictor script loads the trained model and handles prediction requests, while the transformer script applies any necessary input transformations before inference. Predictor and transformer scripts run on separate components and, therefore, scale independently of each other. !!! tip @@ -220,40 +189,26 @@ Predictor and transformer scripts run on separate components and, therefore, sca Additionally, artifact files can also contain a **server configuration file** that helps detach configuration used within the model deployment from the model server or the implementation of the predictor and transformer scripts. Inside a model deployment, the local path to the configuration file is stored in the `CONFIG_FILE_PATH` environment variable (see [environment variables](../serving/predictor.md#environment-variables)). -Every model deployment runs a specific version of the artifact files, commonly referred to as artifact version. ==One or more model deployments can use the same artifact version== (i.e., same predictor and transformer scripts). -Artifact versions are unique for the same model version. - -When a new deployment is created, a new artifact version is generated in two cases: - -- the artifact version in the predictor is set to `CREATE` (see [Artifact Version](./predictor.md#environment-variables)) -- no model artifact with the same files has been created before. +Each deployment tracks its artifact files through a ==deployment version== — an integer (1, 2, 3...) that is incremented whenever the artifact content changes (e.g., updating a predictor script or configuration file). Inside a model deployment, the local path to the artifact files is stored in the `ARTIFACT_FILES_PATH` environment variable (see [environment variables](../serving/predictor.md#environment-variables)). -Moreover, you can explore the artifact files under the `/Models///Artifacts/` directory using the File Browser. !!! warning - All files under `/Models` are managed by Hopsworks. - Changes to artifact files cannot be reverted and can have an impact on existing model deployments. + All files under `/Models` and `/Deployments` are managed by Hopsworks. + Manual changes to these files cannot be reverted and can have an impact on existing model deployments. -!!! tip "Additional files" - Currently, the artifact files can only include predictor and transformer scripts, and a configuration file. - Support for additional files (e.g., other resources) is coming soon. +!!! tip "vLLM omni mode" + For vLLM deployments, the server configuration file supports a `#HOPSWORKS omni: true` directive to enable omni mode. ## Predictor -Predictors are responsible for running the model server that loads the trained model, listens to inference requests and returns prediction results. +Predictors are responsible for running the model server that loads the trained model, handles inference requests and returns prediction results. To learn more about predictors, see the [Predictor Guide](predictor.md) -!!! note - Only one predictor is supported in a deployment. - -!!! info - Model artifacts are assigned an incremental version number, being `0` the version reserved for model artifacts that do not contain predictor or transformer scripts (i.e., shared artifacts containing only the model files). - ## Transformer Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. To learn more about transformers, see the [Transformer Guide](transformer.md). !!! warning - Transformers are only supported in KServe deployments. + Transformers are not available for vLLM deployments. diff --git a/docs/user_guides/mlops/serving/external-access.md b/docs/user_guides/mlops/serving/external-access.md index a02c8c2500..a4371bc370 100644 --- a/docs/user_guides/mlops/serving/external-access.md +++ b/docs/user_guides/mlops/serving/external-access.md @@ -119,9 +119,13 @@ You can create API keys to authenticate your inference requests by clicking on t Depending on the type of model deployment, the URI of the model server can differ (e.g., `/chat/completions` for LLM deployments or `/predict` for traditional model deployments). You can find the corresponding URI on every model deployment card. -In addition to the `Authorization` header containing the API key, the `Host` header needs to be set according to the model deployment where the inference requests are sent to. -This header is used by the ingress to route the inference requests to the corresponding model deployment. -You can find the `Host` header value in the model deployment card. +Inference requests use path-based routing with the format: + +```text +https:///v1/// +``` + +Include the `Authorization` header containing the API key. See the [REST API Guide](rest-api.md) for details on path construction per model server type. !!! tip "Code snippets" For clients sending inference requests using libraries similar to curl or OpenAI API-compatible libraries (e.g., LangChain), you can find code snippet examples by clicking on the `Curl >_` and `LangChain >_` buttons. diff --git a/docs/user_guides/mlops/serving/index.md b/docs/user_guides/mlops/serving/index.md index bc248e01f3..68612ef47f 100644 --- a/docs/user_guides/mlops/serving/index.md +++ b/docs/user_guides/mlops/serving/index.md @@ -3,19 +3,19 @@ ## Deployment Assuming you have already created a model in the [Model Registry](../registry/index.md), a deployment can now be created to prepare a model artifact for this model and make it accessible for running predictions behind a REST or gRPC endpoint. -Follow the [Deployment Creation Guide](deployment.md) to create a Deployment for your model. -### Predictor +Refer to the [Deployment Creation Guide](deployment.md) for step-by-step instructions on creating a deployment for your model. For details on monitoring the status and lifecycle of an existing deployment, see the [Deployment State Guide](deployment-state.md). -Predictors are responsible for running a model server that loads a trained model, handles inference requests and returns predictions, see the [Predictor Guide](predictor.md). +!!! tip "Python deployments" + If you want to deploy a Python script without a model artifact, see the [Python Deployments](../../projects/python-deployment.md) page. -### Transformer +### Predictor (KServe component) -Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model, see the [Transformer Guide](transformer.md). +Predictors are responsible for running a model server that loads a trained model, handles inference requests and returns predictions, see the [Predictor Guide](predictor.md). -### Resource Allocation +### Transformer (KServe component) -Configure the resources to be allocated for predictor and transformer in a model deployment, see the [Resource Allocation Guide](resources.md). +Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model, see the [Transformer Guide](transformer.md). ### Inference Batcher @@ -25,9 +25,25 @@ Configure the predictor to batch inference requests, see the [Inference Batcher Configure the predictor to log inference requests and predictions, see the [Inference Logger Guide](inference-logger.md). +### Resource Allocation + +Configure the resources to be allocated for predictor and transformer in a model deployment, see the [Resource Allocation Guide](resources.md). + +### Autoscaling + +Configure autoscaling for your model deployment, including scale-to-zero, scale metrics and scaling parameters, see the [Autoscaling Guide](autoscaling.md). + +### Scheduling + +Configure scheduling for your model deployment using Kueue queues, see the [Scheduling Guide](scheduling.md). + +### API Protocol + +Choose between REST and gRPC API protocols for your model deployment, see the [API Protocol Guide](api-protocol.md). + ### REST API -Send inference requests to deployed models using REST API, see the [Rest API Guide](rest-api.md). +Send inference requests to deployed models using REST API, see the [REST API Guide](rest-api.md). ### Troubleshooting diff --git a/docs/user_guides/mlops/serving/inference-batcher.md b/docs/user_guides/mlops/serving/inference-batcher.md index d978b900d9..4a75072cff 100644 --- a/docs/user_guides/mlops/serving/inference-batcher.md +++ b/docs/user_guides/mlops/serving/inference-batcher.md @@ -3,7 +3,7 @@ ## Introduction Inference batching can be enabled to increase inference request throughput at the cost of higher latencies. -The configuration of the inference batcher depends on the serving tool and the model server used in the deployment. +The configuration of the inference batcher depends on the model server used in the deployment. See the [compatibility matrix](#compatibility-matrix). ## GUI @@ -99,7 +99,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference @@ -110,11 +109,11 @@ Once you are done with the changes, click on `Create new deployment` at the bott ??? info "Show supported inference batcher configuration" - | Serving tool | Model server | Inference batching | Fine-grained configuration | - | ------------ | ------------------ | ------------------ | ------- | - | Docker | Flask | ❌ | - | - | | TensorFlow Serving | ✅ | ❌ | - | Kubernetes | Flask | ❌ | - | - | | TensorFlow Serving | ✅ | ❌ | - | KServe | Flask | ✅ | ✅ | - | | TensorFlow Serving | ✅ | ✅ | + | Model server | Inference batching | Fine-grained configuration | + | ------------------ | ------------------ | -------------------------- | + | Python | ✅ | ✅ | + | TensorFlow Serving | ✅ | ✅ | + | vLLM | ❌ | — | + +!!! note "Timeout parameter" + The `timeout` parameter sets the request timeout in seconds for the inference batcher. If a batch is not filled within this time, the available requests are sent as a partial batch. diff --git a/docs/user_guides/mlops/serving/inference-logger.md b/docs/user_guides/mlops/serving/inference-logger.md index 64012324b7..71ec2db2e6 100644 --- a/docs/user_guides/mlops/serving/inference-logger.md +++ b/docs/user_guides/mlops/serving/inference-logger.md @@ -6,7 +6,19 @@ Once a model is deployed and starts making predictions as inference requests arr Hopsworks supports logging both inference requests and predictions as events to a Kafka topic for analysis. -!!! warning "Topic schemas vary depending on the serving tool. See [below](#topic-schema)" +!!! warning "Inference logging is not supported for vLLM deployments." + +!!! info "Logging modes" + Three logging modes are available: + + | Mode | Logger Mode | Description | + | ------------- | ------------ | ------------------------- | + | ALL | `"all"` | Log both inputs and outputs | + | PREDICTIONS | `"response"` | Log model outputs only | + | MODEL_INPUTS | `"request"` | Log model inputs only | + +!!! note "Kafka topic requirements" + The Kafka topic must use the `inferenceschema` subject. Schema v4+ is required for KServe topics. ## GUI @@ -113,7 +125,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference @@ -122,47 +133,23 @@ Once you are done with the changes, click on `Create new deployment` at the bott ## Topic schema -The schema of Kafka events varies depending on the serving tool. -In KServe deployments, model inputs and predictions are logged in separate events, but sharing the same `requestId` field. -In non-KServe deployments, the same event contains both the model input and prediction related to the same inference request. - -??? example "Show kafka topic schemas" - - === "KServe" - - ``` json - { - "fields": [ - { "name": "servingId", "type": "int" }, - { "name": "modelName", "type": "string" }, - { "name": "modelVersion", "type": "int" }, - { "name": "requestTimestamp", "type": "long" }, - { "name": "responseHttpCode", "type": "int" }, - { "name": "inferenceId", "type": "string" }, - { "name": "messageType", "type": "string" }, - { "name": "payload", "type": "string" } - ], - "name": "inferencelog", - "type": "record" - } - ``` - - === "Docker / Kubernetes" - - ``` json - { - "fields": [ - { "name": "modelId", "type": "int" }, - { "name": "modelName", "type": "string" }, - { "name": "modelVersion", "type": "int" }, - { "name": "requestTimestamp", "type": "long" }, - { "name": "responseHttpCode", "type": "int" }, - { "name": "inferenceRequest", "type": "string" }, - { "name": "inferenceResponse", "type": "string" }, - { "name": "modelServer", "type": "string" }, - { "name": "servingTool", "type": "string" } - ], - "name": "inferencelog", - "type": "record" - } - ``` +Model inputs and predictions are logged in separate events, sharing the same `requestId` field. + +??? example "Show kafka topic schema" + + ``` json + { + "fields": [ + { "name": "servingId", "type": "int" }, + { "name": "modelName", "type": "string" }, + { "name": "modelVersion", "type": "int" }, + { "name": "requestTimestamp", "type": "long" }, + { "name": "responseHttpCode", "type": "int" }, + { "name": "inferenceId", "type": "string" }, + { "name": "messageType", "type": "string" }, + { "name": "payload", "type": "string" } + ], + "name": "inferencelog", + "type": "record" + } + ``` diff --git a/docs/user_guides/mlops/serving/predictor.md b/docs/user_guides/mlops/serving/predictor.md index 4470ffac75..2ae45ce929 100644 --- a/docs/user_guides/mlops/serving/predictor.md +++ b/docs/user_guides/mlops/serving/predictor.md @@ -14,20 +14,21 @@ In this guide, you will learn how to configure a predictor for a trained model. Predictors are the main component of deployments. They are responsible for running a model server that loads a trained model, handles inference requests and returns predictions. -They can be configured to use different model servers, serving tools, log specific inference data or scale differently. -In each predictor, you can configure the following components: +They can be configured to use different model servers, different resources or scale differently. +In each predictor, you can decide the following configuration: !!! info "" 1. [Model server](#model-server) - 2. [Serving tool](#serving-tool) - 3. [User-provided script](#user-provided-script) - 4. [Server configuration file](#server-configuration-file) - 5. [Python environments](#python-environments) - 6. [Transformer](#transformer) - 7. [Inference Logger](#inference-logger) - 8. [Inference Batcher](#inference-batcher) - 9. [Resources](#resources) - 10. [API protocol](#api-protocol) + 2. [User-provided script](#user-provided-script) + 3. [Server configuration file](#server-configuration-file) + 4. [Python environments](#python-environments) + 5. [Transformer](#transformer) + 6. [Inference Logger](#inference-logger) + 7. [Inference Batcher](#inference-batcher) + 8. [Resources](#resources) + 9. [Autoscaling](#autoscaling) + 10. [Scheduling](#scheduling) + 11. [API protocol](#api-protocol) ## GUI @@ -55,7 +56,7 @@ For example if you registered the model as a TensorFlow model using `ModelRegist

- Select the model framework + Select the model server
Select the backend

@@ -69,7 +70,7 @@ All models compatible with the selected backend will be listed in the model drop

-Moreover, you can optionally select a predictor script (see [Step 3 (Optional): Select a predictor script](#step-3-optional-select-a-predictor-script)), enable KServe (see [Step 4 (Optional): Enable KServe](#step-6-optional-enable-kserve)) or change other advanced configuration (see [Step 5 (Optional): Other advanced options](#step-7-optional-other-advanced-options)). +Moreover, you can optionally select a predictor script (see [Step 3 (Optional): Select a predictor script](#step-3-optional-select-a-predictor-script)) or change other advanced configuration (see [Step 6 (Optional): Other advanced options](#step-6-optional-other-advanced-options)). Otherwise, click on `Create new deployment` to create the deployment for your model. ### Step 3 (Optional): Select a predictor script @@ -102,7 +103,7 @@ To create your own it is recommended to [clone](../../projects/python/python_env ### Step 5 (Optional): Select a configuration file !!! note - Only available for LLM deployments. + Only available for LLM deployments. Required for vLLM deployments. You can select a configuration file to be added to the [artifact files](deployment.md#artifact-files). If a predictor script is provided, this configuration file will be available inside the model deployment at the local path stored in the `CONFIG_FILE_PATH` environment variable. @@ -116,10 +117,9 @@ You can find all configuration parameters supported by the vLLM server in the [v

-### Step 6 (Optional): Enable KServe +### Step 6 (Optional): Other advanced options -Other configuration such as the serving tool, is part of the advanced options of a deployment. -To navigate to the advanced creation form, click on `Advanced options`. +To access the advanced deployment configuration, click on `Advanced options`.

@@ -128,25 +128,16 @@ To navigate to the advanced creation form, click on `Advanced options`.

-Here, you change the [serving tool](#serving-tool) for your deployment by enabling or disabling the KServe checkbox. +Here, you can change the default values of your predictor: -

-

- KServe in advanced deployment form -
KServe checkbox in the advanced deployment form
-
-

- -### Step 7 (Optional): Other advanced options - -Additionally, you can adjust the default values of the rest of components: - -!!! info "Predictor components" +!!! info "Predictor configuration" 1. [Transformer](#transformer) 2. [Inference logger](#inference-logger) 3. [Inference batcher](#inference-batcher) 4. [Resources](#resources) - 5. [API protocol](#api-protocol) + 5. [Autoscaling](#autoscaling) + 6. [Scheduling](#scheduling) + 7. [API protocol](#api-protocol) Once you are done with the changes, click on `Create new deployment` at the bottom of the page to create the deployment for your model. @@ -203,65 +194,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott # return self.model.predict(result) ``` -=== "Predictor (vLLM deployments only)" - - ``` python - import os - from vllm import **version**, AsyncEngineArgs, AsyncLLMEngine - from typing import Iterable, AsyncIterator, Union, Optional - from kserve.protocol.rest.openai import ( - CompletionRequest, - ChatPrompt, - ChatCompletionRequestMessage, - ) - from kserve.protocol.rest.openai.types import Completion - from kserve.protocol.rest.openai.types.openapi import ChatCompletionTool - - class Predictor(): - - def __init__(self): - """ Initialization code goes here""" - - # (optional) if any, access the configuration file via os.environ["CONFIG_FILE_PATH"] - config = ... - - print("Starting vLLM backend...") - engine_args = AsyncEngineArgs( - model=os.environ["MODEL_FILES_PATH"], - **config - ) - - # "self.vllm_engine" is required as the local variable with the vllm engine handler - self.vllm_engine = AsyncLLMEngine.from_engine_args(engine_args) - - # - # NOTE: Default implementations of the apply_chat_template and create_completion methods are already provided. - # If needed, you can override these methods as shown below - # - - #def apply_chat_template( - # self, - # messages: Iterable[ChatCompletionRequestMessage], - # chat_template: Optional[str] = None, - # tools: Optional[list[ChatCompletionTool]] = None, - #) -> ChatPrompt: - # """Converts a prompt or list of messages into a single templated prompt string""" - - # prompt = ... # apply chat template on the message to build the prompt - # return ChatPrompt(prompt=prompt) - - #async def create_completion( - # self, request: CompletionRequest - #) -> Union[Completion, AsyncIterator[Completion]]: - # """Generate responses using the vLLM engine""" - # - # generators = self.vllm_engine.generate(...) - # - # # Completion: used for returning a single answer (batch) - # # AsyncIterator[Completion]: used for returning a stream of answers - # return ... - ``` - !!! info "Jupyter magic" In a jupyter notebook, you can add `%%writefile my_predictor.py` at the top of the cell to save it as a local file. @@ -279,10 +211,9 @@ Once you are done with the changes, click on `Create new deployment` at the bott "/Projects", project.name, uploaded_file_path ) - ``` -### Step 4: Define predictor +### Step 4: Pass predictor configuration to deployment === "Python" @@ -293,11 +224,8 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_model, # optional model_server="PYTHON", - serving_tool="KSERVE", script_file=predictor_script_path, ) - - ``` ### Step 5: Create a deployment with the predictor @@ -311,7 +239,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference @@ -320,42 +247,39 @@ Once you are done with the changes, click on `Create new deployment` at the bott ## Model Server -Hopsworks Model Serving supports deploying models with a Flask server for python-based models, TensorFlow Serving for TensorFlow / Keras models and vLLM for Large Language Models (LLMs). -Today, you can deploy PyTorch models as python-based models. +Hopsworks Model Serving supports deploying models with a Python model server for python-based models (scikit-learn, xgboost, pytorch...), TensorFlow Serving for TensorFlow / Keras models and vLLM for Large Language Models (LLMs). ??? info "Show supported model servers" - | Model Server | Supported | ML Models and Frameworks | - | ------------------ | --------- | ----------------------------------------------------------------------------------------------- | - | Flask | ✅ | python-based (scikit-learn, xgboost, pytorch...) | - | TensorFlow Serving | ✅ | keras, tensorflow | - | TorchServe | ❌ | pytorch | - | vLLM | ✅ | vLLM-supported models (see [list](https://docs.vllm.ai/en/v0.7.1/models/supported_models.html)) | + | Model Server | ML Models and Frameworks | + | ------------------ | ----------------------------------------------------------------------------------------------- | + | Python | python-based (scikit-learn, xgboost, pytorch...) | + | TensorFlow Serving | keras, tensorflow | + | vLLM | vLLM-supported models (see [list](https://docs.vllm.ai/en/v0.7.1/models/supported_models.html)) | -## Serving tool +??? info "Show framework compatibility matrix" -In Hopsworks, model servers are deployed on Kubernetes. -There are two options for deploying models on Kubernetes: using [KServe](https://kserve.github.io/website/latest/) inference services or Kubernetes built-in deployments. ==KServe is the recommended way to deploy models in Hopsworks==. + | Model Framework | Allowed Model Server | Script Required? | Notes | + | --------------- | -------------------- | ---------------- | ------------------------------------------------------------------------- | + | TENSORFLOW | TensorFlow Serving | No (not allowed) | Must have model artifact. Path needs `variables/` + `.pb` | + | SKLEARN | Python | No | Uses sklearn built-in KServe runtime. Files: `.joblib`, `.pkl`, `.pickle` | + | PYTHON | Python | Yes | Custom Python deployment | + | TORCH | Python | Yes | PyTorch deployment | + | LLM | Python or vLLM | Depends | Python requires script, vLLM optional | -The following is a comparative table showing the features supported by each of them. +!!! warning "vLLM restrictions" + The vLLM model server has the following restrictions: -??? info "Show serving tools comparison" + - No transformer support + - No inference logging + - Config file must be YAML (`.yml`/`.yaml`) when no predictor script is provided + - Only LLM model framework is supported - | Feature / requirement | Kubernetes (enterprise) | KServe (enterprise) | - | ----------------------------------------------------- | ----------------------- | ------------------------- | - | Autoscaling (scale-out) | ✅ | ✅ | - | Resource allocation | ➖ min. resources | ✅ min / max. resources | - | Inference logging | ➖ simple | ✅ fine-grained | - | Inference batching | ➖ partially | ✅ | - | Scale-to-zero | ❌ | ✅ after 30s of inactivity | - | Transformers | ❌ | ✅ | - | Low-latency predictions | ❌ | ✅ | - | Multiple models | ❌ | ➖ (python-based) | - | User-provided predictor required
(python-only) | ✅ | ❌ | +All deployments use [KServe](https://kserve.github.io/website/latest/) as the serving platform, providing autoscaling (including scale-to-zero), fine-grained resource allocation, inference logging, inference batching, and transformers. ## User-provided script -Depending on the model server and serving platform used in the model deployment, you can (or need) to provide your own python script to load the model and make predictions. +Depending on the model server used in the model deployment, you can (or need) to provide your own python script to load the model and make predictions. This script is referred to as **predictor script**, and is included in the [artifact files](../serving/deployment.md#artifact-files) of the model deployment. The predictor script needs to implement a given template depending on the model server of the model deployment. @@ -363,13 +287,11 @@ See the templates in [Step 2](#step-2-optional-implement-a-predictor-script). ??? info "Show supported user-provided predictors" - | Serving tool | Model server | User-provided predictor script | - | ------------ | ------------------ | ---------------------------------------------------- | - | Kubernetes | Flask server | ✅ (required) | - | | TensorFlow Serving | ❌ | - | KServe | Fast API | ✅ (only required for artifacts with multiple models) | - | | TensorFlow Serving | ❌ | - | | vLLM | ✅ (optional) | + | Model server | User-provided predictor script | + | ------------------ | ---------------------------------------------------- | + | Python | ✅ (only required for artifacts with multiple models) | + | TensorFlow Serving | ❌ | + | vLLM | ✅ (optional) | ### Server configuration file @@ -378,28 +300,58 @@ In other words, by modifying the configuration file of an existing model deploym Inside a model deployment, the local path to the configuration file is stored in the `CONFIG_FILE_PATH` environment variable (see [environment variables](#environment-variables)). !!! warning "Configuration file format" - The configuration file can be of any format, except in vLLM deployments **without a predictor script** for which a YAML file is ==required==. + The configuration file can be of any format, except in vLLM deployments **without a predictor script** for which a YAML file (`.yml`/`.yaml`) is ==required==. + When a predictor script is present, any format is allowed (the user is responsible for parsing it). !!! note "Passing arguments to vLLM via configuration file" - For vLLM deployments **without a predictor script**, the server configuration file is ==required== and it is used to configure the vLLM server. - For example, you can use this configuration file to specify the chat template or LoRA modules to be loaded by the vLLM server. + For vLLM deployments **without a predictor script**, the server configuration file is required and is used to configure the vLLM server. + For example, you can use this configuration file to specify the chat template or LoRA modules to be loaded by the vLLM server. See all available parameters in the [official documentation](https://docs.vllm.ai/en/v0.7.1/serving/openai_compatible_server.html#command-line-arguments-for-the-server). ### Environment variables A number of different environment variables is available in the predictor to ease its implementation. -??? info "Show environment variables" +??? info "Show common environment variables" + + These variables are available in all deployments. + + | Name | Description | + | ---------------------- | -------------------------------------------------- | + | DEPLOYMENT_NAME | Name of the current deployment | + | DEPLOYMENT_VERSION | Version of the deployment | + | ARTIFACT_FILES_PATH | Local path to the artifact files | + | REST_ENDPOINT | Hopsworks REST API endpoint | + | HOPSWORKS_PROJECT_ID | ID of the project | + | HOPSWORKS_PROJECT_NAME | Name of the project | + | HOPSWORKS_PUBLIC_HOST | Hopsworks public hostname | + | API_KEY | API key for authenticating with Hopsworks services | + | PROJECT_ID | Project ID (for Feature Store access) | + | PROJECT_NAME | Project name (for Feature Store access) | + | SECRETS_DIR | Path to secrets directory (`/keys`) | + | MATERIAL_DIRECTORY | Path to TLS certificates (`/certs`) | + | REQUESTS_VERIFY | SSL verification setting | + +??? info "Show predictor-specific environment variables" + + These variables are set for predictor components. + + | Name | Description | + | ---------------- | -------------------------------------------------- | + | SCRIPT_PATH | Full path to the predictor script | + | SCRIPT_NAME | Prefixed filename of the predictor script | + | CONFIG_FILE_PATH | Local path to the configuration file (if provided) | + | IS_PREDICTOR | Set to `true` for predictor components | + +??? info "Show model artifact environment variables (when model is present)" - | Name | Description | - | ------------------- | -------------------------------------------------------------------- | - | MODEL_FILES_PATH | Local path to the model files | - | ARTIFACT_FILES_PATH | Local path to the artifact files | - | CONFIG_FILE_PATH | Local path to the configuration file | - | DEPLOYMENT_NAME | Name of the current deployment | - | MODEL_NAME | Name of the model being served by the current deployment | - | MODEL_VERSION | Version of the model being served by the current deployment | - | ARTIFACT_VERSION | Version of the model artifact being served by the current deployment | + These variables are only available when the deployment has a model artifact. + + | Name | Description | + | ---------------- | ---------------------------------------------------------------- | + | MODEL_FILES_PATH | Local path to the model files (`/var/lib/hopsworks/model_files`) | + | MODEL_NAME | Name of the model being served by the current deployment | + | MODEL_VERSION | Version of the model being served by the current deployment | ## Python environments @@ -408,13 +360,11 @@ To create a new Python environment see [Python Environments](../../projects/pyth ??? info "Show supported Python environments" - | Serving tool | Model server | Editable | Predictor | Transformer | - | ------------ | ------------------ | -------- | ------------------------------------------ | ------------------------------ | - | Kubernetes | Flask server | ❌ | `pandas-inference-pipeline` only | ❌ | - | | TensorFlow Serving | ❌ | (official) tensorflow serving image | ❌ | - | KServe | Fast API | ✅ | any `inference-pipeline` image | any `inference-pipeline` image | - | | TensorFlow Serving | ✅ | (official) tensorflow serving image | any `inference-pipeline` image | - | | vLLM | ✅ | `vllm-inference-pipeline` or `vllm-openai` | any `inference-pipeline` image | + | Model server | Editable | Predictor | Transformer | + | ------------------ | -------- | ------------------------------------------ | ------------------------------ | + | Python | ✅ | any `inference-pipeline` image | any `inference-pipeline` image | + | TensorFlow Serving | ✅ | (official) tensorflow serving image | any `inference-pipeline` image | + | vLLM | ✅ | `vllm-inference-pipeline` or `vllm-openai` | any `inference-pipeline` image | !!! note The selected Python environment is used for both predictor and transformer. @@ -443,6 +393,18 @@ To learn about the different configuration available for the inference batcher, Resources include the number of replicas for the deployment as well as the resources (i.e., memory, CPU, GPU) to be allocated per replica. To learn about the different combinations available, see the [Resources Guide](resources.md). +## Autoscaling + +Deployments use Knative Pod Autoscaler (KPA) to automatically scale the number of replicas based on traffic, including scale-to-zero. + +To learn about the different autoscaling parameters, see the [Autoscaling Guide](autoscaling.md). + +## Scheduling + +If the cluster has Kueue enabled, you can select a queue for your deployment from the advanced configuration. Queues control resource allocation and scheduling priority across the cluster. + +For full details on scheduling configuration, see the [Scheduling Guide](scheduling.md). + ## API protocol Hopsworks supports both REST and gRPC as the API protocols to send inference requests to model deployments. diff --git a/docs/user_guides/mlops/serving/resources.md b/docs/user_guides/mlops/serving/resources.md index 2ba350d1a9..cec5985e31 100644 --- a/docs/user_guides/mlops/serving/resources.md +++ b/docs/user_guides/mlops/serving/resources.md @@ -6,8 +6,7 @@ description: Documentation on how to allocate resources to a model deployment ## Introduction -Depending on the serving tool used to deploy a trained model, resource allocation can be configured at different levels. -While deployments on Docker containers only support a fixed number of resources (CPU and memory), using Kubernetes or KServe allows a better exploitation of the resources available in the platform, by enabling you to specify how many CPUs, GPUs, and memory are allocated to a deployment. +Resource allocation can be configured per component (predictor and transformer) in a deployment, allowing you to specify how many CPUs, GPUs, and memory are allocated. See the [compatibility matrix](#compatibility-matrix). ## GUI @@ -103,7 +102,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott num_instances=2, requests=minimum_res, limits=maximum_res ) - ``` ### Step 4: Create a deployment with the resource configuration @@ -126,7 +124,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor) my_deployment.save() - ``` ### API Reference @@ -137,11 +134,26 @@ Once you are done with the changes, click on `Create new deployment` at the bott ??? info "Show supported resource allocation configuration" - | Serving tool | Component | Resources | - | ------------ | ----------- | --------------------------- | - | Docker | Predictor | Fixed | - | | Transformer | ❌ | - | Kubernetes | Predictor | Minimum resources | - | | Transformer | ❌ | - | KServe | Predictor | Minimum / maximum resources | - | | Transformer | Minimum / maximum resources | + | Component | Resources | + | ----------- | --------------------------- | + | Predictor | Minimum / maximum resources | + | Transformer | Minimum / maximum resources | + +??? info "Show resource defaults" + + | Field | Default Request | Default Limit | + | ----------------- | --------------- | ---------------- | + | CPU (cores) | 0.2 | -1 (unlimited) | + | Memory (MB) | 32 | -1 (unlimited) | + | GPUs | 0 | 0 | + | Shared Memory (MB)| 128 | — | + +!!! warning "Validation rules" + - Requested cores cannot exceed limits (unless limit is -1, meaning unlimited) + - Requested memory cannot exceed limits (unless limit is -1) + - GPU requests must equal GPU limits + +## Autoscaling + +Deployments can be configured to automatically scale the number of replicas based on traffic. +To learn about the different autoscaling parameters, see the [Autoscaling Guide](autoscaling.md). diff --git a/docs/user_guides/mlops/serving/rest-api.md b/docs/user_guides/mlops/serving/rest-api.md index d7e99de1e1..cb0a2ea983 100644 --- a/docs/user_guides/mlops/serving/rest-api.md +++ b/docs/user_guides/mlops/serving/rest-api.md @@ -8,10 +8,16 @@ This document explains how to interact with a model deployment via REST API. ## Base URL -Deployed models are accessible through the Istio ingress gateway. +Deployed models are accessible through the Istio ingress gateway using path-based routing. The URL to interact with a model deployment is provided on the model deployment page in the Hopsworks UI. -The URL follows the format `http:///`, where `RESOURCE_PATH` depends on the [`Predictor.model_server`][hsml.predictor.Predictor.model_server] (e.g., vLLM, TensorFlow Serving, SKLearn ModelServer). +The URL follows the format: + +```text +https:///v1/// +``` + +Where `rest_of_path` depends on the model server type (see [Path construction](#path-construction)).

@@ -23,7 +29,7 @@ The URL follows the format `http:///`, where `R ## Authentication All requests must include an API Key for authentication. -You can create an API by following this [guide](../../projects/api_key/create_api_key.md). +You can create an API key by following this [guide](../../projects/api_key/create_api_key.md). Include the key in the Authorization header: @@ -33,20 +39,45 @@ Authorization: ApiKey ## Headers -| Header | Description | Example Value | -| --------------- | ------------------------------------------- | ------------------------------------ | -| `Host` | Model’s hostname, provided in Hopsworks UI. | `fraud.test.hopsworks.ai` | -| `Authorization` | API key for authentication. | `ApiKey ` | -| `Content-Type` | Request payload type (always JSON). | `application/json` | +| Header | Description | Example Value | +| --------------- | ----------------------------------- | ----------------------- | +| `Authorization` | API key for authentication. | `ApiKey ` | +| `Content-Type` | Request payload type (always JSON). | `application/json` | + +## Inference Verbs + +Different inference verbs are available depending on the model server and use case. + +??? info "Show inference verbs" + + | Verb | Typical Use | + | ---------------------------- | ---------------------------------- | + | `predict` | TF Serving, sklearn, custom Python | + | `classify` | TF Serving classification | + | `regress` | TF Serving regression | + | `v1/completions` | vLLM OpenAI completions | + | `v1/chat/completions` | vLLM OpenAI chat | + | `openai/v1/completions` | vLLM inference pipeline | + | `openai/v1/chat/completions` | vLLM inference pipeline chat | + | `test` | Legacy test endpoint | + +## Path Construction + +The full URL is constructed by combining the base path with a model server-specific suffix. + +| Scenario | Path Format | Example | +| ------------------------- | ---------------------------------------------------- | ---------------------------------------------- | +| TF Serving | `/v1///v1/models/:` | `/v1/my_project/fraud/v1/models/fraud:predict` | +| Python/Sklearn with model | `/v1///v1/models/:` | `/v1/my_project/fraud/v1/models/fraud:predict` | +| vLLM | `/v1///` | `/v1/my_project/my-llm/v1/chat/completions` | +| Python deployments | `/v1///` | `/v1/my_project/my-app/predict` | ## Request Format -The request format depends on the model sever being used. +The request format depends on the model server being used. -For predictive inference (i.e., for Tensorflow or SkLearn or Python Serving). -The request must be sent as a JSON object containing an `inputs` or `instances` field. +For predictive inference (TensorFlow, sklearn, or Python model server), the request must be sent as a JSON object containing an `inputs` or `instances` field. See [more information on the request format](https://kserve.github.io/website/docs/concepts/architecture/data-plane/v1-protocol#request-format). -An example for this is given below. !!! example "REST API example for Predictive Inference (Tensorflow or SkLearn or Python Serving)" === "Python" @@ -56,41 +87,105 @@ An example for this is given below. data = {"inputs": [[4641025220953719, 4920355418495856]]} - headers = { - "Host": "fraud.test.hopsworks.ai", - "Authorization": "ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp", - "Content-Type": "application/json", - } + headers = {"Authorization": "ApiKey ", "Content-Type": "application/json"} response = requests.post( - "http://10.87.42.108/v1/models/fraud:predict", headers=headers, json=data + "https:///v1/my_project/fraud/v1/models/fraud:predict", + headers=headers, + json=data, ) print(response.json()) - - ``` === "Curl" ```bash - curl -X POST "http://10.87.42.108/v1/models/fraud:predict" \ - -H "Host: fraud.test.hopsworks.ai" \ - -H "Authorization: ApiKey 8kDOlnRlJU4kiV1Y.RmFNJY3XKAUSqmJZ03kbUbXKMQSHveSBgMIGT84qrM5qXMjLib7hdlfGeg8fBQZp" \ + curl -X POST "https:///v1/my_project/fraud/v1/models/fraud:predict" \ + -H "Authorization: ApiKey " \ -H "Content-Type: application/json" \ -d '{ "inputs": [ - [ - 4641025220953719, - 4920355418495856 - ] + [4641025220953719, 4920355418495856] ] }' ``` -For generative inference (i.e vLLM) the response follows the [OpenAI specification](https://platform.openai.com/docs/api-reference/chat/create). +For generative inference (vLLM), the request follows the [OpenAI specification](https://platform.openai.com/docs/api-reference/chat/create). + +!!! example "vLLM chat completions" + === "Python" + + ```python + import requests + + data = { + "model": "my-llm", + "messages": [{"role": "user", "content": "Hello, how are you?"}], + } + + headers = {"Authorization": "ApiKey ", "Content-Type": "application/json"} + + response = requests.post( + "https:///v1/my_project/my-llm/v1/chat/completions", + headers=headers, + json=data, + ) + print(response.json()) + ``` + + === "Curl" + + ```bash + curl -X POST "https:///v1/my_project/my-llm/v1/chat/completions" \ + -H "Authorization: ApiKey " \ + -H "Content-Type: application/json" \ + -d '{ + "model": "my-llm", + "messages": [ + {"role": "user", "content": "Hello, how are you?"} + ] + }' + ``` + +## Python Client Methods + +The Hopsworks Python client provides convenience methods for constructing endpoint URLs. + +=== "Python" + + ```python + deployment = ms.get_deployment("my-deployment") + + # Base endpoint URL (for custom/Python deployments) + # Returns: https:///v1// + endpoint_url = deployment.get_endpoint_url() + + # Standard model inference URL + # Returns: https:///v1///v1/models/:predict + inference_url = deployment.get_inference_url() + + # vLLM OpenAI-compatible base URL (vLLM deployments only) + # Returns: https:///v1///v1 + # Append /chat/completions or /completions for specific endpoints + openai_url = deployment.get_openai_url() + ``` + +## CORS + +The Istio EnvoyFilter handles CORS preflight (`OPTIONS`) requests automatically. Allowed origins can be configured via `istio.envoyFilter.corsAllowedOrigins` in the Helm chart configuration. ## Response The model returns predictions in a JSON object. The response depends on the model server implementation. -You can find more information regarding specific model servers in the [Kserve documentation](https://kserve.github.io/website/docs/intro). +You can find more information regarding specific model servers in the [KServe documentation](https://kserve.github.io/website/docs/intro). + +??? info "Legacy: Host-based routing" + Prior to path-based routing, requests were routed using a `Host` header matching the InferenceService hostname. + This method is still used internally by the Hopsworks backend when proxying inference requests. + + ``` + Host: ..hopsworks.ai + ``` + + Each InferenceService gets its own Knative-generated hostname, and routing depends on the `Host` header matching Istio VirtualService rules. Path-based routing (described above) is the preferred method for external access. diff --git a/docs/user_guides/mlops/serving/scheduling.md b/docs/user_guides/mlops/serving/scheduling.md new file mode 100644 index 0000000000..b24d5f4605 --- /dev/null +++ b/docs/user_guides/mlops/serving/scheduling.md @@ -0,0 +1,63 @@ +--- +description: Documentation on how to configure scheduling options for a model deployment +--- + +# How To Configure Scheduling For A Model Deployment + +## Introduction + +Scheduling configuration determines how and where your model deployment pods are placed in the Kubernetes cluster. Hopsworks supports Kubernetes scheduler abstractions such as node affinity, anti-affinity, and priority classes, as well as advanced scheduling with Kueue queues and topologies. + +All scheduling options are available in jobs, Jupyter notebooks, model deployments and Python deployments. + +## GUI + +### Step 1: Create new deployment + +If you have at least one model already trained and saved in the Model Registry, navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Once in the deployments page, you can create a new deployment by either clicking on `New deployment` (if there are no existing deployments) or on `Create new deployment` it the top-right corner. +Both options will open the deployment creation form. + +### Step 2: Go to advanced options + +A simplified creation form will appear including the most common deployment fields from all available configurations. +In the simplified form, click on `Advanced options` to navigate to the advanced creation form. + +### Step 3: Configure scheduling + +In the advanced creation form, navigate to the `Scheduler section` to configure the scheduling options for your deployment. + +#### Queues and Topologies (Kueue) + +If the cluster has Kueue enabled, you can select a queue for your deployment. Queues control resource allocation and scheduling priority across the cluster. Administrators define quotas on how many resources a queue can use, and queues can be grouped in cohorts to borrow resources from each other. + +![Select a queue for the deployment](../../../assets/images/guides/project/scheduler/job_queue.png) + +You can also select a topology unit to control how deployment pods are co-located. For example, you can require all pods to run on the same host to minimize network latency. + +![Select a topology unit for the deployment](../../../assets/images/guides/project/scheduler/job_topology_unit.png) + +#### Affinity, Anti-Affinity, and Priority Classes + +You can configure node affinity, anti-affinity, and priority classes for your deployment: + +- **Affinity**: Constrains which nodes the deployment pods can run on based on node labels (e.g., GPU nodes, specific zones). +- **Anti-Affinity**: Prevents pods from running on nodes with specific labels. +- **Priority Class**: Determines the scheduling and eviction priority of pods. Higher priority pods are scheduled first and can preempt lower priority pods. + +![Affinity and Priority Classes](../../../assets/images/guides/project/scheduler/job_configuration.png) + +## Learn more + +For detailed documentation on scheduling abstractions and cluster-level configuration, see the following guides: + +- [Scheduler](../../projects/scheduling/kube_scheduler.md) — Affinity, anti-affinity, priority classes, and project-level defaults +- [Kueue Details](../../projects/scheduling/kueue_details.md) — Queues, cohorts, topologies, and resource flavors diff --git a/docs/user_guides/mlops/serving/transformer.md b/docs/user_guides/mlops/serving/transformer.md index 9abf279d59..2f6bc4067c 100644 --- a/docs/user_guides/mlops/serving/transformer.md +++ b/docs/user_guides/mlops/serving/transformer.md @@ -9,10 +9,13 @@ description: Documentation on how to configure a KServe transformer for a model In this guide, you will learn how to configure a transformer in a deployment. Transformers are used to apply transformations on the model inputs before sending them to the predictor for making predictions using the model. -They run on a built-in Flask server provided by Hopsworks and require a user-provided python script implementing the [Transformer class](#step-2-implement-transformer-script). +They run on a Python inference image via `kserve_server_launcher.sh` and require a user-provided python script (`.py` or `.ipynb`) implementing the [Transformer class](#step-2-implement-transformer-script). -???+ warning - Transformers are only supported in deployments using KServe as serving tool. +!!! warning + Transformers are only supported in KServe deployments and are not available for vLLM model server. + +!!! info "Independent scaling" + The transformer has independent scaling, resources, and node affinity from the predictor. This allows you to scale the pre/post-processing separately from the model inference. A transformer has two configurable components: @@ -53,17 +56,7 @@ To navigate to the advanced creation form, click on `Advanced options`. ### Step 3: Select a transformer script -Transformers require KServe as the serving platform for the deployment. -Make sure that KServe is enabled for this deployment by activating the corresponding checkbox. - -

-

- KServe enabled in advanced deployment form -
Enable KServe in the advanced deployment form
-
-

- -Then, if the transformer script is already located in Hopsworks, click on `From project` and navigate through the file system to find your script. +If the transformer script is already located in Hopsworks, click on `From project` and navigate through the file system to find your script. Otherwise, you can click on `Upload new file` to upload the transformer script now.

@@ -146,7 +139,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott "/Projects", project.name, uploaded_file_path ) - ``` ### Step 4: Define a transformer @@ -162,7 +154,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_transformer = Transformer(script_file) - ``` ### Step 5: Create a deployment with the transformer @@ -177,7 +168,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott my_deployment = ms.create_deployment(my_predictor, transformer=my_transformer) my_deployment.save() - ``` ### API Reference @@ -195,10 +185,28 @@ A number of different environment variables is available in the transformer to e ??? info "Show environment variables" - | Name | Description | - | ------------------- | -------------------------------------------------------------------- | - | ARTIFACT_FILES_PATH | Local path to the model artifact files | - | DEPLOYMENT_NAME | Name of the current deployment | - | MODEL_NAME | Name of the model being served by the current deployment | - | MODEL_VERSION | Version of the model being served by the current deployment | - | ARTIFACT_VERSION | Version of the model artifact being served by the current deployment | + **Transformer-specific:** + + | Name | Description | + | ---------------- | -------------------------------------------------------------------- | + | IS_TRANSFORMER | Set to `true` for transformer components | + | SCRIPT_PATH | Full path to the transformer script | + | SCRIPT_NAME | Prefixed filename of the transformer script | + + **Common:** + + | Name | Description | + | ---------------------- | -------------------------------------------------------------------- | + | ARTIFACT_FILES_PATH | Local path to the model artifact files | + | DEPLOYMENT_NAME | Name of the current deployment | + | DEPLOYMENT_VERSION | Version of the deployment | + | MODEL_NAME | Name of the model being served by the current deployment | + | MODEL_VERSION | Version of the model being served by the current deployment | + | REST_ENDPOINT | Hopsworks REST API endpoint | + | HOPSWORKS_PROJECT_ID | ID of the project | + | HOPSWORKS_PROJECT_NAME | Name of the project | + | API_KEY | API key for authenticating with Hopsworks services | + | PROJECT_ID | Project ID (for Feature Store access) | + | PROJECT_NAME | Project name (for Feature Store access) | + | SECRETS_DIR | Path to secrets directory (`/keys`) | + | MATERIAL_DIRECTORY | Path to TLS certificates (`/certs`) | diff --git a/docs/user_guides/mlops/serving/troubleshooting.md b/docs/user_guides/mlops/serving/troubleshooting.md index f02d942ab8..645236b30d 100644 --- a/docs/user_guides/mlops/serving/troubleshooting.md +++ b/docs/user_guides/mlops/serving/troubleshooting.md @@ -9,9 +9,13 @@ description: Documentation on how to troubleshoot a model deployment In this guide, you will learn how to troubleshoot a deployment that is having issues to serve a trained model. But before that, it is important to understand how [deployment states](deployment-state.md) are defined and the possible transitions between conditions. +Before a deployment starts, it goes through a CREATING phase where deployment artifacts are prepared. When a deployment is starting, it follows an ordered sequence of [states](deployment-state.md#deployment-conditions) before becoming ready for serving predictions. Similarly, it follows an ordered sequence of states when being stopped, although with fewer steps. +!!! warning "FAILED is a terminal state" + If a deployment reaches the FAILED state, it cannot recover on its own. You must stop and restart the deployment to attempt recovery. + ## GUI ### Step 1: Inspect deployment status @@ -135,7 +139,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f ```python deployment = ms.get_deployment("mydeployment") - ``` ### Step 3: Get current deployment's predictor state @@ -147,7 +150,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f state.describe() - ``` ### Step 4: Explore transient logs @@ -157,7 +159,6 @@ Once in the OpenSearch Dashboards, you can search for keywords, apply multiple f ```python deployment.get_logs(component="predictor|transformer", tail=10) - ``` ### API Reference diff --git a/docs/user_guides/projects/python-deployment.md b/docs/user_guides/projects/python-deployment.md new file mode 100644 index 0000000000..331a834434 --- /dev/null +++ b/docs/user_guides/projects/python-deployment.md @@ -0,0 +1,184 @@ +--- +description: Documentation on how to create Python deployments +--- + +# Python Deployment + +## Introduction + +Python deployments allow you to deploy a Python script as a service without requiring a model artifact in the Model Registry. +This is useful for custom inference pipelines, feature view deployments, or any Python-based program that needs to be served behind an HTTP endpoint. + +!!! warning "Incoming requests are directed to port 8080" + Python deployments run your script directly on port 8080. Therefore, make sure your implementation listens to 8080 port for handling incoming requests. + +!!! info "gRPC protocol not supported" + Currently, only REST API protocol is supported + +!!! tip "Use your favourite HTTP server" + There are no constraints on the framework or library used — you can use Flask, FastAPI, or any other HTTP server. + +## GUI + +### Step 1: Create new deployment + +Navigate to the deployments page by clicking on the `Deployments` tab on the navigation menu on the left. + +

+

+ Deployments navigation tab +
Deployments navigation tab
+
+

+ +Then, click on `New Python deployment`. + +### Step 2: Configure the deployment + +Choose a name for your Python deployment. Then, provide the script for you Python program by clicking on `From project` or `Upload new file`. + +### Step 3 (Optional): Change Python environment + +Python deployments run the scripts in one of the [Python Environments](../projects/python/python_env_overview.md) available in your project. This environment must have all the necessary dependencies for your Python program. + +Hopsworks provide a collection of built-in environments like `minimal-inference-pipeline`, `pandas-inference-pipeline` or `torch-inference-pipeline` with different sets of libraries pre-installed. By default, the `pandas-inference-pipeline` Python environment is used in Python deployments. + +To create your own environment it is recommended to [clone](../projects/python/python_env_clone.md) the `minimal-inference-pipeline` or `pandas-inference-pipeline` environment and install additional dependencies needed for your Python program. + +

+

+ Python script in the simplified deployment form +
Select an environment for the Python program
+
+

+ +### Step 4 (Optional): Advanced configuration + +Click on `Advanced options` to configure your Python deployment further, including: + +!!! info "" + 1. [Resource allocation](#resource-allocation) + 2. [Autoscaling](#autoscaling) + 3. [Scheduling](#scheduling) + +Once you are done with the changes, click on `Create new Python deployment` at the bottom of the page to create the Python deployment. + +## Code + +### Step 1: Connect to Hopsworks + +=== "Python" + + ```python + import hopsworks + + project = hopsworks.login() + + # get Hopsworks Model Serving handle + ms = project.get_model_serving() + ``` + +### Step 2: Implement a Python script + +=== "Python" + + ```python + import uvicorn + from fastapi import FastAPI + + app = FastAPI() + + + @app.get("/ping") + async def ping(): + return {"status": "ready"} + + + @app.post("/echo") + async def echo(data: dict): + return data + + + if __name__ == "__main__": + uvicorn.run(app, host="0.0.0.0", port=8080) + ``` + +!!! info "Jupyter magic" + In a jupyter notebook, you can add `%%writefile python_server.py` at the top of the cell to save it as a local file. + +### Step 3: Upload the script to your project + +=== "Python" + + ```python + import os + + dataset_api = project.get_dataset_api() + + uploaded_file_path = dataset_api.upload("python_server.py", "Resources", overwrite=True) + script_path = os.path.join("/Projects", project.name, uploaded_file_path) + ``` + +### Step 4: Create a deployment + +=== "Python" + + ```python + py_server = ms.create_endpoint( + name="pyserver", + script_file=script_path + ) + py_deployment = py_server.deploy() + ``` + +### Step 5: Send requests + +=== "Python" + + ```python + import requests + + url = py_deployment.get_endpoint_url() + + response = requests.post(f"{url}/echo", json={"key": "value"}) + print(response.json()) + ``` + +## Resource Allocation + +Configure CPU, memory, and GPU allocation for your Python deployment. Each deployment component has separate request and limit values. + +For full details on resource configuration, see the [Resource Allocation Guide](../mlops/serving/resources.md). + +## Autoscaling + +Deployments use ==Knative Pod Autoscaler (KPA)== to automatically scale the number of replicas based on traffic. You can configure the minimum and maximum number of instances as well as the scale metric (requests per second or concurrency). + +For full details on autoscaling parameters, see the [Autoscaling Guide](../mlops/serving/autoscaling.md). + +## Scheduling + +If the cluster has Kueue enabled, you can select a queue for your deployment from the advanced configuration. Queues control resource allocation and scheduling priority across the cluster. + +For full details on scheduling configuration, see the [Scheduling Guide](../mlops/serving/scheduling.md). + +## Environment Variables + +??? info "Show available environment variables" + + | Name | Description | + | ---------------------- | -------------------------------------------------- | + | DEPLOYMENT_NAME | Name of the current deployment | + | DEPLOYMENT_VERSION | Version of the deployment | + | ARTIFACT_FILES_PATH | Local path to the artifact files | + | SCRIPT_PATH | Full path to the Python script | + | SCRIPT_NAME | Prefixed filename of the Python script | + | CONFIG_FILE_PATH | Local path to the configuration file (if provided) | + | REST_ENDPOINT | Hopsworks REST API endpoint | + | HOPSWORKS_PROJECT_ID | ID of the project | + | HOPSWORKS_PROJECT_NAME | Name of the project | + | API_KEY | API key for authenticating with Hopsworks services | + | PROJECT_ID | Project ID (for Feature Store access) | + | PROJECT_NAME | Project name (for Feature Store access) | + | SECRETS_DIR | Path to secrets directory (`/keys`) | + | MATERIAL_DIRECTORY | Path to TLS certificates (`/certs`) | diff --git a/mkdocs.yml b/mkdocs.yml index f98f3dd9b2..654e5840c3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -190,6 +190,7 @@ nav: - Api Keys: - Create API Key: user_guides/projects/api_key/create_api_key.md - AWS IAM Roles: user_guides/projects/iam_role/iam_role_chaining.md + - Python Deployment: user_guides/projects/python-deployment.md - MLOps: - user_guides/mlops/index.md - Model Registry: @@ -205,15 +206,17 @@ nav: - Model Evaluation Images: user_guides/mlops/registry/model_evaluation_images.md - Model Serving: - user_guides/mlops/serving/index.md - - Deployment: - - Deployment creation: user_guides/mlops/serving/deployment.md - - Deployment state: user_guides/mlops/serving/deployment-state.md - - Predictor: user_guides/mlops/serving/predictor.md - - Transformer: user_guides/mlops/serving/transformer.md - - Resource Allocation: user_guides/mlops/serving/resources.md - - Inference Logger: user_guides/mlops/serving/inference-logger.md - - Inference Batcher: user_guides/mlops/serving/inference-batcher.md - - API Protocol: user_guides/mlops/serving/api-protocol.md + - Model Deployment: + - Deployment Creation: user_guides/mlops/serving/deployment.md + - Deployment State: user_guides/mlops/serving/deployment-state.md + - Predictor (KServe): user_guides/mlops/serving/predictor.md + - Transformer (KServe): user_guides/mlops/serving/transformer.md + - Inference Logger: user_guides/mlops/serving/inference-logger.md + - Inference Batcher: user_guides/mlops/serving/inference-batcher.md + - Resource Allocation: user_guides/mlops/serving/resources.md + - Autoscaling: user_guides/mlops/serving/autoscaling.md + - Scheduling: user_guides/mlops/serving/scheduling.md + - API Protocol: user_guides/mlops/serving/api-protocol.md - REST API: user_guides/mlops/serving/rest-api.md - Troubleshooting: user_guides/mlops/serving/troubleshooting.md - External Access: user_guides/mlops/serving/external-access.md