logicalclocks · javierdlrm · Mar 3, 2026
diff --git a/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png b/docs/assets/images/guides/mlops/serving/deployment_simple_form_py_endp_env.png
diff --git a/...images/guides/mlops/serving/deployment_simple_form_py_pred_env - Copy.png:Zone.Identifier b/...images/guides/mlops/serving/deployment_simple_form_py_pred_env - Copy.png:Zone.Identifier
diff --git a/docs/concepts/mlops/serving.md b/docs/concepts/mlops/serving.md
@@ -1,32 +1,37 @@
-In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community).
-KServe is the defacto open-source framework for model serving on Kubernetes.
-You can deploy models in either programs, using the HSML library, or in the UI.
+In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes.
+You can deploy models programmatically using the HSML library or via the UI.
 A KServe model deployment can include the following components:
 
-**`Transformer`**
+**`Predictor (KServe component)`**
 
-:   A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client.
+:   A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions.
 
-**`Predictor`**
+**`Transformer (KServe component)`**
 
-:   A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output.
+:   A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. Not available for vLLM deployments.
 
 **`Inference Logger`**
 
-:   Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model.
+:   Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. Not available for vLLM deployments.
 
 **`Inference Batcher`**
 
 :   Inference requests can be batched to improve throughput (at the cost of slightly higher latency).
 
 **`Istio Model Endpoint`**
 
-:   You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key.
+:   You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key, accessible via path-based routing through Istio.
     API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks.
 
+    !!! warning "Host-based routing"
+        The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. Path-based routing is recommended for new deployments.
+
 Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store.
 
 <img src="../../../assets/images/concepts/mlops/kserve.svg">
 
 !!! info "Model Serving Guide"
     More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md).
+
+!!! tip "Python deployments"
+    For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment.md) page.
diff --git a/docs/user_guides/fs/feature_view/feature-vectors.md b/docs/user_guides/fs/feature_view/feature-vectors.md
@@ -239,7 +239,6 @@ However, you can retrieve the untransformed feature vectors without applying mod
             entry=[{"id": 1}, {"id": 2}], transform=False
         )
 
-
         ```
 
 ## Retrieving feature vector without on-demand features
@@ -258,7 +257,6 @@ To achieve this, set the  parameters `transform` and `on_demand_features` to `Fa
             entry=[{"id": 1}, {"id": 2}], transform=False, on_demand_features=False
         )
 
-
         ```
 
 ## Passing Context Variables to Transformation Functions
@@ -274,7 +272,6 @@ After [defining a transformation function using a context variable](../transform
             entry=[{"pk1": 1}], transformation_context={"context_parameter": 10}
         )
 
-
         ```
 
 ## Choose the right Client

diff --git a/docs/user_guides/fs/feature_view/helper-columns.md b/docs/user_guides/fs/feature_view/helper-columns.md
@@ -41,7 +41,6 @@ for computing the [on-demand feature](../../../concepts/fs/feature_group/on_dema
             inference_helper_columns=["expiry_date"],
         )
 
-
         ```
 
 ### Inference Data Retrieval
@@ -88,7 +87,6 @@ However, they can be optionally fetched with inference or training data.
             ]
         ]
 
-
         ```
 
 #### Online inference
@@ -129,7 +127,6 @@ However, they can be optionally fetched with inference or training data.
             passed_features={"days_valid": days_valid},
         )
 
-
         ```
 
 ## Training Helper columns
@@ -156,7 +153,6 @@ For example one might want to use feature like `category` of the purchased produ
             training_helper_columns=["category"],
         )
 
-
         ```
 
 ### Training Data Retrieval
@@ -190,7 +186,6 @@ However, they can be optionally fetched.
             training_dataset_version=1, training_helper_columns=True
         )
 
-
         ```
 
 !!! note

diff --git a/docs/user_guides/fs/feature_view/model-dependent-transformations.md b/docs/user_guides/fs/feature_view/model-dependent-transformations.md
@@ -55,7 +55,6 @@ Additionally, Hopsworks also allows users to specify custom names for transforme
             transformation_functions=[add_two, add_one_multiple],
         )
 
-
         ```
 
 ### Specifying input features
@@ -77,7 +76,6 @@ The features to be used by a model-dependent transformation function can be spec
             ],
         )
 
-
         ```
 
 ### Using built-in transformations
@@ -106,7 +104,6 @@ The only difference is that they can either be retrieved from the Hopsworks or i
             ],
         )
 
-
         ```
 
 To attach built-in transformation functions from the `hopsworks` module they can be directly imported into the code from `hopsworks.builtin_transformations`.
@@ -134,7 +131,6 @@ To attach built-in transformation functions from the `hopsworks` module they can
             ],
         )
 
-
         ```
 
 ## Using Model Dependent Transformations
@@ -160,7 +156,6 @@ Model-dependent transformation functions can also be manually applied to a featu
         # Apply Model Dependent transformations
         encoded_feature_vector = fv.transform(feature_vector)
 
-
         ```
 
 ### Retrieving untransformed feature vector and batch inference data
@@ -185,5 +180,4 @@ To achieve this, set the `transform` parameter to False.
         # Fetching untransformed batch data.
         untransformed_batch_data = feature_view.get_batch_data(transform=False)
 
-
         ```
diff --git a/docs/user_guides/fs/feature_view/training-data.md b/docs/user_guides/fs/feature_view/training-data.md
@@ -154,7 +154,6 @@ Once you have [defined a transformation function using a context variable](../tr
             transformation_context={"context_parameter": 10},
         )
 
-
         ```
 
 ## Read training data with primary key(s) and event time

diff --git a/docs/user_guides/mlops/serving/api-protocol.md b/docs/user_guides/mlops/serving/api-protocol.md
@@ -3,11 +3,12 @@
 ## Introduction
 
 Hopsworks supports both REST and gRPC as API protocols for sending inference requests to model deployments.
-While REST API protocol is supported in all types of model deployments, support for gRPC is only available for models served with [KServe](predictor.md#serving-tool).
+While REST API protocol is supported in all types of model deployments, gRPC is only supported for **Python model server** deployments with a model artifact.
 
-!!! warning
-    At the moment, the gRPC API protocol is only supported for **Python model deployments** (e.g., scikit-learn, xgboost).
-    Support for Tensorflow model deployments is coming soon.
+!!! warning "gRPC constraints"
+    - gRPC is only supported for Python model server deployments
+    - A model artifact is required — gRPC is not available for Python deployments
+    - gRPC uses port 8081 with `h2c` protocol
 
 ## GUI
 
@@ -40,17 +41,7 @@ To navigate to the advanced creation form, click on `Advanced options`.
 
 ### Step 3: Select the API protocol
 
-Enabling gRPC as the API protocol for a model deployment requires KServe as the serving platform for the deployment.
-Make sure that KServe is enabled by activating the corresponding checkbox.
-
-<p align="center">
-  <figure>
-    <img style="max-width: 85%; margin: 0 auto" src="../../../../assets/images/guides/mlops/serving/deployment_adv_form_kserve.png" alt="KServe enabled in advanced deployment form">
-    <figcaption>Enable KServe in the advanced deployment form</figcaption>
-  </figure>
-</p>
-
-Then, you can select the API protocol to be enabled in your model deployment.
+You can select the API protocol to be enabled in your model deployment in the advanced deployment form.
 
 <p align="center">
   <figure>
@@ -102,7 +93,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott
   my_deployment = ms.create_deployment(my_predictor)
   my_deployment.save()
 
-
   ```
 
 ### API Reference

diff --git a/docs/user_guides/mlops/serving/autoscaling.md b/docs/user_guides/mlops/serving/autoscaling.md
@@ -0,0 +1,55 @@
+---
+description: Documentation on how to configure scaling for a deployment
+---
+
+# How To Configure Scaling For A Deployment
+
+## Introduction
+
+Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic.
+
+??? info "Show scale metrics"
+
+    | Scale Metric | Default Target | Description                          |
+    | ------------ | -------------- | ------------------------------------ |
+    | RPS          | 200            | Requests per second per replica      |
+    | CONCURRENCY  | 100            | Concurrent requests per replica      |
+
+**Scaling parameters:**
+
+- `minInstances` — Minimum replicas (0 enables scale-to-zero)
+- `maxInstances` — Maximum replicas (must be ≥1, cannot be less than min)
+- `panicWindowPercentage` — Panic window as percentage of stable window (default: 10.0, range: 1-100)
+- `stableWindowSeconds` — Stable window duration in seconds (default: 60, range: 6-3600)
+- `panicThresholdPercentage` — Traffic threshold to trigger panic mode (default: 200.0, must be >0)
+- `scaleToZeroRetentionSeconds` — Time to retain pods before scaling to zero (default: 0, must be ≥0)
+
+!!! note "Cluster-level constraints"
+    Administrators can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments.
+
+## Code
+
+=== "Python"
+
+    ```python
+    from hsml.resources import PredictorResources, Resources
+
+    minimum_res = Resources(cores=1, memory=256, gpus=1)
+    maximum_res = Resources(cores=2, memory=512, gpus=1)
+
+    predictor_res = PredictorResources(
+        num_instances=1,
+        requests=minimum_res,
+        limits=maximum_res
+    )
+
+    my_predictor = ms.create_predictor(
+        my_model,
+        resources=predictor_res,
+        # autoscaling
+        min_instances=1,
+        max_instances=5,
+        scale_metric="RPS",
+        scale_target=100
+    )
+    ```
diff --git a/docs/user_guides/mlops/serving/deployment-state.md b/docs/user_guides/mlops/serving/deployment-state.md
@@ -86,7 +86,6 @@ Additionally, you can find the nº of instances currently running by scrolling d
   ```python
   deployment = ms.get_deployment("mydeployment")
 
-
   ```
 
 ### Step 3: Inspect deployment state
@@ -98,7 +97,6 @@ Additionally, you can find the nº of instances currently running by scrolling d
 
   state.describe()
 
-
   ```
 
 ### Step 4: Check nº of running instances
@@ -112,7 +110,6 @@ Additionally, you can find the nº of instances currently running by scrolling d
   # nº of transformer instances
   deployment.transformer.resources.describe()
 
-
   ```
 
 ### API Reference
@@ -127,16 +124,23 @@ The status of a deployment is a high-level description of its current state.
 
 ??? info "Show deployment status"
 
-    | Status   | Description                                                                                                              |
-    | -------- | ------------------------------------------------------------------------------------------------------------------------ |
-    | CREATED  | Deployment has never been started                                                                                        |
-    | STARTING | Deployment is starting                                                                                                   |
-    | RUNNING  | Deployment is ready and running. Predictions are served without additional latencies.                                    |
-    | IDLE     | Deployment is ready, but idle. Higher latencies (i.e., cold-start) are expected in the first incoming inference requests |
-    | FAILED   | Deployment is in a failed state, which can be due to multiple reasons. More details can be found in the condition        |
-    | UPDATING | Deployment is applying updates to the running instances                                                                  |
-    | STOPPING | Deployment is stopping                                                                                                   |
-    | STOPPED  | Deployment has been stopped                                                                                              |
+    | Status   | Description                                                                                                                                     |
+    | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
+    | CREATING | Deployment artifacts are being prepared                                                                                                         |
+    | CREATED  | Deployment has never been started                                                                                                               |
+    | STARTING | Deployment is starting                                                                                                                          |
+    | RUNNING  | Deployment is ready and running. Predictions are served without additional latencies.                                                           |
+    | IDLE     | Deployment is ready but scaled to zero or has no active replicas. Higher latencies (cold-start) are expected on the first inference request.    |
+    | FAILED   | Terminal state. The deployment has encountered an unrecoverable error. More details can be found in the status condition.               |
+    | UPDATING | Deployment is applying updates to the running instances                                                                                         |
+    | STOPPING | Deployment is stopping                                                                                                                          |
+    | STOPPED  | Deployment has been stopped                                                                                                                     |
+
+## How States Are Determined
+
+Deployment state is determined from multiple sources: the database state (whether the deployment has been deployed and its revision), KServe InferenceService conditions, pod presence (available replicas for predictor and transformer), and the artifact filesystem (whether the deployment artifact files are ready).
+
+A revision ID and deployment version are used to distinguish between STARTING (first generation) and UPDATING (subsequent changes to a running deployment).
 
 ## Deployment conditions
-Original file line number
+Diff line change
@@ Expand Up @@
                 transformation_context={"context_parameter": 10},
             )
             ```
     ## Read training data with primary key(s) and event time
@@ Expand Down @@