Skip to content
Open

(tmp) #556

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
23 changes: 14 additions & 9 deletions docs/concepts/mlops/serving.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,37 @@
In Hopsworks, you can easily deploy models from the model registry in KServe or in Docker containers (for Hopsworks Community).
KServe is the defacto open-source framework for model serving on Kubernetes.
You can deploy models in either programs, using the HSML library, or in the UI.
In Hopsworks, you can easily deploy models from the model registry using [KServe](https://kserve.github.io/website/latest/), the standard open-source framework for model serving on Kubernetes.
You can deploy models programmatically using the HSML library or via the UI.
A KServe model deployment can include the following components:

**`Transformer`**
**`Predictor (KServe component)`**

: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client.
: A predictor runs a model server (Python, TensorFlow Serving, or vLLM) that loads a trained model, handles inference requests and returns predictions.

**`Predictor`**
**`Transformer (KServe component)`**

: A predictor is a ML model in a Python object that takes a feature vector as input and returns a prediction as output.
: A ^^pre-processing^^ and ^^post-processing^^ component that can transform model inputs before predictions are made, and predictions before these are delivered back to the client. Not available for vLLM deployments.

**`Inference Logger`**

: Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model.
: Hopsworks logs inputs and outputs of transformers and predictors to a ^^Kafka topic^^ that is part of the same project as the model. Not available for vLLM deployments.

**`Inference Batcher`**

: Inference requests can be batched to improve throughput (at the cost of slightly higher latency).

**`Istio Model Endpoint`**

: You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key.
: You can publish a model over ^^REST(HTTP)^^ or ^^gRPC^^ using a Hopsworks API key, accessible via path-based routing through Istio.
API keys have scopes to ensure the principle of least privilege access control to resources managed by Hopsworks.

!!! warning "Host-based routing"
The Istio Model Endpoint supports host-based routing for inference requests; however, this approach is considered legacy. Path-based routing is recommended for new deployments.

Models deployed on KServe in Hopsworks can be easily integrated with the Hopsworks Feature Store using either a Transformer or Predictor Python script, that builds the predictor's input feature vector using the application input and pre-computed features from the Feature Store.

<img src="../../../assets/images/concepts/mlops/kserve.svg">

!!! info "Model Serving Guide"
More information can be found in the [Model Serving guide](../../user_guides/mlops/serving/index.md).

!!! tip "Python deployments"
For deploying Python scripts without a model artifact, see the [Python Deployments](../../user_guides/projects/python-deployment.md) page.
3 changes: 0 additions & 3 deletions docs/user_guides/fs/feature_view/feature-vectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,6 @@ However, you can retrieve the untransformed feature vectors without applying mod
entry=[{"id": 1}, {"id": 2}], transform=False
)


```

## Retrieving feature vector without on-demand features
Expand All @@ -258,7 +257,6 @@ To achieve this, set the parameters `transform` and `on_demand_features` to `Fa
entry=[{"id": 1}, {"id": 2}], transform=False, on_demand_features=False
)


```

## Passing Context Variables to Transformation Functions
Expand All @@ -274,7 +272,6 @@ After [defining a transformation function using a context variable](../transform
entry=[{"pk1": 1}], transformation_context={"context_parameter": 10}
)


```

## Choose the right Client
Expand Down
5 changes: 0 additions & 5 deletions docs/user_guides/fs/feature_view/helper-columns.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ for computing the [on-demand feature](../../../concepts/fs/feature_group/on_dema
inference_helper_columns=["expiry_date"],
)


```

### Inference Data Retrieval
Expand Down Expand Up @@ -88,7 +87,6 @@ However, they can be optionally fetched with inference or training data.
]
]


```

#### Online inference
Expand Down Expand Up @@ -129,7 +127,6 @@ However, they can be optionally fetched with inference or training data.
passed_features={"days_valid": days_valid},
)


```

## Training Helper columns
Expand All @@ -156,7 +153,6 @@ For example one might want to use feature like `category` of the purchased produ
training_helper_columns=["category"],
)


```

### Training Data Retrieval
Expand Down Expand Up @@ -190,7 +186,6 @@ However, they can be optionally fetched.
training_dataset_version=1, training_helper_columns=True
)


```

!!! note
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,6 @@ Additionally, Hopsworks also allows users to specify custom names for transforme
transformation_functions=[add_two, add_one_multiple],
)


```

### Specifying input features
Expand All @@ -77,7 +76,6 @@ The features to be used by a model-dependent transformation function can be spec
],
)


```

### Using built-in transformations
Expand Down Expand Up @@ -106,7 +104,6 @@ The only difference is that they can either be retrieved from the Hopsworks or i
],
)


```

To attach built-in transformation functions from the `hopsworks` module they can be directly imported into the code from `hopsworks.builtin_transformations`.
Expand Down Expand Up @@ -134,7 +131,6 @@ To attach built-in transformation functions from the `hopsworks` module they can
],
)


```

## Using Model Dependent Transformations
Expand All @@ -160,7 +156,6 @@ Model-dependent transformation functions can also be manually applied to a featu
# Apply Model Dependent transformations
encoded_feature_vector = fv.transform(feature_vector)


```

### Retrieving untransformed feature vector and batch inference data
Expand All @@ -185,5 +180,4 @@ To achieve this, set the `transform` parameter to False.
# Fetching untransformed batch data.
untransformed_batch_data = feature_view.get_batch_data(transform=False)


```
1 change: 0 additions & 1 deletion docs/user_guides/fs/feature_view/training-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,6 @@ Once you have [defined a transformation function using a context variable](../tr
transformation_context={"context_parameter": 10},
)


```

## Read training data with primary key(s) and event time
Expand Down
22 changes: 6 additions & 16 deletions docs/user_guides/mlops/serving/api-protocol.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
## Introduction

Hopsworks supports both REST and gRPC as API protocols for sending inference requests to model deployments.
While REST API protocol is supported in all types of model deployments, support for gRPC is only available for models served with [KServe](predictor.md#serving-tool).
While REST API protocol is supported in all types of model deployments, gRPC is only supported for **Python model server** deployments with a model artifact.

!!! warning
At the moment, the gRPC API protocol is only supported for **Python model deployments** (e.g., scikit-learn, xgboost).
Support for Tensorflow model deployments is coming soon.
!!! warning "gRPC constraints"
- gRPC is only supported for Python model server deployments
- A model artifact is required — gRPC is not available for Python deployments
- gRPC uses port 8081 with `h2c` protocol

## GUI

Expand Down Expand Up @@ -40,17 +41,7 @@ To navigate to the advanced creation form, click on `Advanced options`.

### Step 3: Select the API protocol

Enabling gRPC as the API protocol for a model deployment requires KServe as the serving platform for the deployment.
Make sure that KServe is enabled by activating the corresponding checkbox.

<p align="center">
<figure>
<img style="max-width: 85%; margin: 0 auto" src="../../../../assets/images/guides/mlops/serving/deployment_adv_form_kserve.png" alt="KServe enabled in advanced deployment form">
<figcaption>Enable KServe in the advanced deployment form</figcaption>
</figure>
</p>

Then, you can select the API protocol to be enabled in your model deployment.
You can select the API protocol to be enabled in your model deployment in the advanced deployment form.

<p align="center">
<figure>
Expand Down Expand Up @@ -102,7 +93,6 @@ Once you are done with the changes, click on `Create new deployment` at the bott
my_deployment = ms.create_deployment(my_predictor)
my_deployment.save()


```

### API Reference
Expand Down
55 changes: 55 additions & 0 deletions docs/user_guides/mlops/serving/autoscaling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
description: Documentation on how to configure scaling for a deployment
---

# How To Configure Scaling For A Deployment

## Introduction

Deployments use [Knative Pod Autoscaler (KPA)](https://knative.dev/docs/serving/autoscaling/) to automatically scale the number of replicas based on traffic.

??? info "Show scale metrics"

| Scale Metric | Default Target | Description |
| ------------ | -------------- | ------------------------------------ |
| RPS | 200 | Requests per second per replica |
| CONCURRENCY | 100 | Concurrent requests per replica |

**Scaling parameters:**

- `minInstances` — Minimum replicas (0 enables scale-to-zero)
- `maxInstances` — Maximum replicas (must be ≥1, cannot be less than min)
- `panicWindowPercentage` — Panic window as percentage of stable window (default: 10.0, range: 1-100)
- `stableWindowSeconds` — Stable window duration in seconds (default: 60, range: 6-3600)
- `panicThresholdPercentage` — Traffic threshold to trigger panic mode (default: 200.0, must be >0)
- `scaleToZeroRetentionSeconds` — Time to retain pods before scaling to zero (default: 0, must be ≥0)

!!! note "Cluster-level constraints"
Administrators can set cluster-wide limits on the maximum and minimum number of instances. When the minimum is set to 0, scale-to-zero is enforced for all deployments.

## Code

=== "Python"

```python
from hsml.resources import PredictorResources, Resources

minimum_res = Resources(cores=1, memory=256, gpus=1)
maximum_res = Resources(cores=2, memory=512, gpus=1)

predictor_res = PredictorResources(
num_instances=1,
requests=minimum_res,
limits=maximum_res
)

my_predictor = ms.create_predictor(
my_model,
resources=predictor_res,
# autoscaling
min_instances=1,
max_instances=5,
scale_metric="RPS",
scale_target=100
)
```
30 changes: 17 additions & 13 deletions docs/user_guides/mlops/serving/deployment-state.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ Additionally, you can find the nº of instances currently running by scrolling d
```python
deployment = ms.get_deployment("mydeployment")


```

### Step 3: Inspect deployment state
Expand All @@ -98,7 +97,6 @@ Additionally, you can find the nº of instances currently running by scrolling d

state.describe()


```

### Step 4: Check nº of running instances
Expand All @@ -112,7 +110,6 @@ Additionally, you can find the nº of instances currently running by scrolling d
# nº of transformer instances
deployment.transformer.resources.describe()


```

### API Reference
Expand All @@ -127,16 +124,23 @@ The status of a deployment is a high-level description of its current state.

??? info "Show deployment status"

| Status | Description |
| -------- | ------------------------------------------------------------------------------------------------------------------------ |
| CREATED | Deployment has never been started |
| STARTING | Deployment is starting |
| RUNNING | Deployment is ready and running. Predictions are served without additional latencies. |
| IDLE | Deployment is ready, but idle. Higher latencies (i.e., cold-start) are expected in the first incoming inference requests |
| FAILED | Deployment is in a failed state, which can be due to multiple reasons. More details can be found in the condition |
| UPDATING | Deployment is applying updates to the running instances |
| STOPPING | Deployment is stopping |
| STOPPED | Deployment has been stopped |
| Status | Description |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| CREATING | Deployment artifacts are being prepared |
| CREATED | Deployment has never been started |
| STARTING | Deployment is starting |
| RUNNING | Deployment is ready and running. Predictions are served without additional latencies. |
| IDLE | Deployment is ready but scaled to zero or has no active replicas. Higher latencies (cold-start) are expected on the first inference request. |
| FAILED | Terminal state. The deployment has encountered an unrecoverable error. More details can be found in the status condition. |
| UPDATING | Deployment is applying updates to the running instances |
| STOPPING | Deployment is stopping |
| STOPPED | Deployment has been stopped |

## How States Are Determined

Deployment state is determined from multiple sources: the database state (whether the deployment has been deployed and its revision), KServe InferenceService conditions, pod presence (available replicas for predictor and transformer), and the artifact filesystem (whether the deployment artifact files are ready).

A revision ID and deployment version are used to distinguish between STARTING (first generation) and UPDATING (subsequent changes to a running deployment).

## Deployment conditions

Expand Down
Loading