MLflow Tracking Server

MLflow provides experiment tracking for training and fine-tuning workloads. In Alauda AI, MLflow is installed as a cluster plugin through the MLflow Operator. The operator deploys an MLflow Tracking Server, exposes it through the platform ingress, and adds an MLFlow entry in the Tools menu.

Prerequisites

  • Alauda AI is installed on the target cluster.
  • A PostgreSQL database is available for MLflow metadata.
  • The platform OAuth/OIDC provider is configured.
  • For namespace-backed workspaces, the target workspace namespaces have the label selected by the MLflow configuration, such as mlflow-enabled=true.

Install Or Upgrade

  1. Upload the MLflow cluster plugin package to the global cluster.
  2. In the Web Console, go to Administrator > Marketplace > Upload Packages and verify that the MLflow package version appears.
  3. Install or upgrade the MLflow cluster plugin for the target cluster.
  4. Configure the PostgreSQL host, port, username, and password.
  5. Enable multi-tenancy when users should access MLflow workspaces backed by Kubernetes namespaces.
  6. Open Alauda AI > Tools > MLFlow after the plugin status is running.

Workspace Access

MLflow workspaces map to Kubernetes namespaces. Only namespaces matching the configured label selector are visible as workspaces.

Example namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: team-a
  labels:
    mlflow-enabled: "true"
  annotations:
    mlflow.kubeflow.org/workspace-description: "Team A MLflow workspace"

Grant users access to MLflow resources in a workspace with Kubernetes RBAC:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: mlflow-manager
  namespace: team-a
rules:
  - apiGroups: ["mlflow.kubeflow.org"]
    resources: ["experiments", "datasets", "registeredmodels"]
    verbs: ["get", "list", "create", "update", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: mlflow-manager
  namespace: team-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: mlflow-manager
subjects:
  - kind: Group
    name: team-a-mlflow-users
    apiGroup: rbac.authorization.k8s.io

Client Configuration

Set the MLflow tracking URI to the platform route and select the workspace:

import mlflow

mlflow.set_tracking_uri("https://<platform>/clusters/<cluster-name>/mlflow")
mlflow.set_workspace("team-a")

with mlflow.start_run():
    mlflow.log_param("tenant", "team-a")

For HTTP clients, pass the workspace header:

curl \
  -H "X-MLFLOW-WORKSPACE: team-a" \
  "https://<platform>/clusters/<cluster-name>/mlflow/api/2.0/mlflow/experiments/search"

High Availability And Storage

MLflow uses an external PostgreSQL database for metadata. Use a highly available PostgreSQL service for production environments.

The default artifact path is local to the MLflow pod. For production, configure durable artifact storage through the MLflow deployment settings before users store experiment artifacts. The default MLflow server deployment is not a multi-replica high-availability deployment unless the release notes for your version state otherwise.

Troubleshooting

  • If the MLFlow Tools menu entry is missing, verify that the aml-mlflow-menu-config ConfigMap exists in the MLflow namespace and has the label aml.cpaas.io/centralMenuItem: "true".
  • If a workspace is not visible, verify that its namespace matches the MLflow workspace label selector.
  • If requests are denied, check the user's Kubernetes RoleBinding in the workspace namespace.
  • If the server does not start, verify PostgreSQL connectivity and credentials.