All Articles

Platform Engineering for AI Teams: What's Different

AI and ML workloads break the assumptions that traditional developer platforms are built on. Here's how to extend your IDP to support data scientists and ML engineers.

R2
R2SA Technologies
· · 10 min read

Platform Engineering for AI Teams: What’s Different

Internal Developer Platforms built for traditional software engineering teams break in predictable ways when data scientists and ML engineers try to use them. The workflows are different, the compute requirements are different, and the artifacts are different.

Here’s what we’ve learned extending IDPs to support AI and ML workloads.

Where Traditional IDPs Break for AI Teams

Compute requirements. Your standard developer platform probably offers CPU-based compute with a few GB of RAM. ML engineers need GPU access, large memory instances for data processing, and the ability to run long-running batch jobs. None of this fits the standard Kubernetes workload profile.

Interactive workflows. Software engineers mostly work through CI/CD pipelines. Data scientists work interactively — notebooks, exploratory analysis, iterative model training. They need JupyterHub or similar, integrated with your platform’s auth and storage.

Artifact management. Software engineers produce Docker images and Helm charts. ML engineers produce models, datasets, and experiment results. Your artifact registry probably doesn’t support these.

Non-deterministic outputs. You can’t test an ML model the same way you test software. CI/CD pipelines that run tests and gate on pass/fail don’t translate directly to model evaluation.

The Extensions That Actually Help

Self-service GPU access. Implement a GPU quota system with self-service provisioning. Data scientists should be able to request a GPU notebook or training job without raising a ticket. We use Kubernetes resource quotas per team, with a Backstage plugin for self-service provisioning.

Managed notebook environment. Deploy JupyterHub on your cluster, integrated with your SSO and connected to shared storage. Pre-build images with common ML libraries. This eliminates the “works on my laptop” problem for notebooks.

Experiment tracking integration. Deploy MLflow or Weights & Biases in your platform. Make it the default — pre-configure the tracking URI in your notebook images so experiments are tracked automatically.

Model registry. A model registry (MLflow’s is fine for most teams) gives models the same lifecycle management that software gets from a container registry — versioning, staging, promotion, rollback.

Data access patterns. ML workloads need large-scale data access. Integrate your platform with your data lake (S3, GCS, ADLS) and provide pre-configured credentials via your secrets management system. Don’t make data scientists manage cloud credentials manually.

The Golden Path for ML

Applying the golden path concept to ML means defining a standard way to go from experiment to production model:

Notebook (exploration)

Python script (reproducible experiment)

MLflow experiment (tracked, versioned)

Training pipeline (reproducible, parameterised)

Model registry (versioned, with evaluation metrics)

Serving infrastructure (KServe, with monitoring)

Each step should be achievable through self-service, with clear standards for what “done” looks like at each stage.

Governance Without Bureaucracy

AI models need governance that software doesn’t: model cards, fairness evaluations, data lineage, and approval workflows for high-stakes deployments.

The mistake is building heavyweight approval gates that slow everything down. The better approach: automate the documentation (generate model cards from MLflow metadata), enforce quality gates in the pipeline (evaluation metrics must meet thresholds to promote), and reserve human review for genuinely high-stakes deployments.

Practical Advice

Start by talking to your data science and ML engineering teams. Their biggest pain points are probably not what you expect. In our experience, the top three are:

  1. Getting GPU access without a lengthy approval process
  2. Managing Python dependencies and environments
  3. Sharing code and results with colleagues

These are all solvable with relatively modest platform investment. Fix the basics before building sophisticated MLOps infrastructure.


Extending your developer platform to support AI and ML workloads? Get in touch — this is one of our core specialisms.

Ready to build something exceptional?

Whether you need a platform engineer, cloud architect, or technical leader — let's talk about how we can help your team move faster.