Technical Documentation

Deep Dives

Detailed technical write-ups on the systems I've built.
Architecture decisions, trade-offs, and implementation details.

☁️

Global Cloud Provider

Cross-Cluster AI Training Infrastructure

🏗️ System Architecture

K8s on K8s: Recursive Virtualization Architecture

Two-Layer Virtual Kubelet for Multi-Cluster GPU Pooling

Note: Due to confidentiality agreements, specific implementation details have been abstracted. The patterns described represent general industry practices.

When building large-scale AI SaaS platforms, we faced a critical business challenge: customers were paying for expensive on-demand container instances, while our reserved GPU clusters sat underutilized.

The root cause? Kubernetes wasn't designed for cross-cluster resource pooling. Each cluster was an isolated island.

Our solution: A two-layer Virtual Kubelet architecture that presents multiple GPU clusters as a single, unified resource pool to users. This enables:

  • Transparent routing of workloads to reserved instances first
  • Centralized billing and quota management
  • Seamless failover between clusters

    The architecture achieves significant cost reduction by maximizing reserved instance utilization before falling back to on-demand resources.

Key Components

L1: Managed K8s Frontend (User-facing Kubernetes)L2: CPU Cluster (Control Plane / The Brain)L3: GPU Clusters (Data Plane / The Muscle)Internal Load Balancer for cross-cluster networkingVK Layer 2 Injection Modules
📦

Tier-1 Tech Company

LLM Training Platform for Foundation Models

⚡ AIOps Platform

LLM Training Platform Architecture

Internal Tools for Large-Scale GPU Fleet Management

Note: Due to confidentiality agreements, specific implementation details have been abstracted. The patterns described represent general industry practices.

We operate an LLM Training Platform managing thousands of GPUs distributed across multiple managed Kubernetes clusters. As the Internal Tools Team, we handle Health Checks, Node Remediation, and operational automation—primarily serving Scientists and ML researchers.

The platform provides:

  • Airflow DAGs for orchestrating Health Checks, Fault Isolation, and Node Remediation
  • RESTful APIs via Serverless Functions for cluster info and job management
  • CLI Tools for scientists to submit jobs and query status
  • Persistent Layer tracking GPU serial numbers, node status, and job history

    Result: Multi-million dollar annual savings through automated fault detection and reduced GPU idle time.

Key Components

Managed K8s Clusters (Thousands of GPUs)Airflow DAGs for Health Check & RemediationServerless Functions (RESTful APIs)Relational Database for Job/Node/Cluster metadataGitOps-based CI/CD