Deep Dives
Detailed technical write-ups on the systems I've built.
Architecture decisions, trade-offs, and implementation details.
Global Cloud Provider
Cross-Cluster AI Training Infrastructure
K8s on K8s: Recursive Virtualization Architecture
Two-Layer Virtual Kubelet for Multi-Cluster GPU Pooling
Note: Due to confidentiality agreements, specific implementation details have been abstracted. The patterns described represent general industry practices.
When building large-scale AI SaaS platforms, we faced a critical business challenge: customers were paying for expensive on-demand container instances, while our reserved GPU clusters sat underutilized.
The root cause? Kubernetes wasn't designed for cross-cluster resource pooling. Each cluster was an isolated island.
Our solution: A two-layer Virtual Kubelet architecture that presents multiple GPU clusters as a single, unified resource pool to users. This enables:
- Transparent routing of workloads to reserved instances first
- Centralized billing and quota management
- Seamless failover between clusters
The architecture achieves significant cost reduction by maximizing reserved instance utilization before falling back to on-demand resources.
Key Components
🔧Case Studies
K8s on K8s: Recursive Virtualization
Virtual Kubelet-based Ray Resource Pooling Architecture
Read Case Study →Time-Window Node Scheduling
Cron-Driven GPU Pool Time-Sharing Architecture
Read Case Study →Mock PV: Cross-Cluster Storage
Two-Phase Provisioning for Storage Virtualization
Read Case Study →Tier-1 Tech Company
LLM Training Platform for Foundation Models
LLM Training Platform Architecture
Internal Tools for Large-Scale GPU Fleet Management
Note: Due to confidentiality agreements, specific implementation details have been abstracted. The patterns described represent general industry practices.
We operate an LLM Training Platform managing thousands of GPUs distributed across multiple managed Kubernetes clusters. As the Internal Tools Team, we handle Health Checks, Node Remediation, and operational automation—primarily serving Scientists and ML researchers.
The platform provides:
- Airflow DAGs for orchestrating Health Checks, Fault Isolation, and Node Remediation
- RESTful APIs via Serverless Functions for cluster info and job management
- CLI Tools for scientists to submit jobs and query status
- Persistent Layer tracking GPU serial numbers, node status, and job history
Result: Multi-million dollar annual savings through automated fault detection and reduced GPU idle time.