
Hi, I'm Bill Hsu
Built a Heterogeneous Compute Platform for Multi-Cluster AI Training π
Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI
Where I've Made Impact
What I Bring to Your Team
Cross-Cluster Architecture Expert
Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets
- Unified Resource Orchestration across clusters
- Federated Identity Mesh with secure AuthN/AuthZ
- Hybrid Network Fabric for low-latency communication
- 25x capacity scaling for humanoid-robotics training
Massive Cost Savings
Proven track record of multi-million dollar optimizations
- Multi-million dollar annual savings at Amazon AGI
- 40% cost reduction at Alibaba Cloud
- Migration from Serverless to Reserved instances
- Thousands of scaling requests handled efficiently
Multi-Agent Systems & LLM Agent Architecture
Authored RCAgent β a trust-first multi-agent kernel for distributed-systems incident triage
- Supervisor-Worker architecture with a 4-Gate Hallucination Defense
- New-skill gating via pass^3 β₯ 80% (Anthropic Ο-bench consistency metric)
- Meta-Tool over a hierarchical skill tree β avg 6 of 200+ tools per call
- ~40% auto-healing on confirmed-cause incidents
Security & Reliability at Scale
Enterprise-grade security for AI/ML infrastructure
- Novel credential injection via Service Accounts
- Automated token rotation for Cross-Cluster auth
- Secure Enclave telemetry with DCGM metrics
- Significant engineer-hours saved monthly via automation
Technical Expertise
Cloud Native & Kubernetes
Infrastructure & Automation
AI Infrastructure & GPU
Certifications & Achievements
Built for Massive-Scale GPU Infrastructure
From a recursive K8s-on-K8s compute platform to the 0β1 SaaS products and multi-agent systems built on top.
Distributed-systems engineering across cloud, platform, and product.
Architecture simplified for confidentiality. Patterns represent general industry practices.
Heterogeneous Compute Platform
The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster resource pooling β each cluster is an isolated island.
Built a recursive K8s-on-K8s compute platform (virtual-node-on-virtual-node) unifying dedicated GPU and serverless CPU pools across clusters into one substrate β the foundation two 0β1 SaaS products run on, for a humanoid-robotics training & simulation customer.
Key Engineering Decisions:
- Recursive K8s-on-K8s presents isolated clusters as one logical compute substrate
- Cross-Cluster Identity Mesh: application-layer routing + per-pod secrets-mount, no static credentials
- Two 0β1 SaaS products on top: AI dev workstations + distributed training & simulation scheduler
What People Say About My Work
Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.
His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clustersβachieving ~40% TCO reduction while expanding training capacity by 25x.
Let's Build Something Amazing Together
Why Schedule a Call?
- Discuss how I can solve your infrastructure challenges
- Share ideas about scaling AI/ML systems
- Explore potential collaboration opportunities
- Get insights from my experience at scale
My Availability
Pacific Time (PST/PDT)
Mon-Fri: 9 AM - 6 PM
Response within 24 hours



