
Hi, I'm Bill Hsu
Architected Cross-Cluster Infrastructure for Massive-Scale GPU Fleets π
Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI
Where I've Made Impact
What I Bring to Your Team
Cross-Cluster Architecture Expert
Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets
- Unified Resource Orchestration across clusters
- Federated Identity Mesh with secure AuthN/AuthZ
- Hybrid Network Fabric for low-latency communication
- 25x capacity scaling for Unitree G1 robot training
Massive Cost Savings
Proven track record of multi-million dollar optimizations
- Multi-million dollar annual savings at Amazon AGI
- 40% cost reduction at Alibaba Cloud
- Migration from Serverless to Reserved instances
- Thousands of scaling requests handled efficiently
AIOps & Self-Healing Systems
Architecting autonomous infrastructure that fixes itself
- ~90% troubleshooting time reduction per incident
- Custom Kubernetes Controllers for auto-remediation
- Closed-loop telemetry pipeline with SysOM
- Zero data loss in isolated GPU sandboxes
Security & Reliability at Scale
Enterprise-grade security for AI/ML infrastructure
- Novel credential injection via Service Accounts
- Automated token rotation for Cross-Cluster auth
- Secure Enclave telemetry with DCGM metrics
- Significant engineer-hours saved monthly via automation
Technical Expertise
Cloud Native & Kubernetes
Infrastructure & Automation
AI Infrastructure & GPU
Certifications & Achievements
Built for Massive-Scale GPU Infrastructure
From Ray AI SaaS platforms to self-healing GPU fleets for LLM training.
Solving distributed system challenges at the petabyte scale.
Architecture simplified for confidentiality. Patterns represent general industry practices.
Federated AI Infrastructure
The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster orchestration at this scale.
Built a Ray AI SaaS platform on Managed Kubernetes with dual-layer Virtual Kubelet architecture. This powers training for enterprise AI customers, unifying isolated GPU clusters into one logical pool.
Key Engineering Decisions:
- Virtual Kubelet presents isolated clusters as a single logical pool
- Federated Identity Mesh solves cross-cluster auth with automated token rotation
- TimeWindow scheduling shifts workloads to off-peak hours β significant cost savings
What People Say About My Work
Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.
His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clustersβachieving ~40% TCO reduction while expanding training capacity by 25x.
Let's Build Something Amazing Together
Why Schedule a Call?
- Discuss how I can solve your infrastructure challenges
- Share ideas about scaling AI/ML systems
- Explore potential collaboration opportunities
- Get insights from my experience at scale
My Availability
Pacific Time (PST/PDT)
Mon-Fri: 9 AM - 6 PM
Response within 24 hours



