Available for Senior/Staff SDE | SRE | MLE Roles | GPU/AI Infrastructure Specialist

Hi, I'm Bill Hsu
Architected Cross-Cluster Infrastructure for 10,000+ GPUs 🚀
Site Reliability Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI
↓ See My Impact ↓
Where I've Made Impact
🖥️GPUs Managed
⚡MTTR Reduction
📄Publications

Amazon AGI
- Saved $9M/year in infrastructure costs
- Architected Cross-Cluster systems for 10,000+ GPUs
- Built self-healing AIOps reducing MTTR by 90%
Annual Savings

Alibaba Cloud
- Current - Site Reliability Engineer II
- AI Training Platform resource management
- 40% cost reduction through optimization
Cost Reduction

Amazon Nova
- Co-Author on Amazon Nova Technical Report
- Contributed to frontier AI model development
- Published Dec 2024

Unitree Robotics
- 25x scaling for G1-D humanoid robot training
- GPU infrastructure for reinforcement learning
- Cross-cluster orchestration
Training Scale
UC San Diego
- M.S. Computer Science (AI/ML focus), 2023
- Computational Neuroscience & Machine Learning Research
- Full-stack Developer @ UCSD IT
What I Bring to Your Team
🏗️
Cross-Cluster Architecture Expert
Designed Dual-Layer Virtual Kubelet for 10,000+ heterogeneous GPUs
- Unified Resource Orchestration across clusters
- Federated Identity Mesh with secure AuthN/AuthZ
- Hybrid Network Fabric for low-latency communication
- 25x scaling for Unitree G1 robot training
💰
Massive Cost Savings
Proven track record of multi-million dollar optimizations
- $9M annual savings at Amazon AGI
- 40% cost reduction at Alibaba Cloud
- Migration from Serverless to Reserved instances
- 3,000+ scaling requests handled efficiently
⚡
AIOps & Self-Healing Systems
Architecting autonomous infrastructure that fixes itself
- 90% MTTR reduction (10hr → 1hr)
- Custom Kubernetes Controllers for auto-remediation
- Closed-loop telemetry pipeline with SysOM
- Zero data loss in isolated GPU sandboxes
🔐
Security & Reliability at Scale
Enterprise-grade security for AI/ML infrastructure
- Novel credential injection via Service Accounts
- 9-hour token rotation for Cross-Cluster auth
- Secure Enclave telemetry with DCGM metrics
- 100 engineer-hours saved monthly via automation
Technical Expertise
Cloud Native & Kubernetes
K8s Internals (Operators, CRDs, Virtual Kubelet)95%
Cross-Cluster Architecture95%
AWS EKS / Helm / Docker90%
Service Mesh & Identity Management90%
Infrastructure & Automation
Terraform / AWS CDK / CloudFormation90%
Golang (Primary Language)95%
Python / TypeScript85%
VPC Networking / gRPC / eBPF85%
AI Infrastructure & GPU
NVIDIA A100/H100 Optimization90%
Ray Cluster/Serve85%
PyTorch Distributed Training80%
Prometheus / Grafana / DCGM90%
Certifications & Achievements
🎓 UCSD MS Computer Science📚 3 Publications🏆 Amazon Nova Co-Author⚙️ Kubernetes Expert
Systems I Could Be Managing for You
LIVE SIMULATION
Active GPUs
9,847
Requests/sec
1,245
Uptime
99.99%
Active Nodes
342
What People Say About My Work
💡 These are simulated testimonials based on actual impact metrics
"
Bill's Dual-Layer Virtual Kubelet architecture revolutionized our cross-cluster GPU management. His solution enabled 25x scaling for Unitree G1 robot training while achieving 40% cost reduction.
Engineering DirectorAnalyticDB AI PlatformAlibaba Cloud
Let's Build Something Amazing Together
Why Schedule a Call?
- Discuss how I can solve your infrastructure challenges
- Share ideas about scaling AI/ML systems
- Explore potential collaboration opportunities
- Get insights from my experience at scale
My Availability
Pacific Time (PST/PDT)
Mon-Fri: 9 AM - 6 PM
Response within 24 hours