Available for Senior/Staff SDE | SRE | MLE Roles | GPU/AI Infrastructure Specialist

Hi, I'm Bill Hsu

Architected Cross-Cluster Infrastructure for Massive-Scale GPU Fleets 🚀

Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI

Connect on LinkedIn 📅 Schedule a Call 📄 Download Resume

↓ See My Impact ↓

Where I've Made Impact

🖥️Massive-Scale GPU Fleet

⚡Drastic MTTR Reduction

📄Publications

Amazon AGI

Multi-million dollar infrastructure savings
Architected Cross-Cluster systems for massive-scale GPU fleets
Built self-healing AIOps drastically reducing MTTR

Alibaba Cloud

Current - Infrastructure Software Engineer II
AI Training Platform resource management
Significant cost reduction through optimization

Amazon Nova

Co-Author on Amazon Nova Technical Report
Contributed to frontier AI model development
Published Dec 2024

Unitree Robotics

Order-of-magnitude scaling for G1 humanoid robot training
GPU infrastructure for reinforcement learning
Cross-cluster orchestration

UC San Diego

M.S. Computer Science (AI/ML focus), 2023
Computational Neuroscience & Machine Learning Research
Full-stack Developer @ UCSD IT

What I Bring to Your Team

🏗️

Cross-Cluster Architecture Expert

Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets

Unified Resource Orchestration across clusters
Federated Identity Mesh with secure AuthN/AuthZ
Hybrid Network Fabric for low-latency communication
25x capacity scaling for Unitree G1 robot training

💰

Massive Cost Savings

Proven track record of multi-million dollar optimizations

Multi-million dollar annual savings at Amazon AGI
40% cost reduction at Alibaba Cloud
Migration from Serverless to Reserved instances
Thousands of scaling requests handled efficiently

⚡

AIOps & Self-Healing Systems

Architecting autonomous infrastructure that fixes itself

~90% troubleshooting time reduction per incident
Custom Kubernetes Controllers for auto-remediation
Closed-loop telemetry pipeline with SysOM
Zero data loss in isolated GPU sandboxes

🔐

Security & Reliability at Scale

Enterprise-grade security for AI/ML infrastructure

Novel credential injection via Service Accounts
Automated token rotation for Cross-Cluster auth
Secure Enclave telemetry with DCGM metrics
Significant engineer-hours saved monthly via automation

Technical Expertise

Cloud Native & Kubernetes

K8s Internals (Operators, CRDs, Virtual Kubelet)95%

Cross-Cluster Architecture95%

AWS EKS / Helm / Docker90%

Service Mesh & Identity Management90%

Infrastructure & Automation

Terraform / AWS CDK / CloudFormation90%

Golang (Primary Language)95%

Python / TypeScript85%

VPC Networking / gRPC / eBPF85%

AI Infrastructure & GPU

NVIDIA A100/H100 Optimization90%

Ray Cluster/Serve85%

PyTorch Distributed Training80%

Prometheus / Grafana / DCGM90%

Certifications & Achievements

🎓 UCSD MS Computer Science📚 3 Publications🏆 Amazon Nova Co-Author⚙️ Kubernetes Expert

System Architecture Spotlight

Built for Massive-Scale GPU Infrastructure

From Ray AI SaaS platforms to self-healing GPU fleets for LLM training.
Solving distributed system challenges at the petabyte scale.

Architecture simplified for confidentiality. Patterns represent general industry practices.

Federated AI Infrastructure

The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster orchestration at this scale.

Built a Ray AI SaaS platform on Managed Kubernetes with dual-layer Virtual Kubelet architecture. This powers training for enterprise AI customers, unifying isolated GPU clusters into one logical pool.

Key Engineering Decisions:

Virtual Kubelet presents isolated clusters as a single logical pool
Federated Identity Mesh solves cross-cluster auth with automated token rotation
TimeWindow scheduling shifts workloads to off-peak hours → significant cost savings

Virtual KubeletGolanggRPCKubernetes

Technical Deep Dive →

What People Say About My Work

Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.

His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clusters—achieving ~40% TCO reduction while expanding training capacity by 25x.

Panfeng ZhouSenior Staff Engineer, Alibaba Cloud20+ years in database & distributed systems • Led team that set TPC-C world record

Let's Build Something Amazing Together

Why Schedule a Call?

Discuss how I can solve your infrastructure challenges
Share ideas about scaling AI/ML systems
Explore potential collaboration opportunities
Get insights from my experience at scale

My Availability

Pacific Time (PST/PDT)

Mon-Fri: 9 AM - 6 PM

Response within 24 hours