Available for Senior/Staff SDE | MLE Roles | GPU/AI Infrastructure Specialist

Hi, I'm Bill Hsu

Built a Heterogeneous Compute Platform for Multi-Cluster AI Training 🚀

Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI

Connect on LinkedIn 📅 Schedule a Call 📄 Download Resume

↓ See My Impact ↓

Where I've Made Impact

📚Citations (Google Scholar)

🚀Capacity Scaling (Alibaba)

⚡MTTR Reduction (Amazon)

📄Publications & Patents

Amazon AGI

Multi-million dollar annualized GPU-fleet savings
GPU lifecycle management at massive scale (Airflow DAG)
Automated fault remediation cutting troubleshooting ~90%

Alibaba Cloud

Current — Infrastructure Software Engineer II
Heterogeneous compute platform for multi-cluster AI training
Two 0→1 SaaS products + a multi-agent RCA kernel

Amazon Nova

Co-Author on Amazon Nova Technical Report
Contributed to frontier AI model development
Published Dec 2024

Unitree Robotics

Order-of-magnitude scaling for G1 humanoid robot training
GPU infrastructure for reinforcement learning
Cross-cluster orchestration

UC San Diego

M.S. Computer Science (AI/ML focus), 2023
Computational Neuroscience & Machine Learning Research
Full-stack Developer @ UCSD IT

What I Bring to Your Team

🏗️

Cross-Cluster Architecture Expert

Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets

Unified Resource Orchestration across clusters
Federated Identity Mesh with secure AuthN/AuthZ
Hybrid Network Fabric for low-latency communication
25x capacity scaling for humanoid-robotics training

💰

Massive Cost Savings

Proven track record of multi-million dollar optimizations

Multi-million dollar annual savings at Amazon AGI
40% cost reduction at Alibaba Cloud
Migration from Serverless to Reserved instances
Thousands of scaling requests handled efficiently

🤖

Multi-Agent Systems & LLM Agent Architecture

Authored RCAgent — a trust-first multi-agent kernel for distributed-systems incident triage

Supervisor-Worker architecture with a 4-Gate Hallucination Defense
New-skill gating via pass^3 ≥ 80% (Anthropic τ-bench consistency metric)
Meta-Tool over a hierarchical skill tree — avg 6 of 200+ tools per call
~40% auto-healing on confirmed-cause incidents

🔐

Security & Reliability at Scale

Enterprise-grade security for AI/ML infrastructure

Novel credential injection via Service Accounts
Automated token rotation for Cross-Cluster auth
Secure Enclave telemetry with DCGM metrics
Significant engineer-hours saved monthly via automation

Technical Expertise

Cloud Native & Kubernetes

K8s Internals (Operators, CRDs, Virtual Kubelet)95%

Cross-Cluster Architecture95%

AWS EKS / Helm / Docker90%

Service Mesh & Identity Management90%

Infrastructure & Automation

Terraform / AWS CDK / CloudFormation90%

Golang (Primary Language)95%

Python / TypeScript85%

VPC Networking / gRPC / eBPF85%

AI Infrastructure & GPU

NVIDIA A100/H100 Optimization90%

Ray Cluster/Serve85%

PyTorch Distributed Training80%

Prometheus / Grafana / DCGM90%

Certifications & Achievements

🎓 UCSD MS Computer Science📚 3 Publications🏆 Amazon Nova Co-Author⚙️ Kubernetes Expert

System Architecture Spotlight

Built for Massive-Scale GPU Infrastructure

From a recursive K8s-on-K8s compute platform to the 0→1 SaaS products and multi-agent systems built on top.
Distributed-systems engineering across cloud, platform, and product.

Architecture simplified for confidentiality. Patterns represent general industry practices.

Heterogeneous Compute Platform

The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster resource pooling — each cluster is an isolated island.

Built a recursive K8s-on-K8s compute platform (virtual-node-on-virtual-node) unifying dedicated GPU and serverless CPU pools across clusters into one substrate — the foundation two 0→1 SaaS products run on, for a humanoid-robotics training & simulation customer.

Key Engineering Decisions:

Recursive K8s-on-K8s presents isolated clusters as one logical compute substrate
Cross-Cluster Identity Mesh: application-layer routing + per-pod secrets-mount, no static credentials
Two 0→1 SaaS products on top: AI dev workstations + distributed training & simulation scheduler

Virtual KubeletGolanggRPCKubernetes

Technical Deep Dive →

What People Say About My Work

Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.

His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clusters—achieving ~40% TCO reduction while expanding training capacity by 25x.

Panfeng ZhouSenior Staff Engineer, Alibaba Cloud20+ years in database & distributed systems • Led team that set TPC-C world record

Let's Build Something Amazing Together

Why Schedule a Call?

Discuss how I can solve your infrastructure challenges
Share ideas about scaling AI/ML systems
Explore potential collaboration opportunities
Get insights from my experience at scale

My Availability

Pacific Time (PST/PDT)

Mon-Fri: 9 AM - 6 PM

Response within 24 hours