Available for Senior/Staff SDE | MLE Roles | GPU/AI Infrastructure Specialist
Bill Hsu

Hi, I'm Bill Hsu

Built a Heterogeneous Compute Platform for Multi-Cluster AI Training πŸš€

Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI

↓ See My Impact ↓

What I Bring to Your Team

πŸ—οΈ

Cross-Cluster Architecture Expert

Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets

  • Unified Resource Orchestration across clusters
  • Federated Identity Mesh with secure AuthN/AuthZ
  • Hybrid Network Fabric for low-latency communication
  • 25x capacity scaling for humanoid-robotics training
πŸ’°

Massive Cost Savings

Proven track record of multi-million dollar optimizations

  • Multi-million dollar annual savings at Amazon AGI
  • 40% cost reduction at Alibaba Cloud
  • Migration from Serverless to Reserved instances
  • Thousands of scaling requests handled efficiently
πŸ€–

Multi-Agent Systems & LLM Agent Architecture

Authored RCAgent β€” a trust-first multi-agent kernel for distributed-systems incident triage

  • Supervisor-Worker architecture with a 4-Gate Hallucination Defense
  • New-skill gating via pass^3 β‰₯ 80% (Anthropic Ο„-bench consistency metric)
  • Meta-Tool over a hierarchical skill tree β€” avg 6 of 200+ tools per call
  • ~40% auto-healing on confirmed-cause incidents
πŸ”

Security & Reliability at Scale

Enterprise-grade security for AI/ML infrastructure

  • Novel credential injection via Service Accounts
  • Automated token rotation for Cross-Cluster auth
  • Secure Enclave telemetry with DCGM metrics
  • Significant engineer-hours saved monthly via automation

Technical Expertise

Cloud Native & Kubernetes

K8s Internals (Operators, CRDs, Virtual Kubelet)95%
Cross-Cluster Architecture95%
AWS EKS / Helm / Docker90%
Service Mesh & Identity Management90%

Infrastructure & Automation

Terraform / AWS CDK / CloudFormation90%
Golang (Primary Language)95%
Python / TypeScript85%
VPC Networking / gRPC / eBPF85%

AI Infrastructure & GPU

NVIDIA A100/H100 Optimization90%
Ray Cluster/Serve85%
PyTorch Distributed Training80%
Prometheus / Grafana / DCGM90%

Certifications & Achievements

πŸŽ“ UCSD MS Computer ScienceπŸ“š 3 PublicationsπŸ† Amazon Nova Co-Authorβš™οΈ Kubernetes Expert
System Architecture Spotlight

Built for Massive-Scale GPU Infrastructure

From a recursive K8s-on-K8s compute platform to the 0β†’1 SaaS products and multi-agent systems built on top.
Distributed-systems engineering across cloud, platform, and product.

HETEROGENEOUS COMPUTE PLATFORM β€” RECURSIVE K8s-ON-K8s FOUNDATION0 β†’ 1 SaaS PRODUCTS Β· run on the platformAI Dev Workstationsinteractive Β· dual-plane networkingDistributed Training & Sim Schedulermixed CPU/GPU job dispatchruns onHETEROGENEOUS COMPUTE PLATFORM β€” THE FOUNDATIONRecursive K8s-on-K8svirtual-node-on-virtual-nodeCross-Cluster Identity Meshapp-layer routing Β· per-pod secrets-mount Β· no static credentialsUnified Compute Substratededicated GPU pools + serverless CPU pools across multiple clustersUnified substrate Β· ~40% TCO reduction Β· 25x capacity scaling

Architecture simplified for confidentiality. Patterns represent general industry practices.

Heterogeneous Compute Platform

The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster resource pooling β€” each cluster is an isolated island.

Built a recursive K8s-on-K8s compute platform (virtual-node-on-virtual-node) unifying dedicated GPU and serverless CPU pools across clusters into one substrate β€” the foundation two 0β†’1 SaaS products run on, for a humanoid-robotics training & simulation customer.

Key Engineering Decisions:

  • Recursive K8s-on-K8s presents isolated clusters as one logical compute substrate
  • Cross-Cluster Identity Mesh: application-layer routing + per-pod secrets-mount, no static credentials
  • Two 0β†’1 SaaS products on top: AI dev workstations + distributed training & simulation scheduler
Virtual KubeletGolanggRPCKubernetes
Technical Deep Dive β†’

What People Say About My Work

"

Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.

His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clustersβ€”achieving ~40% TCO reduction while expanding training capacity by 25x.

Panfeng ZhouSenior Staff Engineer, Alibaba Cloud20+ years in database & distributed systems β€’ Led team that set TPC-C world record

Let's Build Something Amazing Together

Why Schedule a Call?

  • Discuss how I can solve your infrastructure challenges
  • Share ideas about scaling AI/ML systems
  • Explore potential collaboration opportunities
  • Get insights from my experience at scale

My Availability

Pacific Time (PST/PDT)

Mon-Fri: 9 AM - 6 PM

Response within 24 hours

Or send a quick note