Available for Senior/Staff SDE | SRE | MLE Roles | GPU/AI Infrastructure Specialist
Bill Hsu

Hi, I'm Bill Hsu

Architected Cross-Cluster Infrastructure for Massive-Scale GPU Fleets πŸš€

Infrastructure Software Engineer II @ Alibaba Cloud | Ex-Amazon AGI
Building the infrastructure that powers next-gen AI

↓ See My Impact ↓

What I Bring to Your Team

πŸ—οΈ

Cross-Cluster Architecture Expert

Designed Dual-Layer Virtual Kubelet for massive-scale heterogeneous GPU fleets

  • Unified Resource Orchestration across clusters
  • Federated Identity Mesh with secure AuthN/AuthZ
  • Hybrid Network Fabric for low-latency communication
  • 25x capacity scaling for Unitree G1 robot training
πŸ’°

Massive Cost Savings

Proven track record of multi-million dollar optimizations

  • Multi-million dollar annual savings at Amazon AGI
  • 40% cost reduction at Alibaba Cloud
  • Migration from Serverless to Reserved instances
  • Thousands of scaling requests handled efficiently
⚑

AIOps & Self-Healing Systems

Architecting autonomous infrastructure that fixes itself

  • ~90% troubleshooting time reduction per incident
  • Custom Kubernetes Controllers for auto-remediation
  • Closed-loop telemetry pipeline with SysOM
  • Zero data loss in isolated GPU sandboxes
πŸ”

Security & Reliability at Scale

Enterprise-grade security for AI/ML infrastructure

  • Novel credential injection via Service Accounts
  • Automated token rotation for Cross-Cluster auth
  • Secure Enclave telemetry with DCGM metrics
  • Significant engineer-hours saved monthly via automation

Technical Expertise

Cloud Native & Kubernetes

K8s Internals (Operators, CRDs, Virtual Kubelet)95%
Cross-Cluster Architecture95%
AWS EKS / Helm / Docker90%
Service Mesh & Identity Management90%

Infrastructure & Automation

Terraform / AWS CDK / CloudFormation90%
Golang (Primary Language)95%
Python / TypeScript85%
VPC Networking / gRPC / eBPF85%

AI Infrastructure & GPU

NVIDIA A100/H100 Optimization90%
Ray Cluster/Serve85%
PyTorch Distributed Training80%
Prometheus / Grafana / DCGM90%

Certifications & Achievements

πŸŽ“ UCSD MS Computer ScienceπŸ“š 3 PublicationsπŸ† Amazon Nova Co-Authorβš™οΈ Kubernetes Expert
System Architecture Spotlight

Built for Massive-Scale GPU Infrastructure

From Ray AI SaaS platforms to self-healing GPU fleets for LLM training.
Solving distributed system challenges at the petabyte scale.

RECURSIVE VIRTUALIZATION ARCHITECTUREUser-Facing LayerManaged K8s β€’ User WorkloadsControl PlaneOrchestration β€’ Scheduling β€’ Identity ManagementVirtual Kubelet β€’ Cross-Cluster NetworkingData PlaneGPU Resource Pools β€’ Training JobsUnified resource pool β€’ Significant cost reduction β€’ Zero-downtime operations

Architecture simplified for confidentiality. Patterns represent general industry practices.

Federated AI Infrastructure

The Challenge: Enterprise AI training demands massive GPU fleets, but Kubernetes wasn't designed for cross-cluster orchestration at this scale.

Built a Ray AI SaaS platform on Managed Kubernetes with dual-layer Virtual Kubelet architecture. This powers training for enterprise AI customers, unifying isolated GPU clusters into one logical pool.

Key Engineering Decisions:

  • Virtual Kubelet presents isolated clusters as a single logical pool
  • Federated Identity Mesh solves cross-cluster auth with automated token rotation
  • TimeWindow scheduling shifts workloads to off-peak hours β†’ significant cost savings
Virtual KubeletGolanggRPCKubernetes
Technical Deep Dive β†’

What People Say About My Work

"

Mr. Hsu possesses a rare combination of expertise in GPU cluster management, distributed systems, and Kubernetes orchestration that is essential for building the next generation of AI training platforms.

His most significant contribution has been the design and implementation of our Dual-Layer Virtual Kubelet architecture, a groundbreaking system that enables centralized orchestration of a massive-scale heterogeneous GPU fleet across multiple clustersβ€”achieving ~40% TCO reduction while expanding training capacity by 25x.

Panfeng ZhouSenior Staff Engineer, Alibaba Cloud20+ years in database & distributed systems β€’ Led team that set TPC-C world record

Let's Build Something Amazing Together

Why Schedule a Call?

  • Discuss how I can solve your infrastructure challenges
  • Share ideas about scaling AI/ML systems
  • Explore potential collaboration opportunities
  • Get insights from my experience at scale

My Availability

Pacific Time (PST/PDT)

Mon-Fri: 9 AM - 6 PM

Response within 24 hours