About Me

Message for Recruiters & Hiring Managers

I am Yan-Cheng (Bill) Hsu, a Site Reliability Engineer II at Alibaba Cloud, specializing in AI Training Platform resource management. With a Master's degree from UC San Diego, I bring deep expertise in building and managing large-scale GPU infrastructure for AI/ML workloads.

At Alibaba Cloud, I architect cross-cluster AI training infrastructure for cutting-edge robotics (Unitree G1-D), designing systems that centralize 10,000+ heterogeneous GPUs with 40% cost reduction. My work on Federated Identity Mesh and AIOps observability demonstrates my ability to solve complex distributed systems challenges.

Previously at Amazon AGI Org, I built the GPU infrastructure powering Amazon NOVA, architecting systems that saved $1.5M annually and reduced troubleshooting time by 90% for a fleet of 7,000+ GPUs. I'm a co-author on the Amazon Nova technical report.

My research includes publications in Sensors journal and IEEE APSIPA ASC 2023 on deep learning and time series transformers. I combine strong systems engineering skills with AI/ML expertise to build reliable, cost-effective infrastructure at scale.

Career Timeline