About Me
Message for Recruiters & Hiring Managers
I am Yan-Cheng (Bill) Hsu, an Infrastructure Software Engineer II at Alibaba Cloud, specializing in AI Training Platform resource management. With a Master's degree from UC San Diego, I bring deep expertise in building and managing large-scale GPU infrastructure for AI/ML workloads.
At Alibaba Cloud, I architect cross-cluster AI training infrastructure for cutting-edge robotics (Unitree G1-D), designing systems that centralize massive-scale heterogeneous GPU fleets with ~40% TCO reduction and 25x capacity scaling. My work on Federated Identity Mesh and AIOps observability demonstrates my ability to solve complex distributed systems challenges.
Previously at Amazon AGI Org, I built the GPU infrastructure powering Amazon NOVA, architecting systems that delivered multi-million dollar annual savings and reduced troubleshooting time by ~90% per incident. I'm a co-author on the Amazon Nova technical report.
My research includes publications in Sensors journal and IEEE APSIPA ASC 2023 on deep learning and time series transformers. I combine strong systems engineering skills with AI/ML expertise to build reliable, cost-effective infrastructure at scale.
