Chief Architect - Cluster Management Reliability - Permanent

Huawei

  • Dublin
  • Permanent
  • Full-time
  • 1 month ago
  • Apply easily
About the jobAre you a technical leader interested in applying your deep expertise in managing large scale production clusters to the problems involved in building highly reliable hyperscale cloud infrastructure? We are looking for people who are motivated by the hard challenge of building and modernizing the cluster management layer of a very large public cloud deployment.The Cloud Reliability Lab at the Huawei Ireland Research center has a mission to bring world class reliability to Huawei Cloud by solving cross-functional problems that span hardware, software, networking, monitoring and operations. We have teams working in all these areas with a diverse mix of people including industry veterans, academic researchers, and Ph.D. student interns. In your role, you will collaborate with the local teams in Ireland, research centers across Europe, and engineering teams around the world.Responsibilities
  • Lead the organization responsible for defining and defending meaningful cluster management reliability KPIs.
  • Work closely with cross-functional teams including hardware systems engineering, SRE, and cloud service owners to understand the gaps in the cluster management ecosystem and create solutions that meet their stringent reliability demands.
  • Define and execute key technical projects necessary to achieve a highly reliable and scalable scheduling and workload management abstractions.
  • Research and develop key foundational capabilities around multi-tenancy, isolation, performance optimization, observability, simulation, hitless upgrades, machine failure resilience, failure domain awareness, autoscaling etc.
  • Maintain academic partnerships, collaborate with hardware vendors and engage with open communities that are relevant to this domain.
  • Publish key findings in relevant conferences & journals or file patents as appropriate
Requirements
  • Ph.D. or Master’s degree in Computer Science or a related field.
  • 10+ years of experience leading organizations or teams that support large scale cloud infrastructure.
  • Deep practical experience in designing and scaling clusters of more than 10,000 machines using technologies like Kubernetes, OpenStack, Google Borg, Facebook Twine.
  • Technical mastery of foundational technologies used in cluster management such as cgroups, namespaces, overlay networking, hardware virtualization etc.
  • Strong API design & System design skills along with the fluency to read and write code in a modern programming language like Go or Rust.
  • Exceptional communication skills required to negotiate, collaborate with and educate cross-functional teams across the globe.
  • Optional: Understanding of the Open Compute Projects (OCP) ecosystem.
Benefits
  • Competitive salary package
  • Long-term personal growth space
  • Opportunities to work on high profile initiatives that impact the whole company
  • Opportunities to work with the brightest minds in software engineering (including Huawei Fellow and renowned professors in the world)
  • A multi-cultural, international working environment
  • Work for an international world leader, an established yet still rapidly growing Fortune 500 company
Check out Life at Huawei Ireland Research Centre:DUE TO THE HIGH VOLUME OF REPLIES, ONLY CANDIDATES WHO ARE SHORTLISTED FOR INTERVIEW WILL BE CONTACTED.Privacy StatementPlease read and understand our West European Recruitment Privacy Notice before submitting your personal data to Huawei so that you fully understand how we process and manage your personal data received.

Huawei