Chief Architect - Hardware Platforms Reliability - Permanent

Huawei

  • Dublin
  • Permanent
  • Full-time
  • 1 month ago
Job Description:About the jobAre you an engineer or researcher interested in applying your expertise in constructing reliable server class hardware systems to the problems involved in building highly reliable hyperscale cloud infrastructure? We are looking for people who are motivated by hard challenges that lie at the intersection of hardware failures, operating systems, and large-scale distributed systems.The Cloud Reliability Lab at the Huawei Ireland Research center has a mission to bring world class reliability to Huawei Cloud by solving cross-functional problems that span hardware, software, networking, monitoring and operations. We have teams working in all these areas with a diverse mix of people including industry veterans, academic researchers, and Ph.D. student interns. In your role, you will collaborate with the local teams in Ireland, research centers across Europe, and engineering teams around the world.Responsibilities
  • Lead the organization responsible for defining and defending meaningful hardware reliability KPIs to low level infrastructure services (compute, storage, networking).
  • Work closely with cross-functional teams including hardware engineers, system architects, and software developers, to create solutions that meet stringent reliability requirements.
  • Define and execute key technical projects necessary to minimize the impact of hardware faults on cloud workloads.
  • Understand deep technical issues behind hardware pathologies and come up with a long-term strategy to improve the reliability experience of cloud customers.
  • Maintain academic partnerships, collaborate with hardware vendors and engage with open communities that are relevant to this domain.
  • Publish key findings in relevant conferences & journals or file patents as appropriate
Requirements:
  • Ph.D. or Master's degree in Computer Science or a related field.
  • 10+ years of experience leading hardware platforms organizations or teams that support large scale cloud infrastructure.
  • Deep expertise in system-level architecture, reliability engineering, fault tolerance mechanisms, optimizing RAS architectures in a data center environment.
  • Experience in managing end-to-end lifecycle (not just production qualification) of server class hardware including the skills to navigate complex TCO tradeoffs between efficiency and reliability.
  • Basic proficiency in simulation tools and reliability analysis techniques (e.g: Fault injection, reliability block diagrams, survival analysis, failure rate analysis)
  • Exceptional communication skills required to negotiate, collaborate and educate cross-functional teams across the globe.
  • Optional: Hands-on experience working with AWS, Azure, GCP or other cloud systems.
Benefits:
  • Competitive salary package
  • Long-term personal growth space
  • Opportunities to work on high profile initiatives that impact the whole company
  • Opportunities to work with the brightest minds in software engineering (including Huawei Fellow and renowned professors in the world)
  • A multi-cultural, international working environment
  • Work for an international world leader, an established yet still rapidly growing Fortune 500 company
Check out Life at Huawei Ireland Research Centre:DUE TO THE HIGH VOLUME OF REPLIES, ONLY CANDIDATES WHO ARE SHORTLISTED FOR INTERVIEW WILL BE CONTACTED.Privacy StatementPlease read and understand our West European Recruitment Privacy Notice before submitting your personal data to Huawei so that you fully understand how we process and manage your personal data received.

Huawei