Machine Learning Infrastructure Observability - Expert Software Engineer

Huawei

  • Dublin
  • Permanent
  • Full-time
  • 1 month ago
Job Description:Company Overview: Our cutting-edge technology company is at the forefront of the AI revolution, and we're seeking an Expert to join our talented team. As a global leader in Cloud & ML infrastructure, we operate large fleets with ML accelerators and distributed systems. Our work directly impacts the rapid development and deployment of AI models across various domains.Role Summary: As an Expert, you will be a pivotal force in shaping the efficiency, reliability, and scalability of our ML infrastructure by designing and developing observability solutions and tools. Your role involves close collaboration with technical leaders across multidisciplinary domains, including cloud infrastructure and ML software systems. Together, we aim to design observability to help operational excellence in our fleet, ensuring seamless ML experiences for our customers.Responsibilities:
  • Design and develop our ML fleet infrastructure observability/monitoring, including GPU clusters, distributed storage, and compute nodes.
  • Design and develop ai cluster operations related observability to help proactive maintenance and capacity planning functions.
  • Drive efficiency improvements and provide guidance for the AI/ML operations engineers with observability best practices.
  • Evaluate cutting edge observability technologies for hardware accelerators, and next generation networking infrastructure.
  • Provide technical leadership and mentorship to junior SREs, SDEs and Data Scientists.
Requirements:
  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.
  • Minimum 5 years of hands-on experience in SRE or DevOps roles, specifically focused on ML infrastructure along with AI Infra and AI monitoring.
  • Proficiency in Linux, low-level debugging, and system performance analysis.
  • Strong scripting skills (Python, Bash) for automation and monitoring.
  • Experience with Kubernetes, Docker, and container orchestration.
  • Excellent communication skills and ability to collaborate across teams.
Benefits:
  • Competitive salary package
  • Long-term personal growth space
  • Opportunities to work on high profile initiatives that impact the whole company
  • Opportunities to work with the brightest minds in software engineering (including Huawei Fellow and renowned professors in the world)
  • A multi-cultural, international working environment
  • Work for an international world leader, an established yet still rapidly growing Fortune 500 company
Check out Life at Huawei Ireland Research Centre:DUE TO THE HIGH VOLUME OF REPLIES, ONLY CANDIDATES WHO ARE SHORTLISTED FOR INTERVIEW WILL BE CONTACTED.Privacy StatementPlease read and understand our West European Recruitment Privacy Notice before submitting your personal data to Huawei so that you fully understand how we process and manage your personal data received.

Huawei

Similar Jobs

  • Senior Machine Learning Engineer

    Intercom

    • Dublin
    Intercom is an AI-first customer service platform that helps businesses deliver better, faster, more personalized support. Intercom is bringing AI-first Customer Service to the w…
    • 13 hours ago
  • Staff Machine Learning Engineer

    HubSpot

    • Dublin
    POS-23576 HubSpot is an all-in-one marketing, sales, and service software platform that helps businesses grow and succeed. With a user-friendly interface and powerful tools, HubS…
    • 19 days ago