Machine Learning Infrastructure Observability - Expert Software Engineer
Huawei
- Dublin
- Permanent
- Full-time
- Design and develop our ML fleet infrastructure observability/monitoring, including GPU clusters, distributed storage, and compute nodes.
- Design and develop ai cluster operations related observability to help proactive maintenance and capacity planning functions.
- Drive efficiency improvements and provide guidance for the AI/ML operations engineers with observability best practices.
- Evaluate cutting edge observability technologies for hardware accelerators, and next generation networking infrastructure.
- Provide technical leadership and mentorship to junior SREs, SDEs and Data Scientists.
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.
- Minimum 5 years of hands-on experience in SRE or DevOps roles, specifically focused on ML infrastructure along with AI Infra and AI monitoring.
- Proficiency in Linux, low-level debugging, and system performance analysis.
- Strong scripting skills (Python, Bash) for automation and monitoring.
- Experience with Kubernetes, Docker, and container orchestration.
- Excellent communication skills and ability to collaborate across teams.
- Competitive salary package
- Long-term personal growth space
- Opportunities to work on high profile initiatives that impact the whole company
- Opportunities to work with the brightest minds in software engineering (including Huawei Fellow and renowned professors in the world)
- A multi-cultural, international working environment
- Work for an international world leader, an established yet still rapidly growing Fortune 500 company