Senior Site Reliability Engineer - Ireland
Arista Networks View all jobs
- Dublin
- Permanent
- Full-time
- Design, build, and deploy production systems with a focus on scalability, reliability, observability, and performance, ensuring systems meet stringent security standards
- Develop and maintain comprehensive automation solutions to eliminate toil and streamline operational efficiency across production environments
- Proactively monitor production systems, establish intelligent alerting strategies, and implement automated incident response mechanisms to minimise downtime
- Create and maintain detailed incident response runbooks; conduct thorough postmortem analyses following incidents to identify root causes and prevent recurrence
- Collaborate with software engineering teams to identify and resolve infrastructural bottlenecks, designing innovative solutions that enhance product deployment workflows
- Manage and optimise monitoring infrastructure using industry-standard tools, ensuring comprehensive visibility across all systems
- Plan, communicate, and execute maintenance windows on production systems with minimal disruption to service availability
- Triage platform and infrastructural issues with decisiveness and analytical rigour; engage with third-party vendors and support teams as required
- Deploy new systems and updates in a staged, risk-managed manner, ensuring safe and incremental rollouts
- Survey and adopt best practices in infrastructure and platform management to maintain secure, scalable, and fault-tolerant systems
- Study the design and implementation details of open-source systems to enhance troubleshooting capabilities and accelerate issue resolution
- Work transparently with stakeholders to communicate system status, planned maintenance, and infrastructure improvements
- Bachelor's degree in Computer Science, Engineering, or equivalent professional experience (5+ years in a related infrastructure or systems role)
- Proficiency in one or more programming languages: Go, Python, or bash shell scripting, with the ability to implement medium-complexity automation workflows
- Strong knowledge of Linux or UNIX from both administration and debugging perspectives
- Hands-on experience operating software systems, infrastructure, and complex applications at scale in production environments
- Demonstrated expertise in infrastructure-as-code principles and practices
- Strong problem-solving and software troubleshooting skills with a methodical, analytical approach
- Experience with server provisioning, particularly from storage and networking perspectives
- Proven ability to work collaboratively within cross-functional teams and communicate technical concepts clearly
- Experience with incident response, postmortem analysis, and continuous improvement methodologies
- Experience with container orchestration platforms, particularly Kubernetes
- Hands-on experience with Docker and virtualisation technologies
- Proficiency in managing monitoring stacks, including Prometheus and Grafana
- Experience with CI/CD systems such as GitLab tools or Spinnaker
- Knowledge of infrastructure-as-code frameworks, particularly Terraform
- Experience managing databases such as PostgreSQL or equivalent relational database management systems
- Experience with artifact repositories and Docker registries
- Familiarity with cloud platforms (Google Cloud Platform, Amazon Web Services, or Microsoft Azure)
- Understanding of distributed systems architecture and principles
- Experience with performance tuning and system optimisation
- Knowledge of security best practices in infrastructure and systems design
- On-call support experience and comfort with incident response responsibilities