
Staff Site Reliability Engineer (Swing shift 4 days a week)
- Dublin
- Permanent
- Full-time
- Provide relief and sustainable resolution to issues within our infrastructure.
- Conduct root cause analysis of incidents and implement preventive measures.
- Participate in troubleshooting bridges and provide support during critical incidents.
- Use your experience in software development, systems engineering, and networking to proactively prevent repeatable issues.
- Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design.
- Drive a culture of intolerance to manual activity which results in a highly automated environment delivering scalable solutions.
- Design, develop, and maintain scalable and reliable systems.
- Implement and manage monitoring, alerting, and incident response processes.
- Collaborate with development teams to ensure the reliability and performance of new features.
- Automate repetitive tasks to improve efficiency and reduce human error.
- Innovate and continuously improve system reliability, performance, and capacity.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
- 8+ years of experience in a Site Reliability Engineering or similar role.
- A degree in Computer Science, Engineering, or a related field.
- Self-motivated go-getter attitude with a proven ability to lead and drive initiatives across the organization.
- The ability to inspire collaboration, navigate ambiguity, and drive initiatives from concept to successful execution, consistently delivering impactful results.
- Extensive experience with ITIL-based IT operations, including incident, problem, and change management.
- Advanced expertise in Unix/Linux system administration, including troubleshooting memory, processes, storage, network connectivity, and performance issues using command-line utilities and shell scripting.
- Proficient in automation tools and security best practices, ensuring robust, scalable, and secure production operations across diverse environments.
- Comprehensive knowledge of networking protocols, including TCP/IP, DNS, HTTP/HTTPS, TLS/SSL, FTP/SFTP, DHCP, among others.
- Solid experience with relational databases such as MySQL or Postgres, including performance tuning and query optimization.
- Experience with infrastructure-as-code and configuration management tools like Terraform, Puppet or Ansible.
- Strong programming skills in languages such as Python, Go, or Java.
- Cloud experience across AWS, Azure, or GCP;
- Proficiency in using monitoring and logging tools like Splunk, Prometheus, Grafana, or ELK stack.
- Experience with Kubernetes to orchestrate the deployment, scaling, and management of containers.
- Excellent problem-solving skills and attention to details.
- Excellent written and verbal communication skills with the ability to clearly articulate solutions to technical problems
- Ability to work in shifts which are from 3 pm to 1 am.
- Certifications in one or more public cloud platforms
- Exposure to DevOps and Agile methodologies.
- Familiarity with CI/CD pipelines and tools like Jenkins or GitLab CI.
- Understanding of development on ServiceNow platform