Lead Site Reliability Engineer
OpenText View all jobs
- Cork
- Permanent
- Full-time
- Managing and scaling distributed data services including Kafka, Elasticsearch, Cassandra, Solr, Redis, and OpenSearch.
- Building and maintaining infrastructure in on-prem and public cloud environments (AWS, Azure, GCP).
- Developing and maintaining IaC templates using Terraform and Ansible.
- Ensuring systems are patched, secure, and compliant with internal standards.
- Collaborating with SRE and engineering teams to design, deploy, and monitor data platforms.
- Supporting incident resolution and participating in on-call rotations for critical services.
- Contributing to capacity planning and performance tuning of data services.
- Creating and updating documentation such as operational procedures, change plans, and incident reports.
- Participating in training and knowledge-sharing activities.
- Learning new technologies and contributing to automation and reliability improvements.
- Supporting service requests and ensuring SLA/OLA compliance.
- May require shift work and participation in a 24x7 on-call rotation.
- Bachelor’s Degree in Computer Science or related field.
- 5+ years of experience in Information Technology, with a focus on large-scale enterprise systems.
- 3+ years of experience managing distributed data platforms (Kafka, Elasticsearch, Cassandra, Solr, Redis).
- Experience with OpenSearch is a strong asset.
- 2+ years of experience with automation tools such as Terraform and Ansible.
- Solid understanding of Linux systems and cloud infrastructure (AWS, Azure, GCP).
- Strong troubleshooting and problem-solving skills.
- Excellent written and verbal communication skills.
- Detail-oriented, reliable, and self-driven.
- Ability to work in a fast-paced environment and manage multiple priorities.
- Familiarity with ITIL principles; certification is a plus.
- Experience with observability tools (e.g., Prometheus, Grafana, ELK stack) is a plus.