Manager Enterprise Operations
Mastercard View all jobs
- Dublin
- Permanent
- Full-time
We are seeking a highly driven and technically strong Manager – Site Reliability Engineering (SRE) / Enterprise Operations to lead mission-critical cloud and platform operations across multi-cloud and hybrid environments. This role will drive reliability engineering practices, operational excellence, observability maturity, incident governance, and 24x7 enterprise support models.
The ideal candidate combines deep technical expertise with operational leadership and has experience managing globally distributed teams supporting production workloads at scale.Key Responsibilities
1. Reliability Engineering & Production Stability
- Lead SRE practices aligned with SRE principles (SLIs, SLOs, error budgets).
- Own platform availability, performance, resiliency, and capacity management.
- Drive proactive reliability improvements through automation and root cause elimination.
- Establish and monitor service health KPIs across cloud and on-prem platforms.
- Manage 24x7 Enterprise Operations (L1/L2/L3) support model across global regions.
- Lead Major Incident Management (MIM) and ensure rapid restoration of services.
- Govern incident, problem, and change management processes.
- Conduct blameless RCAs and ensure preventive actions are implemented.
- Oversee operations across AWS, Azure, PCF and Kubernetes-based platforms.
- Ensure strong governance of CI/CD pipelines, production deployments, and release controls.
- Implement standardized monitoring, alerting, and observability frameworks.
- Drive infrastructure-as-code, configuration management, and automation initiatives.
- Establish enterprise-wide monitoring standards using tools such as Splunk, Dynatrace, Prometheus, Grafana, etc.
- Drive alert rationalization to reduce noise and improve signal quality.
- Implement single-pane-of-glass visibility across multi-cloud environments.
- Partner with Security teams to ensure operational compliance.
- Support vulnerability management, patch governance, and cloud security posture monitoring.
- Ensure audit readiness for production platforms.
- Identify repetitive operational tasks and drive automation-first mindset.
- Improve MTTR, change failure rate, and deployment frequency.
- Drive operational maturity through SOP standardization and runbook automation.
- Lead and mentor SRE and Enterprise Ops engineers.
- Drive skill uplift (certifications, cross-training, DevOps capabilities).
- Collaborate with Engineering, Security, and Product teams.
- Provide executive-level reporting on reliability and operational risk.
- 10–12+ years of experience in Cloud Operations / SRE / Enterprise Production Support.
- 3–5+ years in a leadership or managerial role.
- Strong experience in AWS ,Azure and/or PCF cloud environments.
- Deep understanding of Kubernetes, container orchestration, and distributed systems.
- Experience managing 24x7 support models.
- Strong background in incident management and production governance.
- Experience implementing SLO/SLI frameworks.
- Proficiency in CI/CD tools (Jenkins, GitHub Actions, Azure DevOps, etc.).
- Strong automation skills (Python, Bash, Terraform, Ansible).
- AWS Solutions Architect (Associate/Professional)
- Azure Architect / DevOps Engineer
- AWS/Azure Sysops
- AWS/Azure NW
- Certified Kubernetes Administrator (CKA)
- ITIL (preferred)
This role supports mission-critical enterprise platforms operating 24x7.
The selected candidate must be willing and able to work in rotational shifts, including weekends, night shifts and on-call coverage,Key Competencies
- Operational Excellence
- Reliability Engineering Mindset
- Crisis Leadership
- Automation-First Thinking
- Data-Driven Decision Making
- Strong Communication & Executive Reporting