
Software Architect, Reliability Engineering
- Ireland
- Permanent
- Full-time
- Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.
- Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.
- Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;
- Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.
- Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.
- Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling.
- Establish and champion reliability practices and drive systemic improvements.
- Mentor and grow engineers and technical leaders
- Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.
- 15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect.
- Strong experience in driving strategic technical decisions and defining long-term technical vision.
- In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization.
- Experience driving cross-org technical architecture outcomes.
- Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices.
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.
- Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS.
- Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure.
- Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting.
- Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling.
- Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
- Experience running cross-functional post-incident reviews and driving improvements.
- Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
- Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.
- Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments.
- Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs.
- Ability to influence and build effective working relationships with all levels of the organization.
- Specific experience owning and operating large AWS footprints.
- Knowledge of Kubernetes architecture and concepts.
- Experience with data technologies like Apache Kafka, AWS MSK, or similar for reliable streaming.
- Passion for building reliable products, with prior projects in high-availability systems