Reliability Engineering Manager

hace 3 semanas


Santiago de Querétaro, México Petco A tiempo completo

Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning reliability goals with business and product objectives. This role requires strong leadership, deep technical understanding of distributed systems and SRE practices, and a strategic mindset to manage risk, guide incident response, and continuously improve reliability outcomes.Duties & Responsibilities:Lead and manage a team of Site Reliability Engineers, providing coaching, mentorship, and performance feedback.Partner with senior leadership to define reliability objectives and align SRE strategies with overall business and product goals.Define, implement, and evolve SLOs, SLIs, and error budgets in collaboration with product and engineering teams.Oversee the reliability, performance, and capacity of production systems, including incident management, post-incident reviews, and problem management.Drive automation for operational tasks, deployments, and recovery playbooks to reduce toil and improve consistency.Design and maintain infrastructure and platform reliability using infrastructure as code tools such as Terraform, Ansible, or similar.Guide the implementation and management of containerized and cloud-native platforms (for example, Kubernetes) with a focus on resilience, scalability, and safe rollouts.Own observability practices and tooling (logging, metrics, tracing, alerting) to ensure proactive detection and fast diagnosis of issues.Champion best practices for security, compliance, and governance in production environments.Collaborate with cross-functional teams to ensure reliability is considered in architecture, design, and release planning.Foster a culture of blameless incident reviews, learning, and continuous improvement within the Reliability Engineering organization.Manage relationships with external vendors and service providers that support reliability, monitoring, and infrastructure needs.Minimum Qualifications:Bachelor’s degree in Computer Science, Engineering, or a related field.5+ years of experience in Site Reliability Engineering, Production Engineering, or related fields, including at least 2 years in a leadership or management role.Strong proficiency in scripting or programming languages such as Python, Java, NodeJs, or NextJs.Experience operating large-scale systems on cloud platforms such as AWS, Azure, or Google Cloud Platform.In-depth knowledge of containerization and orchestration technologies such as Docker and Kubernetes.Experience with infrastructure as code and configuration management tools (for example, Terraform, Ansible, or similar).Hands-on experience with observability and incident management tools (for example, Prometheus, Grafana, Datadog, PagerDuty, or equivalents).Solid understanding of SRE principles, including SLOs/SLIs, error budgets, capacity planning, and incident response.Excellent problem-solving, troubleshooting, and communication skills, with the ability to influence and collaborate across teams.



  • Santiago de Querétaro, México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro, México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro, México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro, Qro., México Petco A tiempo completo

    Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning reliability...


  • Querétaro City, México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro, Qro., México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro, Qro., México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Querétaro City, México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Santiago, Tláhuac, D.F., México Petco A tiempo completo

    Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...


  • Santiago de Querétaro, México Petco A tiempo completo

    Create a healthier, brighter future for pets, pet parents and people! If you want to make a real difference, create an exciting career path, feel welcome to be your whole self and nurture your wellbeing, Petco is the place for you. Our core values capture that spirit as we work to improve lives by doing what’s right for pets, people and our planet. We love...