Reliability Engineering Manager
hace 3 semanas
Summary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning reliability goals with business and product objectives. This role requires strong leadership, deep technical understanding of distributed systems and SRE practices, and a strategic mindset to manage risk, guide incident response, and continuously improve reliability outcomes. Duties & Responsibilities: Lead and manage a team of Site Reliability Engineers, providing coaching, mentorship, and performance feedback. Partner with senior leadership to define reliability objectives and align SRE strategies with overall business and product goals. Define, implement, and evolve SLOs, SLIs, and error budgets in collaboration with product and engineering teams. Oversee the reliability, performance, and capacity of production systems, including incident management, post-incident reviews, and problem management. Drive automation for operational tasks, deployments, and recovery playbooks to reduce toil and improve consistency. Design and maintain infrastructure and platform reliability using infrastructure as code tools such as Terraform, Ansible, or similar. Guide the implementation and management of containerized and cloud-native platforms (for example, Kubernetes) with a focus on resilience, scalability, and safe rollouts. Own observability practices and tooling (logging, metrics, tracing, alerting) to ensure proactive detection and fast diagnosis of issues. Champion best practices for security, compliance, and governance in production environments. Collaborate with cross-functional teams to ensure reliability is considered in architecture, design, and release planning. Foster a culture of blameless incident reviews, learning, and continuous improvement within the Reliability Engineering organization. Manage relationships with external vendors and service providers that support reliability, monitoring, and infrastructure needs. Minimum Qualifications: Bachelor’s degree in Computer Science, Engineering, or a related field. 5+ years of experience in Site Reliability Engineering, Production Engineering, or related fields, including at least 2 years in a leadership or management role. Strong proficiency in scripting or programming languages such as Python, Java, NodeJs, or NextJs. Experience operating large-scale systems on cloud platforms such as AWS, Azure, or Google Cloud Platform. In-depth knowledge of containerization and orchestration technologies such as Docker and Kubernetes. Experience with infrastructure as code and configuration management tools (for example, Terraform, Ansible, or similar). Hands-on experience with observability and incident management tools (for example, Prometheus, Grafana, Datadog, PagerDuty, or equivalents). Solid understanding of SRE principles, including SLOs/SLIs, error budgets, capacity planning, and incident response. Excellent problem-solving, troubleshooting, and communication skills, with the ability to influence and collaborate across teams.
-
Reliability Engineering Manager
hace 3 semanas
Querétaro, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Manager Reliability Engineering
hace 3 semanas
Querétaro, Qro., México Petco A tiempo completoLead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning reliability...
-
Reliability Engineering Manager
hace 2 semanas
Querétaro City, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Querétaro, Qro., México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Querétaro, Qro., México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Querétaro City, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Santiago de Querétaro, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Santiago de Querétaro, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Reliability Engineering Manager
hace 3 semanas
Santiago de Querétaro, México Petco A tiempo completoSummary: Lead and grow a team of Reliability Engineers responsible for designing, implementing, and operating the platforms, practices, and tooling that ensure the availability, performance, and resilience of our production systems. You will drive reliability, scalability, and operational excellence across the software delivery lifecycle, while aligning...
-
Manager Reliability Engineering
hace 4 semanas
Querétaro, Qro., México Petco A tiempo completoLead a team responsible for designing, implementing, and maintaining the infrastructure and processes that support the development, deployment, and operation of our software systems. You will play a critical role in driving efficiency, reliability, and scalability across our software development lifecycle while ensuring alignment with business objectives....