Site Reliability Engineer

hace 1 semana

Monterrey N L, México Concord USA A tiempo completo

Location: Hybrid in Monterrey, MX. 8 days a month on-site.

Possibility to get a relocation stipend if not currently based in Monterrey.

Requirement: Must be legally authorized to work for any Mexican employer without sponsorship, now or in the future.

About Us

Concord isn't your typical consulting firm; we're an execution focused company passionate about delivering results. Our mission is to help clients enhance customer experiences, optimize operations, and revolutionize product offerings through seamless integration, optimization, and activation of technology and data.

Our services and solutions include Digital Experience (Salesforce, Headless Commerce, UI/UX), Data and Analytics (Snowflake, Databricks, Martech Analytics), and Engineering and Application Services (Application Modernization, Greenfield Apps, Portal Buildout, etc.).

About the Role

We are seeking a strategic, technically adept, and hands-on SRE Manager to lead the reliability, scalability, and operational excellence of our production systems. This role is ideal for a leader who thrives in high-pressure environments, excels at debugging complex production issues, and is passionate about building and mentoring high-performing teams.

The SRE Manager will be responsible for hiring and managing a team of SREs, driving incident response and postmortem processes, and collaborating with multiple product teams to build and maintain robust CI/CD pipelines and deployment practices. This role demands a strong sense of ownership, a deep understanding of cloud-native infrastructure, and the ability to lead by example.

Business Alignment

The SRE Manager will partner with business stakeholders to ensure reliability goals support customer experience, compliance, and growth targets. This includes aligning SRE initiatives with broader business objectives such as revenue protection, innovation, and regulatory adherence.

Key Responsibilities

Build and lead a high-performing Site Reliability Engineering team.
Create individualized development plans for SREs, encourage participation in industry conferences, and support certification programs.
Debug and resolve complex production issues, ensuring minimal downtime and rapid recovery.
Own the incident lifecycle, including coordination, communication, and creation of detailed postmortem documentation.
Implement blameless postmortems and maintain a library of runbooks for common incident types.
Follow up with product teams to ensure resolution and implementation of long-term fixes.
Partner with internal product and engineering teams to understand infrastructure needs and deliver scalable, secure, and reliable solutions.
Drive the design, implementation, and automation of cloud infrastructure using Azure, Terraform, and Kubernetes (AKS).
Lead the adoption and management of tools such as Argo CD, Argo Workflows, Azure DevOps, and Octopus Deploy.
Architect and manage API Gateways, WAFs, Service Mesh, and multi-cloud networking (VNets, private networks).
Establish and enforce deployment best practices, including documentation, versioning, rollback strategies, and environment management.
Collaborate with product teams to build and maintain CI/CD pipelines, ensuring reliable and repeatable deployments.
Foster a culture of ownership, accountability, and continuous improvement across the team.
Define and track key performance indicators (KPIs) for system reliability and team effectiveness.
Define and manage Service Level Objectives (SLOs) and error budgets for all critical services.
Lead the adoption of advanced observability tools for proactive reliability management.
Collaborate with security, compliance, and architecture teams through joint reviews, shared dashboards, and audits to ensure infrastructure meets enterprise standards.

Required Qualifications

10+ years of experience in infrastructure, DevOps, or SRE roles, with 3+ years in a technical leadership or management capacity.
Proven experience debugging and resolving production issues in large-scale systems.
Experience building and scaling cloud-native infrastructure on Azure.
Deep expertise in Kubernetes (AKS), CI/CD pipelines, and Infrastructure as Code (Terraform).
Strong understanding of networking, VNets, private cloud connectivity, and multi-cloud architectures.
Hands-on experience with Argo CD, Argo Workflows, Azure DevOps.
Demonstrated ability to hire, mentor, and lead engineering teams.
Excellent communication and stakeholder management skills.
Strong problem-solving mindset with a bias for action and ownership.
Ability to create and maintain detailed deployment documentation and lead by example in operational excellence.
Advanced English proficiency (C1 or C2) with proven success collaborating in global, English-speaking environments.

Preferred Qualifications

Experience supporting internal product teams or platform engineering organizations.
Familiarity with FinOps, cost optimization, and cloud governance.
Exposure to compliance frameworks (SOC2, ISO, HIPAA).
Experience with service mesh technologies (Istio, Linkerd).
Knowledge of emerging technologies such as AI/ML ops, edge computing, and sustainability practices.

What Success Looks Like

A high-performing SRE team that operates with autonomy and accountability.
Internal customers view the SRE team as a trusted partner in delivering reliable, scalable systems.
Infrastructure is automated, observable, and resilient by design.
Incidents are rare, well-managed, and always lead to learning and improvement.
CI/CD pipelines are robust, well-documented, and consistently deliver high-quality deployments.

***

Concord is an execution partner helping organizations drive digital transformation, modernization, and scalable technology solutions. We deliver results that solve real business challenges. We operate globally and are growing fast, shaping the future of technology. Join a team trusted by top companies to drive strategic growth and operational excellence

2zRj9h9pTf

Site Reliability Engineer

hace 1 semana

Monterrey, Nuevo León, México NOV A tiempo completo

DescriptionSite Reliability Engineer (SRE) – Application Performance Monitoring (APM)Location: Monterrey, Nuevo León, Mexico (Hybrid – candidates must reside in Monterrey or the metropolitan area)Language requirement: Fluent English (spoken and written)About the RoleWe're looking for a Site Reliability Engineer (SRE) with a passion for Application...
Site Reliability Engineer

hace 2 semanas

Monterrey, Nuevo León, México NOV A tiempo completo

Site Reliability Engineer (SRE) – Application Performance Monitoring (APM)Location:Monterrey, Nuevo León, Mexico (Hybrid – candidates must reside in Monterrey or the metropolitan area).Language requirement:Fluent English (spoken and written).About the Role:We're looking for aSite Reliability Engineer (SRE)with a passion forApplication Performance...
Senior Site Reliability Engineer

hace 5 días

Monterrey, México Datalogics A tiempo completo

**Senior Site Reliability Engineer (Hybrid from Monterrey)** - **MXN $1,020,000 - $1,260,000/year (gross)**: - **Equity and comprehensive health benefits**: - **Hybrid from Monterrey, Mexico**: - **Full-time, Contract of Employment** Are you passionate about optimizing software release processes in a fast-paced environment? Do you enjoy building tools...
Site Reliability Engineer

hace 7 días

Monterrey, México BairesDev A tiempo completo

At BairesDev®, we've been leading the way in technology projects for over 15 years. We deliver cutting-edge solutions to giants like Google and the most innovative startups in Silicon Valley. Our diverse 4,000+ team, composed of the world's Top 1% of tech talent, works remotely on roles that drive significant impact worldwide. Site Reliability Engineer at...
Site Reliability Engineer

hace 2 semanas

Monterrey, México National Oilwell Varco A tiempo completo

Overview We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) with a specialization in Application Performance Monitoring (APM) to join our team. You will be a key player in ensuring the reliability, performance, and scalability of our mission-critical applications and systems. You will work closely with software engineering and...
Site Reliability Engineer

hace 2 semanas

Monterrey, México Concord USA A tiempo completo

Location: Hybrid in Monterrey, MX. 8 days a month on-site. Possibility to get a travel or relocation stipend for travel. Type of Employment: contract to hire. Initial 6-12 month contract with pay in USD. About Us Concord isn't your typical consulting firm; we're an execution focused company passionate about delivering results. Our mission is to help clients...
Site Reliability Engineer

hace 2 semanas

Monterrey, México Concord USA A tiempo completo

Location: Hybrid in Monterrey, MX. 8 days a month on-site. Possibility to get a travel or relocation stipend for travel. Type of Employment: contract to hire. Initial 6-12 month contract with pay in USD. About Us Concord isn't your typical consulting firm; we're an execution focused company passionate about delivering results. Our mission is to help clients...
Senior Site Reliability Engineer

hace 4 días

Monterrey, México Canonical A tiempo completo

Join to apply for the Senior Site Reliability Engineer role at Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our...
Senior Site Reliability Engineer

hace 3 días

Monterrey, México Canonical A tiempo completo

Join to apply for the Senior Site Reliability Engineer role at Canonical Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in breakthrough enterprise initiatives such as public cloud, data science, AI, engineering innovation and IoT. Our...
Site Reliability Engineer

hace 2 semanas

Monterrey, México British American Tobacco A tiempo completo

**BAT MEXICO IS SEARCHING FOR A SITE RELIABILITY EGINEER (Cloud & Data Center)****JOB TITLE**:Site Reliability Engineer (Cloud & Data Center)**FUNCTION**:Digital Business Solutions**SUB FUNCTION**: IDT Services**CITY & COUNTRY**:Monterrey, Mexico**ROLE SUMMARY****What are the key objectives and expectations from this role?****What is the direct impact of...

Américas

Europa

Asia / Oceanía

África

Site Reliability Engineer