Site Reliability Engineer

hace 5 días

Santiago de Querétaro, Querétaro de Arteaga, México MedTrainer A tiempo completo

Company Description

MedTrainer is an innovator in the healthcare industry, changing the landscape of technology offerings with its Platform Solution, comprised of our proprietary Learning Management System (LMS), our core focus on Compliance Training, and our Managed Services offering in Credentialing and Compliance Management.

We impact thousands of healthcare providers, and we are building the future of healthcare through innovation, scale, and collaboration.

Job Description

Looking for a Site Reliability Engineer who can build, scale, maintain, and monitor highly available, secure, and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs, error budgets, incident response, postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance, uptime, and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient, automated, and observable systems.

Qualifications

Bachelor's in Computer Science, equivalent degree, or equivalent professional experience.
3+ years working on distributed systems and cloud operations.
Strong hands-on experience with at least two major cloud providers (Azure, AWS, GCP) and their managed Kubernetes services.
Deep experience architecting and/or operating large Kubernetes clusters: workload identity, networking, storage, autoscaling, upgrades, security, and multi-tenancy.
Container expertise (Docker/OCI), packaging and configuration, and service mesh experience is a plus.
Advanced GitHub Actions expertise: reusable workflows/composites, concurrency/queueing, environments and approvals, OIDC federation, artifacts, caching, dependency review, and policy/as-code.
Strong Python skills (required) for Pulumi-based IaC, tooling, and automation; Golang knowledge is a plus.
Familiarity with CI/CD, change management, and experience in progressive delivery.
Observability stack experience and alerting practices tied to SLOs.
Configuration of cloud-native networking, storage, Linux, security controls, and cost governance.
Experience migrating and scaling infrastructure across clouds.
Relevant certifications (e.g., CKA) are a plus.
Advanced English (optional)

Responsibilities

Design, build, and operate production-grade Kubernetes (AKS) clusters and supporting services with high availability, security, and cost optimization.
Architect, implement, and maintain CI/CD using GitHub Actions (advanced), including reusable workflows, matrices, environments, required approvals, OIDC-based cloud auth, self-hosted runners, and policy controls.
Define, codify, and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components, enforce code reviews, testing, and documentation.
Develop and maintain configuration management with Ansible (roles, collections, inventories, playbooks) for OS, middleware, and app operations.
Implement progressive delivery and deployment strategies (blue/green, canary, feature flags) and automate rollback/roll-forward based on health checks and SLOs.
Establish comprehensive observability (metrics, logs, traces, profiles) with alerting tied to SLIs; drive capacity planning, performance tuning, and chaos/resiliency testing.
Lead incident management and on-call response; coordinate triage, communication, mitigation, root-cause analysis, and follow-through on corrective actions.
Partner with product and engineering to design for reliability (readiness/liveness probes, graceful shutdown, backpressure, retries/timeouts, circuit breakers).
Implement security best practices (least privilege, secrets management) and ensure compliance with internal policies and audits.
Continuously review existing systems, eliminate toil via automation, reduce technical debt, and document operational runbooks and standards.

Essential technologies and/or skills:

Exceptional problem-solving, with the ability to anticipate and remediate issues before they affect business productivity.
Proven experience handling production environments and being available for emergencies.
Clear, calm communication with technical and non-technical audiences.
Passion for detail and a structured, methodical mindset in design, execution, and documentation.
Professional, positive approach with strong ethics and high working morale.
Curiosity to learn, bias for automation, and a true can-do attitude.
Cloud Platforms (Azure, AWS)
Version control tools (Git/GitHub)
Continuous Integration servers (GitHub Actions as primary)
Configuration management (Ansible)
Containers (Docker/OCI)
Infrastructure Orchestration (Pulumi/Python)
Monitoring and analytics (metrics/logs/traces, APM, alerting)
Secrets management and security scanning/signing
Incident management and on-call tooling
Python (scripting level)
MySQL

Additional Information

What We Offer

Competitive monthly net salary: $45,000 – $70,000 MXN.
100% remote work from anywhere in Mexico.
Major Medical Insurance and healthcare coverage.
Home office and ergonomics support (internet, electricity, office chair).
Professional development opportunities, including English classes.
Wellness benefits such as TotalPass gym discounts.
Savings plan.
Paid time off, including personal days.
A collaborative, international, and growth-oriented environment.

All your information will be kept confidential according to EEO guidelines.

Site Reliability Engineer

hace 2 semanas

Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completo

Technical Service Consultant/Site Reliability EngineerBased at: RELEX office in MexicoEmployment type: Permanent, full-timeTravel: Some ad hoc travel to client sites and the Atlanta office may be requiredThe RELEX team in the Americas is growing, and we're now looking for a Technical Consultant/Site Reliability Engineer. You'll join our global continuous...
Site Reliability Engineer

hace 1 semana

Ciudad de México, Ciudad de México Azkait A tiempo completo

AZKAITes una empresa mexicana que busca y conecta el mejor talento IT con empresas Latinoamericanas y de Estados Unidos.Estamos en la búsqueda de tu talento comoSite Reliability Engineer (SRE)Requisitos:Licenciatura o Ingeniería en Sistemas, Informática o afín.+5 años de experiencia en roles de SRE, DevOps o Ingeniería de Software.Experiencia...
Site Reliability Engineer

hace 23 horas

Ciudad de México, Ciudad de México Mastercard A tiempo completo

Our PurposeMastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships...
Senior Cloud Site Reliability Engineer

hace 5 días

Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completo

We are now looking for a full-time Cloud SRE to join our RELEX team in Mexico. As a Senior Cloud SRE you will be building, maintaining operating various tool sets and operations in highly critical environment. We are looking for proactive, open-minded, and self-driven person who will help us enhance private cloud on-site support activities into the next...
Senior Cloud Site Reliability Engineer

hace 5 días

Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completo

We are now looking for a full-timeCloud SREto join our RELEX team in Mexico. As a Senior Cloud SRE you will be building, maintaining operating various tool sets and operations in highly critical environment.We are looking for proactive, open-minded, and self-driven person who will help us enhance private cloud on-site support activities into the next...
Site Reliability Engineer

hace 2 semanas

Ciudad de México, Ciudad de México Sur A tiempo completo

As the Site Reliability Engineer you will support and scale the infrastructure powering their secure, mission-critical SaaS platform. You must be confident in operating and debugging both modern infrastructure (cloud-native, containerized services) and classic Windows production environments (IIS, SQL Server AlwaysOn, Service Broker), with the ability to...
Lead Site Reliability Engineer

hace 1 semana

Ciudad de México, Ciudad de México Pathlock A tiempo completo

About Pathlock:Pathlock is a leader in application security, access governance, and compliance automation. Our cloud-based solutions help organizations secure critical applications, mitigate risk, and enforce policies across a diverse IT landscape.Job Summary:Join Pathlock, a fast-growing leader in Governance, Access and Compliance, where you'll help shape...
Site Reliability Engineer

hace 2 semanas

Ciudad de México, Ciudad de México Tech Mahindra A tiempo completo

We're Hiring We are seeking a talented Site Reliability Engineer (SRE) CDMX with robust experience in Azure environments, Kubernetes, and DevOps practices.Your mission will be to ensure the reliability, scalability, and automation of our critical platforms. If you thrive on solving complex challenges, automating processes, and ensuring seamless operations,...
FBS Site Reliability Engineer

hace 2 semanas

Ciudad de México, Ciudad de México Capgemini A tiempo completo

Our Client is one of the United States' largest insurers, providing a wide range of insurance and financial services products with gross written premiums well over US$25 Billion (P&C). They proudly serve more than 10 million U.S. households with more than 19 million individual policies across all 50 states through the efforts of over 48,000 exclusive and...
Senior Site Reliability Engineer

hace 5 días

Ciudad de México, Ciudad de México Thomson Reuters México A tiempo completo

Are you passionate about the chance to bring your experience to a world-class company that is market-leading or both content and technology? If yes, we're looking for you.Join our team Senior Site Reliability Engineer (SRE) will be implement Site Reliability Engineering and DevOps best practices. Feed non-functional requirements into the product backlog,...

Américas

Europa

Asia / Oceanía

África

Site Reliability Engineer