Site Reliability Engineer
hace 5 días
Company Description
MedTrainer is an innovator in the healthcare industry, changing the landscape of technology offerings with its Platform Solution, comprised of our proprietary Learning Management System (LMS), our core focus on Compliance Training, and our Managed Services offering in Credentialing and Compliance Management.
We impact thousands of healthcare providers, and we are building the future of healthcare through innovation, scale, and collaboration.
Job Description
Looking for a Site Reliability Engineer who can build, scale, maintain, and monitor highly available, secure, and cost-efficient cloud platforms and Kubernetes workloads with a strong focus on reliability engineering practices (SLIs/SLOs, error budgets, incident response, postmortems). Own production readiness and operational excellence across infrastructure and delivery tooling. Ensure performance, uptime, and scalability while maintaining high standards of code quality and thoughtful design. Lead the transition and continuous improvement of applications and infrastructure toward resilient, automated, and observable systems.
Qualifications
- Bachelor's in Computer Science, equivalent degree, or equivalent professional experience.
- 3+ years working on distributed systems and cloud operations.
- Strong hands-on experience with at least two major cloud providers (Azure, AWS, GCP) and their managed Kubernetes services.
- Deep experience architecting and/or operating large Kubernetes clusters: workload identity, networking, storage, autoscaling, upgrades, security, and multi-tenancy.
- Container expertise (Docker/OCI), packaging and configuration, and service mesh experience is a plus.
- Advanced GitHub Actions expertise: reusable workflows/composites, concurrency/queueing, environments and approvals, OIDC federation, artifacts, caching, dependency review, and policy/as-code.
- Strong Python skills (required) for Pulumi-based IaC, tooling, and automation; Golang knowledge is a plus.
- Familiarity with CI/CD, change management, and experience in progressive delivery.
- Observability stack experience and alerting practices tied to SLOs.
- Configuration of cloud-native networking, storage, Linux, security controls, and cost governance.
- Experience migrating and scaling infrastructure across clouds.
- Relevant certifications (e.g., CKA) are a plus.
- Advanced English (optional)
Responsibilities
- Design, build, and operate production-grade Kubernetes (AKS) clusters and supporting services with high availability, security, and cost optimization.
- Architect, implement, and maintain CI/CD using GitHub Actions (advanced), including reusable workflows, matrices, environments, required approvals, OIDC-based cloud auth, self-hosted runners, and policy controls.
- Define, codify, and evolve Infrastructure as Code with Pulumi (Python) as the primary stack; create reusable components, enforce code reviews, testing, and documentation.
- Develop and maintain configuration management with Ansible (roles, collections, inventories, playbooks) for OS, middleware, and app operations.
- Implement progressive delivery and deployment strategies (blue/green, canary, feature flags) and automate rollback/roll-forward based on health checks and SLOs.
- Establish comprehensive observability (metrics, logs, traces, profiles) with alerting tied to SLIs; drive capacity planning, performance tuning, and chaos/resiliency testing.
- Lead incident management and on-call response; coordinate triage, communication, mitigation, root-cause analysis, and follow-through on corrective actions.
- Partner with product and engineering to design for reliability (readiness/liveness probes, graceful shutdown, backpressure, retries/timeouts, circuit breakers).
- Implement security best practices (least privilege, secrets management) and ensure compliance with internal policies and audits.
- Continuously review existing systems, eliminate toil via automation, reduce technical debt, and document operational runbooks and standards.
Essential technologies and/or skills:
- Exceptional problem-solving, with the ability to anticipate and remediate issues before they affect business productivity.
- Proven experience handling production environments and being available for emergencies.
- Clear, calm communication with technical and non-technical audiences.
- Passion for detail and a structured, methodical mindset in design, execution, and documentation.
- Professional, positive approach with strong ethics and high working morale.
- Curiosity to learn, bias for automation, and a true can-do attitude.
- Cloud Platforms (Azure, AWS)
- Version control tools (Git/GitHub)
- Continuous Integration servers (GitHub Actions as primary)
- Configuration management (Ansible)
- Containers (Docker/OCI)
- Infrastructure Orchestration (Pulumi/Python)
- Monitoring and analytics (metrics/logs/traces, APM, alerting)
- Secrets management and security scanning/signing
- Incident management and on-call tooling
- Python (scripting level)
- MySQL
Additional Information
What We Offer
- Competitive monthly net salary: $45,000 – $70,000 MXN.
- 100% remote work from anywhere in Mexico.
- Major Medical Insurance and healthcare coverage.
- Home office and ergonomics support (internet, electricity, office chair).
- Professional development opportunities, including English classes.
- Wellness benefits such as TotalPass gym discounts.
- Savings plan.
- Paid time off, including personal days.
- A collaborative, international, and growth-oriented environment.
All your information will be kept confidential according to EEO guidelines.
-
Site Reliability Engineer
hace 2 semanas
Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completoTechnical Service Consultant/Site Reliability EngineerBased at: RELEX office in MexicoEmployment type: Permanent, full-timeTravel: Some ad hoc travel to client sites and the Atlanta office may be requiredThe RELEX team in the Americas is growing, and we're now looking for a Technical Consultant/Site Reliability Engineer. You'll join our global continuous...
-
Site Reliability Engineer
hace 1 semana
Ciudad de México, Ciudad de México Azkait A tiempo completoAZKAITes una empresa mexicana que busca y conecta el mejor talento IT con empresas Latinoamericanas y de Estados Unidos.Estamos en la búsqueda de tu talento comoSite Reliability Engineer (SRE)Requisitos:Licenciatura o Ingeniería en Sistemas, Informática o afín.+5 años de experiencia en roles de SRE, DevOps o Ingeniería de Software.Experiencia...
-
Site Reliability Engineer
hace 23 horas
Ciudad de México, Ciudad de México Mastercard A tiempo completoOur PurposeMastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships...
-
Senior Cloud Site Reliability Engineer
hace 5 días
Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completoWe are now looking for a full-time Cloud SRE to join our RELEX team in Mexico. As a Senior Cloud SRE you will be building, maintaining operating various tool sets and operations in highly critical environment. We are looking for proactive, open-minded, and self-driven person who will help us enhance private cloud on-site support activities into the next...
-
Senior Cloud Site Reliability Engineer
hace 5 días
Santiago de Querétaro, Querétaro de Arteaga, México RELEX Solutions A tiempo completoWe are now looking for a full-timeCloud SREto join our RELEX team in Mexico. As a Senior Cloud SRE you will be building, maintaining operating various tool sets and operations in highly critical environment.We are looking for proactive, open-minded, and self-driven person who will help us enhance private cloud on-site support activities into the next...
-
Site Reliability Engineer
hace 2 semanas
Ciudad de México, Ciudad de México Sur A tiempo completoAs the Site Reliability Engineer you will support and scale the infrastructure powering their secure, mission-critical SaaS platform. You must be confident in operating and debugging both modern infrastructure (cloud-native, containerized services) and classic Windows production environments (IIS, SQL Server AlwaysOn, Service Broker), with the ability to...
-
Lead Site Reliability Engineer
hace 1 semana
Ciudad de México, Ciudad de México Pathlock A tiempo completoAbout Pathlock:Pathlock is a leader in application security, access governance, and compliance automation. Our cloud-based solutions help organizations secure critical applications, mitigate risk, and enforce policies across a diverse IT landscape.Job Summary:Join Pathlock, a fast-growing leader in Governance, Access and Compliance, where you'll help shape...
-
Site Reliability Engineer
hace 2 semanas
Ciudad de México, Ciudad de México Tech Mahindra A tiempo completoWe're Hiring We are seeking a talented Site Reliability Engineer (SRE) CDMX with robust experience in Azure environments, Kubernetes, and DevOps practices.Your mission will be to ensure the reliability, scalability, and automation of our critical platforms. If you thrive on solving complex challenges, automating processes, and ensuring seamless operations,...
-
FBS Site Reliability Engineer
hace 2 semanas
Ciudad de México, Ciudad de México Capgemini A tiempo completoOur Client is one of the United States' largest insurers, providing a wide range of insurance and financial services products with gross written premiums well over US$25 Billion (P&C). They proudly serve more than 10 million U.S. households with more than 19 million individual policies across all 50 states through the efforts of over 48,000 exclusive and...
-
Senior Site Reliability Engineer
hace 5 días
Ciudad de México, Ciudad de México Thomson Reuters México A tiempo completoAre you passionate about the chance to bring your experience to a world-class company that is market-leading or both content and technology? If yes, we're looking for you.Join our team Senior Site Reliability Engineer (SRE) will be implement Site Reliability Engineering and DevOps best practices. Feed non-functional requirements into the product backlog,...