System Reliability Engineer

hace 6 días

WorkFromHome, México Zipdev A tiempo completo

Description We’re looking for a passionate and experienced System Reliability Engineer to play a key role in designing, implementing, and maintaining our evolving cloud‑native platform. You’ll be instrumental in shaping our reliability practices, automating operational tasks, and driving continuous improvement across our systems. This is an exciting time to join us as we embark on significant refactoring efforts and continue to leverage cutting‑edge technologies. What You’ll Do Design, build and maintain highly available, scalable and resilient systems on Google Cloud Platform (GCP). Proactively monitor system health, performance and capacity, identifying and resolving issues before they impact users. Develop and implement automation for infrastructure provisioning, deployment and operational tasks (e.g., CI/CD pipelines, disaster recovery). Collaborate with development teams to ensure new features are designed and implemented with reliability and operational excellence in mind. Manage and optimize our MongoDB Atlas instances, ensuring data integrity, performance and security. Lead the refactoring effort of our Redis services to a more scalable and resilient Pub/Sub or Kafka‑based architecture. Participate in on‑call rotations and incident response, conducting thorough post‑mortems and implementing preventative measures. Contribute to the development of best‑practice runbooks and documentation for system operations. Identify and implement opportunities for cost optimization without compromising reliability. Requirements 5 years of experience in a System Reliability Engineering, DevOps or Site Reliability Engineering role. Strong hands‑on experience with Google Cloud Platform (GCP) services (e.g., Compute Engine, Kubernetes Engine, Cloud SQL, Cloud Monitoring, Cloud Functions, Networking). Proven expertise in managing and optimizing MongoDB Atlas (or other cloud‑hosted) databases. Solid experience with containerization technologies, particularly Docker and Kubernetes. Demonstrated experience with Infrastructure as Code (e.g., Terraform, Cloud Deployment Manager). Proficiency in scripting languages such as Python, Go or Bash. Familiarity with message queuing systems like Redis, RabbitMQ or Kafka; direct experience with Kafka or Google Cloud Pub/Sub is a must. Familiarity with Prometheus, Grafana or similar monitoring and alerting tools. Experience with service mesh technologies (e.g., Istio). Experience with CI/CD tools and practices. Strong understanding of network protocols, security best practices and distributed systems. Excellent problem‑solving skills with a methodical approach to troubleshooting complex issues. Ability to communicate effectively with both technical and non‑technical stakeholders. A proactive mindset with a commitment to continuous learning and improvement. Benefits Work remotely Monday – Friday, 40 hours a week (no weekends). Vacation: 10 business days a year. Holidays: 5 National Holidays a year. Company Holidays: 5 Company Holidays a year (Christmas Eve, Christmas Day, New Years Eve, New Years Day, Zipdev Day). Parental Leave. Health Care Reimbursement. Active Lifestyle Reimbursement. Quarterly Home Office Reimbursement. Payroll Deduction Purchase Plans. Longevity Bonus. Continuous Learning Bonus. Access to Training and Professional Development Platforms. Did we mention it’s REMOTE Key Skills Kubernetes, FMEA, Continuous Improvement, Elasticsearch, Go, Root cause Analysis, Maximo, CMMS, Maintenance, Mechanical Engineering, Manufacturing, Troubleshooting Employment Type: Full‑Time Department / Functional Area: Engineering Experience: years Vacancy: 1 One of our core values at Zipdev is Be authentic. That’s why we encourage you to answer the application form in your own words; we are interested in getting to know you, not a digital assistant. Wondering how our remote environment or our payment method work? We’ve put together some helpful answers in our FAQs at the bottom of our career site. Take a look and let us know if you have any other questions Reliability Engineer • Mexico City, Mexico, Mexico #J-18808-Ljbffr

Site Reliability Engineer

hace 2 semanas

WorkFromHome, México KI people A tiempo completo

18 hours ago Be among the first 25 applicants Direct message the job poster from KI people In Search of the Best Global IT & Digital Talent We are looking for a Site Reliability Engineer to work on hybrid mode from GDL, MTY o CDMX for a multicultural project with stability and growth in the short, medium and long term. Role Overview: The SRE Operations...
Senior Cloud Reliability Engineer — Remote

hace 6 días

WorkFromHome, México Zipdev A tiempo completo

A technology company is seeking a System Reliability Engineer in Mexico City. The role involves designing and maintaining resilient systems on Google Cloud Platform, enhancing reliability practices, and automating operational tasks. Ideal candidates have 5+ years of experience, strong skills in GCP and container technologies, and a proactive mindset....
Remote Site Reliability Engineer

hace 3 semanas

WorkFromHome, México Resend A tiempo completo

A modern email platform company is seeking a Site Reliability Engineer for a fully remote position. In this role, you will enhance system reliability and automation, monitor performance parameters, and collaborate with engineering teams. Ideal candidates will have over 5 years in Site Reliability or Infrastructure Engineering, strong skills in Node.js and...
Site Reliability Engineer

hace 4 semanas

WorkFromHome, México - A tiempo completo

JOB DESCRIPTION Site Reliability Engineer (SRE) - Application Performance Monitoring (APM) Location: Monterrey, Nuevo León, Mexico (Hybrid - candidates must reside in Monterrey or the metropolitan area) Language requirement: Fluent English (spoken and written) About the Role We're looking for a Site Reliability Engineer (SRE) with a passion for Application...
Sr. Site Reliability Engineer

hace 2 semanas

WorkFromHome, México Nova A tiempo completo

Sr. Site Reliability Engineer (Remote, Mexico) Join to apply for the Sr. Site Reliability Engineer (Remote, Mexico) role at Nova Sr. Site Reliability Engineer (Remote, Mexico) 1 year ago Be among the first 25 applicants Join to apply for the Sr. Site Reliability Engineer (Remote, Mexico) role at Nova Get AI-powered advice on this job and more exclusive...
Site Reliability Engineer

hace 1 semana

WorkFromHome, México National Oilwell Varco, Inc. A tiempo completo

Site Reliability Engineer (SRE) – Application Performance Monitoring (APM) Location: Monterrey, Nuevo León, Mexico (Hybrid – candidates must reside in Monterrey or the metropolitan area) Language requirement: Fluent English (spoken and written) About the Role We’re looking for a Site Reliability Engineer (SRE) with a passion for Application...
Senior Site Reliability Engineer

hace 6 días

WorkFromHome, México American Express A tiempo completo

A leading financial services company is looking for a Software Reliability Engineer in Mexico, Ciudad de México. The role requires at least 8 years of experience in system design, algorithms, and software engineering. You will be responsible for ensuring high availability and reliability of applications, using your expertise in cloud-native principles and...
Senior Site Reliability Engineer

hace 2 semanas

WorkFromHome, México Finastra A tiempo completo

A global provider of financial software solutions is seeking a Senior Site Reliability Engineer in Mexico to drive operations and ensure reliability for their Cloud services. The ideal candidate will have extensive experience in software development, particularly in Cloud environments like Azure. They will work on implementing robust infrastructure and...
Remote SRE: Incident Management

hace 3 días

WorkFromHome, México F5 Networks, Inc. A tiempo completo

A global technology company seeks a Reliability Engineer to ensure system reliability and performance. This role includes managing incident responses, designing observability tools, and automating operations. Candidates should have a Bachelor's degree in IT or related fields along with 3+ years in Site Reliability Engineering, and experience with automation...
Remote Site Reliability Engineer – Chaos Testing

hace 4 semanas

WorkFromHome, México Capgemini A tiempo completo

A leading global consulting firm seeks a Reliability Engineer to oversee infrastructure tasks, ensuring system resilience and minimizing downtime. The role requires at least 6 years of relevant experience and fluency in English. Candidates should possess strong coding skills, particularly in JAVA and C++, along with hands-on troubleshooting experience. The...

Américas

Europa

Asia / Oceanía

África

System Reliability Engineer