Cloud Infrastructure SRE/Engineer

hace 3 semanas

Mexico City Natsoft A tiempo completo

Senior Lead Recruiter – Hiring for IT industry (Mostly) Apply to robertoms@natsoft.us - Apply with your English resume and in subject the name of the role you're applying to either Tier 3 or Tier 4 after reading all the job description. Contract: Contractor LATAM: (Columbia/Argentina) Number of positions (8) Tire 3 and (8) Tire 4 Shift Coverage - For the Tier 4 team, the mode of operation involves up to 12-hour shifts and/or providing On-call Duties and participating in regular on-call rotations. Shift roaster sample attached Core Skills – possess very good logic (Bash OR preferably Python) + (Linux OR Kubernetes) English is a must We are seeking a highly skilled and motivated Cloud Infrastructure Engineer to join our Infrastructure Customer Engineering and Support team, part of the Red Hat Telco Cloud Organization. This role offers the opportunity to contribute to either the Tier 3 (L3S) or Tier 4 (L4S) Engineering team, both of which work closely together to ensure the performance, availability, and reliability of our cloud-based services and underlying infrastructure. The successful candidate will act as a critical technical subject matter expert, applying strong analytical knowledge to quickly diagnose and resolve complex issues across the entire cloud stack. This team is responsible for managing, troubleshooting, and optimizing containerized applications and infrastructure deployed on Kubernetes, RedHat OpenShift, and OpenStack platforms. Tier 3 will also support Nokia Container Services (NCS) and CloudBand Infrastructure Software (CBIS) products. Tier 3 Engineers serve as the Subject Matter Experts (SME) for core cloud infrastructure technologies, lead the investigation and resolution of complex, high-severity customer issues, and provide end-to-end Escalation, Monitoring, and Emergency (EME) support, acting as a final escalation point to ensure service availability and meet SLAs. This is a special Engineering task force dedicated to preventing and solving the most critical and strategic customer issues encountered in the field. Tier 4 Engineers deep dive into troubleshooting, traversing layers from high-level Kubernetes errors to pinpointing a kernel bug. They are deeply involved in technologies like Nokia Container Services (NCS) and CloudBand Infrastructure Software (CBIS), private clouds based on Kubernetes and OpenStack, and collaborate closely with developers and product engineers to bridge the gap between infrastructure and software. Main Responsibilities (for both teams) Manage, troubleshoot, and optimize containerized applications and infrastructure deployed on platforms like Kubernetes, RedHat OpenShift, and OpenStack. Lead the investigation and resolution of complex, high-severity customer incidents. Utilize expertise to quickly identify root causes and implement effective, durable solutions. Prepare and conduct rigorous Root Cause Analysis (RCA) for critical incidents to identify systemic issues and prevent recurrence. Develop, test, and maintain robust automation scripts using Python and Ansible to streamline daily operational tasks and improve overall service efficiency. Provide immediate support for urgent cases as part of an on-call rotation. Stay current with industry best practices and emerging technologies in cloud and containerization. Required Skills and Experience Core Technical Expertise Linux Expertise: Strong knowledge and proven hands-on experience with advanced Linux (CentOS) system administration. Familiarity with Red Hat and CentOS is highly valued. Networking Foundations: Strong knowledge of core networking principles (TCP/IP, routing, load balancing, firewalls) in a cloud environment. A solid grasp of computer networking fundamentals, such as understanding of VLANs and IP routing, is a must. Containerization & Virtualization: Strong knowledge of Kubernetes orchestration, OpenStack platforms, and Docker/Containerization. Knowledge in areas like Podman, Kubernetes, Helm, and/or OpenStack, KVM/QEMU is a significant advantage. Scripting and Automation: Solid Python scripting skills for task automation and system management. Proficiency in scripting with Bash and Python, or the willingness to learn and adapt, as well as familiarity with Ansible is required. Root Cause Analysis (RCA): Expertise in preparation and implementation of RCAs. Escalation and Monitoring: Proven experience with EME (Escalation, Monitoring, and Emergency) management processes. Problem‑Solving and Operations Problem‑Solving Mindset: Possess sharp troubleshooting skills combined with an analytical mindset to dissect and address complex challenges. Deep Dive Troubleshooting: Ability to traverse layers, starting with high-level K8s error messages, to pinpoint low-level issues like kernel bugs. Monitoring and Logging: Experience with monitoring and logging tools such as ELK (Elasticsearch, Logstash, Kibana), Prometheus, and Grafana will be invaluable. Beneficial Expertise (Added Advantage) Networking Advanced Tools: Familiarity with advanced tools and technologies such as Calico, Multus, and Open vSwitch. Storage Systems: Proficiency with storage solutions such as CEPH and Rook. Database Expertise: Understanding of relational databases such as MySQL and MariaDB, as well as experience with ETCD. Certifications: One or more certifications from the list below will be considered an added advantage: RHCSA, RHCE, CKA, EX280 (RedHat Certified Specialist in OpenShift Administration), EX380 (RedHat Certified Specialist in OpenShift Automation and API Management). Scheduling and Mode of Operation Our team operates on a follow-the-sun model, ensuring 24/7 coverage and rapid response times. A rotational on-call schedule is a mandatory part of this position for both teams. On-call duties include direct customer contact and prompt engagement in designated war rooms for case investigation and resolution. For the Tier 4 team, the mode of operation may involve up to 12-hour shifts and/or providing On-call Duties and participating in regular on-call rotations. Interventions are performed in live environments and must strictly follow the Standard Network Touch Policy. Seniority level Mid-Senior level Employment type Contract Job function Information Technology Industries Information Technology & Services and Telecommunications Location: Bogota, D.C., Capital District, Colombia We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr

Azure SRE: Cloud Reliability

hace 4 semanas

Mexico City Pinnacle Talent Placement A tiempo completo

A global technology consulting company is seeking a Site Reliability Engineer (SRE) in Mexico City. This position focuses on supporting and maintaining cloud-based services with Microsoft Azure, ensuring high availability and reliability. You will work closely with various teams to enhance system scalability and performance. The role requires deep expertise...
Azure SRE: Cloud Reliability Engineer

hace 2 semanas

Mexico City Pinnacle Talent Placement A tiempo completo

A global technology consulting firm is seeking a Site Reliability Engineer (SRE) – Technical Member in Mexico City. The ideal candidate will have 5+ years of experience in SRE or DevOps roles, with strong skills in cloud infrastructure and automation tools. Key responsibilities include providing operational support for a cloud platform, collaborating with...
DevOps / SRE Engineer – Azure Platform

hace 1 semana

Mexico City Pinnacle Talent Placement A tiempo completo

Are you legally eligible to work where you live? We are not able to sponsor VISAs. Who We Are We are a worldwide technology consulting firm providing advanced software and cloud services to enterprise organizations. Our experts design and deliver scalable systems, transform legacy infrastructure, and deploy automation within complex, distributed ecosystems....
DevOps / SRE Engineer – Azure Platform

hace 1 semana

Mexico City Pinnacle Talent Placement A tiempo completo

Are you legally eligible to work where you live? We are not able to sponsor VISAs. Who We Are We are a worldwide technology consulting firm providing advanced software and cloud services to enterprise organizations. Our experts design and deliver scalable systems, transform legacy infrastructure, and deploy automation within complex, distributed ecosystems....
Azure SRE

hace 2 semanas

Mexico City Pinnacle Talent Placement A tiempo completo

Are you legally eligible to work where you live? We are not able to sponsor VISAs. Who We Are We are a global technology consulting company that delivers innovative software and cloud solutions for enterprise clients. Our teams specialize in building scalable platforms, and implementing automation across large distributed environments. With engineering...
Azure SRE

hace 2 semanas

Mexico City Pinnacle Talent Placement A tiempo completo

Are you legally eligible to work where you live? We are not able to sponsor VISAs. Who We Are We are a global technology consulting company that delivers innovative software and cloud solutions for enterprise clients. Our teams specialize in building scalable platforms, and implementing automation across large distributed environments. With engineering...
Remote Senior Cloud

hace 3 semanas

Mexico City BairesDev A tiempo completo

A leading tech solutions firm is seeking a Senior Infrastructure & Cloud Engineer for a remote position. The role involves designing and maintaining cloud infrastructure solutions while automating deployment processes. Candidates should have over four years of experience and strong knowledge in cloud platforms. The firm offers excellent compensation,...
Senior Cloud Infrastructure Engineer

hace 4 semanas

Mexico City Helix OpCo LLC A tiempo completo

A leading healthcare technology firm is seeking a Senior Software Engineer focused on Infrastructure & Cloud in Mexico City. In this role, you will innovate and secure our cloud infrastructure, collaborate with product teams, and optimize cloud strategies. You will need at least 5 years of experience in cloud architecture and software development, preferably...
Remote SRE in CDMX — Azure, Kubernetes,

hace 2 semanas

Mexico City Tech Mahindra A tiempo completo

A leading tech company is seeking a Site Reliability Engineer (SRE) in Mexico City. The role requires solid experience with Azure environments, Kubernetes, and DevOps practices. Key responsibilities include optimizing CI/CD pipelines, managing cloud infrastructure, and automating processes. Advanced English proficiency is essential. The position offers...
Remote Cloud Operations Engineer — SRE

hace 4 días

Mexico City Third-Party Job Posts A tiempo completo

A leading hospitality technology company is hiring a Cloud Operations Engineer to maintain operational stability across AWS-based systems. The role involves incident management, collaboration with engineering teams, and execution of operational tasks. Ideal candidates should have 3-4 years in DevOps or SRE, practical experience with Kubernetes, and...

Américas

Europa

Asia / Oceanía

África

Cloud Infrastructure SRE/Engineer