Site Reliability Engineer

hace 4 semanas


Región Centro, México AgileEngine A tiempo completo

Site Reliability Engineer (Middle/Senior) Apply for the Site Reliability Engineer (Middle/Senior) role at AgileEngine. AgileEngine is an Inc. 5000 company that creates award‑winning software for Fortune 500 brands and trailblazing startups across 17+ industries. We rank among the leaders in application development and AI/ML, and our people‑first culture has earned us multiple Best Place to Work awards. Why Join Us Looking for a place to grow, make an impact, and work with people who care? We’d love to meet you What You Will Do Shift: Monday – Thursday 8 AM – 7 PM PST (11 AM – 10 PM EST) with rotating on‑call. On‑call shifts: every 6 weeks, one week as primary responder and the following week as secondary. Manage alerts daily, check systems, and elevate issues as needed. Provide 24×7 on‑call support for critical SaaS events. Be available for emergencies when team members are not present. Document issues and remediation steps. Proactively create appropriate monitors in the EKS/K8S ecosystem. Deploy to EKS/K8s clusters using Terraform and Helm. Learn and maintain existing infrastructure running under Docker Swarm. Improve existing infrastructure health by implementing checks and scripts to correct known issues. Maintain and develop deployment code. Automate manual tasks. Implement and integrate new technologies in our Cloud Infrastructure. Collaborate with other teams and departments to provide the highest level of support. Apply a real customer focus when planning deployments/updates, considering their impact before changes. Work closely with Support, Customer Success, Migration, and Professional Services teams to provide best‑in‑class SaaS service. Perform RCA and take corrective actions to prevent recurrence. Create and assign alert‑related actions to the appropriate team after investigation. Handle support requests for environment‑specific actions. Identify and provide automation requirements to improve RCA. Must Haves 2+ years of professional experience. Experience working with Datadog. Hands‑on experience as an AWS Cloud Engineer. Working knowledge of EKS, Terraform, and Helm. Experience with Docker and Docker Swarm. Good understanding of AWS IAM roles and policies. Experience logging and monitoring AWS resources using CloudWatch logs. Experience working in a Linux environment. Proficient in Bash and/or Python scripting. Strong understanding of web technologies such as REST APIs. Experience with monitoring solutions, such as Grafana and Prometheus. Excellent oral and written communication skills. Customer‑facing communication skills to explain issues and RCAs. Experience in product/app support for SaaS‑based products. Understanding of APIs, databases, systems architecture, and design. Experience designing, implementing, and operating in a DevSecOps environment. Ability to work independently as well as within a collaborative environment. Technical aptitude with a desire to learn new and evolving technologies. Upper‑intermediate English proficiency. Nice to Haves Experience with GCP or Azure. Certifications: AWS Certified DevOps Engineer – Professional or AWS Certified Advanced Networking Specialty. Perks and Benefits Professional growth: Accelerate your journey with mentorship, TechTalks, and personalized roadmaps. Competitive compensation: USD‑based compensation with budgets for education, fitness, and team activities. A selection of exciting projects: Work on modern solutions for Fortune 500 enterprises and product brands. Flexible time: Tailor your schedule; choose to work from home or the office. Job Details Seniority level: Mid‑Senior Employment type: Full‑time Job function: Information Technology Industries: IT Services and IT Consulting Location: Guadalajara, Jalisco, Mexico Referrals increase your chances of interviewing at AgileEngine by 2x. #J-18808-Ljbffr


  • Site Reliability Engineer

    hace 2 semanas


    Región Centro, México Oracle A tiempo completo

    A leading cloud solutions provider in Mexico is seeking a skilled Cloud Region Build Site Reliability Engineer to join its team. This full-time role focuses on ensuring the performance, availability, and scalability of cloud infrastructure services. Responsibilities include building and maintaining OCI infrastructure, responding to incidents, and improving...

  • Site Reliability Engineer

    hace 4 semanas


    Región Centro, México FICO A tiempo completo

    Site Reliability Engineer - Engineer I page is loaded## Site Reliability Engineer - Engineer Ilocations: Guadalajara, Mexicotime type: Full timeposted on: Posted Yesterdayjob requisition id: 31193**FICO (NYSE: FICO)** is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today...


  • Región Centro, México AgileEngine A tiempo completo

    A leading software development company is seeking a Site Reliability Engineer (Middle/Senior) in Guadalajara, Jalisco, focused on supporting critical SaaS systems and enhancing cloud infrastructure. The role demands experience with AWS, EKS, and automation tools like Terraform and Helm. Ideal candidates are skilled in communication, customer focus, and...


  • región centro jalisco, México GrainChain Inc A tiempo completo

    ¡Estamos en busca de nuevos talentos! GrainChain es una empresa tecnológica dedicada a reducir la brecha digital en la industria agrícola. Nuestras plataformas facilitan las transacciones de manera rápida, seguras y sencillas para nuestros usuarios. Estamos en búsqueda de un Site Reliability Engineer capaz de integrar y automatizar las áreas de...


  • Región Centro, México F5 A tiempo completo

    Site Reliability Engineer – Incident Management Join to apply for the Site Reliability Engineer – Incident Management role at F5 . At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate...


  • región centro jalisco, México GrainChain Inc A tiempo completo

    Una empresa tecnológica en el sector agrícola está en búsqueda de un Site Reliability Engineer para integrar y automatizar las áreas de desarrollo y operaciones, asegurando la calidad y la entrega de soluciones de software. El candidato ideal tendrá experiencia en scripting, infraestructura Linux y herramientas de CI/CD. Se ofrece un ambiente inclusivo...


  • Región Centro, México Jabil A tiempo completo

    A global product solutions company in Jalisco is seeking a Site Reliability Test Engineer to maintain and improve their Cloud Test Platform. This role involves supporting manufacturing server operations, responding to production issues, and enhancing usability of test applications. The ideal candidate has a BS degree in a related field, 5-8 years of relevant...


  • región centro jalisco, México FICO A tiempo completo

    Site Reliability Engineer - Engineer I page is loaded## Site Reliability Engineer - Engineer Ilocations: Guadalajara, Mexicotime type: Full timeposted on: Posted Yesterdayjob requisition id: 31193**FICO (NYSE: FICO)** is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today...


  • región centro jalisco, México ValorH A tiempo completo

    Conceivable Life Sciences is pioneering the world's first AI-powered, automated IVF laboratory, revolutionizing reproductive healthcare through cutting-edge robotics and artificial intelligence. We are seeking a passionate and dedicated Site Reliability Cloud Engineer to design, implement, and maintain the entire cloud infrastructure of our growing company...


  • Región Centro, México Tata Consultancy Services Limited A tiempo completo

    IT Site Reliability Engineer / Architect We are seeking a talented and experienced IT Engineer / Architect with a strong focus on site reliability engineering responsibilities to join our team. As a key member of our team, you will be responsible for ensuring the reliability, scalability, and performance of our infrastructure and applications, with a...