Site Reliability Developer 4

hace 3 días


Ciudad de México, Ciudad de México Oracle A tiempo completo
Description

What You'll Do

  • Capacity Engineering – Act as a strategic capacity partner, immersing in the end-to-end architecture and performance of SaaS production services. Ensure mission-critical workloads—including emerging agentic AI and MLOps pipelines—are forecasted, scaled, and optimized for OCI cloud capacity at enterprise scale.
  • Cost Engineering – Translate SaaS capacity architectures into cost models that improve efficiency year over year. Partner with Cost Engineers to drive down infrastructure margins while enhancing reliability, producing actionable forecasts and executive-level insights.
  • AI/MLOps & Automation – Apply deep knowledge of AI, MLOps, and orchestration to streamline operations, eliminate technical debt, and propose automation opportunities. Collaborate with AI/ML Ops and data engineering teams to evolve architectures, enhance scalability, and influence future OCI feature sets.
  • Run-the-Business Support – Deliver detailed capacity roadmaps that define tuning, scaling, and demand characteristics. Communicate inflection points and future requirements to the Cloud Capacity Run-the-Business organization for seamless planning.
  • Technical Expertise – Leverage a strong foundation in cloud capacity topologies (compute, storage, network) to identify dependencies and drive service reliability improvements. Prior experience across DB, middleware, containers, or networking is valuable in translating complex architectures into capacity supply requirements.
  • Cross-Team Collaboration – Engage confidently across all levels of the organization, from ICs to executives, as a trusted advisor on SaaS capacity. Present data-driven insights with clarity and executive presence.
  • Curiosity & Breadth – Approach services with professional curiosity, exploring APIs, profiling workloads, and analyzing anomalies to anticipate demand and performance needs.

Your Experience

  • Bachelor's degree in Computer Science or related field; Master's preferred
  • Relevant Cloud MLOps / AI certifications (e.g., AWS ML Specialty, GCP ML Engineer, Azure AI, NVIDIA MLOps, Linux Foundation MLOps Practitioner)
  • 10+ years senior engineering experience across one or more domains: databases (Oracle DB preferred), virtualization/middleware, container orchestration, networking, or monitoring/observability
  • Proven expertise in forecasting, scaling, and cost-optimizing capacity for AI/ML and MLOps workloads, including dynamic and agentic workloads, across hybrid and cloud environments
  • Strong knowledge of Oracle OCI cloud services
  • Advanced analytical skills with experience building and interpreting complex models (Excel or equivalent)
  • Exceptional communication and stakeholder-management skills; ability to translate engineering into executive-ready narratives
  • Experience driving initiatives in fast-paced, dynamic, cross-functional environments


Responsibilities
  • Partner with SRE and Product Engineering on shared ownership of SaaS services, ensuring reliability, security, scale, and performance across OCI.
  • Forecast, design, and optimize capacity for AI/ML pipelines, MLOps platforms, and emerging agentic workloads; model dynamic demand and define scaling strategies.
  • Translate complex product architectures into capacity and cost models, aligning infrastructure with SaaS business priorities.
  • Drive automation and orchestration initiatives to reduce technical debt, accelerate delivery, and enhance service resiliency.
  • Serve as an escalation point for complex, cross-stack issues, leveraging deep knowledge of service topology and dependencies.
  • Collaborate with development teams to evolve SaaS Capacity architectures, propose cloud feature enhancements, and guide the addition of new capabilities to the Oracle Cloud portfolio.
  • Deliver clear communication of scale, capacity, performance, and cost characteristics to stakeholders, from engineers to executives.
  • Apply professional curiosity to explore APIs, workload profiles, and anomalies, turning insights into capacity and reliability improvements.


Qualifications

Career Level - IC4



  • Ciudad de México, Ciudad de México Oracle A tiempo completo

    DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems....

  • Site Reliability Engineer

    hace 2 semanas


    Ciudad de México, Ciudad de México Mastercard A tiempo completo

    Our PurposeMastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships...

  • Site Reliability Engineer

    hace 2 semanas


    Ciudad de México, Ciudad de México Mastercard A tiempo completo

    Our PurposeMastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships...


  • Ciudad de México, Ciudad de México Azkait A tiempo completo

    AZKAITes una empresa mexicana que busca y conecta el mejor talento IT con empresas Latinoamericanas y de Estados Unidos.Estamos en la búsqueda de tu talento comoSite Reliability Engineer (SRE)Requisitos:Licenciatura o Ingeniería en Sistemas, Informática o afín.+5 años de experiencia en roles de SRE, DevOps o Ingeniería de Software.Experiencia...


  • Ciudad de México, Ciudad de México Royal Caribbean Group A tiempo completo

    Journey with usCombine your career goals and sense of adventure by joining our incredible team of employees atRoyal Caribbean Group. We are proud to offer a competitive compensation and benefits package, and excellent career development opportunities, each offering unique ways to explore the world.We are proud to be the vacation-industry leader with global...


  • Ciudad de México, Ciudad de México Encora A tiempo completo

    Important Information:Years of Experience: 5+ yearsJob Mode: Full-timeWork Mode: Remote within MexicoJob Summary:We are seeking a Site Reliability Engineer to ensure the reliability, scalability, and performance of custom platforms running on AWS infrastructure and Kubernetes. This role focuses on Tier 3 issue resolution, operational readiness for new...

  • Site Reliability Engineer

    hace 2 semanas


    Ciudad de México, Ciudad de México Sur A tiempo completo

    As the Site Reliability Engineer you will support and scale the infrastructure powering their secure, mission-critical SaaS platform. You must be confident in operating and debugging both modern infrastructure (cloud-native, containerized services) and classic Windows production environments (IIS, SQL Server AlwaysOn, Service Broker), with the ability to...

  • Site Reliability Engineer

    hace 2 semanas


    Ciudad de México, Ciudad de México Mastercard A tiempo completo

    Our PurposeMastercard powers economies and empowers people in 200+ countries and territories worldwide. Together with our customers, we're helping build a sustainable economy where everyone can prosper. We support a wide range of digital payments choices, making transactions secure, simple, smart and accessible. Our technology and innovation, partnerships...


  • Ciudad de México, Ciudad de México itD Tech A tiempo completo

    itD is seeking a Site Reliability Engineer who will report to the Sr. Engineering Manager for a client in the gaming and entertainment space. As a Site Reliability Engineer, you will focus on designing, deploying, and operating resilient, secure, and globally scalable services in AWS, with , TypeScript, Kubernetes, GitLab, Argo CD (CI/CD).This long-term W2...

  • Site Reliability Engineer

    hace 2 semanas


    Ciudad de México, Ciudad de México itD Website A tiempo completo

    itD is seeking a Site Reliability Engineer who will report to the Sr. Engineering Manager for a client in the gaming and entertainment space.  As a Site Reliability Engineer, you will focus on designing, deploying, and operating resilient, secure, and globally scalable services in AWS, with , TypeScript, Kubernetes, GitLab, Argo CD (CI/CD).    This...