Principal Site Reliability Engineer

hace 2 semanas


Zapopan, Jalisco, México Oracle A tiempo completo
Description

As a senior member of the Site Reliability Engineering (SRE) team, you'll take ownership of highly available systems, influence service design, and work across teams to drive resiliency, automation, and operational excellence. This is a hands-on engineering role where deep infrastructure knowledge meets software engineering expertise, ideal for experienced SREs ready to take the lead.

Responsibilities

What You'll Do:

  • Lead the design, automation, and support of OCI services with a focus on resiliency, security, scalability, and performance.
  • Own and improve the end-to-end reliability metrics (SLOs, SLAs, KPIs) for your services.
  • Design and implement high-availability architectures and standards for large-scale distributed systems.
  • Serve as the ultimate escalation point for complex operational issues, using a deep understanding of service topologies and interdependencies.
  • Architect and build automation and orchestration tools that reduce manual work and prevent problem recurrence.
  • Collaborate with development teams to improve service designs, optimize deployments, and implement best practices for operational efficiency.
  • Guide technical decision-making and mentor junior SREs and developers across teams.
  • Participate in and lead postmortems, root cause analysis, and preventative design changes.
  • Contribute to capacity planning, demand forecasting, and long-term service scalability strategies.
  • Participate in a rotational on-call schedule to ensure the health and availability of production services.

What We're Looking For:

  • Advanced experience with Linux systems administration
  • Strong programming skills in Python (with automation libraries)
  • Advanced Bash/Shell scripting
  • Deep understanding of distributed systems, networking, and service architecture
  • Solid knowledge of databases and how they behave in production (SQL or NoSQL)
  • Strong understanding of CI/CD pipelines, Agile methodologies, and DevOps best practices
  • Experience writing and maintaining unit tests and production-grade software
  • Proven ability to lead cross-functional efforts and technical problem-solving in live environments

Nice to Have:

  • Hands-on experience with monitoring and observability tools (Grafana, Prometheus, New Relic, etc.)
  • Familiarity with Oracle Cloud Infrastructure (OCI) or other cloud platforms (AWS, Azure, GCP)
  • Experience with Infrastructure-as-Code (Terraform, Ansible) and container orchestration (Kubernetes)
Qualifications

Career Level - IC4



  • Zapopan, Jalisco, México Oracle A tiempo completo

    We are looking for a skilled and motivated Cloud Region Build Site Reliability Engineer (SRE) to join our Oracle Cloud Infrastructure Region Build team. In this role, you will be responsible for building, deploying, and maintaining compute cloud infrastructure services across multiple regions to ensure high availability, scalability, and performance. You...


  • Zapopan, Jalisco, México Oracle A tiempo completo

    DescriptionWe are looking for a skilled and motivated Cloud Region Build Site Reliability Engineer (SRE) to join our Oracle Cloud Infrastructure Region Build team. In this role, you will be responsible for building, deploying, and maintaining compute cloud infrastructure services across multiple regions to ensure high availability, scalability, and...


  • Zapopan, Jalisco, México GrainChain Inc A tiempo completo

    Estamos en busca de nuevos talentosGrainChaines una empresa tecnológica dedicada a reducir la brecha digital en la industria agrícola. Nuestras plataformas facilitan las transacciones de manera rápida, seguras y sencillas para nuestros usuarios. Estamos en búsqueda de unSite Reliability Engineercapaz de integrar y automatizar las áreas de desarrollo y...


  • Zapopan, Jalisco, México GrainChain Inc A tiempo completo

    Estamos en busca de nuevos talentosGrainChain es una empresa tecnológica dedicada a reducir la brecha digital en la industria agrícola. Nuestras plataformas facilitan las transacciones de manera rápida, seguras y sencillas para nuestros usuarios. Estamos en búsqueda de un Site Reliability Engineer capaz de integrar y automatizar las áreas de desarrollo...


  • Zapopan, Jalisco, México GrainChain A tiempo completo

    Estamos en busca de nuevos talentosGrainChain es una empresa tecnológica dedicada a reducir la brecha digital en la industria agrícola. Nuestras plataformas facilitan las transacciones de manera rápida, seguras y sencillas para nuestros usuarios. Estamos en búsqueda de un Site Reliability Engineer capaz de integrar y automatizar las áreas de desarrollo...


  • Zapopan, Jalisco, México Oracle A tiempo completo

    DescriptionWork with the Site Reliability Engineering (SRE) team on the shared full stack ownership of a collection of services and/or technology areas. Understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of production services. Responsible for the design and delivery of the mission critical stack, focusing...


  • Zapopan, Jalisco, México Oracle A tiempo completo

    Job DescriptionAs a senior member of the Site Reliability Engineering (SRE) team, you'll take ownership of highly available systems, influence service design, and work across teams to drive resiliency, automation, and operational excellence. This is a hands-on engineering role where deep infrastructure knowledge meets software engineering expertise, ideal...


  • Zapopan, Jalisco, México Oracle A tiempo completo

    DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems....


  • Zapopan, Jalisco, México Oracle A tiempo completo

    DescriptionWe are looking for a skilled and motivated Cloud Region Build Site Reliability Engineer (SRE) to join our Oracle Cloud Infrastructure Region Build team. In this role, you will be responsible for building, deploying, and maintaining compute cloud infrastructure services across multiple regions to ensure high availability, scalability, and...


  • Zapopan, Jalisco, México Oracle A tiempo completo

    Job DescriptionWe are looking for a skilled and motivated Cloud Region Build Site Reliability Engineer (SRE) to join our Oracle Cloud Infrastructure Region Build team. In this role, you will be responsible for building, deploying, and maintaining compute cloud infrastructure services across multiple regions to ensure high availability, scalability, and...