Sr. SRE

hace 2 semanas


México NTD Software A tiempo completo

The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of production systems. The role focuses on monitoring, alerting, and dashboard creation with a strong emphasis on SRE tools like Grafana, Prometheus, and Datadog. The ideal candidate should have hands-on experience with Python scripting and be able to collaborate effectively with cross-functional teams to address service issues and improve system reliability.

Requirements
  • +4 years of experience in similar roles
  • Fluent English
  • Experience with creating and modifying Grafana dashboards for system monitoring.
  • Knowledge of Prometheus for setting up and maintaining monitoring systems.
  • Experience with Datadog for user and system monitoring.
  • Hands-on experience with Python scripting for automation and other tasks.
  • Understanding of SRE practices, including monitoring, alerting, and incident response.
  • Ability to create and enhance runbooks for incident response and remediation.
  • Experience with DevOps practices, such as CI/CD and infrastructure automation, is a secondary desired skill set.
  • Strong communication skills to collaborate with cross-functional teams and stakeholders.
  • Ability to proactively identify and address service issues.
  • Familiarity with ITIL process experience, including Service Management, Knowledge Management, and Incident Management.
  • Experience with user and system monitoring, remediation, and implementation to maintain service stability.
Responsibilities
  • Create and modify Grafana dashboards to monitor system performance and user experience.
  • Set up and maintain monitoring and alerting systems using Prometheus and Datadog.
  • Collaborate with cross-functional teams to improve service reliability and respond to incidents.
  • Develop and enhance runbooks for incident response and remediation.
  • Proactively work with alerting to ensure timely detection of issues and minimize downtime.
  • Implement monitoring, remediation, and other operational practices to maintain high service levels.
#J-18808-Ljbffr

  • Ciudad de México SimCorp A tiempo completo

    Senior Site Reliability Engineer (SRE/Azure) page is loaded Senior Site Reliability Engineer (SRE/Azure) Apply locations Manila posted on Posted 30+ Days Ago job requisition id R-206253 Senior Site Reliability Engineer (SRE/Azure) Who we are: For over 50 years, we have worked closely with investment and asset managers to become the world’s leading...

  • Sr. DevOps Engineer

    hace 1 semana


    Ciudad de México Digital@FEMSA Careers A tiempo completo

    Digital@FEMSA es la división de innovación tecnológica que ofrece soluciones digitales para simplificar la vida de nuestros clientes. Está integrada por negocios que aprovechan la tecnología para generar herramientas prácticas y confiables, como el medio de pago Spin by OXXO, así como por un equipo diverso y multidisciplinario centrado en desarrollar...


  • Ciudad de México SimCorp A tiempo completo

    Sr. Site Reliability Engineer (Azure) page is loaded Sr. Site Reliability Engineer (Azure) Apply locations Manila time type Full time posted on Posted 30+ Days Ago job requisition id R-206416 Who we are: For over 50 years, we have worked closely with investment and asset managers to become the world’s leading provider of integrated investment...