Empleos actuales relacionados con Site Reliability Engineer - Guadalajara, Jalisco - F5


  • Guadalajara, Jalisco, México NTT DATA A tiempo completo

    SRE - Site Reliability EngineerWe are currently seeking a Site Reliability Engineer to join our team in GDL, Jalisco (MX-JAL), Mexico (MX). Perform L1.5 activities such as monitoring, deployment, rollback. Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage. Troubleshoot Azure...


  • Guadalajara, Jalisco, México NTT DATA North America A tiempo completo

    SRE – Site Reliability EngineerWe are currently seeking a Site Reliability Engineer to join our team in GDL, Jalisco (MX-JAL), Mexico (MX).Perform L1.5 activities such as monitoring, deployment, rollback. Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage. Troubleshoot Azure...


  • Guadalajara, Jalisco, México rctsglobal A tiempo completo

    Site Reliability Engineer (SRE)Overview We're looking for a passionate and hands-on Site Reliability Engineer (SRE) to join our team. This role is critical for ensuring the stability, performance, and scalability of our production services. You'll be the bridge between development and operations, with a strong focus on using code to manage infrastructure and...


  • Guadalajara, Jalisco, México ValorH A tiempo completo

    Conceivable Life Sciencesis pioneering the world's first AI-powered, automated IVF laboratory, revolutionizing reproductive healthcare through cutting-edge robotics and artificial intelligence. We are seeking a passionate and dedicatedSite Reliability Cloud Engineerto design, implement, and maintain the entire cloud infrastructure of our growing company (~60...


  • Guadalajara, Jalisco, México Oracle A tiempo completo

    DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems....


  • Guadalajara, Jalisco, México FICO A tiempo completo

    FICO (NYSE: FICO)is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today and fulfill your career potentialThe Opportunity"The Site Reliability Engineer is an overlay of software development and systems engineering. Your responsibility is a full-stack support role, managing...


  • Guadalajara, Jalisco, México FICO A tiempo completo

    FICO (NYSE: FICO) is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today and fulfill your career potentialThe Opportunity"The Site Reliability Engineer is an overlay of software development and systems engineering. Your responsibility is a full-stack support role, managing...


  • Guadalajara, Jalisco, México Wizeline A tiempo completo

    Senior Site Reliability EngineerWe are:Wizeline, a global AI-native technology solutions provider, develops cutting-edge, AI-powered digital products and platforms. We partner with clients to leverage data and AI, accelerating market entry and driving business transformation. As a global community of innovators, we foster a culture of growth, collaboration,...


  • Guadalajara, Jalisco, México FICO A tiempo completo

    FICO (NYSE: FICO) is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today and fulfill your career potentialThe Opportunity"The Site Reliability Engineering group is a global team responsible for providing 24x7 operational support of the company's Cloud, SaaS, ASP and hosted...


  • Guadalajara, Jalisco, México Finastra A tiempo completo

    Who are we?At Finastra, we are a dynamic global provider of open finance software solutions, dedicated to expanding access to financial services. Our innovative applications span Lending, Payments, Treasury and Capital Markets, and Universal Banking. Proudly serving over 8,000 customers, including 45 of the world's top 50 banks, we aim to boost financial...

Site Reliability Engineer

hace 2 semanas


Guadalajara, Jalisco, México F5 A tiempo completo

At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation.

Everything we do centers around people. That means we obsess over how to make the lives of our customers, and their customers, better. And it means we prioritize a diverse F5 community where each individual can thrive.

About F5
At F5, we strive to bring a better digital world to life. Our teams empower organizations across the globe to create, secure, and run applications that enhance how we experience our evolving digital world. We are passionate about cybersecurity, from protecting consumers from fraud to enabling companies to focus on innovation.

Everything we do centers around people. That means we obsess over how to make the lives of our customers, and their customers, better. And it means we prioritize a diverse F5 community where each individual can thrive.

Position Summary
The Reliability Engineer will be a critical contributor within the Site Reliability Engineering (SRE) and Incident Management team, focusing on ensuring the availability, reliability, and performance of critical systems and services. This role is responsible for managing and facilitating major incident response efforts, ensuring that service disruptions are quickly identified, triaged, and resolved. As an incident facilitator, the Reliability Engineer will take the lead during high-pressure situations, collaborating with cross-functional teams to restore service and drive root cause analysis to prevent future issues. Clear and consistent communication will be critical to the success of the incident management team and processes.

In addition to incident management, the Reliability Engineer will apply technical expertise to design, deploy, and manage modern observability tools, including synthetic monitoring and infrastructure monitoring solutions. The ideal candidate will demonstrate a mix of strong technical skills, effective communication, and the ability to remain composed and solutions-oriented under pressure.

Key Responsibilities
Incident Response and Management

  • Lead the resolution of major incidents by managing the end-to-end incident lifecycle, including detection, escalation, troubleshooting, and resolution.
  • Serve as the incident facilitator during escalations, ensuring effective, clear, and timely communication between all stakeholders to drive collaborative problem-solving.
  • Ensure appropriate handoffs and escalations between global engineering and incident management teams.
  • Coordinate root cause analysis (RCA) efforts, facilitating discussions to identify contributing factors, lessons learned, and long-term corrective actions to reduce the likelihood of recurrence.
  • Create, document, and improve incident response and management processes, defining clear roles and responsibilities for all participants during incidents.
  • Ensure stakeholders and leadership across business and technical teams are kept informed with clear, concise updates during incidents, minimizing customer and business impact.
  • Ensure open lines of communication by ensuring engineering teams engage in communication processes during incidents and have a clear understanding of their responsibilities.

Observability Tools Design and Implementation

  • Design, implement, and manage end-to-end observability solutions, including synthetic monitoring, infrastructure monitoring, tracing and metrics monitoring systems.
  • Evaluate, deploy, and maintain observability and monitoring tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic or similar platforms.
  • Maintain and manage escalation tooling such as VictorOps or PagerDuty to ensure teams across have up to date schedules and escalation processes.
  • Build and maintain monitoring and alerting for critical systems, ensuring that warnings and issues are quickly identified and actionable in real time.
  • Drive the standardization of monitoring practices across teams, ensuring critical applications, systems, and infrastructure components are well-instrumented and monitored.
  • Develop infrastructure monitoring pipelines leveraging telemetry, logging, tracing, metrics, and visualization tools to provide accurate insights into production system health.

Process Development and Automation

  • Support efforts to define and document standard operating procedures for managing incidents, alerts, system failures, and post-incident reviews across global teams.
  • Collaborate with development, infrastructure, and security teams to improve system reliability through efficient processes and workflows.
  • Advocate for the development and implementation of SLAs, SLOs, and error budgets to support decision-making and prioritization in reliability efforts.
  • Identify and implement opportunities to automate manual operational tasks to further reduce incident response and resolution times.
  • Work closely with service desk to ensure consistent incident management practices and appropriate escalations to major incident management team.

Collaboration and Communication

  • Partner with engineering, operations, and security teams to confirm observability tools and monitoring approaches meet their needs and align with organizational standards.
  • Actively engage during incident scenarios to ensure identification and mobilization of the appropriate resources, facilitating collaboration across teams and ensuring best practices are followed.
  • Contribute to a culture of shared responsibility and blameless postmortems by documenting and communicating findings from incident responses.
  • Proactively provide input to the SRE Manager to recommend improvements in processes, tools, and systems to enhance team capabilities and outcomes.

Qualifications

  • Education: Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent professional experience).
  • 3+ years of professional experience in Site Reliability Engineering (SRE), System Engineering, DevOps, or IT Operations roles.
  • Highly experienced as a major incident manager, incident commander, or similar role, with a proven ability to facilitate, communicate, and drive resolution of technical incidents.
  • Strong understanding of ITIL principles and their application in incident management.
  • Experience with observability tools such as DataDog, Grafana, LogicMonitor, Splunk, New Relic, or similar technologies.
  • Experience with synthetic monitoring, infrastructure monitoring, and metrics and tracing monitoring tools.
  • Experience with hybrid infrastructure environments and understand monitoring signals from static on-premise infrastructure, cloud based ephemeral infrastructure, and SaaS applications.
  • Strong understanding of telemetry, logging, tracing, and their roles in system monitoring and observability pipelines.
  • Experience with Python, Go, Bash, or a similar language to develop and maintain monitoring and automation scripts.
  • Proven ability to remain calm and effective during high-pressure situations, facilitating resolution in a methodical, professional manner.

Preferred Qualifications

  • Certifications: AWS Certified Solutions Architect (Associate or higher) or Microsoft Certified: Azure Administrator/Architect.
  • ITIL Foundation Certification.
  • Experience with Infrastructure-as-Code (IaC) tools such as Terraform, CloudFormation, or Ansible as part of observability and monitoring pipelines.
  • Experience building tooling using modern infrastructure patterns such as containerization and serverless.
  • Experience implementing SLAs, SLOs, and error budgets in environments operating under Site Reliability Engineering or ITIL frameworks.
  • Knowledge of network and system security, including secure configurations, traffic monitoring, and network observability.

The Job Description is intended to be a general representation of the responsibilities and requirements of the job. However, the description may not be all-inclusive, and responsibilities and requirements are subject to change.

Please note that F5 only contacts candidates through F5 email address (ending with ) or auto email notification from Workday (ending with or
)
.
Equal Employment Opportunity
It is the policy of F5 to provide equal employment opportunities to all employees and employment applicants without regard to unlawful considerations of race, religion, color, national origin, sex, sexual orientation, gender identity or expression, age, sensory, physical, or mental disability, marital status, veteran or military status, genetic information, or any other classification protected by applicable local, state, or federal laws. This policy applies to all aspects of employment, including, but not limited to, hiring, job assignment, compensation, promotion, benefits, training, discipline, and termination. F5 offers a variety of reasonable accommodations for candidates. Requesting an accommodation is completely voluntary. F5 will assess the need for accommodations in the application process separately from those that may be needed to perform the job. Request by contacting