Lead Site Reliability Engineer

hace 7 días


Desde casa, México EPAM Systems, Inc. A tiempo completo

We are seeking a Lead Site Reliability Engineer to join our team.

In this role, you will help drive the reliability and performance of critical systems for a leading client. You will work in a collaborative environment focused on innovation and operational excellence. Please note, the client operates in the US Central Time Zone from 8 am CST to 5 pm CST.

Responsibilities

  • Champion SRE practices in both code development and system design from the outset
  • Ensure application resiliency, stability, responsiveness, and observability across all services and applications
  • Deliver comprehensive end-to-end observability for the application portfolio
  • Work closely with Enterprise Architects on new application designs and recommend improvements for solution resiliency
  • Participate in change review meetings to assess the impact of proposed changes
  • Support the change deployment process by verifying application health before and after changes
  • Provide on-call support for production incidents
  • Set up and manage monitoring and alerting using available tools
  • Collaborate with developers to embed operational excellence throughout the software lifecycle
  • Partner with development teams to conduct performance testing, identify bottlenecks, and optimize for capacity and efficiency

Requirements

  • Minimum 5 years of experience in Site Reliability Engineering
  • At least one year of experience leading and managing development teams
  • Practical experience with cloud deployment, monitoring, and operations analysis tools such as Dynatrace, Grafana, Splunk, Kubernetes, and Prometheus
  • Ability to work closely with developers on agile teams to ensure SRE principles are integrated into application code
  • Experience supporting multi-cloud environments, hybrid models, and PaaS solutions
  • Fluent English skills (written and spoken) at B2+ level or higher

Nice to have

  • Experience with Apache Kafka for distributed messaging
  • Proficiency in Java for backend development and troubleshooting
  • Strong background in observability and troubleshooting within distributed systems

We offer

  • Career plan and real growth opportunities
  • Unlimited access to LinkedIn learning solutions
  • Constant training, mentoring, online corporate courses, eLearning and more
  • English classes with a certified teacher
  • Support for employee's initiatives (Algorithms club, toastmasters, agile club and more)
  • Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
  • Flexible work schedule and dress code
  • Collaborate in a multicultural environment and share best practices from around the globe
  • Hired directly by EPAM & 100% under payroll
  • Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
  • Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
  • 13 % employee savings fund, capped to the law limit
  • Grocery coupons
  • 30 days December bonus
  • Employee Stock Purchase Plan
  • 12 vacations days
  • Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
  • Monthly non-taxable amount for the electricity and internet bills

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

By applying to our role, you are agreeing that your personal data may be used as in set out in EPAM´s Privacy Notice and Policy.



  • Desde casa, México thegetch mexico A tiempo completo

    **Función: Site Reliability Engineer****Aperturas: más de 10 contrataciones****Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)****Salario:- 25-33 USD/hr****Comunicación en inglés: avanzado****Experiência: 4+ años****Responsabilidades de Site Reliability Engineer**:Reúna y analice métricas de sistemas...


  • Desde casa, México Tekshapers Inc A tiempo completo

    **Position : Lead Site Reliability Engineer****Location : Remote****Duration : Contract**- Lead and mentor a team of SREs to ensure operational excellence and maximize the reliability and availability of client systems.- Minimum 10 years of work experience in DevOps/SRE, including leadership roles.- Architect and design highly scalable and available...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are looking for an experienced **Lead Site Reliability Engineer**to join our team. In this role, you will play a pivotal part in the Reliability Tooling team, taking responsibility for writing and reviewing code, making key technical decisions, and mentoring engineers within your squad. This position requires a strong grasp of SRE principles and best...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    Join our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems.**Responsibilities**- Resolve complex incidents to ensure system availability- Maintain reliability and performance of Azure-based enterprise infrastructure- Deploy observability, monitoring, and logging tools- Automate...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    Join our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems. **Responsibilities** - Resolve complex incidents to ensure system availability - Maintain reliability and performance of Azure-based enterprise infrastructure - Deploy observability, monitoring, and logging tools - Automate...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are looking for an experienced **Site Reliability Engineer (SRE)** to take a leadership role in ensuring the stability, scalability, and performance of our cloud infrastructure on **Google Cloud Platform (GCP)**. As an SRE, you will be at the forefront of optimizing system reliability, automating processes, and collaborating with engineering teams to...


  • Desde casa, México Right Balance A tiempo completo

    **Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...


  • Desde casa, México Right Balance A tiempo completo

    **Overview**We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.**Engagement Details**Our client is a USA-based company producing video solutions with the mission to advance scientific...


  • Desde casa, México Luxoft A tiempo completo

    **Project description**: Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are seeking an experienced **Senior Site Reliability Engineer**to join our team. As a key member of the Reliability Tooling team, you will be responsible for writing and reviewing code, contributing to critical technical decisions, and mentoring engineers within your squad. This role requires a deep understanding of SRE principles and best practices, as...