Lead Site Reliability Engineer

hace 3 semanas


Desde casa, México EPAM Systems, Inc. A tiempo completo

We are looking for an experienced **Site Reliability Engineer (SRE)** to take a leadership role in ensuring the stability, scalability, and performance of our cloud infrastructure on **Google Cloud Platform (GCP)**. As an SRE, you will be at the forefront of optimizing system reliability, automating processes, and collaborating with engineering teams to enhance operational excellence. If you're passionate about **infrastructure-as-code, automation, and building resilient systems**, we’d love to hear from you.**Responsibilities**- Lead reliability initiatives to optimize system performance, scalability, and cost efficiency- Manage and participate in on-call rotations, providing 24/7 support for critical infrastructure- Troubleshoot incidents, conduct root cause analysis (RCA), and implement long-term solutions- Deploy and manage microservices in alignment with release cycles- Design and maintain infrastructure-as-code solutions using Terraform- Collaborate with development teams to improve system reliability, performance, and cloud resource management- Oversee incident response and ticket management using ServiceNow and Jira- Maintain and expand internal knowledge bases on infrastructure and monitoring**Requirements**:- 5+ years of experience in SRE, DevOps, or system administration roles- Expertise in Google Cloud Platform (GCP) and cloud-native architectures- Hands-on experience with incident management and monitoring tools (ServiceNow, Cloud Monitoring, etc.)- Strong debugging and problem-solving skills for complex technical issues- Proficiency in GitHub and infrastructure-as-code best practices- Strong communication and teamwork skills, with a proactive mindset**Nice to have**- Experience with Kubernetes and containerization technologies- Deep understanding of CI/CD pipelines and related tools- Familiarity with Prometheus, Grafana, Catchpoint, and ELK for monitoring and logging**We offer**- Career plan and real growth opportunities- Unlimited access to LinkedIn learning solutions- International Mobility Plan within 25 countries- Constant training, mentoring, online corporate courses, eLearning and more- English classes with a certified teacher- Support for employee’s initiatives (Algorithms club, toastmasters, agile club and more)- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)- Flexible work schedule and dress code- Collaborate in a multicultural environment and share best practices from around the globe- Hired directly by EPAM & 100% under payroll- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)- 13 % employee savings fund, capped to the law limit- Grocery coupons- 30 days December bonus- Employee Stock Purchase Plan- 12 vacations days plus 4 floating days- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)- Monthly non-taxable amount for the electricity and internet billsEPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.



  • Desde casa, México thegetch mexico A tiempo completo

    **Función: Site Reliability Engineer****Aperturas: más de 10 contrataciones****Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)****Salario:- 25-33 USD/hr****Comunicación en inglés: avanzado****Experiência: 4+ años****Responsabilidades de Site Reliability Engineer**:Reúna y analice métricas de sistemas...


  • Desde casa, México Tekshapers Inc A tiempo completo

    **Position : Lead Site Reliability Engineer****Location : Remote****Duration : Contract**- Lead and mentor a team of SREs to ensure operational excellence and maximize the reliability and availability of client systems.- Minimum 10 years of work experience in DevOps/SRE, including leadership roles.- Architect and design highly scalable and available...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are looking for an experienced **Lead Site Reliability Engineer**to join our team. In this role, you will play a pivotal part in the Reliability Tooling team, taking responsibility for writing and reviewing code, making key technical decisions, and mentoring engineers within your squad. This position requires a strong grasp of SRE principles and best...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    Join our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems.**Responsibilities**- Resolve complex incidents to ensure system availability- Maintain reliability and performance of Azure-based enterprise infrastructure- Deploy observability, monitoring, and logging tools- Automate...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    Join our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems. **Responsibilities** - Resolve complex incidents to ensure system availability - Maintain reliability and performance of Azure-based enterprise infrastructure - Deploy observability, monitoring, and logging tools - Automate...


  • Desde casa, México Right Balance A tiempo completo

    **Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...


  • Desde casa, México Right Balance A tiempo completo

    **Overview**We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.**Engagement Details**Our client is a USA-based company producing video solutions with the mission to advance scientific...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are seeking a Lead Site Reliability Engineer to join our team.In this role, you will help drive the reliability and performance of critical systems for a leading client. You will work in a collaborative environment focused on innovation and operational excellence. Please note, the client operates in the US Central Time Zone from 8 am CST to 5 pm...


  • Desde casa, México Luxoft A tiempo completo

    **Project description**: Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...


  • Desde casa, México EPAM Systems, Inc. A tiempo completo

    We are seeking an experienced **Senior Site Reliability Engineer**to join our team. As a key member of the Reliability Tooling team, you will be responsible for writing and reviewing code, contributing to critical technical decisions, and mentoring engineers within your squad. This role requires a deep understanding of SRE principles and best practices, as...