Senior Site Reliability Engineer
hace 4 días
Join our team as a **Senior Site Reliability Engineer**, where you will maintain and improve our product monitoring system, manage incident responses, and facilitate collaboration between operations and development teams.
**Responsibilities**
- Maintain and improve the product monitoring system
- Manage incident response including troubleshooting, resolution, documentation, and post-mortem analysis
- Share knowledge and lessons learned across teams
- Act as a bridge between operations and development teams
- Build automation solutions for log analysis, testing production environments, and alert automation
- Monitor system health, performance, and service level indicators (SLI/SLO/SLA)
- Document knowledge and procedures related to incident management
- Conduct post-incident reviews and implement improvements
- Provide on-call support during and outside regular working hours
- Collaborate with development and operations to improve reliability and efficiency
- Use tools like PagerDuty, ELK/Kibana, SEQ logging, Prometheus, and Grafana for monitoring and incident management
- Develop and maintain scripts and automation using Python, C#, and Bash
- Manage infrastructure and orchestration with SaltStack and Docker
- Support project management and issue tracking using Azure DevOps and Wiki
- Maintain source code management using Git
**Requirements**:
- Experience building solutions from scratch with 3+ years in Site Reliability Engineering
- Strong expertise in cloud providers and automation scripting with Bash and Python
- Deep domain knowledge of Oil & Gas industry operations and incident resolution
- Proven experience managing incident response and on-call support
- Familiarity with monitoring tools including Prometheus and Grafana
- Experience with logging tools such as ELK/Kibana and SEQ logging
- Knowledge of infrastructure and orchestration tools like SaltStack and Docker
- Basic network knowledge including inbound/outbound and firewall rules
- Experience with project management and issue tracking tools like Azure DevOps
- Proficient in source code management using Git
- Strong documentation and knowledge-sharing skills
- Ability to conduct thorough post-incident reviews
- Excellent troubleshooting and problem-solving skills
- Good communication skills with English proficiency at B2+ level
**Nice to have**
- Experience with PagerDuty for incident management
- Familiarity with C# programming
- Knowledge of SQL and MongoDB databases
- Experience with Zededa infrastructure
- Prior involvement in Oil & Gas field operations support
**We offer**
- Career plan and real growth opportunities
- Unlimited access to LinkedIn learning solutions
- Constant training, mentoring, online corporate courses, eLearning and more
- English classes with a certified teacher
- Support for employee’s initiatives (Algorithms club, toastmasters, agile club and more)
- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
- Flexible work schedule and dress code
- Collaborate in a multicultural environment and share best practices from around the globe
- Hired directly by EPAM & 100% under payroll
- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
- 13 % employee savings fund, capped to the law limit
- Grocery coupons
- 30 days December bonus
- Employee Stock Purchase Plan
- 12 vacations days
- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
- Monthly non-taxable amount for the electricity and internet bills
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
-
Site Reliability Engineer
hace 1 semana
Desde casa, México thegetch mexico A tiempo completo**Función: Site Reliability Engineer****Aperturas: más de 10 contrataciones****Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)****Salario:- 25-33 USD/hr****Comunicación en inglés: avanzado****Experiência: 4+ años****Responsabilidades de Site Reliability Engineer**:Reúna y analice métricas de sistemas...
-
Senior Site Reliability Engineer
hace 1 semana
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are seeking an experienced **Senior Site Reliability Engineer**to join our team. As a key member of the Reliability Tooling team, you will be responsible for writing and reviewing code, contributing to critical technical decisions, and mentoring engineers within your squad. This role requires a deep understanding of SRE principles and best practices, as...
-
Senior Site Reliability Engineer
hace 2 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Senior Site Reliability Engineer** focused on delivering advanced support for critical Azure-based systems.**Responsibilities**- Troubleshoot and resolve complex incidents to maintain system uptime- Ensure reliability and performance of Azure-based enterprise infrastructure- Implement observability, monitoring, and logging solutions-...
-
Senior Site Reliability Engineer
hace 2 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Senior Site Reliability Engineer** focused on delivering advanced support for critical Azure-based systems. **Responsibilities** - Troubleshoot and resolve complex incidents to maintain system uptime - Ensure reliability and performance of Azure-based enterprise infrastructure - Implement observability, monitoring, and logging...
-
Senior Site Reliability Engineer
hace 3 semanas
Desde casa, México EPAM Systems A tiempo completo**DESCRIPTION**:Join EPAM as a **Senior Site Reliability Engineer specializing in AWS!**In this role, you'll ensure fleet services reliability and availability under the SRE model.If you have a good track record of highly scalable, distributed systems projects and previous experience working as an SRE, we'd love to hear from you.EPAM is a leading global...
-
Senior Azure Site Reliability Engineer
hace 2 semanas
Desde casa, México Pinnacle A tiempo completo**Job Title**: Senior Azure Site Reliability Engineer**Reports** **To**: Azure Site Reliability Lead**About us**:Welcome to Pinnacle, the ultimate destination for sports enthusiasts seeking an exhilarating sportsbook and gaming experience! Established in 1998, we have solidified our position as one of the globe's foremost licensed online gaming companies....
-
Senior Azure Site Reliability Engineer
hace 2 semanas
Desde casa, México Pinnacle A tiempo completo**Job Title**: Senior Azure Site Reliability Engineer **Reports** **To**: Azure Site Reliability Lead **About us**: Welcome to Pinnacle, the ultimate destination for sports enthusiasts seeking an exhilarating sportsbook and gaming experience! Established in 1998, we have solidified our position as one of the globe's foremost licensed online gaming...
-
Site Reliability Engineer
hace 15 horas
Desde casa, México Luxoft A tiempo completo**Project description**: Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...
-
Site Reliability Engineer
hace 4 días
Desde casa, México Right Balance A tiempo completo**Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Site Reliability Engineer
hace 2 días
Desde casa, México Right Balance A tiempo completo**Overview**We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.**Engagement Details**Our client is a USA-based company producing video solutions with the mission to advance scientific...