Site Reliability Engineer
hace 2 semanas
We are looking for a skilled **Site Reliability Engineer**to join our team.
This position will focus on supporting the LatAm timezone, working closely with a team of SREs and a hands-on Lead SRE, while collaborating with a European-based SRE team. The role ensures seamless follow-the-sun 24/7 on-call support for a customer platform comprising multiple Java backend services.
**Responsibilities**
- Deliver 12/7 on-call support for Java backend services, ensuring consistent platform performance and uptime
- Oversee API Gateway observability to monitor and safeguard service health
- Implement and deploy patches to address issues in Java code and cloud infrastructure components
- Build and maintain metrics and dashboards to evaluate and enhance platform stability and performance
- Develop and refine runbooks for EOS backend services to optimize operational workflows
- Track and monitor Service Level Objectives (SLOs), addressing errors and contributing code changes to improve service reliability
- Diagnose and resolve complex system issues using logs and telemetry to identify root causes effectively
- Work with various teams to ensure operational readiness and improve incident response processes
**Requirements**:
- A Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field
- At least 2 years of experience in Java backend development and working in DevOps or SRE roles
- Expertise with Amazon DynamoDB, Amazon ElastiCache, and other AWS cloud services
- Proficiency with Git for version control and Gradle for build automation
- Strong knowledge of observability practices and troubleshooting in distributed systems
- Skilled in analyzing logs and telemetry to diagnose and resolve complex system challenges
- Strong written communication skills to document operational issues during live incident responses
- Dedication to improving SLOs across multiple systems using repeatable and structured processes
- Fluent English communication skills, both written and spoken, at a B2 level or higher
**Nice to have**
- Familiarity with Apache Cassandra for managing distributed databases
- Knowledge of Grafana for creating and maintaining observability dashboards
- Proficiency in backend development using Java and Scala
- Experience with Terraform for managing infrastructure as code
**We offer**
- Career plan and real growth opportunities
- Unlimited access to LinkedIn learning solutions
- International Mobility Plan within 25 countries
- Constant training, mentoring, online corporate courses, eLearning and more
- English classes with a certified teacher
- Support for employee’s initiatives (Algorithms club, toastmasters, agile club and more)
- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
- Flexible work schedule and dress code
- Collaborate in a multicultural environment and share best practices from around the globe
- Hired directly by EPAM & 100% under payroll
- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
- 13 % employee savings fund, capped to the law limit
- Grocery coupons
- 30 days December bonus
- Employee Stock Purchase Plan
- 12 vacations days plus 4 floating days
- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
- Monthly non-taxable amount for the electricity and internet bills
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
-
Site Reliability Engineer
hace 3 semanas
Desde casa, México thegetch mexico A tiempo completo**Función: Site Reliability Engineer****Aperturas: más de 10 contrataciones****Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)****Salario:- 25-33 USD/hr****Comunicación en inglés: avanzado****Experiência: 4+ años****Responsabilidades de Site Reliability Engineer**:Reúna y analice métricas de sistemas...
-
Site Reliability Engineer
hace 2 semanas
Desde casa, México Right Balance A tiempo completo**Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Site Reliability Engineer
hace 1 semana
Desde casa, México Right Balance A tiempo completo**Overview**We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.**Engagement Details**Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Lead Site Reliability Engineer
hace 4 semanas
Desde casa, México Tekshapers Inc A tiempo completo**Position : Lead Site Reliability Engineer****Location : Remote****Duration : Contract**- Lead and mentor a team of SREs to ensure operational excellence and maximize the reliability and availability of client systems.- Minimum 10 years of work experience in DevOps/SRE, including leadership roles.- Architect and design highly scalable and available...
-
Site Reliability Engineer
hace 1 semana
Desde casa, México Luxoft A tiempo completo**Project description**: Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...
-
Senior Site Reliability Engineer
hace 3 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Senior Site Reliability Engineer** focused on delivering advanced support for critical Azure-based systems.**Responsibilities**- Troubleshoot and resolve complex incidents to maintain system uptime- Ensure reliability and performance of Azure-based enterprise infrastructure- Implement observability, monitoring, and logging solutions-...
-
Lead Site Reliability Engineer
hace 3 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems.**Responsibilities**- Resolve complex incidents to ensure system availability- Maintain reliability and performance of Azure-based enterprise infrastructure- Deploy observability, monitoring, and logging tools- Automate...
-
Site Reliability Engineer
hace 7 días
Desde casa, México Luxoft A tiempo completo**Project description**:Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...
-
Lead Site Reliability Engineer
hace 4 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are looking for an experienced **Site Reliability Engineer (SRE)** to take a leadership role in ensuring the stability, scalability, and performance of our cloud infrastructure on **Google Cloud Platform (GCP)**. As an SRE, you will be at the forefront of optimizing system reliability, automating processes, and collaborating with engineering teams to...
-
Senior Azure Site Reliability Engineer
hace 3 semanas
Desde casa, México Pinnacle A tiempo completo**Job Title**: Senior Azure Site Reliability Engineer**Reports** **To**: Azure Site Reliability Lead**About us**:Welcome to Pinnacle, the ultimate destination for sports enthusiasts seeking an exhilarating sportsbook and gaming experience! Established in 1998, we have solidified our position as one of the globe's foremost licensed online gaming companies....