Senior Site Reliability Engineer
hace 7 días
We are seeking an experienced **Senior Site Reliability Engineer**to join our team.
This role will cover the LatAm timezone, working collaboratively with a team of SREs and a hands-on Lead SRE, while also coordinating with a European-based SRE team. The position ensures follow-the-sun 24/7 on-call support for a customer platform that includes multiple Java backend services.
**Responsibilities**
- Provide 12/7 on-call support for Java backend services, ensuring platform reliability and availability
- Own API Gateway observability to monitor and maintain service health
- Prepare and deploy patches to resolve issues in Java code and related service cloud infrastructure
- Develop and maintain metrics and dashboards to assess and improve platform health
- Create and enhance runbooks for all EOS backend services to streamline operational processes
- Monitor Service Level Objectives (SLOs) for backend services, addressing errors and submitting code changes to improve them
- Troubleshoot complex system issues using logs and telemetry to identify and resolve root causes efficiently
- Collaborate with cross-functional teams to ensure operational excellence and incident response readiness
**Requirements**:
- Bachelor’s or Master’s degree in Computer Science or a related field
- At least 3 years of experience in Java backend development and DevOps/SRE roles
- Proficiency with Amazon DynamoDB, Amazon ElastiCache, and other AWS services
- Experience with Git and Gradle for version control and build automation
- Strong understanding of observability and troubleshooting in distributed systems
- Skilled in using logs and telemetry to diagnose and resolve complex systems issues
- Effective written communication skills to document operational issues during live incident responses
- Motivation to track and improve SLOs across multiple systems through repeatable processes
- Fluent English communication skills, both written and spoken, at a B2 level or higher
**Nice to have**
- Familiarity with Apache Cassandra for distributed database management
- Experience with Apache Kafka for real-time data streaming
- Knowledge of Grafana for building and maintaining observability dashboards
- Proficiency in Java and Scala for backend development
- Hands-on experience with Kubernetes for container orchestration
- Knowledge of Terraform for infrastructure as code
**We offer**
- Career plan and real growth opportunities
- Unlimited access to LinkedIn learning solutions
- International Mobility Plan within 25 countries
- Constant training, mentoring, online corporate courses, eLearning and more
- English classes with a certified teacher
- Support for employee’s initiatives (Algorithms club, toastmasters, agile club and more)
- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)
- Flexible work schedule and dress code
- Collaborate in a multicultural environment and share best practices from around the globe
- Hired directly by EPAM & 100% under payroll
- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)
- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)
- 13 % employee savings fund, capped to the law limit
- Grocery coupons
- 30 days December bonus
- Employee Stock Purchase Plan
- 12 vacations days plus 4 floating days
- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)
- Monthly non-taxable amount for the electricity and internet bills
EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
-
Site Reliability Engineer
hace 2 semanas
Desde casa, México thegetch mexico A tiempo completo**Función: Site Reliability Engineer** **Aperturas: más de 10 contrataciones** **Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)** **Salario: - 25-33 USD/hr** **Comunicación en inglés: avanzado** **Experiência: 4+ años** **Responsabilidades de Site Reliability Engineer**: Reúna y analice métricas...
-
Senior Site Reliability Engineer
hace 2 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are seeking an experienced **Senior Site Reliability Engineer**to join our team.As a key member of the Reliability Tooling team, you will be responsible for writing and reviewing code, contributing to critical technical decisions, and mentoring engineers within your squad. This role requires a deep understanding of SRE principles and best practices, as...
-
Site Reliability Engineer
hace 4 días
Desde casa, México Right Balance A tiempo completo**Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Site Reliability Engineer
hace 2 semanas
Desde casa, México Synechron A tiempo completoSynechron is a self-funded, leading digital transformation Consulting firm focused on the financial services industry working to accelerate digital initiatives for Banks, Asset Managers and Insurance. We achieve this by providing our clients with innovative solutions that solve their most complex business challenges and combining Synechron’s unique,...
-
Lead Site Reliability Engineer
hace 2 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are looking for an experienced **Lead Site Reliability Engineer**to join our team.In this role, you will play a pivotal part in the Reliability Tooling team, taking responsibility for writing and reviewing code, making key technical decisions, and mentoring engineers within your squad. This position requires a strong grasp of SRE principles and best...
-
Site Reliability Engineer
hace 4 semanas
Desde casa, México Luxoft A tiempo completo**Project description**:Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...
-
Senior Site Reliability Engineer
hace 4 días
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Senior Site Reliability Engineer**, where you will maintain and improve our product monitoring system, manage incident responses, and facilitate collaboration between operations and development teams. **Responsibilities** - Maintain and improve the product monitoring system - Manage incident response including troubleshooting,...
-
Senior Site Reliability Engineer
hace 2 semanas
Desde casa, México Zillow A tiempo completo**About the role**:As a member of the FUB+ Infrastructure & Security team you will architect, develop and deploy systems, processes and environments that support numerous services developed by engineers within the FUB+ engineering organization. You will help us deliver the most reliable and performant experience for our customers and keep our existing...
-
Senior Site Reliability Engineer
hace 3 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are seeking a talented and experienced **Senior Site Reliability Engineer (SRE)** to join our dynamic team.**Responsibilities**- Design and maintain Kubernetes resource manifests, deploying them into clusters on platforms like AKS or GKE- Create and manage continuous deployment pipelines using tools like Helm and ArgoCD- Optimize observability by...
-
Site Reliability Engineer
hace 7 días
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are looking for a skilled **Site Reliability Engineer**to join our team. This position will focus on supporting the LatAm timezone, working closely with a team of SREs and a hands-on Lead SRE, while collaborating with a European-based SRE team. The role ensures seamless follow-the-sun 24/7 on-call support for a customer platform comprising multiple Java...