Lead Site Reliability Engineer
hace 2 semanas
Join our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems.**Responsibilities**- Resolve complex incidents to ensure system availability- Maintain reliability and performance of Azure-based enterprise infrastructure- Deploy observability, monitoring, and logging tools- Automate infrastructure management with Terraform and scripting technologies- Improve system performance and uptime through centralized monitoring- Collaborate with multiple teams to enhance service reliability- Perform root cause analysis and oversee postmortems for incidents- Configure deployment pipelines in Azure DevOps for secure workflows- Write and maintain automation scripts for incident recovery and recurring tasks- Enhance monitoring frameworks with platforms like Prometheus and Grafana- Respond promptly to incidents to meet SLA expectations- Facilitate integration of monitoring data from Azure and AWS environments- Advance service reliability and observability practices continuously- Document processes and incident resolutions thoroughly- Take part in Agile team events and balance task priorities**Requirements**:- Minimum 5 years’ expertise in site reliability engineering or comparable DevOps roles- 1+ years of demonstrated leadership experience- Knowledge of Azure services, including AKS, Azure Monitor, Application Insights, Log Analytics, Cosmos DB, and PostgreSQL- Expertise in infrastructure automation using Azure DevOps and Terraform- Proficiency in scripting languages such as Bash, PowerShell, and Python- Skills in monitoring tools including Prometheus and Grafana- Background in incident management and ITSM processes with analytical capability for root cause investigations- Competency in resolving technical challenges promptly in high-pressure situations- Experience in Agile workflows and fast-paced operational environments- Flexibility to communicate effectively in written and verbal formats for teamwork and documentation- Capability to configure alerts that prevent SLA breaches proactively- Understanding of cloud scaling techniques and security best practices- Knowledge of Kubernetes administration for orchestration tasks- Ability to collaborate with diverse functional teams seamlessly- English proficiency of B2 or higher**Nice to have**- Background in AWS services, such as EKS, RDS, CloudWatch, and X-Ray- Familiarity with distributed logging systems and tools for incident automation- Certifications such as Microsoft Azure Administrator or AWS Certified DevOps Engineer- Understanding of Kubernetes configurations for scaling and advanced networking setups- Proficiency in observability tools such as OpenSearch for AWS environments**We offer**- Career plan and real growth opportunities- Unlimited access to LinkedIn learning solutions- International Mobility Plan within 25 countries- Constant training, mentoring, online corporate courses, eLearning and more- English classes with a certified teacher- Support for employee’s initiatives (Algorithms club, toastmasters, agile club and more)- Enjoyable working environment (Gaming room, napping area, amenities, events, sport teams and more)- Flexible work schedule and dress code- Collaborate in a multicultural environment and share best practices from around the globe- Hired directly by EPAM & 100% under payroll- Law benefits (IMSS, INFONAVIT, 25% vacation bonus)- Major medical expenses insurance: Life, Major medical expenses with dental & visual coverage (for the employee and direct family members)- 13 % employee savings fund, capped to the law limit- Grocery coupons- 30 days December bonus- Employee Stock Purchase Plan- 12 vacations days plus 4 floating days- Official Mexican holidays, plus 5 extra holidays (Maundry Thursday and Friday, November 2nd, December 24th & 31st)- Monthly non-taxable amount for the electricity and internet billsEPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.
-
Site Reliability Engineer
hace 1 semana
Desde casa, México thegetch mexico A tiempo completo**Función: Site Reliability Engineer****Aperturas: más de 10 contrataciones****Ubicación: - any city with TCS Office presence (Queretaro, Guadalajara, Mexico City or Monterrey)****Salario:- 25-33 USD/hr****Comunicación en inglés: avanzado****Experiência: 4+ años****Responsabilidades de Site Reliability Engineer**:Reúna y analice métricas de sistemas...
-
Lead Site Reliability Engineer
hace 3 semanas
Desde casa, México Tekshapers Inc A tiempo completo**Position : Lead Site Reliability Engineer****Location : Remote****Duration : Contract**- Lead and mentor a team of SREs to ensure operational excellence and maximize the reliability and availability of client systems.- Minimum 10 years of work experience in DevOps/SRE, including leadership roles.- Architect and design highly scalable and available...
-
Lead Site Reliability Engineer
hace 1 semana
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are looking for an experienced **Lead Site Reliability Engineer**to join our team. In this role, you will play a pivotal part in the Reliability Tooling team, taking responsibility for writing and reviewing code, making key technical decisions, and mentoring engineers within your squad. This position requires a strong grasp of SRE principles and best...
-
Lead Site Reliability Engineer
hace 2 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoJoin our team as a **Lead Site Reliability Engineer** dedicated to providing advanced support for critical Azure-based systems. **Responsibilities** - Resolve complex incidents to ensure system availability - Maintain reliability and performance of Azure-based enterprise infrastructure - Deploy observability, monitoring, and logging tools - Automate...
-
Lead Site Reliability Engineer
hace 3 semanas
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are looking for an experienced **Site Reliability Engineer (SRE)** to take a leadership role in ensuring the stability, scalability, and performance of our cloud infrastructure on **Google Cloud Platform (GCP)**. As an SRE, you will be at the forefront of optimizing system reliability, automating processes, and collaborating with engineering teams to...
-
Site Reliability Engineer
hace 4 días
Desde casa, México Right Balance A tiempo completo**Overview** We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA. **Engagement Details** Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Site Reliability Engineer
hace 2 días
Desde casa, México Right Balance A tiempo completo**Overview**We're looking for a Site Reliability Engineer. Headquartered in Los Angeles, California, Right Balance provides top-tier technology talent for innovative companies in the US. We’re in the top 50 companies to watch in LA.**Engagement Details**Our client is a USA-based company producing video solutions with the mission to advance scientific...
-
Lead Site Reliability Engineer
hace 7 días
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are seeking a Lead Site Reliability Engineer to join our team.In this role, you will help drive the reliability and performance of critical systems for a leading client. You will work in a collaborative environment focused on innovation and operational excellence. Please note, the client operates in the US Central Time Zone from 8 am CST to 5 pm...
-
Site Reliability Engineer
hace 15 horas
Desde casa, México Luxoft A tiempo completo**Project description**: Do you like to work with existing and new software product development teams? This position is to instrument end-to-end observability and visibility for business-critical systems with log ingestion, metrics, and traces. You will function as a site reliability engineer (SRE) that will collaborate with product teams, infrastructure...
-
Senior Site Reliability Engineer
hace 1 semana
Desde casa, México EPAM Systems, Inc. A tiempo completoWe are seeking an experienced **Senior Site Reliability Engineer**to join our team. As a key member of the Reliability Tooling team, you will be responsible for writing and reviewing code, contributing to critical technical decisions, and mentoring engineers within your squad. This role requires a deep understanding of SRE principles and best practices, as...