Site Reliability Engineer

hace 1 semana

WorkFromHome, México KI people A tiempo completo

18 hours ago Be among the first 25 applicants Direct message the job poster from KI people In Search of the Best Global IT & Digital Talent We are looking for a Site Reliability Engineer to work on hybrid mode from GDL, MTY o CDMX for a multicultural project with stability and growth in the short, medium and long term. Role Overview: The SRE Operations specialist focuses on B2B applications support providing round the clock support to identify self healing automation and proactive health checks. They need to be specialized in Site Reliability Engineering (SRE) mode of operations and help to onboard applications to any SRE Orchestration framework for higher business resiliency. The resource needs to have strong IT operations experience, analytical skills and a mindset of proactive issue identification. This resource champions Site Reliability Engineering and collaborates with the customer and the business to troubleshoot issues to identify the root cause and opportunities for automation/proactive health checks. This resource should be able to investigate application code as needed. The SRE Ops needs to have good understanding of different architecture types - legacy/modern app and their logging mechanisms and got exposure to observability tools like APPD, ELK, FullStory etc. SRE Ops should be responsible as pro-active support engineer, diagnosing any anomalies and driving the necessary remediations across the teams involved. SRE Ops resource will work with existing L2 support team, understand production issues, participate & contribute to RCA. SRE Ops will identify gaps in proactive health checks, automate and implement self healing mechanism wherever needed and work with SRE orchestration team to bring readiness to on board SRE orchestration framework. Qualifications Basic Bachelor’s degree in computer science or related field. 3+ years related experience in IT Operations SRE platform/Service Cloud operations. Responsibilities: Work with the existing L2 support team to understand production issues and actively participate in and contribute to Root Cause Analyses (RCAs). Identify gaps in proactive health checks and implement new checks to detect potential issues before they impact production. Automate and implement self-healing mechanisms wherever needed to minimize manual intervention and improve system resilience. Collaborate with the SRE orchestration team to onboard and operationalize the SRE orchestration framework. Diagnose anomalies in production environments and drive the necessary remediations across the teams involved. Mandatory Skills: Proven IT operations experience, with a focus on production support. Strong analytical and problem-solving skills, with the ability to troubleshoot complex issues to identify root causes. A mindset of proactive issue identification and prevention. Ability to investigate application code (e.g., debugging, log analysis) to understand system behavior. Understanding of different application architecture types (legacy and modern) and their logging mechanisms. Exposure to observability tools such as AppDynamics (APPD), ELK Stack (Elasticsearch, Logstash, Kibana), and FullStory. IT operations experience, analytical skills. A mindset of proactive issue identification. Troubleshoot issues to identify the root cause and opportunities for automation/proactive health checks. Able to investigate application code as needed. Understanding of different architecture types - legacy/modern app and their logging mechanisms Exposure to observability tools like APPD, ELK, FullStory, Prometheus, Grafana Responsible as pro-active support engineer, diagnosing any anomalies and driving the necessary remediations across the teams involved. Proficiency in scripting Nice-to-Have Skills: Knowledge in Cloud platform –Azure/GCP Knowledge in SQL Exposure to CI/CD pipelines Networking concepts to diagnose the issue Experience with SRE (Site Reliability Engineering) principles and practices. Experience with SRE orchestration frameworks. Knowledge of scripting languages (e.g., Python, Bash, PowerShell) for automation. Experience with containerization technologies (e.g., Docker, Kubernetes). Knowledge of infrastructure-as-code tools (e.g., Terraform, Ansible). Experience with CI/CD pipelines. Excellent communication and collaboration skills. Other Relevant Experience Experience working as part of a SRE Operations team practicing SRE orchestration framework. Experience and desire to work in a Global delivery environment Ability to work in team in diverse/ multiple stakeholder environment Offer: Payroll Direct hire by client Multicultural teams Perm project If you are looking for a new professional challenge, this is a good opportunity, let's talk about your next professional experience. Seniority level Seniority level Mid-Senior level Employment type Employment type Full-time Job function Job function Engineering and Information Technology Industries Human Resources Services Referrals increase your chances of interviewing at KI people by 2x Sign in to set job alerts for “Site Reliability Engineer” roles. Site Reliability Engineer - Remote Work | REF# Senior Site Reliability / Gitops Engineer Python and Kubernetes Software Engineer - Data, AI/ML & Analytics Software Engineer (Python/Linux/Packaging) Junior Software Development Engineer in Test / R+D - Remote Work | REF# Python and Kubernetes Software Engineer - Data, Workflows, AI/ML & Analytics Software Development Engineer in Test - Remote Work | REF# Python Software Engineer - Ubuntu Hardware Certification Team Software Engineer - Solutions Engineering Golang System Software Engineer - Containers / Virtualisation Software Engineer for AI Training (Code Quality & Debugging Focus) Distributed Systems Software Engineer, Python / Go Junior Software Engineer - Cross-platform C++ - Multipass Software Engineer, Ceph & Distributed Storage We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr

Site Reliability Engineer

hace 4 semanas

WorkFromHome, México Hcl International Ltd A tiempo completo

Senior Site Reliability Engineer Site Reliability Engineer to join this fast growing, well-funded business with cloud built on AWS. With first class skills in AWS the Site Reliability Engineer must demonstrate expertise in spinning up featured environments. Reporting to the CTO, this is an excellent opportunity for an ambitious Site Reliability Engineer to...
Site Reliability Engineer

hace 1 semana

WorkFromHome, México BairesDev A tiempo completo

Site Reliability Engineer - Remote Work | REF# Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev Site Reliability Engineer - Remote Work | REF# 6 months ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev At BairesDev, we've been leading the way in...
Sr. Site Reliability Engineer

hace 1 semana

WorkFromHome, México Nova A tiempo completo

Sr. Site Reliability Engineer (Remote, Mexico) Join to apply for the Sr. Site Reliability Engineer (Remote, Mexico) role at Nova Sr. Site Reliability Engineer (Remote, Mexico) 1 year ago Be among the first 25 applicants Join to apply for the Sr. Site Reliability Engineer (Remote, Mexico) role at Nova Get AI-powered advice on this job and more exclusive...
Site Reliability Engineer

hace 1 semana

WorkFromHome, México BairesDev A tiempo completo

Site Reliability Engineer - Remote Work | REF# Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev Site Reliability Engineer - Remote Work | REF# Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev Get AI-powered advice on this job and more exclusive features. At BairesDev, we've been...
Remote Site Reliability Engineer

hace 2 semanas

WorkFromHome, México Resend A tiempo completo

A modern email platform company is seeking a Site Reliability Engineer for a fully remote position. In this role, you will enhance system reliability and automation, monitor performance parameters, and collaborate with engineering teams. Ideal candidates will have over 5 years in Site Reliability or Infrastructure Engineering, strong skills in Node.js and...
Site Reliability Engineer

hace 3 semanas

WorkFromHome, México - A tiempo completo

JOB DESCRIPTION Site Reliability Engineer (SRE) - Application Performance Monitoring (APM) Location: Monterrey, Nuevo León, Mexico (Hybrid - candidates must reside in Monterrey or the metropolitan area) Language requirement: Fluent English (spoken and written) About the Role We're looking for a Site Reliability Engineer (SRE) with a passion for Application...
Site Reliability Engineer

hace 4 días

WorkFromHome, México National Oilwell Varco, Inc. A tiempo completo

Site Reliability Engineer (SRE) – Application Performance Monitoring (APM) Location: Monterrey, Nuevo León, Mexico (Hybrid – candidates must reside in Monterrey or the metropolitan area) Language requirement: Fluent English (spoken and written) About the Role We’re looking for a Site Reliability Engineer (SRE) with a passion for Application...
Site Reliability Engineer

hace 1 semana

WorkFromHome, México BairesDev A tiempo completo

Site Reliability Engineer - Remote Work | REF# Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev Site Reliability Engineer - Remote Work | REF# 6 months ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev At BairesDev, we've been leading the way in...
Senior Site Reliability Engineer

hace 1 semana

WorkFromHome, México DuckDuckGo A tiempo completo

6 days ago Be among the first 25 applicants Who We AreHi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable since 2014, our annual revenue now exceeds $100 million USD. Millions use our browser on Mac, Windows, iOS, and Android, our search engine,...
Site Reliability Engineer

hace 4 semanas

WorkFromHome, México S&Amp;P Global A tiempo completo

Site Reliability Engineer - Data Support | S&P Dow Jones Indices We are seeking an Site Reliability Engineer - Data Support to be a key player in the implementation and support of our Global Index Data Platform that supports our major headline indices like S&P 500, Dow Jones Industrial Averages & also the co-branded indices with our exchange partners such as...

Américas

Europa

Asia / Oceanía

África

Site Reliability Engineer