Principal SRE – Cloud Automation
hace 6 días
This role requires a SRE mindset combined with AI/ML expertise and strong application engineering skills across public and private cloud environments.
ResponsibilitiesKey Responsibilities
- End-to-end service ownership: design for telemetry, security, resiliency, scalability, and performance; lead sizing/architecture; drive service health reviews and process simplification.
- Incident management and prevention: lead postmortems/RCAs, coordinate fixes, define repair items, and implement data-driven prevention and continuous improvement.
- AI/ML and GenAI delivery: design and integrate solutions with LLMs, RAG, agentic workflows, and conversational AI; build low-latency model serving and retraining pipelines.
- Application engineering: develop performant microservices for distributed, containerized, cloud-native systems.
- Automation: eliminate toil by automating operational workflows, recovery procedures, code delivery, and configuration management; build internal tools and reusable scripts/services to accelerate delivery and reduce errors.
- Observability: define and implement monitoring, logging, alerting, and tracing strategies; establish SLOs/SLIs/error budgets; improve diagnostics and performance visibility for rapid triage.
- Cross-functional collaboration: partner with product, operations, and data teams to translate requirements into secure, scalable solutions; communicate effectively with technical and non-technical stakeholders.
Minimum Qualifications
- BS/MS in Computer Science or related field; 10+ years of software engineering in cloud environments.
- Strong in distributed systems/microservices using java / python; SQL/data modeling; python for AI/automation.
- SRE/DevOps expertise: systems and networking fundamentals, application security, observability, performance analysis, and incident response.
- Proven SDLC excellence: code quality, reviews, version control, CI/CD, testing, and release engineering.
- Excellent written and verbal communication; English fluency.
Preferred/Technical Skills
- AI/ML/GenAI: experience with foundational models, RAG, agentic architectures; model deployment, optimization, monitoring, and retraining.
- Cloud and containers: experience with containerization, orchestration, and resilient, fault-tolerant microservices.
- Observability: hands-on experience designing dashboards, alerts, traces, logs, and metrics; defining SLOs/SLIs and error budgets; on-call readiness and runbook quality.
- Operations: performance tuning across java / python and SQL for large-scale enterprise applications; strong Linux/Unix expertise; capacity planning and reliability reviews.
- Automation and scripting: proficiency in scripting to automate operational workflows, build tooling, and CI/CD tasks (e.g., shell scripting, python, configuration-as-code, task runners).
- Familiarity with enterprise ERP applications and standard DevOps tooling and practices.
QualificationsCareer Level - IC4
-
SRE Developer
hace 2 días
Guadalajara, Jalisco, México TouchTunes A tiempo completoSRE DeveloperLocation:GuadalajaraYour mission in the SRE team:As a Site Reliability Engineer (SRE) embedded in our mobile app development squads, you will work side-by-side with backend and mobile engineers to ensure new features and services are reliable, scalable, and maintainable from day one. You'll bring an operational mindset into the development...
-
Guadalajara, Jalisco, México Oracle A tiempo completoDescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle Cloud product and services.Design and develop designs, architectures, standards, and methods for large-scale distributed...
-
Site Reliability Developer 3
hace 6 días
Guadalajara, Jalisco, México Oracle A tiempo completoDescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems....
-
Manager, ERP Cloud Operations
hace 6 días
Guadalajara, Jalisco, México Oracle A tiempo completoDescriptionManage a team that designs, develops, troubleshoots and debugs software programs for databases, applications, tools, networks etc.ResponsibilitiesAs a manager, you will lead people and apply your knowledge of SRE to manage tasks associated with cloud operations, developing, debugging or designing software applications, operating systems and...
-
Guadalajara, Jalisco, México Oracle A tiempo completoJob DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle Cloud product and services.Design and develop designs, architectures, standards, and methods for large-scale distributed...
-
Azure Cloud Engineer
hace 2 semanas
Guadalajara, Jalisco, México Slalom Consulting A tiempo completoWho You'll Work WithAs a modern technology company, our Slalom Technologists are disrupting the market and bringing to life the art of the possible for our clients. We have passion for building strategies, solutions, and creative products to help our clients solve their most complex and interesting business problems. We surround our technologists with...
-
Manager, ERP Cloud Operations
hace 2 semanas
Guadalajara, Jalisco, México Oracle A tiempo completoJob DescriptionManage a team that designs, develops, troubleshoots and debugs software programs for databases, applications, tools, networks etc.ResponsibilitiesAs a manager, you will lead people and apply your knowledge of SRE to manage tasks associated with cloud operations, developing, debugging or designing software applications, operating systems and...
-
Cloud Test Engineer
hace 2 semanas
Guadalajara, Jalisco, México Insulet Corporation A tiempo completoInsulet started in 2000 with an idea and a mission to enable our customers to enjoy simplicity, freedom and healthier lives through the use of our Omnipod product platform. In the last two decades we have improved the lives of hundreds of thousands of patients by using innovative technology that is wearable, waterproof, and lifestyle accommodating.We are...
-
AWS Cloud DevSecOps Engineer
hace 2 semanas
Guadalajara, Jalisco, México Incedo Inc. A tiempo completoAbout the CompanyLeading global investment firm serving endowments, foundations, healthcare organizations, pension plans, and private clients. With over 1,200 employees across ten global offices, we deliver tailored portfolio management services grounded in independence, integrity, and deep investment expertise.Role OverviewAs a member of the Cloud team, the...
-
Site Reliability Engineer
hace 2 días
Guadalajara, Jalisco, México tbo A tiempo completoWe are looking for a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and efficiency of our platforms and services. The ideal candidate will have extensive hands-on experience in Kubernetes, cloud platforms, infrastructure automation, and observability, while also bringing an analytical mindset and...