Senior Site Reliability Engineer, Observability

hace 7 días

Ciudad de México Chainlink Labs A tiempo completo

**About Us**

The Observability Team enables Chainlink development and empowers engineers to continue building and supporting crucial products and services that have a profound impact in the blockchain industry. Reliability is vital to the success of our company. As a Senior SRE, you will help us accelerate and enable other engineering teams by increasing self-service and decreasing cognitive load.

This job would be perfect for someone who has a strong DevOps mentality, is passionate about building and maintaining a mature GitOps environment, and has experience focusing on observability. The entire engineering team is expanding, and you would have plenty of opportunities to build, learn, and grow.

We all have different backgrounds and are determined to help you succeed no matter where you are or who you are. If you think you would do a great job at Chainlink, we are looking forward to speaking with you, even if you don't match 100% of the job requirements: those describe people we've usually had a great time working with, but they're not a tick-box exercise.

**Your Impact**
- Build and orchestrate Modern OTEL-based Observability Platform
- Support multiple telemetry types, like metrics, logs and traces.
- Define and support modern governance in observability and problems at scale.
- Ensure reliability, security, and performance exceed our defined SLAs
- Work with engineers from across the company to help troubleshoot issues, deploy new products and services, and increase velocity while decreasing cognitive load
- Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action.
- Ingest, aggregate, transform, and utilize data from a multitude of sources in our real time data pipeline.
- Oversee the availability, performance, and supportability of our observability infrastructure.
- Create processes around alert response operations and support the team to ensure the reliable delivery of oracle data.
- Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release.
- Champion reliability and security by taking the time to do your work right the first time

**Requirements**:

- 7+ years of relevant professional experience. You probably have worked on a devops, infrastructure, SRE, and/or platform team before
- Ability to develop software outside of the scope of typical infrastructure requirements and configurations
- Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
- Expert knowledge in all aspects of designing, developing, and managing large real-time systems
- Experience with monitoring and logging. You know how to export metrics using Prometheus, have built a Grafana dashboard or two, and have experience with a centralized logging solution like an ELK Stack, Splunk or Grafana Stack.
- Experience with distributed systems and container orchestration. You have maintained or even built Kubernetes clusters before and feel comfortable deploying completely new services on them
- Strong communication skills. You can give and receive constructive feedback, and you do not shy away from planning meetings and code reviews

**Desired Qualifications**
- Excitement for blockchain, Web 3.0, and similar decentralized technologies.
- Experience running any infrastructure in the blockchain/web3 space
- Ability to scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Experience working remotely in a distributed team
- A strong desire to grow and challenge yourself. We would expect you to constantly find ways to improve and automate services to reduce toil

**Some of the tools and services we use daily or almost daily are**:

- AWS; Terraform/Terragrunt; Kubernetes, Calico and ArgoCD; Prometheus and Grafana; GitHub Actions; Packer
- We expect you to be comfortable with most of those tools and very proficient in several of them.
- All roles with Chainlink Labs are global and remote-based. Unless otherwise stated, we ask that you try to overlap some working hours with Eastern Standard Time (EST)._

**Commitment to Equal Opportunity

Site Reliability Engineer

hace 2 semanas

Ciudad de México UST A tiempo completo

Join to apply for the Site Reliability Engineer role at UST Continue with Google Continue with Google Join to apply for the Site Reliability Engineer role at UST Get AI-powered advice on this job and more exclusive features. Sign in to access AI-powered advices Continue with Google Continue with Google Continue with Google Continue with Google Continue with...
Site Reliability Engineer

hace 2 semanas

Estado de México BairesDev A tiempo completo

Site Reliability Engineer - Remote Work | REF# Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev Site Reliability Engineer - Remote Work | REF# 6 months ago Be among the first 25 applicants Join to apply for the Site Reliability Engineer - Remote Work | REF# role at BairesDev At BairesDev, we've been leading the way in...
Senior Site Reliability Engineer

hace 2 semanas

estado de méxico Chainlink Labs A tiempo completo

Join to apply for the Senior Site Reliability Engineer role at Chainlink Labs 2 weeks ago Be among the first 25 applicants Join to apply for the Senior Site Reliability Engineer role at Chainlink Labs About UsChainlink Labs is the primary contributing developer of Chainlink, the decentralized computing platform powering the verifiable web. Chainlink is the...
Site Reliability Engineer

hace 4 semanas

Ciudad de México Quantum World Technologies Inc. A tiempo completo

Role: Site Reliability Engineer (SRE) – Database Services Location: Mexico / Costa Rica / Argentina preferred (Open to LATAM) Availability: Immediate About the Role We are looking for a Site Reliability Engineer (SRE) to join the Database Engineering team and contribute to the reliability, resilience, and automation of mission-critical PostgreSQL...
Site Reliability Engineer

hace 1 semana

Ciudad de México The Functionary A tiempo completo

Direct message the job poster from The Functionary Experienced Technical recruiter with 6+ years of experience. Now hiring for LATAM, India and US. Must-Haves: Looking for a Senior Site Reliability Engineer with strong experience in Terraform, EKS, and Kubernetes. Ability to work with stakeholders and has experience leading P1 and P2 teams. Experience...
Site Reliability Engineer

hace 4 semanas

Ciudad de México Quantum World Technologies Inc. A tiempo completo

Role: Site Reliability Engineer (SRE) – Database Services Location: Open to LATAM About the Role We are looking for a Site Reliability Engineer (SRE) to join the Database Engineering team and contribute to the reliability, resilience, and automation of mission-critical PostgreSQL environments. This role is ideal for an SRE who wants to grow into database...
Senior Site Reliability Engineer

hace 7 días

santiago de querétaro, México Canonical A tiempo completo

Senior Site Reliability Engineer Canonical, a leading provider of open source software and the Ubuntu operating system, is hiring a Senior Site Reliability Engineer to join its distributed engineering team. Responsibilities Architect and run OpenStack, Kubernetes, and storage solutions across bare metal and container environments. Develop Python-based...
Senior Site Reliability Engineer

hace 7 días

Santiago de Querétaro, México Canonical A tiempo completo

Senior Site Reliability Engineer Canonical, a leading provider of open source software and the Ubuntu operating system, is hiring a Senior Site Reliability Engineer to join its distributed engineering team. Responsibilities Architect and run OpenStack, Kubernetes, and storage solutions across bare metal and container environments. Develop Python-based...
Senior Site Reliability Engineer

hace 7 días

Santiago de Querétaro, México Canonical A tiempo completo

Senior Site Reliability Engineer Canonical, a leading provider of open source software and the Ubuntu operating system, is hiring a Senior Site Reliability Engineer to join its distributed engineering team. Responsibilities Architect and run OpenStack, Kubernetes, and storage solutions across bare metal and container environments. Develop Python-based...
Remote Site Reliability Engineer

hace 3 días

Toluca de Lerdo, México Resend A tiempo completo

A modern email platform company is seeking a Site Reliability Engineer for a fully remote position. In this role, you will enhance system reliability and automation, monitor performance parameters, and collaborate with engineering teams. Ideal candidates will have over 5 years in Site Reliability or Infrastructure Engineering, strong skills in Node.js and...

Américas

Europa

Asia / Oceanía

África

Senior Site Reliability Engineer, Observability