Site Reliability Engineer job in Stockholm, Sverige

Site Reliability Engineer

Workplace: Stockholm, Sverige

Expires: August 28, 2025

The Site Reliability Engineer will expand and enhance the Observability sub-track within the Performance and Observability Team at Wolt. This role focuses on building and improving an observability platform used by all Wolt engineers, handling telemetry data at scale, championing observability best practices, and collaborating closely with counterparts at DoorDash for next-generation platform development.

Main requirements:

Proven experience in Software Engineering, SRE, or similar roles focused on observability and scalable systems.
Experience with OpenTelemetry and observability infrastructure strategy.
Strong computer science fundamentals and engineering principles.
Proficient in Go (preferred) or Python development for building automation and distributed systems software.
Hands-on experience with observability tools such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger, and tracing.
Expertise with cloud platforms (AWS, GCP, Azure) and managing cloud infrastructure with Kubernetes and Docker.
Deep knowledge of building and maintaining reliable, scalable distributed systems.
Solid understanding of SRE principles, incident management, and fault tolerance.
Experience with infrastructure-as-code tools like Terraform or Ansible.
Familiarity with CI/CD pipelines and automated delivery.
Strong analytical and problem-solving skills for complex distributed systems.
Excellent communication and collaboration abilities to work cross-functionally.
Willingness to engage directly with application code for observability integration.
Experience with Unix systems, networking, Docker, and Kubernetes.
Openness to feedback and continuous improvement.

Responsibilities:

Build and enhance the observability platform and tools used across Wolt.
Architect, build, and maintain observability stack to handle increasing telemetry data reliably.
Champion and guide observability best practices internally.
Own initiatives to improve quality, efficiency, and reliability of observability.
Apply SRE practices for business impact.
Participate in on-call rotations to address incidents and outages.
Standardize observability resources via tools and documentation to improve developer productivity.
Triage and resolve production issues related to observability.
Contribute to open-source projects by sharing internal tools with the community.

Required hard skills:

Software Engineering, SRE or DevOps with observability focus
OpenTelemetry
Go (preferred) or Python programming
Observability tools: DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger
Cloud platforms: AWS, GCP, Azure
Kubernetes and Docker container management
Infrastructure-as-code: Terraform or Ansible
CI/CD pipelines and automated testing
Unix systems and networking
Distributed systems architecture and reliability
Incident response and fault-tolerant design

Recommended hard skills:

Managing large-scale Elasticsearch or similar databases
Operating distributed event streaming platforms like Apache Kafka
Open-source contributions in observability, cloud or platform engineering

Soft skills:

Analytical and problem-solving skills
Excellent communication and collaboration
Cross-functional teamwork
Open to feedback and continuous learning
Customer-focused mindset

Coding languages:

Go
Python

Operating systems:

Unix
Linux

Natural languages:

English (Proficient)

Cultural skills:

Remote work across Nordic and Baltic countries
Collaborative work culture
Feedback-oriented growth

Apply for this job

Apply here

You might also like: