Site Reliability Engineer
Workplace: Stockholm, Sverige
Expires: August 28, 2025
The Site Reliability Engineer will expand and enhance the Observability sub-track within the Performance and Observability Team at Wolt. This role focuses on building and improving an observability platform used by all Wolt engineers, handling telemetry data at scale, championing observability best practices, and collaborating closely with counterparts at DoorDash for next-generation platform development.
Main requirements:
  • Proven experience in Software Engineering, SRE, or similar roles focused on observability and scalable systems.
  • Experience with OpenTelemetry and observability infrastructure strategy.
  • Strong computer science fundamentals and engineering principles.
  • Proficient in Go (preferred) or Python development for building automation and distributed systems software.
  • Hands-on experience with observability tools such as DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger, and tracing.
  • Expertise with cloud platforms (AWS, GCP, Azure) and managing cloud infrastructure with Kubernetes and Docker.
  • Deep knowledge of building and maintaining reliable, scalable distributed systems.
  • Solid understanding of SRE principles, incident management, and fault tolerance.
  • Experience with infrastructure-as-code tools like Terraform or Ansible.
  • Familiarity with CI/CD pipelines and automated delivery.
  • Strong analytical and problem-solving skills for complex distributed systems.
  • Excellent communication and collaboration abilities to work cross-functionally.
  • Willingness to engage directly with application code for observability integration.
  • Experience with Unix systems, networking, Docker, and Kubernetes.
  • Openness to feedback and continuous improvement.
Responsibilities:
  • Build and enhance the observability platform and tools used across Wolt.
  • Architect, build, and maintain observability stack to handle increasing telemetry data reliably.
  • Champion and guide observability best practices internally.
  • Own initiatives to improve quality, efficiency, and reliability of observability.
  • Apply SRE practices for business impact.
  • Participate in on-call rotations to address incidents and outages.
  • Standardize observability resources via tools and documentation to improve developer productivity.
  • Triage and resolve production issues related to observability.
  • Contribute to open-source projects by sharing internal tools with the community.
Required hard skills:
  • Software Engineering, SRE or DevOps with observability focus
  • OpenTelemetry
  • Go (preferred) or Python programming
  • Observability tools: DataDog, Prometheus, Mimir, Elasticsearch, Grafana, Jaeger
  • Cloud platforms: AWS, GCP, Azure
  • Kubernetes and Docker container management
  • Infrastructure-as-code: Terraform or Ansible
  • CI/CD pipelines and automated testing
  • Unix systems and networking
  • Distributed systems architecture and reliability
  • Incident response and fault-tolerant design
Recommended hard skills:
  • Managing large-scale Elasticsearch or similar databases
  • Operating distributed event streaming platforms like Apache Kafka
  • Open-source contributions in observability, cloud or platform engineering
Soft skills:
  • Analytical and problem-solving skills
  • Excellent communication and collaboration
  • Cross-functional teamwork
  • Open to feedback and continuous learning
  • Customer-focused mindset
Coding languages:
  • Go
  • Python
Operating systems:
  • Unix
  • Linux
Natural languages:
  • English (Proficient)
Cultural skills:
  • Remote work across Nordic and Baltic countries
  • Collaborative work culture
  • Feedback-oriented growth