Site Reliability Engineer

Grad Zagreb
Informacijska tehnologija (IT)
On-site 3-5 years of professional experience
Short Description

On behalf of our client Daytona, we are looking for a Site Reliability Engineer  to join their team in Zagreb or Split.You will be responsible for the reliability, availability, and performance of the infrastructure that powers our sandbox and development environment platform. You will work at the intersection of software engineering and operations to ensure our systems run seamlessly at scale — enabling AI agents and developers worldwide to execute code in secure, isolated environments with sub-100ms spin-up times.

Description

  • Own and maintain the reliability of Daytona's cloud infrastructure, ensuring platform uptime targets and SLA commitments are consistently met
  • Design, implement, and maintain comprehensive observability across the platform, including distributed tracing, metrics collection, and centralized logging (OpenTelemetry, Prometheus, Grafana)
  • Monitor key health metrics across sandbox orchestration, compute clusters, and API services; define and track SLIs/SLOs for all critical platform components
  • Build and improve incident response processes, including on-call rotations, runbooks, post-incident reviews, and automated remediation workflows
  • Manage and optimize Kubernetes clusters, container orchestration, and infrastructure-as-code deployments (Terraform, Helm)
  • ● Collaborate with engineering teams to improve system resilience through capacity planning, load testing, chaos engineering, and failure mode analysis
  • Automate operational toil through scripting and tooling to reduce manual intervention and improve deployment velocity
  • Participate in architecture reviews with a focus on scalability, fault tolerance, and disaster recovery
  • Manage and optimize PostgreSQL and other data stores for performance, replication, and backup reliability
  • Contribute to CI/CD pipeline reliability and deployment safety mechanisms (canary releases, feature flags, rollback procedures)

Requirements

  • 3+ years of experience in an SRE, DevOps, or Platform Engineering role
  • Strong proficiency with Kubernetes and container orchestration in production environments
  • Hands-on experience with observability tools and practices (OpenTelemetry, Prometheus, Grafana, ELK/Loki, or similar)
  • Solid understanding of Linux systems, networking, and troubleshooting at scale
  • Experience with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
  • Proficiency in at least one programminging language (Go, Python, or Bash)
  • Familiarity with cloud platforms (AWS, GCP, or Azure)
  • Experience with PostgreSQL administration and performance tuning
  • Strong incident management and communication skills
  • Familiarity with agentic workflows and AI coding assistants (Claude Code, Cursor, Opencode or similar)

Site Reliability Engineer
Prijava za posao
Dozvoljene ekstenzije: doc, docx, pdf, txt. Maksimalna veličina datoteke: 50MB.
Jeste li spremni na relokaciju?

Manpower d.o.o., Ulica grada Vukovara 23, 10000 Zagreb, obrađuje Vaše osobne podatke iz prijave za posao radi provođenja selekcijskog postupka, odnosno poduzimanja radnji na zahtjev ispitanika radi sklapanja ugovora o radu, sukladno članku 6. stavku 1. točki (b) Opće uredbe o zaštiti podataka. Više informacija o tome kako obrađujemo Vaše osobne podatke dostupno je na politici privatnosti.

CAPTCHA
Unesite kod sa slike.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
loading-gif