Site Reliability Engineer

City of Zagreb
IT and Telecommunication
On-site 3-5 years of professional experience
Short Description

On behalf of our client Daytona, we are looking for a Site Reliability Engineer  to join their team in Zagreb or Split.You will be responsible for the reliability, availability, and performance of the infrastructure that powers our sandbox and development environment platform. You will work at the intersection of software engineering and operations to ensure our systems run seamlessly at scale — enabling AI agents and developers worldwide to execute code in secure, isolated environments with sub-100ms spin-up times.

Description

  • Own and maintain the reliability of Daytona's cloud infrastructure, ensuring platform uptime targets and SLA commitments are consistently met
  • Design, implement, and maintain comprehensive observability across the platform, including distributed tracing, metrics collection, and centralized logging (OpenTelemetry, Prometheus, Grafana)
  • Monitor key health metrics across sandbox orchestration, compute clusters, and API services; define and track SLIs/SLOs for all critical platform components
  • Build and improve incident response processes, including on-call rotations, runbooks, post-incident reviews, and automated remediation workflows
  • Manage and optimize Kubernetes clusters, container orchestration, and infrastructure-as-code deployments (Terraform, Helm)
  • ● Collaborate with engineering teams to improve system resilience through capacity planning, load testing, chaos engineering, and failure mode analysis
  • Automate operational toil through scripting and tooling to reduce manual intervention and improve deployment velocity
  • Participate in architecture reviews with a focus on scalability, fault tolerance, and disaster recovery
  • Manage and optimize PostgreSQL and other data stores for performance, replication, and backup reliability
  • Contribute to CI/CD pipeline reliability and deployment safety mechanisms (canary releases, feature flags, rollback procedures)

Requirements

  • 3+ years of experience in an SRE, DevOps, or Platform Engineering role
  • Strong proficiency with Kubernetes and container orchestration in production environments
  • Hands-on experience with observability tools and practices (OpenTelemetry, Prometheus, Grafana, ELK/Loki, or similar)
  • Solid understanding of Linux systems, networking, and troubleshooting at scale
  • Experience with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
  • Proficiency in at least one programminging language (Go, Python, or Bash)
  • Familiarity with cloud platforms (AWS, GCP, or Azure)
  • Experience with PostgreSQL administration and performance tuning
  • Strong incident management and communication skills
  • Familiarity with agentic workflows and AI coding assistants (Claude Code, Cursor, Opencode or similar)

Site Reliability Engineer
Job Application
Allowed extensions: doc, docx, pdf, txt. Maximum file size: 50MB.
Are You Willing to Relocate?

Manpower d.o.o., Ulica grada Vukovara 23, 10000 Zagreb processes your personal data from the job application for the purpose of conducting the selection process and taking steps at the request of the data subject prior to entering into an employment contract, in accordance with Article 6(1)(b) of the General Data Protection Regulation. More information about how we process your personal data is available at privacy policy.

CAPTCHA
Enter the characters shown in the image.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
loading-gif