Site Reliability Engineer

Grad Zagreb

Informacijska tehnologija (IT)

On-site 3-5 years of professional experience

Short Description

On behalf of our client Daytona, we are looking for a Site Reliability Engineer to join their team in Zagreb or Split.You will be responsible for the reliability, availability, and performance of the infrastructure that powers our sandbox and development environment platform. You will work at the intersection of software engineering and operations to ensure our systems run seamlessly at scale — enabling AI agents and developers worldwide to execute code in secure, isolated environments with sub-100ms spin-up times.

Description

Own and maintain the reliability of Daytona's cloud infrastructure, ensuring platform uptime targets and SLA commitments are consistently met
Design, implement, and maintain comprehensive observability across the platform, including distributed tracing, metrics collection, and centralized logging (OpenTelemetry, Prometheus, Grafana)
Monitor key health metrics across sandbox orchestration, compute clusters, and API services; define and track SLIs/SLOs for all critical platform components
Build and improve incident response processes, including on-call rotations, runbooks, post-incident reviews, and automated remediation workflows
Manage and optimize Kubernetes clusters, container orchestration, and infrastructure-as-code deployments (Terraform, Helm)
● Collaborate with engineering teams to improve system resilience through capacity planning, load testing, chaos engineering, and failure mode analysis
Automate operational toil through scripting and tooling to reduce manual intervention and improve deployment velocity
Participate in architecture reviews with a focus on scalability, fault tolerance, and disaster recovery
Manage and optimize PostgreSQL and other data stores for performance, replication, and backup reliability
Contribute to CI/CD pipeline reliability and deployment safety mechanisms (canary releases, feature flags, rollback procedures)

Requirements

3+ years of experience in an SRE, DevOps, or Platform Engineering role
Strong proficiency with Kubernetes and container orchestration in production environments
Hands-on experience with observability tools and practices (OpenTelemetry, Prometheus, Grafana, ELK/Loki, or similar)
Solid understanding of Linux systems, networking, and troubleshooting at scale
Experience with infrastructure-as-code tools (Terraform, Pulumi, or CloudFormation)
Proficiency in at least one programminging language (Go, Python, or Bash)
Familiarity with cloud platforms (AWS, GCP, or Azure)
Experience with PostgreSQL administration and performance tuning
Strong incident management and communication skills
Familiarity with agentic workflows and AI coding assistants (Claude Code, Cursor, Opencode or similar)

Site Reliability Engineer

Short Description

Description

Requirements

Kontaktirajte nas

Ugovorite sastanak s našim stručnjacima

Naša prisutnost u SEE regiji