Job (Project) Description:
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to play a key role in strengthening and scaling our SaaS product ecosystem. In this role, you will work closely with engineering teams to ensure exceptional observability, availability, reliability, and performance of our cloud-based solutions.
The ideal candidate brings deep expertise in Azure cloud infrastructure, automation, and modern reliability engineering practices. You will apply engineering principles, operational excellence, and software craftsmanship to elevate our production systems and user-facing services.
Requirements:
- 5+ years of experience in SRE, DevOps, Cloud Engineering, or related roles.
- Strong expertise with Azure Cloud (infrastructure, networking, identity, security, scaling).
- Hands-on experience with:
- Terraform (infrastructure as code)
- Kubernetes (AKS preferred)
- GitHub Actions or other CI/CD tools
- Solid understanding of monitoring, observability, incident response, and reliability best practices.
- Strong skills in systems engineering (OS, networking, storage, scaling).
- Experience participating in or leading on-call rotations.
- Strong troubleshooting and analytical skills with a proactive approach to reliability.
Job Responsibilities:
Cloud Infrastructure & Scalability
- Design, build, and maintain Azure-based cloud infrastructure supporting large-scale SaaS environments.
- Ensure seamless scalability, enabling the platform to support hundreds of thousands of concurrent users.
- Manage cloud infrastructure using Terraform, GitHub Actions, Kubernetes, and other automation tools.
Monitoring, Observability & Reliability
- Develop advanced, proactive monitoring and alerting systems that detect symptoms before they become issues.
- Drive improvements in availability, reliability, performance, and capacity planning.
- Implement modern practices in observability, system health monitoring, logging, and tracing.
Automation & Operational Excellence
- Continuously improve and automate operational processes to ensure deployments, upgrades, and maintenance are smooth, predictable, and repeatable.
- Reduce toil by automating routine tasks and enhancing system resilience.
- Apply SRE principles to improve error budgets, SLIs/SLOs, and incident response processes.
Collaboration & Technical Leadership
- Work closely with product engineering teams, influencing architectural decisions and ensuring reliability is embedded into design.
- Partner with developers to enhance system performance, improve CI/CD pipelines, and optimize cloud usage.
- Share best practices and mentor other engineers in DevOps and SRE methodologies.
Incident Management
- Participate in an on-call rotation (PagerDuty) to respond to and resolve incidents impacting service availability.
- Support product engineers during customer incidents, ensuring fast recovery and root cause analysis.
- Contribute to post-incident reviews and implement corrective actions to prevent recurrence.
What We Offer:
- Competitive salary;
- Opportunities for professional growth and advancement;
- A collaborative and innovative work environment;
- Support for participation in professional development opportunities (webinars, conferences, trainings, etc.);
- Flexible work environment (in-office, remote, or hybrid depending on preferences and manager approval).