DescriptionAs a DevOps Engineer on our SRE team, you will play a key role in ensuring the reliability, scalability, and performance of our platform. You’ll work on deploying, configuring, and monitoring services in production, while using automation to support change management, incident response, and capacity planning. Collaborating closely with development teams, you’ll help define and uphold SLAs through clear SLOs and SLIs, driving continuous improvements in service reliability and operational efficiency.
Responsibilities- Monitoring system health and ensuring high availability in production
- Building tools and systems to manage infrastructure and applications
- Improving software reliability, performance, and delivery speed
- Analyzing metrics to identify bottlenecks and optimize performance
- Supporting distributed applications with operational expertise
- Collaborating with developers on testing, releases, and service enhancements
- Leading efforts in automation, sustainable operations, and system design
- Balancing development speed with system reliability using defined SLOs
Qualifications- Proficiency in at least one high-level language (e.g., Python, Java, Go, or JavaScript)
- Understanding of networking basics, including TCP/IP, VPCs, and VPNs
- Experience with monitoring tools like Datadog, Prometheus, Grafana, or ELK
- Familiarity with automation tools like Ansible, Terraform, or Crossplane
- Solid knowledge of operating systems, Docker, Kubernetes, and Helm
- Hands-on experience with CI/CD pipelines and DevOps practices (e.g., Jenkins)
- A proactive mindset for spotting issues and improving performance
Bonus points if you have:
- Experience with incident management and troubleshooting
- Strong communication and teamwork skills
- Familiarity with cloud-native applications