ABOUT THE ROLE
We are looking for a DevOps / Site Reliability Engineer to join the Prediction Platform team. You will be responsible for designing and operating the infrastructure layer of the platform, ensuring it is robust, secure, observable, and cost-efficient.
This role is central to enabling our engineers and researchers to move fast while maintaining the highest standards of reliability, automation, and governance. You will work closely with Generative AI engineers, quantitative developers, and the platform tech lead.
RESPONSIBILITIES
- Infrastructure as Code (IaC)
-
- Design, implement, and maintain reproducible infrastructure using Terraform, AWS CDK, or similar.
- Manage AWS accounts, IAM policies, VPCs, and networking setups to ensure security and scalability.
- Observability & Reliability
-
- Implement logging, metrics, tracing, and alerting across services (e.g., CloudWatch, Prometheus, Grafana, OpenTelemetry).
- Build automated health checks and incident response playbooks.
- Automation & CI/CD
- Enhance and maintain the CI/CD stack (e.g., Jenkins, GitLab CI, or GitHub Actions).
- Automate repetitive tasks and develop internal tooling/scripts (Python, Bash, Go, etc.).
- Security & Governance
- Enforce access control, secrets management, and data protection policies.
- Partner with IT Core to align with CFM’s security and compliance standards.
-
- Monitor AWS usage and implement cost-optimization strategies (e.g., right-sizing, spot instances, S3 lifecycle policies).
- Provide dashboards and reporting on cloud spend, with actionable recommendations.
-
- Partner with Generative AI engineers to integrate RAG pipelines, vector databases, and agentic workflows into production.
- Support quantitative developers by providing stable, secure, and performant APIs and environments.