We are seeking a talented and experienced DevOps Engineer to join our team. This engineer will contribute to the further development of Arena, a web-based software platform for reinforcement learning training and RLOps.
As a DevOps Engineer, you will be responsible for designing, implementing, and maintaining the cloud infrastructure, CI/CD pipelines, and deployment systems that enable businesses to build and deploy reinforcement learning models at scale.
Responsibilities
• Design and maintain robust, scalable cloud infrastructure to support high-performance reinforcement learning workloads and distributed training environments
• Build and optimise CI/CD pipelines for both our open-source framework and Arena enterprise platform, ensuring reliable deployments and automated testing
• Implement and manage containerisation strategies using Docker and Kubernetes for ML model training, deployment, and orchestration
• Develop infrastructure as code (IaC) solutions using tools like Terraform, CloudFormation, or Pulumi to ensure reproducible and version-controlled infrastructure
• Monitor system performance, implement alerting and logging solutions, and troubleshoot production issues across distributed ML training environments
• Collaborate with ML engineers to optimise resource allocation and cost efficiency for compute-intensive RL training workloads
• Implement security best practices, manage access controls, and ensure compliance with enterprise security requirements
• Automate operational tasks including backup strategies, disaster recovery procedures, and system maintenance
• Support the deployment and scaling of GPU clusters and distributed computing resources for reinforcement learning applications
• Maintain high availability and performance of production systems serving ML models to external customers
Requirements
• Bachelor's degree or higher in Computer Science, Engineering, or a related field, or 3+ years of relevant DevOps/infrastructure experience
• Strong experience with cloud platforms (AWS, GCP, Azure) and their ML/AI services, with expertise in managing compute-intensive workloads
• Proficiency in containerisation technologies (Docker, Kubernetes) and container orchestration for ML workloads
• Experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi) and configuration managementSolid understanding of CI/CD principles and tools (GitHub Actions, GitLab CI, Jenkins) with experience in ML pipeline automation
• Knowledge of monitoring and observability tools (Prometheus, Grafana, OpenObserve) and their application to ML systems
• Experience with GPU infrastructure management and distributed computing frameworks for machine learning
• Familiarity with MLOps practices and tools for model deployment, versioning, and lifecycle management
• Strong scripting skills in Python, Bash, or similar languages for automation tasks
• Understanding of networking, security, and database management in cloud environments
• Experience with high-performance computing environments and job scheduling systems is a plus
• Knowledge of machine learning workflows and the unique infrastructure requirements of ML training and inference
• Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
• Excellent communication skills and experience working with cross-functional teams
Compensation
• Competitive salary + significant stock options.
• 30 days of holiday, plus bank holidays, per year.
• Flexible working from home and 6-month remote working policies.
• Enhanced parental leave.
• Learning budget of £500 per calendar year for books, training courses and conferences.
• Company pension scheme.
• Regular team socials and quarterly all-company parties.
• Bike2Work scheme.
Join the fast-growing AgileRL team and play a key role in the development of cutting-edge reinforcement learning tooling and infrastructure.
Apply below