Role Overview
Collaborating with a top academic research lab focused on advancing AI agents in real-world system environments. We're seeking high-performing software engineers based in Five Eyes countries to rigorously evaluate and improve terminal-based agents through the Terminal-Bench 2.0 benchmark suite. This is a short-term, high-intensity contract ideal for engineers with deep systems-level expertise and a passion for hands-on problem-solving. Due to the complexity of the tasks, high engagement and consistent weekly availability are critical.
Key Responsibilities
- Systematically analyze, solve, and document benchmark tasks involving Docker, shell scripting, and Linux system administration
- Evaluate agent outputs for correctness, reproducibility, and reliability across complex multi-step CLI workflows
- Provide detailed, evidence-based reasoning grounded in code structure and terminal behavior
- Synthesize information across files and configurations to assess end-to-end architecture
- Contribute high-quality reference solutions and diagnostic insights to improve agent performance metrics
Ideal Qualifications
- 2+ years of hands-on experience at top-tier tech companies, quant firms, or elite startups
- Bachelorβs or Masterβs in Computer Science or related field from a top 50β100 global university
- Deep familiarity with terminal workflows, Linux environments, and shell scripting
- Strong knowledge of Docker, Git, Python, and distributed systems concepts
- Demonstrated ability to trace, debug, and explain complex system behaviors across multiple files
- Commitment to intellectual honesty, clarity, and rigorous methodology
Application Process
- Submit your resume and brief experience summary
- Qualified applicants will be invited to complete a short-form technical assessment
- We typically follow up within 3β5 business days with next steps
Contract and Payment Terms
- You will be engaged as an independent contractor.
- This is a fully remote role that can be completed on your own schedule.
- Projects can be extended, shortened, or concluded early depending on needs and performance.