Freelance Agent Evaluation Engineer

Remote from: Romania
Salary, yearly, USD: 100,000
Department: Software Engineering
Employment type: Full Time,
Job posted: 11 May 2026
Apply before: 11 Jun 2026
Views / Applies: 319 / 116

Mindrift connects AI experts and clients to advance Generative AI models.

Actively Hiring

AI Summary

This role involves creating challenging tasks and evaluation criteria to test AI coding agents within simulated development environments. You will build virtual companies, design tasks, write tests, and iterate with AI agents to ensure evaluations are fair and rigorous. The position is project-based, not permanent, and requires strong software development skills, particularly in Python and full-stack technologies. Ideal candidates have 5+ years of experience, a background in test automation, and comfort working with Docker, CI/CD, and infrastructure tools. The work is demanding because frontier models are already proficient coders, making it difficult to design tasks that genuinely challenge them.

Job Complexity

Easy Hard

AI Insight The role requires deep technical expertise in software development, test design, and AI evaluation, along with the ability to create complex tasks that challenge advanced AI models. The need to balance test strictness and leniency, iterate with AI agents, and understand model failure modes adds significant complexity.

Salary Analysis

AI Insight The offered compensation is up to $50 per hour, which for a 20-hour task equates to $1,000 per task, but the annual salary equivalent is listed as $100,000. This is competitive for a freelance evaluation engineering role, though it may vary based on project. The market range for similar roles in the US is typically between $80,000 and $150,000 annually, depending on experience and project complexity.

Key Skills

Python Test Automation AI Evaluation Docker CI/CD Full-Stack Development FastAPI pytest React Software Engineering

Cover Letter Sample

I am writing to express my strong interest in the Freelance Agent Evaluation Engineer opportunity at Mindrift. With over 5 years of experience in software development and test automation, I have a deep understanding of Python, full-stack technologies, and CI/CD pipelines that align perfectly with the requirements of this role.

My background includes designing complex test suites and evaluating system behavior, which directly translates to creating challenging tasks for AI coding agents. I am comfortable working with Docker, infrastructure tools like Postgres and Redis, and have experience iterating with AI models to refine evaluation criteria.

I am particularly drawn to the challenge of designing tasks that push the boundaries of current AI capabilities, as I enjoy understanding where models fail and creating scenarios that reveal meaningful differences in performance. My proficiency in English and ability to work independently make me well-suited for this project-based opportunity.

I am excited about the prospect of contributing to the advancement of AI evaluation and would be thrilled to bring my skills to your team. Thank you for considering my application.

Possible Interview Questions

Describe a time you designed a test suite for a complex system. How did you ensure it was both thorough and not overly restrictive?

I once built a test suite for a microservices-based e-commerce platform. I started by mapping out all critical user journeys and edge cases. To avoid over-restriction, I used parameterized tests and allowed for multiple valid solutions. I also implemented a review process where peers could flag tests that were too strict or too lenient, iterating based on feedback.

How would you approach creating a task that challenges a frontier AI model? What factors do you consider?

I would first analyze known failure modes of current models, such as handling ambiguous requirements or multi-step reasoning. I would design a task that requires understanding a complex codebase, making trade-offs, and producing a solution that balances correctness and efficiency. I would also include adversarial scenarios where a straightforward answer is wrong, forcing the model to think critically.

Explain your experience with Docker and how you would set up a simulated environment for evaluating an AI agent.

I have used Docker extensively to create isolated development environments. For evaluation, I would containerize a Linux workstation with all necessary tools (terminal, CLI, MCP servers) and a real web application codebase. I would ensure the environment is reproducible and includes realistic context like documentation and ticket history. The agent would interact with this environment as if it were a real developer.

How do you handle the challenge of writing tests that accept all correct solutions while rejecting incorrect ones, especially when there are many valid approaches?

I start by defining clear evaluation criteria that focus on the outcome rather than the implementation. I use property-based testing and allow for multiple solution patterns. I also test the tests themselves by running them against known good and bad solutions, iterating until they correctly classify all cases. I pay close attention to edge cases and potential false positives/negatives.

Describe your experience with CI/CD pipelines, specifically GitHub Actions, and how you would use them in this role.

I have used GitHub Actions to automate testing and deployment workflows. For this role, I would set up pipelines to automatically run evaluation tests when a task is submitted, ensuring consistent and fast feedback. I would configure triggers for pull requests and use labels to categorize tasks. I would also monitor results to catch regressions in evaluation quality.

Please submit your CV in English and indicate your level of English proficiency.

Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.

What this opportunity involves

We’re building a dataset to evaluate AI coding agents — how well a model handles real-world developer tasks. You’ll create challenging tasks and evaluation criteria within realistic simulated environments:

Build virtual companies following a high-level plan – codebase, infrastructure, and context (conversations, documentation, tickets) that form a realistic environment with development history
Assemble and calibrate tasks from intermediate states of the virtual company: craft the prompt, define evaluation criteria, and ensure the task is solvable and the evaluation is fair
Design tasks set in isolated environments – emulations of a developer’s workstation: a Linux machine with development tools (terminal, CLI), MCP servers (repository, task tracker, messenger, documentation, etc.), and a real web application codebase
Write tests that accept all correct solutions and reject incorrect ones – neither too strict (breaking on valid approaches) nor too lenient (passing bad ones)
Iterate with an AI agent on tests – verifying they catch real problems, don’t miss bad solutions, and don’t break on good ones
Review code written by agents, analyze why an agent failed or succeeded, and design edge cases and adversarial scenarios
Iterate based on feedback from expert QA reviewers who score your work on quality criteria

What this is NOT

Not data labeling
Not prompt engineering
Not writing code from scratch – the agent writes most of the code; you guide and evaluate

A significant part of the work is done together with AI – it’s very hard to create tasks that challenge frontier models without using frontier models.

What we look for

This opportunity is a good fit for experienced developers, software engineers, and/or test automation specialists open to part-time, non-permanent projects. Ideally, contributors will have:

Degree in Computer Science, Software Engineering, or related fields
5+ years in software development, primarily Python (FastAPI, pytest, async/await, subprocess, file operations)
Background in full-stack development, with experience building React-based interfaces (JavaScript/TypeScript) and robust back-end systems
Experience writing tests (functional, integration — not just running them)
Docker containers, and familiarity with infrastructure tools (Postgres, Kafka, Redis)
CI/CD understanding (GitHub Actions as a user: triggers, labels, reading results)
English proficiency – B2

You don’t need to be an expert in every item, but you should be comfortable reading and reasoning about code across the stack.

Why this is hard

Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution.
Tasks have many valid solutions. Writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.

How it works

Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid

Effort estimate

Tasks for this project are estimated to take 20 hours to complete, depending on complexity. This is an estimate and not a schedule requirement; you choose when and how to work. Tasks must be submitted by the deadline and meet the listed acceptance criteria to be accepted.

Compensation

On this project, contributors can earn up to $50 per hour equivalent, depending on their level and pace of contribution.

Compensation varies across projects depending on scope, complexity, and required expertise. Please note that other projects on the platform may offer different earning levels based on their requirements.

Apply now >