Strategy

How to evaluate a dev agency's AI workflow before you sign

April 8, 2026 | 10 min read

Developer reviewing AI-generated code on a monitor with a code review interface

90% of development teams now use AI coding tools (GitHub 2025 survey). AI-assisted engineers ship boilerplate 30-50% faster. AI-generated pull requests also contain 1.7x more bugs than human-written code (GitClear 2025). The question isn't whether your agency uses AI. The question is how.

Every agency you talk to in 2026 will mention AI. They'll tell you it makes them faster, cheaper, better. Some of them are right. Others are shipping AI-generated code with no review process, no security scanning, and no senior oversight. You can't tell the difference from a sales deck.

You can tell the difference from ten specific questions.

Why an agency's AI workflow matters before you sign

Agencies that use AI without review processes ship faster at first. Then the rework starts. AI tools hallucinate API calls that don't exist. They generate code that passes basic tests but fails under real-world conditions. They reproduce security vulnerabilities from their training data. Without senior engineers catching these patterns, you pay twice: once for the initial build, and again to fix what the AI got wrong.

Agencies that refuse AI leave speed on the table. Scaffolding a CRUD interface, generating boilerplate, writing standard form validations; these are tasks where AI saves hours per week. An agency that insists on writing every line by hand is spending your budget on work a tool handles in minutes.

You want the middle ground: AI acceleration with senior review gates. The ten questions below help you identify which agencies have found it and which are guessing.

10 questions to ask any agency about their AI workflow

1. Which AI tools does your team use, and for what tasks?

This question separates agencies with a real AI workflow from those using buzzwords in their pitch deck. An agency with a structured process will name specific tools for specific tasks: Cursor for scaffolding new components, Claude Code for refactoring legacy functions, GitHub Copilot for autocomplete suggestions during pair programming.

Green flag: specific tool-to-task mapping. "We use Cursor for generating React components and Claude Code for breaking apart large functions." Red flag: vague answers like "we use AI for everything" or an inability to name their tools. Both signal that the team hasn't defined boundaries around AI usage.

2. What percentage of your code is AI-generated vs human-written?

This question reveals how dependent the agency is on AI output. A healthy ratio sits between 20-40% AI-generated code with human review on every line. That range means the team uses AI for repetitive tasks while engineers own the architecture, business logic, and edge-case handling.

Green flag: a specific percentage with context. "About 30% of our code starts as AI output, concentrated in CRUD operations and form validations. Engineers rewrite 10-15% of that during review." Red flag: "most of our code is AI-generated" or "we don't track that." The first means they've outsourced engineering judgment to a language model. The second means they don't have a process at all.

3. Who reviews AI-generated code before it ships?

Code review is the single most important quality gate in any AI-augmented workflow. Every pull request, whether a human wrote it or an AI generated it, should go through the same review process. The reviewer needs enough experience to catch subtle errors that pass tests but break in production.

Green flag: senior engineers review every PR. The agency treats AI output the same as junior developer output; it needs sign-off from someone who understands the system. Red flag: no review process, or junior developers reviewing AI-generated code. Junior engineers often lack the context to identify hallucinated API calls or deprecated patterns that AI tools produce confidently.

4. How do you handle AI hallucinations in code?

AI tools generate plausible-looking code that calls APIs that don't exist, references deprecated methods, or invents configuration options. These hallucinations compile and sometimes pass basic tests. They break in production when the nonexistent API returns a 404 or the deprecated method gets removed in the next framework update.

Green flag: the agency gives you specific examples of hallucinations they've caught. "Last month, Copilot suggested a Stripe API method that was removed in v2023-08. Our reviewer caught it because the type signature didn't match our SDK version." Red flag: "that doesn't happen with our tools." It happens with every AI coding tool. An agency that claims otherwise hasn't looked closely enough.

5. What's your security scanning process for AI-generated code?

AI tools reproduce vulnerable patterns from their training data. A 2024 Stanford study found that developers using AI coding assistants produced code with 2.74x more security vulnerabilities than developers working without AI. The AI doesn't flag its own vulnerable output. You need automated scanning in the CI pipeline to catch what human review misses.

Green flag: automated SAST (static application security testing) and DAST (dynamic application security testing) tools running on every commit. Tools like Snyk, Semgrep, or SonarQube integrated into the CI pipeline so vulnerable code can't merge without a security review. Red flag: "we rely on manual review" or "we trust the AI to write secure code." Manual review alone misses injection patterns and insecure deserialization that automated scanners catch in seconds.

6. Can you show me a recent PR with AI-assisted code?

This is the transparency test. An agency with a mature AI workflow will walk you through a real pull request. They'll show you what the AI generated, what the reviewer changed, and why. They'll point to comments where an engineer flagged a hallucinated dependency or rewrote a function the AI over-complicated.

Green flag: willingness to share. The agency opens a PR, shows the diff, and explains their review comments. This takes five minutes and tells you more about their process than any slide deck. Red flag: "our process is proprietary" or an outright refusal. If they can't show you a single example, they either don't have a process worth showing or they're hiding the quality of their AI-assisted output.

7. How does AI affect your project timeline and pricing?

AI tools save time on specific tasks. Scaffolding a data model, generating test boilerplate, creating standard API endpoints. These savings are real and measurable: 30-50% faster on repetitive code. A good agency passes some of those savings to you through lower costs or increased scope within the same budget.

Green flag: specific claims tied to specific tasks. "AI saves us 8-12 hours per sprint on CRUD scaffolding. That lets us include the admin dashboard in your initial scope instead of pushing it to phase two." Red flag: "AI makes everything faster" without task-level specifics. This usually means the agency hasn't measured their AI impact and is using the claim as a marketing line.

8. What tasks do you NOT use AI for?

This question is more revealing than asking what they use AI for. Experienced teams know where AI creates risk. Architecture decisions require understanding trade-offs across the entire system. Security-critical code needs a human who understands threat models. Database migrations can destroy production data if the AI generates an incorrect rollback script. Business logic encodes your competitive advantage; handing it to a model trained on public code is a poor bet.

Green flag: a clear list of AI-free zones. "We don't use AI for architecture decisions, database migrations, authentication flows, payment processing logic, or anything touching PII." Red flag: "we use AI for everything." An agency that applies AI to every task hasn't thought about where AI creates more risk than value.

9. How do you handle intellectual property with AI tools?

Some AI coding tools send your code to third-party servers for processing. GitHub Copilot Business retains code snippets for model improvement unless your organization opts out. Claude Code sends code context to Anthropic's API. Cursor routes code through their servers. If your project involves proprietary algorithms, trade secrets, or regulated data, you need to know where your code goes.

Green flag: the agency has a documented data policy. They know which tools send data externally, they've opted out of training data collection where possible, and they avoid sending proprietary business logic to public models. Red flag: no policy. If the agency hasn't considered where your code ends up when they paste it into an AI tool, they're exposing your IP without your consent.

10. What happens when AI tools produce wrong output on my project?

AI will produce incorrect output. That's a certainty, not a risk. The question is who pays for the fix. If the agency uses AI to speed up their work, the cost of AI mistakes belongs to the agency. You hired them to deliver working software, not to debug their tools at your expense.

Green flag: the agency eats the cost of rework caused by AI errors. Their fixed-price quote accounts for the reality that AI output needs correction. Your invoice doesn't include line items for "debugging AI-generated code." Red flag: billable hours for debugging AI output. If you're paying hourly rates for an engineer to fix what the AI broke, you're subsidizing a tool that benefits the agency's efficiency while increasing your cost.

Red flags vs green flags at a glance

Green flag	Red flag
Names specific AI tools for specific tasks	Vague claims: "we use AI for everything"
20-40% AI-generated code with tracked metrics	"Most of our code is AI-generated" or no tracking
Senior engineers review every PR	No review process, or juniors reviewing AI output
Gives examples of catching AI hallucinations	"That doesn't happen with our tools"
Automated SAST/DAST scanning in CI pipeline	Manual review only, or "we trust the AI"
Walks you through a real PR with AI code	Refuses to show examples; "proprietary process"
AI savings tied to specific tasks and timelines	"AI makes everything faster" with no specifics
Clear list of tasks where AI is not used	No AI-free zones for security or architecture
Documented data policy for AI tools	No policy on where your code goes
Agency absorbs cost of AI rework	Billable hours to debug AI mistakes

The 29% trust gap

Stack Overflow's 2025 Developer Survey found that only 29% of developers trust AI-generated code without review. The remaining 71% treat AI output as a first draft that needs human verification. The best agencies share this skepticism.

Think about what that means for your project. If 71% of professional developers don't trust AI output without review, an agency that ships AI-generated code with no review process is operating below the standard that most individual developers hold themselves to. They're not being efficient. They're skipping the step that separates working software from code that breaks in production.

The agencies worth hiring treat AI as a drafting tool. AI writes the first version. A senior engineer rewrites the parts that matter, catches the hallucinations, fixes the security gaps, and makes the architectural calls that determine whether your software scales or collapses under its own complexity at 10x traffic.

How Savi uses AI in client projects

Every Savi project is staffed with 1-2 senior engineers who own the full stack. Those engineers use Cursor and Claude Code for scaffolding, boilerplate generation, and mechanical refactoring. Every line of AI output goes through the same PR review process as human-written code. If the AI produces it, a senior engineer reviews it before it touches the main branch.

AI handles the repetitive 60%: CRUD endpoints, form validations, data model scaffolding, test boilerplate. Engineers handle architecture, security, business logic, and the integration work that requires understanding how your system fits together. On ZestAMC's 5-portal finance platform, AI handled CRUD scaffolding for the investor and sub-broker dashboards while senior engineers built the payout calculation engine and compliance audit trails. The result: a $10M+ AUM platform shipped in 30 days with zero security incidents in production.

You communicate directly with your engineer. No project manager layer. No game of telephone where your requirements get translated three times before someone writes code. That direct line means you can ask any of the ten questions above and get an answer from the person doing the work. For a deeper look at what AI coding tools can and can't do, read our breakdown of AI coding assistants in 2026. If you're curious about what happens when teams skip the review step entirely, our post on the real cost of vibe coding covers the failure modes in detail.

Frequently asked questions

Should I hire an agency that uses AI coding tools?

Yes, if they have a structured review process. Agencies that pair AI tools with senior engineer review ship boilerplate 30-50% faster without increasing bug rates. The red flag is agencies that use AI without code review gates or can't explain which tasks AI handles versus which tasks humans own.

How do I know if an agency's AI workflow is safe?

Ask three questions: Who reviews AI-generated code? What security scanning runs in the CI pipeline? What's their policy on sending your code to third-party AI APIs? Safe agencies run automated SAST/DAST scans, have senior engineers review every pull request, and use AI tools with clear data retention policies.

Does AI-generated code have more bugs?

GitClear's 2025 analysis found AI-generated pull requests contain 1.7x more bugs than human-written code. The primary causes are hallucinated APIs, deprecated method calls, and missing edge-case handling. Senior code review catches these issues before they reach production.

Will AI make my software project cheaper?

AI reduces cost on repetitive tasks like CRUD scaffolding, boilerplate generation, and standard UI components. These savings range from 10-25% on typical projects. AI does not reduce cost on architecture decisions, security design, business logic, or integration work. Agencies that claim AI cuts total project cost by 50% or more are cutting corners on review.

What AI coding tools do professional developers use in 2026?

The most common tools are GitHub Copilot (autocomplete), Cursor (AI-assisted editing), and Claude Code (refactoring and code generation). Professional teams use these tools for specific tasks like scaffolding and boilerplate, not for architecture decisions or security-critical code. 90% of development teams report using at least one AI coding tool (GitHub 2025 survey).

Want to see how we use AI on real projects?

We'll walk you through our workflow, tools, and review process in a 30-minute call. No pitch, no obligation.

Book a free consultation