The AI Engineer Hiring Guide: Why Most Candidates Are Lying

A motley gang of venture capitalists, LinkedIn influencers, and recently-laid-off product managers have hijacked your company’s hiring pipeline and are threatening to sink it in 30 days with "AI strategy" unless their demands are met. The demand? A full-stack AI Engineer who can build agentic systems, understands the difference between Context Graphs and GraphRAG, knows that MCP and A2A aren’t competing protocols but solve different problems, and actually ships things.

You post the job listing. Within 48 hours, you’re drowning in 847 resumes from people who took a two-hour prompt engineering course on Udemy and now list "AI/ML Expert" on their LinkedIn headline. You’ve got former data analysts claiming they "architected enterprise LLM solutions." You’ve got fresh graduates with "GPT-4 Integration" listed as a skill, which apparently means they called the API once in a hackathon project that crashed.

The Love Boat is sinking. Do you:

(A) Hire the first person who can spell "transformer" correctly?
(B) Give up and tell the board that AI is overhyped?
(C) Learn how to actually evaluate AI Engineers?

Let me save you some time: the answer is (C). But first, you need to understand what you’re actually looking for. Because right now, you probably don’t.

The market is hot. The signal-to-noise ratio is terrible.
What an AI Engineer actually is (30-second version)
The 60-Second Cheat Sheet
The Interview Framework
The AI Engineer Scorecard
Master Red Flags
The uncomfortable truth about the 100x developer
The Trait That Matters More Than Any Technical Skill
Appendix: What Technical Depth Looks Like

The market is hot. The signal-to-noise ratio is terrible.

According to LinkedIn’s Jobs on the Rise 2025 and 2026 reports, AI Engineer has been the #1 fastest-growing job title for two consecutive years. A 130% increase in AI engineering talent since 2016. 7 out of every 1,000 LinkedIn members globally are now considered AI engineering talent. 1.3 million new AI-related jobs in just two years.

Salary range? $106,000 to $206,000 depending on who you ask. Median prior experience? 3.7 years. Most common prior roles? Software Engineer, Data Scientist, Full Stack Engineer.

Here’s the problem: this explosive growth means everybody and their dog now claims to be an AI Engineer. The title has been inflated, diluted, and co-opted by people who think "AI Engineer" sounds fancier than "person who copied code from ChatGPT and hoped it worked."

You’re going to see three types of candidates. Bootcamp graduates who actually believe they’re qualified to build production AI systems. Brilliant superstars who build custom evaluation harnesses for fun. And a large number of "maybes" who seem like they might be able to contribute something.

You don’t want to hire any of the maybes. Ever.

What an AI Engineer actually is (30-second version)

An AI Engineer is a software engineer who specializes in integrating, deploying, and building applications with foundational AI models. They don’t build models from scratch (that’s an ML Engineer). They don’t analyze data for business insights (that’s a Data Scientist). They’re not prompt engineers. Prompt engineering is a baseline skill that AI Engineers have, like how all software engineers know Git.

The practical distinction: ML Engineers build models. AI Engineers build with models.

ML Engineers work with structured data (databases, tables, feature vectors). AI Engineers work with unstructured data (text, images, video, audio). Ask a candidate about their last project. If they talk about training loss curves, that’s an ML Engineer. If they talk about RAG pipelines, evaluation harnesses, and agent orchestration, that’s an AI Engineer.

The 60-Second Cheat Sheet

If you’re a hiring manager who doesn’t have time to read a 7,000-word guide, here’s your one-screen summary:

✅ Must-Have Signals (all six required)

Has shipped a real AI system to production – not a hackathon demo, not a tutorial project, a real system with real users
Built their own evaluation harness – custom test sets, automated scoring, not just "we checked if it looked good"
Has failure stories – can describe an AI system that broke in production and how they diagnosed/fixed it
Demonstrates cost discipline – knows when to use cheap fast models vs expensive reasoning models, has optimized API spend
Shows security awareness – understands prompt injection is OWASP’s #1 LLM risk, can discuss mitigations
Understands context management – can explain compaction strategies, their tradeoffs, and why you can’t just stuff everything into the context window

🚫 Instant No-Hire Signals (any one is disqualifying)

"I built a RAG pipeline" with no follow-up on retrieval strategies or failure modes
Ingestion is just "chunk and embed." No awareness that you need to create summaries, hierarchies, and metadata at ingestion time to support different query types.
No evaluation infrastructure – relied entirely on "it looked right" or provider benchmarks
No debugging stories – never encountered a weird LLM failure in production
Dismisses prompt injection – "that’s not a concern for our use case"
Only knows one model – all projects use OpenAI, can’t discuss tradeoffs
Can’t explain their own code – classic vibe coder who prompts until something works
Doesn’t understand tool calling mechanics – thinks the LLM "executes" tools, can’t explain where execution actually happens or the difference between a tool schema and a tool call
No context management strategy – "we just send the whole conversation" or blank stare when asked about compaction

The Fastest Filter: The Trace Review

Hand the candidate a real agent trace with a failure. Give them 15 minutes. Ask:

"Where did it go wrong?"
"What would you instrument to debug this?"
"What would you change first?"

This single exercise eliminates 70% of pretenders. Real AI Engineers live in traces. Tutorial followers have never seen one.

The Interview Framework

At the end of the interview, you must be prepared to make a sharp decision. There are only two outcomes: Hire or No Hire.

Never say "Hire, but not for my team." That translates to "No Hire."

Never say "Maybe, I can’t tell." If you can’t tell, that means No Hire. A bad hire in AI engineering doesn’t just cost you one headcount. They’ll ship broken systems that create technical debt for years.

This framework is adapted from Joel Spolsky’s Guerrilla Guide to Interviewing, updated for the AI age.

Round 1: Technical Fundamentals (45 minutes)

Start with a question about their most impactful AI project. Let them talk for 10-15 minutes, uninterrupted. Look for: passion (do they get animated?), depth (can they explain the technical details?), and honesty (do they acknowledge failures and limitations?).

Then go into technical questions:

System architecture: "How do you build reliable systems on top of non-deterministic components?" "Walk me through how you’d design an agent with proper guardrails, fallbacks, and hard loop cutoffs." "What feedback mechanisms do you use to catch when an agent is going off the rails?"

Tool calling mechanics: "Walk me through exactly what happens when an LLM ‘calls’ a tool. Where does execution actually occur?" "What’s the difference between a tool schema and a tool call?" This single question eliminates a surprising number of candidates who’ve only followed tutorials.

Retrieval depth: "Walk me through how you’d diagnose and fix a RAG system that’s returning irrelevant results." "When would you use knowledge graphs versus hybrid semantic-keyword search?"

Ingestion design: "A user wants to ask ‘summarize document X.’ How does your ingestion pipeline need to be designed to support that?" If they don’t mention pre-computing summaries at ingestion time, they’ve only built demos.

Evaluation literacy: "How would you evaluate a RAG-based system? What metrics?" "What’s the difference between offline and online evaluation?"

Practical knowledge: "You need to deploy a 70B parameter model. Walk me through quantization options." "What is context rot (accuracy degradation when relevant information is buried in long contexts), and how do you prevent it?"

Context management: "Your agent is in a multi-turn conversation that’s approaching the context limit. What are your options, and what are the tradeoffs of each?" This question separates people who’ve built real conversational systems from those who’ve only done single-turn demos.

Round 2: System Design (60 minutes)

Give them a real problem:

"Design an AI agent that can autonomously research topics and produce reports."

"Design a RAG-based enterprise Q&A system with confidential documents, multiple user roles, and citation requirements."

"Design a content moderation system using LLMs that needs to handle text, images, and adversarial inputs."

Evaluate on:

Did they clarify requirements before jumping in?
Did they discuss failure modes proactively?
Did they mention evaluation and monitoring?
Did they consider security implications?
Did they discuss cost/performance tradeoffs?
For RAG systems: Did they think about ingestion design, not just retrieval?

A senior AI Engineer should naturally bring up model selection, context management, evaluation, and graceful degradation. If they only talk about the happy path, they’re not ready.

Round 3: The Trace Review (15-30 minutes)

This is the killer exercise. Prepare an agent trace from a real (or realistic) failure:

An agent that got stuck in a loop
A RAG system that retrieved irrelevant context
A multi-turn conversation where context rot caused a wrong answer
An agent that misused a tool based on ambiguous instructions

Hand it to the candidate. Ask them to diagnose it. Watch how they work.

What you’re looking for:

Do they know how to read traces?
Do they form hypotheses and test them?
Do they identify the root cause, not just the symptom?
Do they suggest instrumentation they’d add?
Do they propose a fix and explain why it would work?

Candidates who’ve shipped real systems will be comfortable here. Tutorial followers will flounder.

Round 4: Practical Exercise (90 minutes, can be take-home)

Options:

"Build a simple RAG pipeline for this set of documents. Include an evaluation harness."

"Here’s an existing agent implementation. Identify three bugs or improvement opportunities and fix one."

"Write evaluation criteria for [specific use case]. Then implement an automated judge."

Look for: code quality, error handling, edge case consideration, and whether they actually test their work.

The Ingestion Pipeline Test (This Is Where Real Understanding Shows)

If you give them the RAG exercise, pay close attention to how they think about ingestion, not just retrieval. This is where you separate people who understand the system end-to-end from people who followed a LangChain tutorial.

The naive approach: split documents into chunks, embed each chunk, store in vector database. Done.

The problem: this approach can only answer questions where the answer exists verbatim in a chunk and where the query semantically matches that chunk. It completely fails on:

"Summarize document X." Semantic search matches on the word "summarize," which is useless. The chunks contain document content, not summaries. You’ll retrieve random chunks that happen to be vaguely similar to the concept of summarization.
"What are the key themes across these documents?" No individual chunk contains cross-document themes. The information doesn’t exist in retrievable form.
"How does section 3 relate to the conclusion?" Chunks don’t know what section they came from or how sections relate to each other. Document hierarchy was destroyed during chunking.
"What’s missing from this analysis?" Finding gaps requires understanding what should be there. Pure semantic similarity can’t reason about absence.

What a real AI Engineer thinks about at ingestion time:

Pre-compute summaries. Generate document-level and section-level summaries during ingestion and embed them as separate chunks with metadata indicating they’re summaries. Now when someone asks "summarize X," you can retrieve the actual summary.
Preserve hierarchy. Tag chunks with their position in document structure: which document, which section, which subsection, what comes before and after. Store parent-child relationships. Now you can answer questions about document structure.
Extract entities and relationships. During ingestion, pull out key entities (people, products, dates, concepts) and their relationships. Store these as structured metadata or in a knowledge graph. Now you can do multi-hop reasoning.
Generate multiple representations. The same content might need different embeddings for different query types. A chunk might have: its literal embedding, a "what question does this answer?" embedding (HyDE-style), a topic/category embedding, and a summary embedding.
Create cross-document features. If you have multiple documents, generate features that span them: common themes, contradictions, timeline of events, key entities that appear across documents.

The interview signal:

When a candidate describes their RAG approach, listen for whether they think about retrieval at ingestion time. Do they ask "what kinds of questions will users ask?" before deciding how to chunk and embed? Do they create derived artifacts (summaries, entity extracts, hierarchical metadata) that don’t exist in the source documents but are essential for certain query types?

If their entire answer is "chunk it, embed it, retrieve top-k," they’ve only built demos. They’ve never had a user ask "summarize this document" and watched their system return garbage because the retrieval fundamentally couldn’t support that query type.

Questions to probe this:

"How would you design an ingestion pipeline to support queries like ‘summarize document X’?"
"How would you handle a question that requires understanding the relationship between two different sections of a document?"
"What artifacts do you create at ingestion time beyond just the chunk embeddings?"

Red flags:

Ingestion is just "split and embed"
No awareness that query types constrain ingestion design
Never created document summaries or hierarchical metadata during ingestion
Thinks retrieval problems are always solved by better embedding models or reranking (sometimes the information just doesn’t exist in retrievable form)

Round 5: Behavioral Deep-Dive (45 minutes)

"Tell me about a time when an AI system you built failed in production. What happened, how did you discover it, and how did you fix it?"

"Describe a situation where you had to make a tradeoff between model capability and cost/latency."

"Tell me about a time you disagreed with a colleague about technical approach."

What you’re evaluating: Ownership (do they take responsibility?), Learning (do they improve from failures?), Judgment (can they make good decisions under uncertainty?), Communication (can they explain technical concepts clearly?).

The AI Engineer Scorecard

Print this out. Fill it in after each interview. Make your decision within 15 minutes of the interview ending.

TECHNICAL FOUNDATIONS (Score 1-4)

[ ] Understands system architecture for non-deterministic components (guardrails, fallbacks, loop cutoffs)
[ ] Can explain tool calling mechanics (schema vs invocation, where execution happens)
[ ] Can explain evaluation approaches in depth
[ ] Knows context management strategies and their tradeoffs
[ ] Has strong software engineering fundamentals
[ ] Demonstrates security awareness (prompt injection, OWASP LLM Top 10)

Score: | Notes: ____

PRACTICAL EXPERIENCE (Score 1-4)

[ ] Has built and shipped real AI systems to production
[ ] Has created custom evaluation harnesses (not just "it looked right")
[ ] Understands ingestion design (summaries, hierarchies, metadata, not just chunk-and-embed)
[ ] Knows retrieval strategies beyond basic cosine similarity (hybrid search, reranking, knowledge graphs)
[ ] Can discuss production failures and how they diagnosed/fixed them
[ ] Shows evidence of continuous experimentation and curiosity

Score: | Notes: ____

SYSTEM THINKING (Score 1-4)

[ ] Clarifies requirements before jumping to solutions
[ ] Considers failure modes proactively (not just the happy path)
[ ] Discusses cost/latency/capability tradeoffs without prompting
[ ] Mentions monitoring, observability, and debugging from the start
[ ] Designs for graceful degradation when components fail
[ ] Knows when NOT to use LLMs (and can articulate why)

Score: | Notes: ____

COMMUNICATION & JUDGMENT (Score 1-4)

[ ] Explains technical concepts clearly to different audiences
[ ] Demonstrates appropriate confidence calibration (knows what they don’t know)
[ ] Shows learning from past mistakes without excuses
[ ] Makes good decisions under uncertainty
[ ] Exhibits genuine curiosity (asks good questions, wants to understand why)
[ ] Can push back constructively when they disagree

Score: | Notes: ____

SCORING:

4 = Strong Hire signal (exceeds expectations, would be excited to work with)
3 = Hire signal (meets bar, solid performance)
2 = Concerning (below bar, some red flags)
1 = Strong No Hire (major gaps, clear problems)

DECISION RUBRIC:

All 3s or above, at least one 4 → Strong Hire
All 3s or above → Hire
Any 2 → Lean No Hire (discuss with team)
Any 1 → No Hire

Final Decision: HIRE / NO HIRE

Rationale (2-3 sentences): ___

Master Red Flags

Any one of these is a serious concern. Multiple means No Hire.

On Retrieval:

"I built a RAG pipeline" with no follow-up on retrieval strategies or failure modes
Only uses cosine similarity (no hybrid search, no reranking)
Can’t explain when they’d use knowledge graphs vs semantic search vs keyword fallback
Never heard of HyDE, contextual retrieval, or late chunking
Describes retrieval as a single step (no query rewriting, no relevance filtering, no fallbacks)

On Ingestion:

Ingestion pipeline is just "split and embed." No derived artifacts.
Never created document summaries or hierarchical metadata during ingestion
No awareness that query types constrain ingestion design
Can’t explain how to support "summarize document X" queries (hint: you need pre-computed summaries)
Thinks all retrieval problems are solved by better embeddings or reranking

On Evaluation:

No custom evaluation harness ("we checked if it looked right")
Relies on MMLU or public benchmarks for model selection
Can’t explain offline vs online evaluation
No CI/CD integration for evals

On Models:

"We just use GPT-4 for everything"
Only knows one provider
Can’t discuss quantization tradeoffs
Thinks open-source models "aren’t good enough" (hasn’t checked since 2023)

On Agents:

"I used LangChain to build an agent" with no depth on patterns, tool-calling, or error handling
Has never built an MCP server or heard of A2A protocols. Building an MCP server is straightforward, and anyone claiming agent experience should have done it.
Can’t explain how to prevent runaway execution or infinite loops
No human-in-the-loop considerations

On Tool Calling:

"The LLM executes the tool." Doesn’t understand that LLMs output text and the harness executes.
"LangChain handles the tools." Doesn’t understand they define what tools exist.
Can’t explain the difference between a tool schema and a tool call
No mention of validation or error handling for malformed tool calls
Never heard of Zod, Pydantic, or JSON Schema for schema validation

On Context Management:

"We just send the whole conversation." Hasn’t hit real scale or thought about costs.
"We truncate to the last 10 messages" with no awareness of tradeoffs
Never heard of summarization or semantic selection as compaction strategies
Can’t explain when to use sliding window vs summarization vs hybrid approaches
No awareness that long contexts hurt performance even before hitting token limits

On Security:

Dismisses prompt injection as not a concern
"We sanitize user inputs" (ignores indirect injection)
Gives LLMs same permissions as users
No awareness of OWASP LLM Top 10

On Experience:

No war stories about production failures
GitHub shows only tutorial repos
Can’t name papers or techniques from the last 6 months
No side projects, no experiments, no curiosity

On Chinese Models:

Recommends DeepSeek or Qwen for enterprise data without discussing data sovereignty, security findings, or geopolitical risk

The uncomfortable truth about the 100x developer

Here’s the thing that everyone in AI hiring needs to get through their head: the 100x AI Engineer isn’t a myth. They exist. I’ve worked with them. They’re the people who can take a vague product requirement like "make our support system smarter" and deliver, within weeks, a multi-agent system with custom evaluation, proper monitoring, graceful fallbacks, and actual measurable business impact.

The gap between the best AI Engineers and the average ones is larger than in almost any other software discipline, because the field is moving so fast and the leverage is so high. A great AI Engineer can automate in days what used to be impossible. A mediocre one will spend months building something that kind of works sometimes and fails mysteriously in production.

The 100x developer exists, but so does the 0.1x developer who ships broken systems with confident smiles. The entire point of a rigorous interview process is to tell them apart.

The Trait That Matters More Than Any Technical Skill

I haven’t said this explicitly yet, but it’s the most important thing in this entire guide:

The best AI Engineers can’t stop tinkering. They’re pathologically curious.

You can teach someone GraphRAG. You can train them on evaluation harnesses. You can hand them documentation on MCP and A2A protocols. But you can’t teach someone to be the kind of curious that keeps them up at night wondering "what if I tried it this way instead?"

The field is evolving so fast that no "AI Engineering Best Practices" book can stay current. Any book published today is outdated by the time it hits shelves. Any certification program is teaching yesterday’s techniques. Any bootcamp curriculum was written six months ago, which in AI time is approximately the Paleolithic era.

The AI Engineers you want are the ones who:

Read papers for fun. Not because someone assigned them, but because they saw a title and thought "wait, could this fix that thing I’ve been stuck on?"
Have side projects that make no business sense. They’re running experiments at 2am not because they’re on deadline, but because they had an idea and couldn’t sleep until they tested it.
Ask "what if?" constantly. What if we chained three models together? What if we used the model to evaluate itself? What if we built a feedback loop where the system improves its own prompts?
Push until it breaks. They don’t stop at "it works." They want to know: How far can I push this? What are the edge cases? Where does it fail?
Get weirdly excited about new releases. When a new model drops, they’re not waiting for the blog post summary. They’re running it through their eval suite within hours.

Interview Signal: "What’s something you tried recently that didn’t work?"

A great AI Engineer has stories for this question. They tried using a smaller model and found surprising capability gaps. They experimented with a new chunking strategy and it degraded performance in ways they’re still investigating. They built a self-improving prompt system and discovered fascinating failure modes.

A mediocre candidate has nothing. They followed tutorials. They never wondered what would happen if they tried something weird.

Appendix: What Technical Depth Looks Like

This section is for interviewers who want to understand the technical landscape well enough to evaluate candidates. If you’re a CTO who just needs the scorecard, you can stop here.

The Model Landscape

A competent AI Engineer needs working knowledge of the major model families and their current state:

Claude (Anthropic): Opus 4.5 is the current flagship, with 200K context standard and 1M tokens in beta for Sonnet models. Leads on SWE-bench Verified (80.9%) and is widely considered the best coding model. Extended thinking mode for complex reasoning.

GPT-5.2 (OpenAI): The current GPT-5 series flagship with 400K total context (272K input, 128K output). State-of-the-art on long-context reasoning and tool-calling. The OpenAI o3 reasoning model remains available for specific use cases. First model to cross 90% on ARC-AGI.

Gemini (Google): 2.5 Pro ships with 1M context (2M coming soon) and native multimodal processing across text, audio, images, and video. Gemini 3 Pro also released. Strong on reasoning benchmarks, deeply integrated into Google Cloud.

Llama 4 (Meta): A cautionary tale. The April 2025 release was marred by controversy: Meta used a non-public "experimental" version for LMArena benchmarks, real-world performance disappointed (especially on coding), and the 10M context window claim was questionable—no model was trained on prompts longer than 256K. Yann LeCun, Meta’s chief AI scientist, admitted the benchmarks were "fudged a little bit" before leaving to start his own company. Meta has since restructured under new Superintelligence Labs leadership (Alexandr Wang from Scale AI), with new models ("Mango" and "Avocado") planned for H1 2026. Worth watching, but treat current Llama 4 claims skeptically.

Mistral: Cost-efficient MoE architectures for Europe-based deployments with GDPR compliance advantages.

Qwen and DeepSeek (China): Competitive performance at aggressive price points, but with all the data sovereignty and security concerns they carry.

The open-source vs closed-source gap has basically closed, but not because of Llama 4. DeepSeek and Qwen models now compete with frontier models on many benchmarks at aggressive price points. Model selection is no longer "just use OpenAI." It’s a complex decision involving cost, latency, privacy, compliance, and whether you need reasoning capabilities. An AI Engineer who doesn’t know this landscape will make expensive mistakes.

Context Engineering

One of the clearest signs that someone actually knows what they’re doing: they understand the difference between prompt engineering and context engineering.

Prompt engineering is crafting input text to elicit desired responses. Context engineering (a term championed by Anthropic) is "the discipline of designing dynamic systems that provide the right information and tools, in the right format, at the right time." As Andrej Karpathy put it: "In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Telling question: "What is context rot, and how do you prevent it?" If they look blank, they’re not ready. (Answer: LLM accuracy degrades when relevant information is buried in longer contexts. You prevent it by dynamically curating the smallest possible set of high-signal tokens per request.)

Context Management and Compaction Strategies

This is where tutorial followers get exposed. They’ve never built a system that runs longer than a demo conversation, so they’ve never hit the context limit. They’ve never had to make hard choices about what stays and what goes.

The problem: Every LLM has a context window limit. Even the big ones (200K tokens for Claude, 1M for Gemini) run out eventually. And long before you hit the hard limit, you hit performance degradation: the "lost in the middle" problem where accuracy drops for information buried in long contexts. You also hit cost problems, because you’re paying per token.

"Just send everything" is not a strategy. It’s expensive, slow, and actively hurts quality. Real AI Engineers understand compaction strategies and their tradeoffs:

1. Truncation (Sliding Window)

The simplest approach: keep the most recent N messages, drop the rest.

Pros: Dead simple. Predictable token count. No latency overhead.
Cons: Loses everything old. If the user said something important 50 messages ago, it’s gone. The agent has no memory of early conversation context.
When to use: Simple chatbots where early context genuinely doesn’t matter. Quick prototypes.

2. Summarization

Use an LLM to compress older messages into a summary, keep recent messages verbatim.

Pros: Preserves meaning from older context. Agent can still reference early conversation themes.
Cons: Adds latency (extra LLM call). Costs money. Summarization is lossy. Specific details get dropped. Summaries can drift from original meaning over many iterations.
When to use: Conversational agents where maintaining coherent long-term context matters more than preserving exact quotes.

3. Hierarchical Summarization

Multiple tiers: recent messages verbatim, older messages summarized, very old messages summarized-again into higher-level summaries.

Pros: Scales to very long conversations. Preserves different levels of detail at different time horizons.
Cons: Complex to implement. Multiple summarization steps compound lossy compression. Hard to know when to promote messages between tiers.
When to use: Long-running agents, persistent assistants, scenarios where conversations span days or weeks.

4. Semantic Selection

Instead of keeping recent messages, keep the most relevant messages based on semantic similarity to the current query.

Pros: Can surface old-but-relevant context that sliding window would drop. More efficient use of context budget.
Cons: Requires embedding and retrieval infrastructure. Adds latency. May miss context that’s relevant in non-obvious ways. Conversation flow can feel disjointed if the agent "forgets" recent messages that weren’t semantically matched.
When to use: When relevance matters more than recency. Often combined with other strategies.

5. Hybrid Approaches

The real answer is usually a combination:

Keep the system prompt (always)
Keep the last N messages verbatim (recency)
Summarize messages from N to M turns ago
Drop or deeply compress anything older
Inject retrieved context based on current query (semantic selection)

6. Knowing When to Reset

Sometimes the right answer is to start fresh. If the conversation has drifted so far from the original topic that old context is actively confusing the model, compaction won’t save you. Real AI Engineers build systems that can recognize when to reset context and do it gracefully.

The tradeoffs are real:

Strategy	Latency	Cost	Memory Fidelity	Implementation Complexity
Truncation	None	Low	Poor (old context lost)	Trivial
Summarization	High (LLM call)	Medium	Medium (lossy)	Medium
Hierarchical	High	High	Medium-Good	High
Semantic Selection	Medium (embedding + retrieval)	Medium	Variable	High
Hybrid	Variable	Variable	Good	High

Interview questions that expose understanding:

"Your agent is in a multi-turn conversation approaching the context limit. Walk me through your options and their tradeoffs."
"How do you decide what to keep and what to drop when compacting context?"
"What’s the difference between summarization and semantic selection for context management?"
"How do you handle a situation where the user references something from early in a long conversation?"
"What signals would tell you it’s time to reset context rather than compact it?"

Red flags:

"We just send the whole conversation." Hasn’t hit real scale.
"We truncate to the last 10 messages" with no discussion of tradeoffs. Hasn’t thought about it.
Never heard of summarization or semantic selection as compaction strategies
Can’t discuss when they’d use one approach vs another
No awareness that long contexts hurt performance even before hitting limits

Quantization

If a candidate claims AI Engineer experience but can’t explain quantization tradeoffs, they’ve never deployed a model.

GGUF for CPU and hybrid CPU-GPU inference, works well on Apple Silicon. GPTQ for GPU-optimized inference with Hessian-based calibration. AWQ (Activation-aware Weight Quantization) for quality-sensitive applications. The quality/speed tradeoffs are highly task-dependent. Anyone claiming specific percentages without citing a benchmark is guessing.

Real AI Engineers can discuss when they’d choose one over another based on their hardware, latency requirements, and quality tolerance.

Agentic Systems

The biggest shift in AI engineering has been the move from single-turn LLM interactions to multi-step agentic systems. An AI Engineer must understand:

Agent frameworks: LangGraph for complex multi-step reasoning. CrewAI for role-based multi-agent systems. LlamaIndex for data-centric applications. AutoGen for multi-agent conversations. OpenAI’s Agents SDK for lightweight deployments. Google’s ADK for enterprise systems. Vercel AI SDK for TypeScript/React apps with provider-agnostic model access. llms.py for lightweight CLI and gateway access to 530+ models.

Protocols: MCP (Model Context Protocol) for connecting AI systems with external tools (how agents reach tools and data). A2A (Agent-to-Agent Protocol) from Google Cloud for inter-agent communication (how agents coordinate with other agents).

Agent patterns: Self-verification loops. LLM-as-a-judge. Multi-agent orchestration: hierarchical, collaborative, conversational, handoff-based.

Tool Calling: The Litmus Test for Real Understanding

This is one of the best filters for separating real AI Engineers from tutorial followers. Ask a candidate: "Walk me through exactly what happens when an LLM ‘calls’ a tool." If they can’t explain the mechanics, they don’t understand how agentic systems actually work.

The fundamental misconception: LLMs don’t execute tools. They can’t. An LLM is a text prediction engine. It has no ability to make HTTP requests, query databases, or run code. When people say an LLM "calls a tool," they’re describing a multi-step process where the LLM’s role is just one piece.

What actually happens:

You define the tool schema. This is a structured description of what tools are available, what parameters they accept, and what they return. You write this. Not the framework. Not the LLM. You.
The schema gets injected into the prompt. The LLM receives the tool definitions as part of its context, typically as JSON Schema or a similar format. The LLM now "knows" what tools exist and how to request them.
The LLM outputs a tool call request. When the LLM decides it needs to use a tool, it outputs structured text (usually JSON) that says "I want to call tool X with parameters Y." This is just text. Nothing has executed yet.
Your harness parses and validates the output. Your code (or your framework’s code) parses the LLM’s output, extracts the tool call request, and validates it against your schema. Does the tool exist? Are the parameters the right types? Are required fields present?
Your harness executes the tool. Your code actually runs the function, makes the API call, queries the database. Whatever the tool does. The LLM is not involved in this step.
The result gets injected back into context. Your harness takes the tool’s output and adds it to the conversation context, then calls the LLM again so it can incorporate the result.
Loop until done. The LLM either requests another tool call or produces a final response.

Tool Schema vs Tool Call:

A tool schema is the definition: the contract that describes what a tool does, what inputs it accepts, and what it returns. You write this once. Example:

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": { "type": "string", "description": "City name" },
      "units": { "type": "string", "enum": ["celsius", "fahrenheit"] }
    },
    "required": ["location"]
  }
}

A tool call is a specific invocation: the LLM’s request to use that tool with specific arguments. The LLM outputs this:

{
  "tool": "get_weather",
  "arguments": { "location": "San Francisco", "units": "celsius" }
}

The schema is the blueprint. The tool call is a request that conforms to that blueprint.

Who defines the tools?

The AI Engineer defines the tools. This is the key point. The LLM doesn’t know what tools should exist. The framework doesn’t decide what your business logic needs. You analyze your use case, determine what capabilities the agent needs, design the tool interfaces, implement the actual functions, and write schemas that describe them clearly enough for the LLM to use them correctly.

Frameworks like LangChain provide decorators and utilities to make schema generation easier, but you’re still defining the tools. If a candidate thinks the framework "provides the tools," they’ve only used pre-built examples.

Schema validation matters:

When the LLM outputs a tool call, you can’t trust it blindly. LLMs hallucinate parameters, invent tools that don’t exist, and produce malformed JSON. Real AI Engineers validate rigorously:

Zod (TypeScript): Runtime schema validation with excellent type inference. Define your schema once, get TypeScript types and runtime validation.
Pydantic (Python): Similar approach. Define models, get validation and serialization.
JSON Schema: The underlying standard that most tool-calling APIs use. Understanding JSON Schema is foundational.

A candidate who’s built production agentic systems will have opinions about validation. They’ve been burned by malformed tool calls. They know you need to handle the case where the LLM requests a tool that doesn’t exist, or passes a string where you expected a number, or omits required fields.

Interview questions that expose understanding:

"Walk me through exactly what happens when an agent ‘calls’ a tool. Where does execution actually occur?"
"What’s the difference between a tool schema and a tool call?"
"Who decides what tools an agent has access to? Where does that definition live?"
"How do you handle it when the LLM outputs an invalid tool call?"
"What’s your approach to schema validation? What libraries do you use?"

Red flags:

"The LLM executes the tool." No. The LLM outputs text. Your code executes.
"LangChain handles the tools." LangChain handles orchestration. You define what tools exist.
Can’t explain the difference between schema and invocation
No mention of validation or error handling
Never heard of Zod, Pydantic, or JSON Schema

Retrieval Systems

The gap between real AI Engineers and tutorial followers becomes a chasm here.

Why Basic RAG Fails:

Semantic gap: "What caused the delay?" won’t match "Timeline slipped due to vendor issues"
Chunking destroys context: Answers spanning two chunks become meaningless
No relationship reasoning: Multi-hop queries need graph traversal, not cosine similarity
Recency blindness: Old documents score the same as authoritative new ones
Lost in the middle: LLMs perform worse when relevant info is buried in context
Query-document mismatch: Questions and statements don’t embed similarly

SOTA retrieval in 2026:

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed that instead of the query
Hybrid search: Combine semantic similarity + keyword (BM25) + metadata filtering + recency weighting
Contextual retrieval: Prepend LLM-generated context to chunks before embedding
Late chunking: Embed entire document first, then split embeddings to preserve cross-chunk attention
Knowledge graphs: Extract entities and relationships for multi-hop reasoning
Agentic RAG: Multi-step retrieval with query rewriting, relevance grading, and self-correction

Layer	2023 Approach	2026 SOTA
Chunking	Fixed 512 tokens	Semantic chunking, late chunking, hierarchical
Embedding	Single vector per chunk	Multi-vector, contextual headers, summary embeddings
Search	Pure cosine similarity	Hybrid (semantic + BM25 + metadata + recency)
Retrieval	Top-k and done	Agentic loop with reranking, query rewriting, relevance grading
Reasoning	Stuff chunks in prompt	Knowledge graphs, multi-hop traversal
Validation	None	Hallucination detection, source verification

Ingestion: Where Retrieval Succeeds or Fails

Most RAG tutorials focus on retrieval. But retrieval can only find what you put in. If your ingestion pipeline doesn’t create the right artifacts, no amount of clever retrieval will save you.

The core insight: You must think about query patterns at ingestion time. What questions will users ask? What information do they need that doesn’t exist verbatim in the source documents?

If a user will ask "summarize document X," you need a summary to retrieve. Pure chunk embeddings won’t help. Semantic search on "summarize" will return random content that’s vaguely related to the concept of summarization, not the actual summary.

What to create at ingestion time:

Artifact	Why You Need It	Query Types It Enables
Document summaries	Source docs don’t contain their own summaries	"Summarize X", "What’s the main point of X?"
Section summaries	Enable mid-level abstraction	"What does section 3 cover?"
Hierarchical metadata	Chunks lose document structure	"How does section 3 relate to the conclusion?"
Entity extracts	Enable structured queries	"What does the doc say about [person/product]?"
Cross-document themes	Individual docs don’t know about each other	"What are the common themes across these docs?"
"Questions this answers"	Bridge query-document semantic gap	Better matching for question-style queries

The failure mode nobody talks about:

A user asks: "Summarize the Q3 report."

What happens with naive RAG: Semantic search embeds this query and looks for similar chunks. It finds chunks that are semantically similar to "summarize" and "Q3 report." Maybe a chunk that mentions Q3 results, maybe a chunk about summarization from a different document. None of these are the summary. The system returns garbage or hallucinates.

What happens with proper ingestion: During ingestion, you generated a summary of the Q3 report and stored it with metadata {type: "summary", document: "Q3 Report"}. Your retrieval system recognizes "summarize X" as a summary request, queries for documents matching "Q3 report" with type "summary," and returns the actual pre-computed summary.

The difference isn’t retrieval sophistication. It’s whether the thing the user wants exists in retrievable form.

Evaluation

Generic benchmarks (MMLU, HumanEval, SWE-bench) are useful for initial model selection but won’t tell you anything useful about your specific application. MMLU is saturated. HumanEval has contamination concerns.

Real AI Engineers build custom evaluation harnesses specific to their application. They create holdout test sets they don’t share publicly. They understand that evaluation criteria evolve, so they build systems that support rapid iteration. They use LLM-as-judge patterns for subjective failures. They run evals in CI/CD before every deployment.

Security

Prompt injection is OWASP’s #1 risk for LLM applications. Direct injection (users saying "ignore previous instructions") and indirect injection (malicious instructions in documents the LLM processes).

Real AI Engineers implement input validation and sanitization, instruction hierarchy (training LLMs to prioritize privileged instructions), fine-grained permissions (don’t give the LLM the same access as the user), and sandboxing for code execution.

PRC-Hosted Models: A Risk Assessment

DeepSeek and Qwen are impressive models at aggressive price points. But there are documented risks that any AI Engineer should understand:

NIST’s CAISI evaluated DeepSeek in September 2025 and found it was 94% susceptible to jailbreaking (versus 8% for US reference models). It was 12x more likely to follow malicious agent hijacking instructions. Hijacked agents successfully sent phishing emails, downloaded malware, and exfiltrated credentials.

DeepSeek refuses or deflects on politically sensitive topics (Tiananmen Square, Taiwan, Uyghur treatment). Its internal documentation mentions being designed for "government-aligned responses."

All data is stored on servers in the People’s Republic of China. China’s National Intelligence Law requires cooperation with intelligence services (though the legal interpretation is debated and the practical risk is uncertain). Multiple governments have restricted DeepSeek: Taiwan government-wide, Australia on government devices, Italy from app stores.

The practical guidance:

For regulated or confidential enterprise data: avoid PRC-hosted services unless you have a contractual, technical, and legal risk plan that addresses data sovereignty, security vulnerabilities, and geopolitical uncertainty.

For non-sensitive workloads: treat it like any other high-risk vendor. Threat model it, gate it, log it, and assume it will be targeted.

Share This Guide

If you’re a hiring manager, a CTO, or a recruiter trying to figure out what the hell an "AI Engineer" actually is: please share this with your team.

The industry is drowning in imposters right now. We’re in the "webmaster" era of AI, where everyone has added "AI/ML Expert" to their LinkedIn headline after a 4-hour YouTube tutorial. Companies are paying six-figure salaries to people who can’t explain why their RAG system returns garbage for multi-hop queries.

Bad hires are expensive. Not just the salary: the broken systems they ship, the technical debt they create, the months wasted before you realize they can’t do the job.

Real AI Engineers are rare and valuable. Finding them is hard because they’re buried under an avalanche of resume spam.

Save a hiring manager from the imposters. Send them this guide.

Hell, By the Time You’re Reading This…

…half of the specific techniques I’ve mentioned are probably already outdated. That’s not a bug; that’s the point.

This guide isn’t meant to be a permanent reference. It’s meant to show you what depth looks like so you can recognize it when you see it. The specific techniques will change. The expectation that your AI Engineer knows the current state of the art, has opinions about it, and is actively experimenting with what comes next? That’s permanent.

An AI Engineer worth hiring reads this guide and thinks: "Yeah, but what about [technique I haven’t mentioned]?" They have additions, corrections, disagreements.

That’s exactly who you want to hire.

Thank You for Reading

If you made it this far, you’re serious about hiring well. Good for you.

I’m Jason Roell, Head of Engineering at Vurvey, where we’re building AI that builds AI worlds. From the feedback and voices of always-on AI consumer companions that give regular people a voice and get paid while doing it. I wrote this because I’ve been on both sides of the AI hiring problem: trying to find real AI Engineers, and watching talented people get passed over while imposters land roles they can’t handle.

Let’s connect:

🌐 Blog: jasonroell.com
💻 GitHub: github.com/jroell
💼 LinkedIn: linkedin.com/in/jason-roell-47830817

Did I miss something? Is there a red flag that should be on this list? A technique I didn’t cover? Reach out.

And if you’re an AI Engineer reading this, nodding along, thinking "finally, someone gets it": we should talk. The best people in this field find each other.

The Love Boat is counting on you.

Sources:

The technical landscape described here is current as of January 2026. In six months, half of it will be outdated. That’s the job.

The Curious Programmer