AI interviews have changed quickly over the last few years. The classical machine learning questions that defined the field for a long time are still on the list, but they sit next to a newer set of questions about large language models, retrieval augmented generation, evaluation, and the practical realities of putting AI into production. A strong candidate today is expected to be fluent in both worlds.
The questions below are a representative set of the kinds of things that come up in interviews for roles that involve building, deploying, or working closely with AI systems. They cover foundational concepts, modern generative AI, and the applied judgment that interviewers usually care about most. The answers are written as the kind of response a strong candidate might give in a real interview, with enough depth to demonstrate understanding but without becoming a textbook.
Use the list as a study guide, as a prompt for your own preparation, or as a starting point if you are the interviewer trying to assemble a question set that actually distinguishes good candidates from average ones.
Foundations and Classical Machine Learning
1. What is the difference between supervised, unsupervised, and reinforcement learning?
Supervised learning trains a model on labeled examples where each input has a known target output, so the model learns to map inputs to outputs. Classification and regression are the two main flavors. Unsupervised learning works on data without labels and tries to find structure in it, such as clusters of similar items or a lower dimensional representation of the data. Reinforcement learning is different in kind. An agent learns to take actions in an environment by receiving rewards or penalties, and the goal is to learn a policy that maximizes long term reward. The three are not mutually exclusive in practice. Many modern systems combine them, for example large language models trained with supervised learning on text and then refined with reinforcement learning from human feedback.
2. Explain the bias variance tradeoff.
Bias is the error a model makes because its form is too simple to capture the underlying pattern in the data. Variance is the error a model makes because it is too sensitive to the specific training data it saw and would behave differently on a different sample from the same distribution. A model with high bias underfits. A model with high variance overfits. The two errors usually trade against each other. Simpler models tend to have higher bias and lower variance. More flexible models tend to have lower bias and higher variance. The goal of model selection is to find the point where the total expected error on new data is minimized, which usually means a model that is rich enough to fit the real pattern but not so flexible that it fits noise.
3. What is overfitting and how do you prevent it?
Overfitting is when a model performs well on training data but poorly on data it has not seen before, because it has learned patterns specific to the training set rather than patterns that generalize. The usual prevention techniques include collecting more data when possible, using regularization such as L1 or L2 penalties on model weights, applying dropout in neural networks, using simpler models, early stopping during training, and cross validation to check generalization. Data augmentation also helps in domains like images and text. For deep learning, weight decay is a common tool, and batch normalization, while primarily an optimization stabilizer, often has a secondary regularizing effect in practice. The most important practical habit is to always evaluate on a held out test set that the model never saw during training or model selection.
4. When does precision matter more than recall, and vice versa?
Precision is the fraction of items the model flagged that were actually correct. Recall is the fraction of actually correct items that the model caught. Which one matters more depends on what the cost of each type of error is. Precision matters more when false positives are expensive. A spam filter that flags real email as spam is worse than one that misses some spam, so precision is the priority. Recall matters more when false negatives are expensive. A cancer screening test that misses cases is much worse than one that flags some healthy patients for follow up, so recall is the priority. In practice, you often optimize a combined metric like F1, or choose an operating point on the precision recall curve that reflects the actual cost balance in your application.
5. What is cross validation and why use it?
Cross validation is a way to estimate how well a model will generalize without needing a large dedicated test set. The most common form, k fold cross validation, splits the training data into k equal parts, trains the model k times each time holding out one part as a validation set, and averages the resulting scores. It uses the data more efficiently than a single train and validation split and gives a more stable estimate of model performance. It is particularly useful for tuning hyperparameters or comparing models when data is limited. The main cost is computation, since the model is trained multiple times.
6. Explain gradient descent and the difference between batch, stochastic, and mini batch versions.
Gradient descent is an optimization method that updates model parameters in the direction that reduces the loss function, taking small steps proportional to the negative gradient of the loss with respect to the parameters. Batch gradient descent computes the gradient using all the training data before each update, which is accurate but slow and memory intensive on large datasets. Stochastic gradient descent uses a single training example per update, which is fast and noisy and tends to escape shallow local minima but can oscillate near the optimum. Mini batch gradient descent uses a small batch, typically between thirty two and a few thousand examples per update, which is the practical compromise used in almost all modern deep learning. Modern optimizers like Adam and AdamW build on mini batch gradient descent with adaptive learning rates per parameter.
Deep Learning and Neural Networks
7. How does a neural network learn? Explain backpropagation at a high level.
A neural network is a function with many parameters, organized in layers, that maps inputs to outputs. Training works by computing a loss that measures how wrong the network's output is on a batch of examples, then adjusting the parameters to reduce that loss. Backpropagation is the algorithm that efficiently computes the gradient of the loss with respect to every parameter in the network. It works by applying the chain rule of calculus, propagating the gradient backward through the layers, so each layer learns how its parameters should change to reduce the final loss. The gradients are then used by an optimizer like Adam to update the parameters. The process is repeated many times across the training data until the loss converges or stops improving on a validation set.
8. What is the difference between CNNs, RNNs, and Transformers?
Convolutional neural networks, or CNNs, use shared weight filters that slide across spatial inputs, making them well suited to images and any data with local structure. Recurrent neural networks, or RNNs, including LSTMs and GRUs, process sequences one element at a time and maintain a hidden state, which makes them suited to sequential data but slow to train and limited in how much earlier context they can use effectively. Transformers replaced RNNs as the dominant architecture for sequence modeling after the 2017 "Attention Is All You Need" paper. They use a self attention mechanism that lets every position in the input attend to every other position in parallel, which is much more efficient on modern hardware and captures long range dependencies better. Transformers are now used not only for language but also for vision, audio, code, and many other modalities.
9. What is attention and why does it matter?
Attention is a mechanism that lets a model dynamically weight different parts of its input when producing each part of its output. In self attention, every token in a sequence computes a weighted combination of the other tokens, where the weights depend on how relevant each token is to the current one. The practical effect is that the model can directly reference any part of the input no matter how far away it is, rather than having to compress everything through a sequential hidden state. Attention is the core building block of Transformers and a big part of why modern language models work as well as they do.
Generative AI and Large Language Models
10. How does a large language model work, at a high level?
A large language model is a Transformer based neural network trained on a very large corpus of text to predict the next token given the preceding tokens. The training data is broken into tokens, which are sub word units, and the model learns the statistical patterns of language by being asked to predict each token in turn. After this pretraining stage, the model is usually fine tuned with supervised examples of helpful responses and then aligned with a preference optimization method, often reinforcement learning from human feedback but also newer approaches such as direct preference optimization or related variants, to make it useful as an assistant. At inference time, the model generates text one token at a time, sampling from the probability distribution it predicts for the next token at each step.
11. What is retrieval augmented generation and when should you use it?
Retrieval augmented generation, or RAG, is a pattern where a language model is given relevant context retrieved from an external knowledge source as part of its prompt, rather than relying only on what it learned during training. A typical RAG pipeline embeds documents into a vector database, retrieves the most relevant chunks for a given query, and includes them in the prompt so the model can answer based on that material. RAG is useful when the model needs access to information that was not in its training data, that changes frequently, or that is proprietary to a specific organization. It is also a common way to reduce hallucinations and to make answers grounded in citable sources.
12. What is the difference between prompting, fine tuning, and RAG?
All three are ways to adapt a general purpose model to a specific use case, and they sit on a spectrum of cost and complexity. Prompting changes the model's behavior by giving it carefully constructed inputs at inference time, with no change to the model weights. It is the fastest and cheapest to try and works well for many tasks. Fine tuning updates the model weights using additional training data, which can specialize behavior in ways prompting cannot, but is more expensive and requires care to avoid degrading the model. RAG injects external context at query time without changing the weights, which is the right pattern when the goal is to ground answers in a specific knowledge base rather than to change the model's underlying behavior. In production systems, the three are often combined.
13. What are hallucinations and how do you mitigate them?
A hallucination is when a language model produces output that sounds confident and plausible but is factually wrong or unsupported by any real source. It happens because the model is fundamentally a probability distribution over plausible text rather than a database of facts, and nothing in the basic training objective requires the output to be true. Mitigations include grounding the model in retrieved documents using RAG, asking the model to cite sources and refusing to answer when sources are not available, lowering the sampling temperature for factual tasks, asking the model to express uncertainty when relevant, validating model outputs against external systems where possible, and using evaluations that specifically test for factual accuracy on representative inputs. No single technique eliminates hallucinations, so production systems usually layer several.
14. How do you evaluate a language model output?
Evaluation of generative output is harder than evaluation of classification, because there is rarely a single correct answer. The most common approaches include human evaluation against rubrics for criteria like helpfulness, accuracy, and tone, automated metrics for specific properties such as exact match, factuality against a reference, code that runs correctly, or adherence to a required format, and model graded evaluation where a stronger model is used as a judge to score outputs against criteria. For production systems, evaluation usually involves a curated test set of representative prompts, automated checks that catch regressions, and ongoing human review of a sample of real traffic. The harder question, often discussed in interviews, is how to evaluate quality in a way that catches real failures rather than just measuring superficial properties.
15. What is the difference between zero shot, few shot, and fine tuned approaches?
Zero shot means the model is asked to perform a task with no examples of the task in the prompt. Few shot means the model is given a small number of examples in the prompt to anchor the pattern, usually two to ten. Fine tuning means the model is trained on many examples of the task so that the behavior becomes part of the model itself. The general progression in cost and capability runs from zero shot, which is free and often sufficient for simple tasks, through few shot, which is a cheap way to lift quality on more structured tasks, to fine tuning, which is the right move when prompting cannot reach the quality bar or when the task is run at high enough volume that the inference cost savings of a smaller fine tuned model justify the up front training work.
Applied Judgment and Production
16. How would you approach a brand new machine learning problem?
A reasonable answer walks through a structured process. First, clarify the business problem and what success would look like in concrete terms before touching any data. Second, define an evaluation metric that reflects what actually matters, including how the model's errors will translate into business outcomes. Third, look at the data carefully, understand what is available and what is missing, and check for obvious quality issues. Fourth, start with a simple baseline, often a heuristic or a basic model, to establish a reference point. Fifth, iterate on the model with the simplest changes first, validating against the held out test set and against the actual business outcome where possible. Sixth, plan for deployment from the beginning, including how the model will be monitored, retrained, and rolled back if it fails in production. The structured answer matters more than the specific framework. Interviewers want to see that the candidate thinks about the problem end to end rather than jumping straight to model selection.
17. What are the main considerations when deploying a model to production?
Deployment involves a different set of concerns than model development. Latency and throughput need to meet the requirements of the application, which often means model compression, batching, caching, or choosing a smaller model. Reliability requires monitoring of input distributions for drift, output distributions for anomalies, and downstream business metrics for impact. Versioning of both the model and the data it was trained on is important for reproducibility and for rollback. Cost needs to be tracked, particularly for large model inference. Security and privacy require thinking about what data the model sees, where it is stored, and what could leak in outputs. Compliance matters in regulated domains. And the deployment pipeline itself needs to be testable and reversible, because models will need to be updated. The MLOps discipline exists to cover all of this.
18. How do you think about bias and fairness in machine learning systems?
Bias can enter at every stage of the pipeline. Training data can reflect historical patterns that are themselves biased. Feature engineering can encode unintended proxies for protected attributes. Model architecture and objectives can amplify existing patterns. Evaluation can miss problems that only show up for specific subgroups. Deployment can apply the model to populations that were underrepresented in training. A serious approach to fairness includes representative training data, careful feature design, evaluation broken out by subgroup rather than only in aggregate, explicit fairness metrics chosen to match the application, ongoing monitoring in production, and a clear escalation path when problems are found. The right metrics and the right tradeoffs depend on the domain, and interviewers usually care that the candidate can name the tensions involved rather than that they recite a single formula.
19. What is the difference between AI, machine learning, deep learning, and generative AI?
AI is the broad field of building systems that perform tasks that would normally require human intelligence. Machine learning is a subset of AI that focuses on systems that learn patterns from data rather than being explicitly programmed. Deep learning is a subset of machine learning that uses neural networks with many layers, and is responsible for most of the progress of the last fifteen years in vision, speech, and language. Generative AI is a category of deep learning systems whose primary output is new content such as text, images, audio, video, or code. Large language models and diffusion image models are the most visible examples. The terms are often used loosely, and a useful candidate can use them precisely when the conversation needs precision.
20. Tell me about a project you worked on. How would you do it differently today?
This is the question that most often distinguishes strong candidates. The best answers describe a concrete project with specifics, the problem being solved, the approach taken, the outcome, and an honest reflection on what could have been done better. The "differently today" part matters because the field moves quickly, and a candidate who would still build the same system the same way they did three years ago is probably not staying current. A strong answer might note that a problem that was solved with a custom model in 2022 might today be reasonable to address with a prompted general purpose model and a RAG layer, or that an evaluation approach that was state of the art at the time has since been superseded by better methods. The point is not that newer is always better. It is that the candidate is reflective, current, and able to make judgments rather than just execute a recipe.
How to Use These Questions
If you are preparing for an interview, the most useful exercise is to answer each question out loud, not to memorize a written response. Interviewers can tell when a candidate is reciting and when they are reasoning, and the difference matters more than the polish of the answer. Practice explaining each concept at two levels, once for a technical peer and once for a non technical stakeholder. Many real interview questions test the second skill explicitly.
If you are designing an interview loop, the questions above cover most of the territory worth testing, but the best loops also include a hands on component. A short coding or system design exercise, a take home where the candidate builds or evaluates a small model, or a discussion of a real problem your team is working on, all tend to reveal more than any number of pure recall questions. The recall questions are best used as warmups and as filters, not as the main signal.
The bar for AI roles has risen quickly, in part because the surface area of the field keeps growing. A candidate today is expected to know classical machine learning, deep learning, modern generative AI, and the practical concerns of running models in production. Few candidates know all of it deeply. The strongest ones know which parts they know well, are honest about the parts they do not, and can reason through unfamiliar questions rather than getting stuck on them.
The Bottom Line
The questions on this list cover the conceptual core of what AI interviews tend to test, but the list is not a substitute for actually building, deploying, or working with AI systems. The candidates who interview well are usually the ones who have spent real time on real problems, who can talk about specific tradeoffs they faced, and who can extend a familiar concept to an unfamiliar situation in real time.
Use the questions as scaffolding for your preparation, but make sure the scaffolding is supporting something solid. The interviewer is almost always trying to figure out two things. Does this candidate understand the field well enough to make good decisions in it, and will they keep learning as the field continues to change. Strong answers to specific questions matter, but the underlying signal of judgment and curiosity matters more.