A practical DOE glossary for R&D chemists and industrial scientists — no statistics degree required.
"Everything should be made as simple as possible, but not simpler." — Albert Einstein
You run the same reaction three times. Same reagents, same equipment, same procedure. But your yields come back 74%, 76%, and 73%. That's variation — and it's everywhere in a lab. The goal of statistics isn't to pretend variation doesn't exist. It's to understand it: how much is normal, what's causing it, and when something is genuinely changing your results versus just random noise. Before you can improve any process, you have to understand its natural variation first.
You're pipetting the same solution 30 times. Every measurement comes out slightly different — 147.2, 147.5, 146.9 mL. Plot all 30 values and you get a bump in the middle with tails on each side. That's a normal distribution. It's telling you something real: your process has natural variation, and that's okay. A tight bump means a precise process. A wide bump means something is inconsistent. Statistics gives you a way to measure that width — and to notice when something is genuinely shifting your bump left or right.
You ran a reaction five times and got yields of 71%, 74%, 73%, 76%, and 72%. Your average (add them up, divide by 5) is 73.2%. Simple — but powerful. The average gives you a single number that represents your process. The real value comes when you start comparing averages: before vs. after a change, one lab vs. another, one catalyst vs. another. That's when average stops being a math concept and starts being a decision-making tool.
These are three different ways to describe the "center" of your data. The mean is your average — add everything up, divide by count. The median is the middle value when you line everything up in order. The mode is the most frequently occurring value. In a perfect normal distribution, all three are identical. In real lab data they diverge — and that tells you something.
Your average yield is 73%. Great — but how consistent are you? Standard deviation answers that. A small standard deviation means your results cluster tightly around 73% every time. A large one means you're all over the place — 65% one day, 81% the next. Two labs can have the same average yield but completely different standard deviations, meaning one is reliable and one is unpredictable. Before you trust any average, always ask: what's the standard deviation? That's the number that tells you whether your process is actually under control.
Variance is simply standard deviation squared. So why does it exist? Because mathematically, variance is easier to work with when you're combining sources of variation — which is exactly what DOE does. If your total batch-to-batch variation has multiple sources (raw materials, temperature, operator, equipment), variance lets you add those sources up and figure out which one is the biggest culprit.
Every experiment you run is secretly a hypothesis test — you just may not have been calling it that. You change the reaction temperature and check if yield improves. You're asking: "Is this change real, or did I just get lucky?" Hypothesis testing is the formal framework for answering that question using your data. It keeps you honest. Without it, humans naturally see patterns that aren't there — we're wired to. Hypothesis testing forces the data to make the call, not your gut.
Before you run an experiment, you start with a default assumption: "My change did nothing. Temperature doesn't affect yield. The new catalyst is no better than the old one." That's the null hypothesis — your starting position of skepticism. Your experiment then tries to disprove it. If your data is strong enough, you reject the null hypothesis and conclude something real is happening. If not, you can't claim a win.
You ran the experiment. Your yield went up. But your skeptical labmate says it's just luck. The p-value settles the argument. It tells you: "If nothing was actually going on, how likely would I be to see results this extreme just by chance?"
A p-value of 0.03 means there's only a 3% chance you'd see this result randomly — pretty convincing. A p-value of 0.40 means there's a 40% chance it's just noise — not convincing at all. The cutoff scientists usually use is 0.05. Below that, you have a real result. Above it, you need more data or a bigger effect.
When your p-value drops below 0.05, your result is called "statistically significant." That just means: this result is unlikely to be random.
You tweaked your synthesis — new temperature, same everything else. Yield went from 72% to 76%. Is that real? A t-test answers that directly. It looks at both results, accounts for the natural variation in your process, and tells you whether a 4% jump is a genuine signal or something you'd see by chance on a good week. You get a p-value out the other end. Below 0.05 and you can walk into your team meeting and say "I proved this works better" — with the data to back it up.
Every time humidity in the lab is high, your reaction yield drops. Interesting hunch — but is it real? Correlation puts a number on it, from −1 to +1. Close to +1 means when one goes up, the other goes up too. Close to −1 means inverse relationship. Close to 0 means no relationship. Track humidity and yield for a few weeks, calculate correlation, and suddenly you have a lead worth investigating instead of just a gut feeling.
Most scientists test one variable at a time — change temperature, keep everything else the same, see what happens. It feels logical, but it's actually inefficient and can mislead you. DOE is a smarter approach: you deliberately vary multiple factors at once in a structured way. You get more information from fewer experiments, you discover which factors matter most, and — crucially — you discover when factors interact with each other in ways you'd never catch one at a time.
Every experiment has three kinds of variables. Independent variables are what you deliberately change — temperature, pressure, catalyst amount. Dependent variables are what you measure — yield, purity, reaction time. Controlled variables are everything you hold constant so they don't mess up your results — same equipment, same supplier, same operator. Getting these straight before you start is the difference between a clean experiment and one where you can't figure out why your results don't make sense.
You switched to a new solvent and your yield jumped 10%. Exciting! But you also switched suppliers that same week. Now you don't know which change caused the improvement — they're confounded. Confounding variables are the hidden culprits that sneak into your experiments and muddy your conclusions. DOE is specifically designed to avoid this by structuring your experiments so that changes don't accidentally overlap.
These sound the same but mean very different things. Repeats are when you run the same experiment multiple times in one sitting — same day, same batch, same operator. They tell you about short-term consistency. Replicates are truly independent runs — different days, possibly different batches or operators. They capture the full natural variation of your process. Replicates are statistically much more powerful. If you only do repeats, you might convince yourself your process is rock-solid, when really you've just had a good day.
There's another distinction worth knowing: experimental error vs. sampling or test error. Experimental error is the variation that comes from actually running the experiment — slight differences in temperature control, weighing, timing, or technique. Sampling or test error is the variation that comes from measuring the result — how consistently you're pulling a sample, how precise your analytical instrument is. Both contribute to your overall variability, but they have different root causes and different fixes. Better lab technique reduces experimental error. Better analytical methods reduce sampling and test error.
You're running 20 experiments today. If you do all the high-temperature runs in the morning and all the low-temperature runs in the afternoon, and your equipment warms up over the day — you've just confounded temperature with time. Randomization fixes this by scrambling the order of your experiments. It sounds almost too simple, but it's one of the most powerful tools in experimental design. It protects your results from hidden trends you didn't even know were there — equipment drift, reagent degradation, even operator fatigue.
Your DOE spans two days. You know day-to-day variation exists in your lab — humidity changes, fresh reagents, different energy levels. Rather than ignoring that or letting it contaminate your results, you block for it: deliberately split your experiments across days in a balanced way, and account for the day effect in your analysis.
You want to test two temperatures (low and high) and two catalyst amounts (low and high). A factorial design says: run all four combinations — low/low, low/high, high/low, high/high. That's it. Simple grid. But from those four experiments you learn the effect of temperature, the effect of catalyst amount, and whether they interact with each other. Compare that to one-at-a-time testing, which would take more runs and still miss the interactions entirely. Factorial design is the workhorse of DOE — efficient, informative, and surprisingly simple once you see it in action.
In your factorial experiment, the main effect of temperature is simply: on average, how much did yield change when you went from low to high temperature — regardless of what the catalyst was doing? It's the clean, isolated contribution of one factor. Main effects are usually what you report first: "Increasing temperature improved yield by 8% on average." They give you the headline. But they don't tell the whole story — that's where interactions come in.
Here's where DOE gets exciting. You test temperature and catalyst loading together. At low catalyst, high temperature helps. But at high catalyst, high temperature actually hurts yield. That's an interaction — and you would never find it testing one variable at a time. Interactions are incredibly common in chemistry, and missing them leads to bad decisions.
You've done your factorial DOE and found that temperature and pH both affect your yield. Now you want to find the sweet spot — the exact combination that maximizes yield. A response surface experiment maps out the landscape between your variables like a topographic map, with peaks (high yield) and valleys (low yield). Instead of just "high vs. low," you're now exploring the full space. It's how you go from "temperature matters" to "the optimal temperature is 68°C at pH 7.2." This is optimization territory — and it's where DOE pays off big.
Every measurement you take has two components: the real effect you're trying to detect (signal) and the random variation that obscures it (noise). A 5% yield improvement sounds great — unless your process naturally varies by ±8%, in which case you can't see it through the noise. The whole game of statistics is improving your signal-to-noise ratio: tighter experimental control reduces noise, more replicates help you average out noise, and good DOE design amplifies signal.
Your average yield is 73%. But what you really want to know is: where does the true yield of this process actually live? A confidence interval gives you a range — say, 71% to 75% — and tells you you can be 95% confident the true value falls in there. It's more honest than a single number because it shows your uncertainty. When you're comparing two processes and their confidence intervals don't overlap, that's strong evidence they're genuinely different. When they do overlap, you need more data before drawing conclusions.
You've run a factorial DOE and you want to know: does temperature actually matter, or is it just noise? The F-value answers that. It's essentially a ratio — how much variation is caused by your factor compared to how much variation is just experimental error — the natural, unavoidable inconsistency in your lab even when you try to hold everything constant. A large F-value means your factor is a real, dominant driver. A small F-value means its effect is too small to distinguish from experimental error.
You've tested low and high temperature and yield went up as temperature increased. So... just keep cranking up the heat? Not so fast. Many chemical processes have a sweet spot — yield improves up to a point, then drops off as you overshoot. That curve is a quadratic effect, and a straight line can't capture it. Quadratic terms let your DOE model bend — they detect whether your response has a peak or valley rather than just a slope.
This is the whole point. After mapping your response surface and fitting your model — including interactions and quadratic terms — you can now ask: what combination of temperature, pH, catalyst loading, and time gives me the absolute best yield? The optimum value is that answer. DOE software finds the peak of your response surface mathematically and tells you exactly where to run your process.
Instead of years of intuition-based tweaking, you get to the optimum in a structured, defensible, reproducible way. And because you understand the landscape around the optimum, you also know how sensitive it is — whether you need to control temperature to ±0.5°C or whether ±5°C is fine.
You already know p-values from hypothesis testing (Term 9) — but in a regression model, they show up on every single term: temperature, pH, catalyst loading, their interactions, maybe their squared terms. Each one is asking the same question: "Is this term's contribution to the model real, or just noise?" A p-value below 0.05 for a coefficient means that factor genuinely belongs in your model. A high p-value means it's not pulling its weight — and you should consider dropping it to simplify the model.
You already know confidence intervals give you a range around an estimate (Term 24) — but in a regression model, every coefficient gets one. Your model says temperature has a coefficient of +2.3% yield per degree. The 95% confidence interval might be +1.1% to +3.5%. That tells you two things: the effect is real (it doesn't cross zero), and you know roughly how big it is. Narrow intervals mean you've nailed down the effect precisely. Wide intervals mean you need more data before betting the process on that number.
You've run your DOE. You have yield data at different combinations of temperature, pH, and catalyst loading. Regression is what connects the dots — it fits a mathematical equation to your data that describes how each factor influences your response. Instead of a table of results, you get a model: a formula you can plug numbers into and get a predicted yield out. It's the engine under the hood of DOE analysis. Every response surface, every prediction, every optimum value comes from a regression model fitted to your experimental data.
Your DOE software runs the analysis and hands you something like: Yield = 73.2 + 4.1×Temperature − 2.8×pH + 1.6×Temperature×pH. That's your regression equation — the actual mathematical model of your process. Each term tells you something: how much yield changes per unit of temperature, how pH pushes it down, and how those two interact. You can plug in any combination of factor levels and get a predicted yield. It's not just a summary of past experiments — it's a forward-looking tool for making process decisions.
You've fitted a regression model to your yield data. R² tells you: what percentage of the variation in my results does this model actually explain? An R² of 0.92 means the model accounts for 92% of the variation you saw — the remaining 8% is unexplained noise. An R² of 0.45 means your model is missing something important. In lab DOE work, you're generally hoping for R² above 0.80 before trusting model predictions for process decisions.
Adjusted R² is R²'s more honest sibling. It penalizes you for adding terms to your model that aren't actually earning their place. If a new term genuinely improves your model's explanatory power, Adjusted R² goes up. If you're just throwing in extra factors to inflate R², Adjusted R² goes down — or stays flat — and calls you out. When R² and Adjusted R² diverge significantly, that's a signal your model may be over-fitted: it's memorizing your experimental data rather than capturing the real underlying process.
In your regression equation, every factor has a number in front of it — that's the coefficient. It tells you exactly how much your response changes for a one-unit change in that factor, holding everything else constant. A temperature coefficient of +3.5 means every degree you raise temperature adds 3.5% to your yield (at least across the range you tested). A negative coefficient means that factor is hurting you. The bigger the absolute value, the more powerful that factor's influence. Coefficients are how you rank which levers in your process are worth pulling hard and which ones barely move the needle.
The intercept is the constant in your regression equation — the baseline predicted yield when all your factors are at their reference or zero point. In practice, it's often not the most interesting number in your model (you care more about how factors shift yield up or down). But it matters for making accurate predictions: without the right intercept, the whole equation is shifted off. It also gives you a reality check — if your intercept predicts a wildly unrealistic yield when all factors are at zero, that's a signal your model may not be valid outside the range of your experimental data.
You run 16 experiments in your DOE. Fifteen come back with yields clustering nicely around your model predictions. One comes back at 41% — way off. That's an outlier, and it deserves attention before you do anything else. Outliers can mean a genuine process discovery (something interesting happened), a data entry error (you wrote 41 when you meant 74), a contaminated batch, or an equipment failure. The worst thing you can do is silently delete it. The second worst is let it quietly warp your regression model without investigating it first.
Not all data points have equal influence on your regression model. Some experimental runs sit right in the middle of your design space — they're well-supported by their neighbors, and they can't pull the model far on their own. Others sit at the extremes — corner points, or runs with unusual factor combinations — and they have disproportionate power to tug the fitted line toward themselves. That pulling power is called leverage, and it's measured by the hat value (so named because of the notation used in regression math). A high leverage point isn't automatically a problem — corner points in a factorial design are supposed to have high leverage. But when a high-leverage point also has a large residual (its actual result is far from what the model predicts), that combination is where real damage gets done.
Here's the question leverage doesn't fully answer: if I removed this data point entirely, how much would my model change? Cook's Distance answers that directly. It combines a point's leverage (its positional influence) with its residual (how wrong the model is at that point) into a single number. A large Cook's Distance means that one run is significantly shaping your entire model — and if it were removed or corrected, your coefficients, predictions, and optimum could shift noticeably. It's the most complete single diagnostic for identifying runs that are quietly running the show. DOE software will flag these for you.