systeme.io

Reference Guide

Statistics for Lab Scientists

A practical DOE glossary for R&D chemists and industrial scientists — no statistics degree required.

"Everything should be made as simple as possible, but not simpler." — Albert Einstein

⚗️

Foundational Thinking

Terms 1–6

01Variation

You run the same reaction three times. Same reagents, same equipment, same procedure. But your yields come back 74%, 76%, and 73%. That's variation — and it's everywhere in a lab. The goal of statistics isn't to pretend variation doesn't exist. It's to understand it: how much is normal, what's causing it, and when something is genuinely changing your results versus just random noise. Before you can improve any process, you have to understand its natural variation first.

02Normal Distribution

You're pipetting the same solution 30 times. Every measurement comes out slightly different — 147.2, 147.5, 146.9 mL. Plot all 30 values and you get a bump in the middle with tails on each side. That's a normal distribution. It's telling you something real: your process has natural variation, and that's okay. A tight bump means a precise process. A wide bump means something is inconsistent. Statistics gives you a way to measure that width — and to notice when something is genuinely shifting your bump left or right.

03Average

You ran a reaction five times and got yields of 71%, 74%, 73%, 76%, and 72%. Your average (add them up, divide by 5) is 73.2%. Simple — but powerful. The average gives you a single number that represents your process. The real value comes when you start comparing averages: before vs. after a change, one lab vs. another, one catalyst vs. another. That's when average stops being a math concept and starts being a decision-making tool.

04Mean, Median, Mode

These are three different ways to describe the "center" of your data. The mean is your average — add everything up, divide by count. The median is the middle value when you line everything up in order. The mode is the most frequently occurring value. In a perfect normal distribution, all three are identical. In real lab data they diverge — and that tells you something.

⚠️ Watch out: Outliers distort the mean but not the median. If your mean is much higher than your median, a few unusually high results are skewing your picture. Comparing the two is a quick sanity check on your data.

05Standard Deviation

Your average yield is 73%. Great — but how consistent are you? Standard deviation answers that. A small standard deviation means your results cluster tightly around 73% every time. A large one means you're all over the place — 65% one day, 81% the next. Two labs can have the same average yield but completely different standard deviations, meaning one is reliable and one is unpredictable. Before you trust any average, always ask: what's the standard deviation? That's the number that tells you whether your process is actually under control.

06Variance

Variance is simply standard deviation squared. So why does it exist? Because mathematically, variance is easier to work with when you're combining sources of variation — which is exactly what DOE does. If your total batch-to-batch variation has multiple sources (raw materials, temperature, operator, equipment), variance lets you add those sources up and figure out which one is the biggest culprit.

💡 DOE connection: Standard deviation is what you report and communicate. Variance is what you calculate and dissect under the hood.

📊

Drawing Conclusions from Data

Terms 7–12

07Hypothesis Testing

Every experiment you run is secretly a hypothesis test — you just may not have been calling it that. You change the reaction temperature and check if yield improves. You're asking: "Is this change real, or did I just get lucky?" Hypothesis testing is the formal framework for answering that question using your data. It keeps you honest. Without it, humans naturally see patterns that aren't there — we're wired to. Hypothesis testing forces the data to make the call, not your gut.

08Null (Initial) Hypothesis

Before you run an experiment, you start with a default assumption: "My change did nothing. Temperature doesn't affect yield. The new catalyst is no better than the old one." That's the null hypothesis — your starting position of skepticism. Your experiment then tries to disprove it. If your data is strong enough, you reject the null hypothesis and conclude something real is happening. If not, you can't claim a win.

💡 Think of it like a courtroom: the null hypothesis is "innocent until proven guilty." Your data is the evidence.

09p-value

You ran the experiment. Your yield went up. But your skeptical labmate says it's just luck. The p-value settles the argument. It tells you: "If nothing was actually going on, how likely would I be to see results this extreme just by chance?"

A p-value of 0.03 means there's only a 3% chance you'd see this result randomly — pretty convincing. A p-value of 0.40 means there's a 40% chance it's just noise — not convincing at all. The cutoff scientists usually use is 0.05. Below that, you have a real result. Above it, you need more data or a bigger effect.

10Statistical Significance

When your p-value drops below 0.05, your result is called "statistically significant." That just means: this result is unlikely to be random.

⚠️ The trap chemists fall into: Statistically significant doesn't mean practically significant. You could prove with 1000 data points that a new catalyst improves yield by 0.1%. That's statistically significant — but is it worth switching suppliers? Probably not. Always ask: "Is this difference big enough to actually matter in my lab?"

11t-test

You tweaked your synthesis — new temperature, same everything else. Yield went from 72% to 76%. Is that real? A t-test answers that directly. It looks at both results, accounts for the natural variation in your process, and tells you whether a 4% jump is a genuine signal or something you'd see by chance on a good week. You get a p-value out the other end. Below 0.05 and you can walk into your team meeting and say "I proved this works better" — with the data to back it up.

12Correlation

Every time humidity in the lab is high, your reaction yield drops. Interesting hunch — but is it real? Correlation puts a number on it, from −1 to +1. Close to +1 means when one goes up, the other goes up too. Close to −1 means inverse relationship. Close to 0 means no relationship. Track humidity and yield for a few weeks, calculate correlation, and suddenly you have a lead worth investigating instead of just a gut feeling.

⚠️ Big warning: Correlation is not causation. Ice cream sales and sunburns are correlated — because summer causes both. Always ask why before drawing conclusions.

🔬

Designing Good Experiments

Terms 13–18

13Design of Experiments (DOE)

Most scientists test one variable at a time — change temperature, keep everything else the same, see what happens. It feels logical, but it's actually inefficient and can mislead you. DOE is a smarter approach: you deliberately vary multiple factors at once in a structured way. You get more information from fewer experiments, you discover which factors matter most, and — crucially — you discover when factors interact with each other in ways you'd never catch one at a time.

💡 DOE is how you go from trial-and-error to actually understanding your process.

14Variables (Independent, Dependent, Controlled)

Every experiment has three kinds of variables. Independent variables are what you deliberately change — temperature, pressure, catalyst amount. Dependent variables are what you measure — yield, purity, reaction time. Controlled variables are everything you hold constant so they don't mess up your results — same equipment, same supplier, same operator. Getting these straight before you start is the difference between a clean experiment and one where you can't figure out why your results don't make sense.

15Confounding Variables

You switched to a new solvent and your yield jumped 10%. Exciting! But you also switched suppliers that same week. Now you don't know which change caused the improvement — they're confounded. Confounding variables are the hidden culprits that sneak into your experiments and muddy your conclusions. DOE is specifically designed to avoid this by structuring your experiments so that changes don't accidentally overlap.

⚠️ Classic trap: Changing two things at once and then not knowing which one worked. DOE prevents this systematically.

16Replicates vs. Repeats

These sound the same but mean very different things. Repeats are when you run the same experiment multiple times in one sitting — same day, same batch, same operator. They tell you about short-term consistency. Replicates are truly independent runs — different days, possibly different batches or operators. They capture the full natural variation of your process. Replicates are statistically much more powerful. If you only do repeats, you might convince yourself your process is rock-solid, when really you've just had a good day.

There's another distinction worth knowing: experimental error vs. sampling or test error. Experimental error is the variation that comes from actually running the experiment — slight differences in temperature control, weighing, timing, or technique. Sampling or test error is the variation that comes from measuring the result — how consistently you're pulling a sample, how precise your analytical instrument is. Both contribute to your overall variability, but they have different root causes and different fixes. Better lab technique reduces experimental error. Better analytical methods reduce sampling and test error.

17Randomization

You're running 20 experiments today. If you do all the high-temperature runs in the morning and all the low-temperature runs in the afternoon, and your equipment warms up over the day — you've just confounded temperature with time. Randomization fixes this by scrambling the order of your experiments. It sounds almost too simple, but it's one of the most powerful tools in experimental design. It protects your results from hidden trends you didn't even know were there — equipment drift, reagent degradation, even operator fatigue.

18Blocking

Your DOE spans two days. You know day-to-day variation exists in your lab — humidity changes, fresh reagents, different energy levels. Rather than ignoring that or letting it contaminate your results, you block for it: deliberately split your experiments across days in a balanced way, and account for the day effect in your analysis.

💡 Blocking is like randomization's smarter sibling — instead of just scrambling runs, you organize them strategically so known sources of variation don't drown out the effects you actually care about.

📈

Going Deeper with DOE

Terms 19–29

19Factorial Design

You want to test two temperatures (low and high) and two catalyst amounts (low and high). A factorial design says: run all four combinations — low/low, low/high, high/low, high/high. That's it. Simple grid. But from those four experiments you learn the effect of temperature, the effect of catalyst amount, and whether they interact with each other. Compare that to one-at-a-time testing, which would take more runs and still miss the interactions entirely. Factorial design is the workhorse of DOE — efficient, informative, and surprisingly simple once you see it in action.

20Main Effects

In your factorial experiment, the main effect of temperature is simply: on average, how much did yield change when you went from low to high temperature — regardless of what the catalyst was doing? It's the clean, isolated contribution of one factor. Main effects are usually what you report first: "Increasing temperature improved yield by 8% on average." They give you the headline. But they don't tell the whole story — that's where interactions come in.

21Interactions

Here's where DOE gets exciting. You test temperature and catalyst loading together. At low catalyst, high temperature helps. But at high catalyst, high temperature actually hurts yield. That's an interaction — and you would never find it testing one variable at a time. Interactions are incredibly common in chemistry, and missing them leads to bad decisions.

💡 When your chemist instincts say "it depends," that's an interaction waiting to be quantified. Finding interactions is one of the biggest payoffs of proper DOE.

22Response Surface

You've done your factorial DOE and found that temperature and pH both affect your yield. Now you want to find the sweet spot — the exact combination that maximizes yield. A response surface experiment maps out the landscape between your variables like a topographic map, with peaks (high yield) and valleys (low yield). Instead of just "high vs. low," you're now exploring the full space. It's how you go from "temperature matters" to "the optimal temperature is 68°C at pH 7.2." This is optimization territory — and it's where DOE pays off big.

23Signal vs. Noise

Every measurement you take has two components: the real effect you're trying to detect (signal) and the random variation that obscures it (noise). A 5% yield improvement sounds great — unless your process naturally varies by ±8%, in which case you can't see it through the noise. The whole game of statistics is improving your signal-to-noise ratio: tighter experimental control reduces noise, more replicates help you average out noise, and good DOE design amplifies signal.

💡 When a chemist says "I can't reproduce this result," they usually have a noise problem — and statistics is the solution.

24Confidence Interval

Your average yield is 73%. But what you really want to know is: where does the true yield of this process actually live? A confidence interval gives you a range — say, 71% to 75% — and tells you you can be 95% confident the true value falls in there. It's more honest than a single number because it shows your uncertainty. When you're comparing two processes and their confidence intervals don't overlap, that's strong evidence they're genuinely different. When they do overlap, you need more data before drawing conclusions.

25F-value

You've run a factorial DOE and you want to know: does temperature actually matter, or is it just noise? The F-value answers that. It's essentially a ratio — how much variation is caused by your factor compared to how much variation is just experimental error — the natural, unavoidable inconsistency in your lab even when you try to hold everything constant. A large F-value means your factor is a real, dominant driver. A small F-value means its effect is too small to distinguish from experimental error.

💡 In DOE software, every factor gets an F-value, so you can instantly rank which variables are worth your attention — and which ones you can stop worrying about.

26Quadratic Terms

You've tested low and high temperature and yield went up as temperature increased. So... just keep cranking up the heat? Not so fast. Many chemical processes have a sweet spot — yield improves up to a point, then drops off as you overshoot. That curve is a quadratic effect, and a straight line can't capture it. Quadratic terms let your DOE model bend — they detect whether your response has a peak or valley rather than just a slope.

⚠️ Watch out: Without quadratic terms, you might optimize your way right past the best conditions and never know it. This is why Response Surface designs include center points and additional runs — you need them to "see the curve."

27Optimum Value

This is the whole point. After mapping your response surface and fitting your model — including interactions and quadratic terms — you can now ask: what combination of temperature, pH, catalyst loading, and time gives me the absolute best yield? The optimum value is that answer. DOE software finds the peak of your response surface mathematically and tells you exactly where to run your process.

Instead of years of intuition-based tweaking, you get to the optimum in a structured, defensible, reproducible way. And because you understand the landscape around the optimum, you also know how sensitive it is — whether you need to control temperature to ±0.5°C or whether ±5°C is fine.

💡 This is the payoff. Every term in this glossary exists to help you arrive here: a process that's optimized, understood, and under control.

28p-value (in Regression)

You already know p-values from hypothesis testing (Term 9) — but in a regression model, they show up on every single term: temperature, pH, catalyst loading, their interactions, maybe their squared terms. Each one is asking the same question: "Is this term's contribution to the model real, or just noise?" A p-value below 0.05 for a coefficient means that factor genuinely belongs in your model. A high p-value means it's not pulling its weight — and you should consider dropping it to simplify the model.

💡 In DOE software, the p-value column in your ANOVA table is the fastest way to separate the factors that matter from the ones you can stop worrying about.

29Confidence Intervals (in Regression)

You already know confidence intervals give you a range around an estimate (Term 24) — but in a regression model, every coefficient gets one. Your model says temperature has a coefficient of +2.3% yield per degree. The 95% confidence interval might be +1.1% to +3.5%. That tells you two things: the effect is real (it doesn't cross zero), and you know roughly how big it is. Narrow intervals mean you've nailed down the effect precisely. Wide intervals mean you need more data before betting the process on that number.

⚠️ Watch out: A confidence interval that crosses zero means you can't be sure whether the factor helps or hurts. That's not a failure — it's the model being honest with you.

🧮

Reading Your Model

Terms 30–38

30Regression

You've run your DOE. You have yield data at different combinations of temperature, pH, and catalyst loading. Regression is what connects the dots — it fits a mathematical equation to your data that describes how each factor influences your response. Instead of a table of results, you get a model: a formula you can plug numbers into and get a predicted yield out. It's the engine under the hood of DOE analysis. Every response surface, every prediction, every optimum value comes from a regression model fitted to your experimental data.

💡 Think of regression as translating your experimental results into a recipe: given these inputs, here's what to expect from your process.

31Regression Equation

Your DOE software runs the analysis and hands you something like: Yield = 73.2 + 4.1×Temperature − 2.8×pH + 1.6×Temperature×pH. That's your regression equation — the actual mathematical model of your process. Each term tells you something: how much yield changes per unit of temperature, how pH pushes it down, and how those two interact. You can plug in any combination of factor levels and get a predicted yield. It's not just a summary of past experiments — it's a forward-looking tool for making process decisions.

32R-Squared (R²)

You've fitted a regression model to your yield data. R² tells you: what percentage of the variation in my results does this model actually explain? An R² of 0.92 means the model accounts for 92% of the variation you saw — the remaining 8% is unexplained noise. An R² of 0.45 means your model is missing something important. In lab DOE work, you're generally hoping for R² above 0.80 before trusting model predictions for process decisions.

⚠️ Watch out: R² always goes up when you add more terms to a model — even meaningless ones. More terms, better apparent fit. That's why R² alone can be misleading. See Adjusted R² next.

33Adjusted R-Squared

Adjusted R² is R²'s more honest sibling. It penalizes you for adding terms to your model that aren't actually earning their place. If a new term genuinely improves your model's explanatory power, Adjusted R² goes up. If you're just throwing in extra factors to inflate R², Adjusted R² goes down — or stays flat — and calls you out. When R² and Adjusted R² diverge significantly, that's a signal your model may be over-fitted: it's memorizing your experimental data rather than capturing the real underlying process.

💡 In practice: watch both numbers. If they're close, your model is lean and honest. If R² is much higher than Adjusted R², you have terms in the model that shouldn't be there.

34Coefficients

In your regression equation, every factor has a number in front of it — that's the coefficient. It tells you exactly how much your response changes for a one-unit change in that factor, holding everything else constant. A temperature coefficient of +3.5 means every degree you raise temperature adds 3.5% to your yield (at least across the range you tested). A negative coefficient means that factor is hurting you. The bigger the absolute value, the more powerful that factor's influence. Coefficients are how you rank which levers in your process are worth pulling hard and which ones barely move the needle.

35Intercept

The intercept is the constant in your regression equation — the baseline predicted yield when all your factors are at their reference or zero point. In practice, it's often not the most interesting number in your model (you care more about how factors shift yield up or down). But it matters for making accurate predictions: without the right intercept, the whole equation is shifted off. It also gives you a reality check — if your intercept predicts a wildly unrealistic yield when all factors are at zero, that's a signal your model may not be valid outside the range of your experimental data.

⚠️ Watch out: Don't extrapolate. Your regression equation is only trustworthy within the range of conditions you actually tested. Plug in values far outside that range and the intercept will send you somewhere the model was never designed to go.

36Outliers

You run 16 experiments in your DOE. Fifteen come back with yields clustering nicely around your model predictions. One comes back at 41% — way off. That's an outlier, and it deserves attention before you do anything else. Outliers can mean a genuine process discovery (something interesting happened), a data entry error (you wrote 41 when you meant 74), a contaminated batch, or an equipment failure. The worst thing you can do is silently delete it. The second worst is let it quietly warp your regression model without investigating it first.

💡 Every outlier is a question: what happened here? Sometimes the answer is boring (typo). Sometimes it's the most interesting result in your dataset.

37Leverage (Tukey Hat Value)

Not all data points have equal influence on your regression model. Some experimental runs sit right in the middle of your design space — they're well-supported by their neighbors, and they can't pull the model far on their own. Others sit at the extremes — corner points, or runs with unusual factor combinations — and they have disproportionate power to tug the fitted line toward themselves. That pulling power is called leverage, and it's measured by the hat value (so named because of the notation used in regression math). A high leverage point isn't automatically a problem — corner points in a factorial design are supposed to have high leverage. But when a high-leverage point also has a large residual (its actual result is far from what the model predicts), that combination is where real damage gets done.

⚠️ Watch out: High leverage alone is neutral. High leverage + large residual = a point that's both pulling the model hard and pulling it in the wrong direction. That's when you investigate.

38Cook's Distance

Here's the question leverage doesn't fully answer: if I removed this data point entirely, how much would my model change? Cook's Distance answers that directly. It combines a point's leverage (its positional influence) with its residual (how wrong the model is at that point) into a single number. A large Cook's Distance means that one run is significantly shaping your entire model — and if it were removed or corrected, your coefficients, predictions, and optimum could shift noticeably. It's the most complete single diagnostic for identifying runs that are quietly running the show. DOE software will flag these for you.

💡 Think of Cook's Distance as the final answer to "should I worry about this point?" Leverage tells you a point has power. Cook's Distance tells you whether it's using that power to distort your model. If the Cook's Distance is large, don't just delete the point — go back to the lab and find out what happened.