Multiple Linear and Nonlinear Regression

Module 11

statistics
Author

Caio dos Santos

Published

April 12, 2026

Introduction

In Module 10, we kept things simple by design: one predictor, one response, and a straight line. That simplicity was intentional. It let us focus on the core habits that actually matter: starting with a biological question, checking whether the model shape makes sense, and separating statistical significance from practical reality.

But real agronomy is seldom that simple.

Think about corn yield for a moment. How much does it depend on rainfall alone? Maybe some. But yield also responds to many factors, like soil texture, organic matter, pH, temperature, and management choices. All of these act together.

And here’s the thing: many of those relationships are not straight lines. They curve. They plateau. They follow S-shaped patterns.

So Module 11 takes the habits from Module 10 and extends them to handle two real-world challenges:

  • Multiple regression: When you have several predictors, not just one
  • Nonlinear modeling: When the relationship is curved, not straight

The core question doesn’t change:

What model structure gives interpretable, biologically reasonable, and practically useful predictions?

We’re just using better tools to answer it.

Part 1: Multiple Linear Regression

The Model: From One Predictor to Many

Here’s the key idea: We’re going to keep the same basic model structure, but expand it from one predictor to many.

In Module 10, we looked at this simple model:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Now we add more predictors:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon \]

The algebra looks the same. But here’s where things get interesting: the interpretation changes fundamentally.

In simple regression, \(\beta_1\) is straightforward:

“For every one-unit increase in \(X\), \(Y\) changes by \(\beta_1\) units”

In multiple regression, each coefficient tells a different story:

“For every one-unit increase in \(X_j\), \(Y\) changes by \(\beta_j\) units after accounting for all the other predictors.

Let me say that again because it’s crucial: after accounting for all the other predictors. This is what makes multiple regression powerful and also what makes it tricky to interpret.

When you have multiple predictors, each coefficient is what we call a partial effect. It’s the association between one predictor and the response after adjusting for the other predictors in the model.

This is a conditional association, not automatically a causal effect.

Here is a concrete illustration to make this real:

Suppose two counties in Minnesota receive exactly the same amount of rainfall. But one has higher organic matter in its soil than the other. If the model gives \(\beta_{om} = 0.8\), here’s what that means:

Holding rainfall (and all other variables) constant, yield increases by 0.8 units for each one-unit increase in OM

Notice what we’re doing here: we’re comparing two counties that are similar in everything except organic matter.

We’re not comparing “all high-OM counties” vs “all low-OM counties” (because those would differ in rainfall too, and a thousand other things).

That’s the whole point. This is what makes it a partial effect.

This does NOT mean: We have created a controlled experiment. We didn’t go out and manipulate organic matter while holding rainfall steady. We’re just using the model to statistically adjust for the differences.

This adjusted interpretation is why multiple regression matters. Without it, the key question can’t be answered: “Is this factor actually important, or is it just correlated with something else that matters?”

Case Study: Which Soil and Climate Factors Actually Drive Corn Yield?

Here’s a concrete scenario that makes this necessary. Imagine you’re a soil scientist working with a cooperative extension office. A farmer walks in and asks:

“I know rainfall matters for yield, but in this region, which soil properties matter most? Is it organic matter? Texture? pH?”

That’s a great question. But here’s why you can’t just run three separate regressions:

Across counties, rainfall, organic matter, and soil texture are all correlated. Areas with high organic matter typically also have better rainfall. If you regressed yield on OM alone, you’d see a big positive effect. But is that really because of OM, or is it just because those counties happen to have better rainfall too?

This is where multiple regression helps. Instead of asking “Does OM matter?” ask the real question: “After accounting for rainfall, does OM still explain variation in yield?”

Let’s work through it with county-level data from Minnesota and Wisconsin.

     county state  stco      ppt      whc     sand     silt      clay        om
1 Abbeville    SC 45001 562.2991 22.94693 39.26054 21.38839 39.351063  36.92471
2    Acadia    LA 22001 751.1478 37.83115 12.82896 54.19231 32.978734 105.79133
3  Accomack    VA 51001 584.6807 18.56290 74.82042 16.82072  8.358863 103.11425
4       Ada    ID 16001 124.9502 15.54825 44.49350 40.48000 15.026510  77.61499
5     Adair    IA 19001 598.4478 28.91851 17.82011 48.44056 33.739326 259.53255
6     Adair    KY 21001 698.6234 20.87349 15.66781 48.16733 36.164866  73.05356
   kwfactor  kffactor      sph    slope  tfactor      corn  soybean  cotton
1 0.2045719 0.2044325 5.413008 42.59177 4.666834  57.75000 16.75385 525.500
2 0.4432803 0.4431914 6.370533 34.92266 4.855202  92.42500 28.72857      NA
3 0.2044544 0.2044544 5.166446       NA 4.011007 116.99714 30.44857 575.125
4 0.4106704 0.4198545 7.648190       NA 2.853957 145.74138       NA      NA
5 0.3541369 0.3541369 6.215139 55.24201 4.691089 133.02571 41.70857      NA
6 0.3166089 0.3725781 5.351832 37.65075 4.119289  95.76765 37.50000      NA
     wheat
1 30.17500
2 38.11000
3 56.07407
4 97.26296
5 34.04000
6 37.42105

For this example, the analysis focuses on Minnesota and Wisconsin.

[1] 156   6
      corn             ppt             sand            clay       
 Min.   : 56.83   Min.   :379.8   Min.   :10.16   Min.   : 1.241  
 1st Qu.:102.43   1st Qu.:469.7   1st Qu.:33.42   1st Qu.:13.164  
 Median :122.47   Median :514.1   Median :40.46   Median :21.139  
 Mean   :118.29   Mean   :506.9   Mean   :44.43   Mean   :20.045  
 3rd Qu.:134.35   3rd Qu.:548.5   3rd Qu.:54.77   3rd Qu.:26.117  
 Max.   :152.65   Max.   :598.6   Max.   :93.37   Max.   :53.693  
       om              sph       
 Min.   : 101.4   Min.   :5.394  
 1st Qu.: 269.3   1st Qu.:6.184  
 Median : 359.8   Median :6.953  
 Mean   : 455.2   Mean   :6.803  
 3rd Qu.: 560.6   3rd Qu.:7.428  
 Max.   :2021.0   Max.   :7.815  

Plot the Data First

Before fitting anything, inspect the data. No exceptions.

A scatterplot matrix isn’t pretty, but it’s honest. You’ll see which variables move together, which relationships look straight or curved, and how much your predictors are correlating with each other. Your eyes will catch patterns that a test will miss.

Start with a Simple Model (Two Predictors)

Let’s start with a simple model using only rainfall and organic matter to predict corn yield. This might feel like a stripped-down model, but that’s exactly the point. Simplicity helps us see the core logic before we add complexity.

model_simple <- lm(corn ~ ppt + om, data = model_data)
summary(model_simple)

Call:
lm(formula = corn ~ ppt + om, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.252 -11.079  -0.541  12.149  30.401 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.284728  12.638306  -0.418 0.676423    
ppt          0.257903   0.024659  10.459  < 2e-16 ***
om          -0.015734   0.004223  -3.726 0.000273 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.22 on 153 degrees of freedom
Multiple R-squared:  0.4403,    Adjusted R-squared:  0.433 
F-statistic: 60.19 on 2 and 153 DF,  p-value: < 2.2e-16

So you’ve got the output. Now what? Here’s what NOT to do: don’t just scan for low p-values and move on.

Instead, scrutinize each coefficient carefully. First, the sign: does it match biology? Does more rainfall increase yield? That makes sense. If rainfall decreases yield, that’s a red flag unless we’re talking about poorly drained soils.

Then check uncertainty. Is the p-value small, and do the confidence intervals exclude zero? Wide confidence intervals mean the model doesn’t really know where the effect lives. Finally, plausibility: Given your data and model assumptions, does this make biological sense? Sometimes you get a technically valid result that’s biologically nonsensical.

Visualizing Partial Effects

If graphs help you, here’s a direct way to see partial effects.

Let’s use a stripped-down teaching model with just two predictors: rainfall and organic matter.

\[ \text{corn} = \beta_0 + \beta_{ppt}\,\text{ppt} + \beta_{om}\,\text{om} + \varepsilon \]

This makes the partial-effect idea easier to see without extra moving parts. In this setup, \(\beta_{om}\) is the change in predicted yield for a one-unit increase in OM after accounting for rainfall.

How to read it quickly:

  • At any fixed rainfall, the vertical gap between the two lines is the OM partial effect.
  • The lines are parallel because there is no interaction term (ppt:om). That means the OM effect is constant across rainfall values in this model.

Pause and reflect: If two counties have the same rainfall but different OM, the model-predicted yield gap between those counties is the partial OM effect.

Expand to the Full Model

Once that logic is clear, we can expand to the full model with additional predictors:

model_full <- lm(corn ~ ppt + sand + clay + om + sph, data = model_data)
summary(model_full)

Call:
lm(formula = corn ~ ppt + sand + clay + om + sph, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.432  -4.342   1.108   7.035  23.278 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -189.02125   22.76727  -8.302 5.52e-14 ***
ppt            0.37140    0.02165  17.159  < 2e-16 ***
sand          -0.33075    0.10746  -3.078  0.00248 ** 
clay          -1.16506    0.24579  -4.740 4.92e-06 ***
om            -0.01679    0.00307  -5.469 1.85e-07 ***
sph           24.21303    2.02991  11.928  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.61 on 150 degrees of freedom
Multiple R-squared:  0.7651,    Adjusted R-squared:  0.7573 
F-statistic: 97.71 on 5 and 150 DF,  p-value: < 2.2e-16

This is the same idea as before, just with more variables being adjusted for at the same time. You’re not changing the logic, only increasing realism.

Check Assumptions

Just like in Module 10, we need to look at the diagnostics. Here’s what these four plots are actually telling you:

par(mfrow = c(1, 2))
plot(model_full, which = 1:2)

par(mfrow = c(1, 1))

Residuals vs Fitted: You want to see a random cloud centered around zero. No patterns. If it fans out (wider at high values than low), that’s heteroscedasticity—your model loses confidence at some predictions. If you see a U-shape or wave, the model is systematically biased.

Normal Q-Q: Points should track the diagonal line. Small wiggles at the tails are usually fine in large samples, but a systematic S-shape means residuals aren’t normally distributed. That’s worth noting. In larger samples, this is usually more about inference precision (p-values and intervals) than mean-prediction quality.

Quick check: If at least one diagnostic shows a clear pattern, pause before interpreting coefficients as if everything is fine.

What’s Going On When Coefficients Change?

Suppose you fit a model with both sand and clay. You get a negative coefficient on sand. Then you remove clay from the model and refit. Now sand has a positive coefficient. Something flipped.

Your first thought might be: “That’s weird. Did I make a mistake?”

Not necessarily. This is actually a prediction.

Here’s what’s happening:

Sand and clay are correlated in most cases (areas with more sand have less clay). When clay is in the model, the model is extracting the partial effect of sand after already accounting for clay. When you remove clay, sand’s job changes. Now sand has to explain more variation because clay isn’t there to help. So the sign can flip.

What’s the lesson here? This is a consequence of multicollinearity—when predictors are correlated, they can share or reallocate explained variation among themselves. It’s one reason why you have to be careful when interpreting individual coefficients.

Diagnose Multicollinearity

Here’s the problem: If two predictors are highly correlated, their partial effects become unstable. A small change in the data can flip the sign or magnitude of one of the coefficients. This is bad because you can’t trust what the coefficient is telling you.

The correlation matrix is a good first check:

predictor_cor <- cor(model_data[, c("ppt", "sand", "clay", "om", "sph")])
round(predictor_cor, 2)
       ppt  sand  clay    om   sph
ppt   1.00 -0.19 -0.09  0.04 -0.46
sand -0.19  1.00 -0.84  0.34 -0.43
clay -0.09 -0.84  1.00 -0.28  0.67
om    0.04  0.34 -0.28  1.00 -0.04
sph  -0.46 -0.43  0.67 -0.04  1.00

Look for pairs of predictors with correlations near 0.7 or higher (in absolute value). This is not an absolute cutoff but it’s a good indication that standard errors could increase, which typically leads to larger p-values and wider confidence intervals. That makes real effects harder to detect as predictors compete for the same variation.

Pause and reflect: if your goal is prediction, correlated predictors can still help. If your goal is interpretation, correlation becomes a much bigger problem.

Does that automatically mean you need to remove a predictor? Not necessarily. But it absolutely means you need to interpret carefully. Not all multicollinearity is fatal, but you have to know when you’re dealing with it.

Worked Example: Variance Inflation Factor (VIF)

The correlation matrix is useful, but here’s an even more direct measure: the Variance Inflation Factor. It tells you how much the uncertainty of a coefficient is inflated due to correlations with other predictors.

Here’s the idea: Imagine you regressed one predictor (say, sand) on all the other predictors. If you can predict sand really well from the other variables, that means sand is redundant. This is because it’s explaining variation that’s already covered. In that case, sand’s coefficient will be uncertain, and its standard error will increase, which usually pushes p-values upward.

The math is:

\[ VIF_j = \frac{1}{1 - R_j^2} \]

where \(R_j^2\) comes from regressing \(X_j\) on the other predictors.

Here’s how to interpret VIF: values around 1 mean minimal collinearity—the predictor is doing its own thing. Values above about 5 suggest meaningful collinearity and loss of precision. Values above about 10 are often considered problematic. These are heuristics, not hard rules, so context still matters.

Watch out for compositional variables. Soil texture is the classic trap: sand, silt, and clay must sum to 100% by definition. That’s not collinearity—that’s perfect linear dependence. Your VIF will be infinite. This is a special case that requires special handling.

Here’s how it plays out with a worked example:

Let’s run VIF on two scenarios:

Let’s include all three soil texture classes as predictors in a model along with organic matter, and let’s take a look at the summary of this model.

model_inflated <- lm(corn ~ sand + silt + clay + om, data = counties_mn_wi)
summary(model_inflated)

Call:
lm(formula = corn ~ sand + silt + clay + om, data = counties_mn_wi)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.877 -11.757   1.916  11.088  32.694 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.072e+07  3.825e+07  -1.326    0.187
sand         5.072e+05  3.825e+05   1.326    0.187
silt         5.072e+05  3.825e+05   1.326    0.187
clay         5.072e+05  3.825e+05   1.326    0.187
om          -1.669e-03  5.111e-03  -0.327    0.744

Residual standard error: 18.38 on 151 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.2907,    Adjusted R-squared:  0.272 
F-statistic: 15.47 on 4 and 151 DF,  p-value: 1.249e-10

Estimates and standard errors for all three predictors seem to be inflated, right? Let’s take a look at the VIF for these predictors. We will use the vif() function in the library called car, which stands for Companion to Applied Regression.

library(car)
vif(model_inflated)
        sand         silt         clay           om 
1.983537e+13 8.446109e+12 4.548837e+12 1.143113e+00 

We were talking about rules of thumb and saying that values above 10 were problematic… Those values are pretty extreme.

Let’s get rid of one of the soil texture classes, look at the summary and compute the VIF again.

model_adjusted <- lm(corn ~ sand + clay + om, data = counties_mn_wi)
summary(model_adjusted)

Call:
lm(formula = corn ~ sand + clay + om, data = counties_mn_wi)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.741 -11.886   1.901  11.373  32.935 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 184.200776  13.190346  13.965  < 2e-16 ***
sand         -1.003902   0.162054  -6.195 5.21e-09 ***
clay         -1.013062   0.330605  -3.064  0.00258 ** 
om           -0.002201   0.005108  -0.431  0.66709    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.42 on 152 degrees of freedom
  (3 observations deleted due to missingness)
Multiple R-squared:  0.2825,    Adjusted R-squared:  0.2683 
F-statistic: 19.95 on 3 and 152 DF,  p-value: 5.898e-11

The estimates for the effects of sand and clay seem a little more plausible, also, the standard error of the effect of these two predictors seems way less inflated. Let’s compute the VIF to check.

vif(model_adjusted)
    sand     clay       om 
3.533378 3.371310 1.135529 

The VIF looks more under control now, indicating that keeping sand and clay as predictors could work.

What’s happening here?

In the first case, we included all three texture fractions—sand, silt, and clay. The VIF values are huge for the texture variables. Why? Because they’re compositional: they sum to 100% by definition. This means if you know sand and clay, you automatically know silt (it’s 100% minus the others). So the model can’t distinguish between their effects. This is perfect linear dependence.

In the second case, we dropped one texture variable (silt). Now the VIF values are finite and interpretable: sand ≈ 3.5, clay ≈ 3.3, OM ≈ 1.1.

With compositional data, you have to drop one category to avoid perfect collinearity. Once you do that, you can estimate the partial effects properly.

Keep It as Simple as Possible (But Not Simpler)

Here’s a temptation: the more predictors you add to a model, the higher \(R^2\) gets. It always does. So why not just throw everything in?

Because more is not always better.

A simpler model is easier to interpret, easier to communicate to stakeholders, and less prone to overfitting. So the real question is: Do the extra predictors earn their place?

This is a decision step, not a math contest.

Let’s compare a full model (all five predictors) with a reduced model (dropping clay):

model_reduced <- lm(corn ~ ppt + sand + om + sph, data = model_data)

summary(model_reduced)

Call:
lm(formula = corn ~ ppt + sand + om + sph, data = model_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-50.342  -5.438   1.410   7.135  22.276 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.031e+02  2.412e+01  -8.421 2.69e-14 ***
ppt          3.815e-01  2.302e-02  16.575  < 2e-16 ***
sand         5.577e-02  7.479e-02   0.746    0.457    
om          -1.607e-02  3.277e-03  -4.903 2.41e-06 ***
sph          1.953e+01  1.895e+00  10.306  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.34 on 151 degrees of freedom
Multiple R-squared:  0.7299,    Adjusted R-squared:  0.7227 
F-statistic:   102 on 4 and 151 DF,  p-value: < 2.2e-16
anova(model_reduced, model_full)
Analysis of Variance Table

Model 1: corn ~ ppt + sand + om + sph
Model 2: corn ~ ppt + sand + clay + om + sph
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    151 19423                                  
2    150 16893  1    2530.3 22.468 4.922e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model_reduced, model_full)
              df      AIC
model_reduced  6 1207.310
model_full     7 1187.536

Now look at the results together. Here’s what each number is telling you:

ANOVA p-value: If this is large (> 0.05), it says “clay doesn’t add much once you’ve already got the other variables.” The reduced model explains essentially the same variation.

(This comparison assumes the models are nested and fit on the same dataset with identical observations.)

AIC (Akaike Information Criterion): Lower is better. But AIC penalizes complexity, so a simpler model can have lower AIC even if \(R^2\) is slightly smaller. Think of it as “How much better would I have to get if I’m going to justify adding more complexity?”

Adjusted \(R^2\): This penalizes unnecessary predictors. Compare this between models, not the raw \(R^2\).

So how do you decide in practice?

  • ANOVA p-value is large (the dropped predictor adds little)
  • AIC is similar or lower for the reduced model

Lesson: model selection is a biological and statistical decision, not a race for the highest \(R^2\).

Part 2: Nonlinear Relationships

Straight lines are convenient for teaching but biology rarely cooperates.

So here’s the transition from Part 1: once a linear model starts missing structure, we need a model that matches biology better.

Think about nitrogen response in corn. At low N rates, each additional kg gives a big yield bump. But keep adding N and the returns shrink. Eventually more N does nothing. That’s not a straight line—it’s curved.

Or think about plant growth over time. Early on, growth is slow. Then it accelerates. Then it slows again as the plant matures. That’s an S-curve, not a line.

When you fit a straight line to curved data, your residuals will show systematic patterns. That’s your signal: It’s time to move beyond linear.

Here’s the diagnostic approach—and notice it’s the same habit from Module 10:

Look at the residuals.

Nonlinear modeling isn’t a different workflow. It’s what you do after the linear model shows clear problems in the data.

When does a linear model genuinely fail? Here are the warning signs:

  1. Residual plots show systematic curvature. Not random scatter, but a U-shape or wave pattern. The model over-predicts at low values and under-predicts at high values (or vice versa). That’s your signal that the true relationship is curved.

  2. Predictions become biologically nonsensical. A linear model for plant growth might predict negative biomass at early time points. A nitrogen response model might predict infinite yield at very high N rates. That’s not realistic, and it signals that curvature matters.

  3. You know the mechanism is curved. Sometimes domain knowledge tells you the relationship must follow an S-curve (logistic growth) or a linear-plateau (nutrient saturation). Don’t ignore that knowledge just because a line “fits.”

Now comes the choice:

There are two main strategies, and they’re useful in different situations:

  • Strategy 1: Stay in a linear framework but add polynomial terms (quadratic, cubic, etc.). This keeps fitting simple.
  • Strategy 2: Fit a nonlinear model directly with a biological curve shape built in. This requires more care and can extrapolate more sensibly when the chosen form matches the underlying mechanism.

Pause and reflect: if you care most about a quick local fit, start with polynomial terms. If you care most about biologically meaningful parameters and realistic extrapolation, go nonlinear.

Strategy 1: Polynomial Regression in a Linear Framework

When to use polynomials

Polynomials (squared terms, cubic terms, etc.) are attractive because:

  • They stay in the linear regression framework, so all the tools you know work
  • They’re easy to fit
  • You can use all the diagnostics from Module 10

They work well when you need to capture curvature but don’t have strong prior knowledge about what that curve should look like.

Example: Nitrogen Response with Curvature

Look at a simple linear fit to nitrogen response:

corn_n <- read.csv("data/corn_nrate_mono.csv")

ggplot(corn_n, aes(x = n_rate, y = yield)) +
  geom_point(pch = 21, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "tomato") +
  labs(x = "N rate", y = "Yield") +
  theme_bw()

The straight line captures the general trend, but the response is clearly curved. It’s steep at low N and flattens out at high N. Adding a squared term captures this bend:

model_poly <- lm(yield ~ n_rate + I(n_rate^2), data = corn_n)
summary(model_poly)

Call:
lm(formula = yield ~ n_rate + I(n_rate^2), data = corn_n)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.3977  -4.2973   0.8189   5.9189  15.1804 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 97.417690   4.341903  22.437 8.86e-12 ***
n_rate       0.622452   0.087159   7.142 7.57e-06 ***
I(n_rate^2) -0.001557   0.000348  -4.474 0.000626 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.909 on 13 degrees of freedom
Multiple R-squared:  0.9021,    Adjusted R-squared:  0.887 
F-statistic: 59.89 on 2 and 13 DF,  p-value: 2.755e-07

The model is:

\[ Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon \]

When \(\beta_2\) is negative (like here), the curve bends downward and represents diminishing returns—each additional unit of N produces less gain.

Here’s the crucial part: Don’t interpret \(\beta_1\) alone. When a squared term is present, the effect of \(X\) depends on both \(\beta_1\) and \(\beta_2\). The marginal effect of \(X\) is \(\beta_1 + 2\beta_2 X\), so it depends on the value of \(X\) (i.e., the slope changes along the curve).

Quick check: at low N and high N, does the model imply the same slope? If not, that’s exactly why curvature matters.

The best way to understand the effect? Make a prediction plot:

The curve shows what’s actually happening: steep initial response, then flattening. That’s what matters. That’s more meaningful than any coefficient you could report.

Strategy 2: Fit Nonlinear Directly with nls()

When to use nonlinear models

Nonlinear models are powerful when:

  • You understand the mechanism and can describe the curve shape (monomolecular, logistic, etc.)
  • Parameter values have biological meaning and you want to report them
  • You need more control over extrapolation (a line predicts infinite growth; a plateau model doesn’t)

The tradeoff is complexity. Nonlinear models require careful fitting, diagnostics, and validation.

Example: Nutrient Response with a Plateau

At low N rates, yield jumps with each additional kilogram. But eventually, adding more N doesn’t help—the yield plateaus.

A straight line would predict infinite yield at extreme N rates. That’s nonsense. What we need is a model that rises linearly at first, then levels off. That’s exactly what the linear-plateau model does.

For this, we will use the nlraa library, which stands for Nonlinear Regression for Agricultural Applications. If you want to know more about this, I highly recommend the manuscript Nonlinear Regression Models and Applications in Agricultural Research, written by Sotirios Archontoulis and Fernando Miguez.

library(nlraa)
corn_n <- read.csv("data/corn_nrate_mono.csv")

ggplot(corn_n, aes(x = n_rate, y = yield)) +
  geom_point(pch = 21, size = 2) +
  labs(x = "N rate", y = "Yield") +
  theme_bw()

Now fit the linear-plateau model:

model_linp <- nls(yield ~ SSlinp(n_rate, a, b, xs), data = corn_n)
summary(model_linp)

Formula: yield ~ SSlinp(n_rate, a, b, xs)

Parameters:
    Estimate Std. Error t value Pr(>|t|)    
a   96.69732    4.39980  21.978 1.15e-11 ***
b    0.53390    0.07778   6.864 1.15e-05 ***
xs 111.84420   13.32118   8.396 1.31e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.8 on 13 degrees of freedom

Number of iterations to convergence: 1 
Achieved convergence tolerance: 1.826e-11

What do these parameters mean? This is the kind of thing that makes nonlinear models powerful:

  • a: The intercept—predicted yield at zero N.
  • b: The slope in the linear phase—each additional kg of N adds this much yield, up to the join point.
  • xs: The join point—the N rate at which the yield stops responding and plateaus. This is often close to the agronomic optimum and can inform economic decisions, but the true economic optimum depends on crop price, fertilizer cost, and risk.

The fitted curve:

See how the curve rises steeply then flattens sharply at the join point? The xs parameter tells you exactly where that happens. That’s the kind of number you can hand to a farmer.

Lesson: polynomial models can describe shape, but mechanistic nonlinear models often give you parameters you can act on.

Example: Plant Growth Following an S-Curve

Now a different scenario: a crop physiologist tracking soybean biomass from emergence to maturity.

Early on, growth is slow (small plant, limited leaf area). Then it accelerates—the plant has more leaves photosynthesizing, growth compounds. Finally, it slows again as the plant matures and stops investing in biomass accumulation.

That pattern is an S-curve (sigmoidal). It’s not a line, and a polynomial might fit it, but an S-curve model will tell you when growth is fastest (the inflection point), which is biologically interesting.

Fit a logistic (S-curve) model:

model_logis <- nls(tdm ~ SSlogis(dae, Asym, xmid, scal), data = soy_tdm)
summary(model_logis)

Formula: tdm ~ SSlogis(dae, Asym, xmid, scal)

Parameters:
     Estimate Std. Error t value Pr(>|t|)    
Asym 857.5043    11.7558   72.94 2.12e-07 ***
xmid  59.7606     0.7525   79.42 1.51e-07 ***
scal  10.8261     0.6574   16.47 7.96e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.64 on 4 degrees of freedom

Number of iterations to convergence: 0 
Achieved convergence tolerance: 3.018e-06

Again, notice the parameters have meaning:

  • Asym: Maximum biomass the plant ever reaches
  • xmid: The day when growth is fastest (inflection point of the S-curve)
  • scal: How steeply the curve rises. Bigger value = sharper transition

For a crop physiologist, xmid tells you when the plant is at peak growth rate. Asym tells you about ultimate biomass potential. These are actionable biological insights.

Polynomial vs Nonlinear: Which One Should You Use?

Use this decision rule:

  • If you need a quick flexible fit and a familiar workflow, start with polynomial terms.
  • If you need interpretable biological parameters (like join point or asymptote) and realistic extrapolation, use nonlinear models.

Neither is always best. The right choice depends on your question, your data range, and whether you need explanation, prediction, or both.

Common Pitfalls and How to Avoid Them

Let me call out the errors that tend to trap people:

1. Confusing correlation with causation.

Multiple regression finds patterns. It doesn’t prove causation. A model predicting yield from soil properties might just be picking up confounding relationships. Domain knowledge and careful reasoning still matter. Statistics is a tool, not a truth machine.

2. Trusting \(R^2\) too much.

A model can explain 95% of variation in your training data and still fail badly on new observations. That’s overfitting. Always evaluate likely performance on future or held-out data.

3. Misinterpreting coefficients when multicollinearity is present.

Correlated predictors make individual coefficients unstable. Check the correlation matrix and VIF. If you see red flags, focus on what the model actually predicts, not what individual coefficients say.

4. Ignoring curved residuals.

If your residual plot shows systematic curvature instead of random scatter, the model is biased. Fix it with polynomial terms or nonlinear models.

5. Fitting a fancy nonlinear model without validation.

Just because you can fit an S-curve doesn’t mean you should. Check whether the algorithm converged, whether parameter values make sense, and whether different starting values give the same answer. If something looks off, it probably is.

Summary: What Changes and What Doesn’t

Module 11 builds on Module 10. You’re using more complex models, sure. But the core thinking stays the same. And I want to emphasize this because it’s the real lesson: The tools change. The reasoning doesn’t.

Here’s what every analysis should look like:

1. Start with a biological question, not a dataset.

There’s a difference between “What variables predict yield?” and “I have this dataset, let me analyze it.” The first is research. The second is busywork.

2. Plot the data before fitting anything.

No exceptions. Scatterplot matrices, histograms, everything. Your eyes catch patterns that tests miss.

3. Check assumptions.

Residual plots, normality checks, heteroscedasticity. Are they perfect? No. But do they tell a coherent story about what’s working and what’s not? They should.

4. Compare alternatives using statistics AND biology.

AIC is useful. But also ask: “Does this make biological sense?” Both matter equally.

5. Pick the simplest model that’s honest.

Not the fanciest. Not the highest \(R^2\). The simplest one that does real work.

The core thinking is simple: Think like a biologist first. Use statistics to check whether your reasoning holds up, not as a substitute for thinking.

That’s the part that actually matters.