OR in an OB World: Stepwise Regression in R

Sunday, February 27, 2011

Stepwise Regression in R

Let me start with a disclaimer: I am not an advocate of stepwise regression. I teach it in a doctoral seminar (because it's in the book, and because the students may encounter it reading papers), but I try to point out to them some of its limitations. If you want to read some interesting discussions on the issues with stepwise, search the USENET group sci.stat.math for references to stepwise. (If you are a proponent of stepwise, I suggest that you don flame retardant underwear first.)

Since I teach stepwise in my seminar, I would like to demonstrate it in R (not to mention some of my students are learning R while doing their homework, which includes a stepwise problem). The catch is that R seems to lack any library routines to do stepwise as it is normally taught. There is a function (leaps::regsubsets) that does both best subsets regression and a form of stepwise regression, but it uses AIC or BIC to select models. That's fine for best subsets, but stepwise (at least as I've seen it in every book or paper where I've encountered it) uses nested model F tests to make decisions. Again, if you search around, you can find some (fairly old) posts on help forums by people searching (fruitlessly, it appears) for a stepwise implementation in R.

So I finally forced myself to write a stepwise function for R. My knowledge of R coding is very, very limited, so I made no attempt to tack on very many bells and whistles, let alone make it a library package. Unlike most R routines, it does not create an object; it just merrily writes to the standard output stream. There are a number of limitations (expressed in the comments), and I've only tested it on a few data sets. All that said, I'm going to post it below, in case someone else is desperate to do conventional stepwise regression in R.

=== code follows ===

    #
    # This is an R function to perform stepwise regression based on a "nested model" F test for inclusion/exclusion
    # of a predictor.  To keep it simple, I made no provision for forcing certain variables to be included in
    # all models, did not allow for specification of a data frame, and skipped some consistency checks (such as whether
    # the initial model is a subset of the full model).
    #
    # One other note: since the code uses R's drop1 and add1 functions, it respects hierarchy in models. That is,
    # regardless of p values, it will not attempt to drop a term while retaining a higher order interaction
    # involving that term, nor will it add an interaction term if the lower order components are not all present.
    # (You can of course defeat this by putting interactions into new variables and feeding it what looks like
    # a first-order model.)
    #
    # Consider this to be "beta" code (and feel free to improve it).  I've done very limited testing on it.
    #
    # Author: Paul A. Rubin (rubin@msu.edu)
    #
    stepwise <- function(full.model, initial.model, alpha.to.enter, alpha.to.leave) {
      # full.model is the model containing all possible terms
      # initial.model is the first model to consider
      # alpha.to.enter is the significance level above which a variable may enter the model
      # alpha.to.leave is the significance level below which a variable may be deleted from the model
      # (Useful things for someone to add: specification of a data frame; a list of variables that must be included)
      full <- lm(full.model);  # fit the full model
      msef <- (summary(full)$sigma)^2;  # MSE of full model
      n <- length(full$residuals);  # sample size
      allvars <- attr(full$terms, "predvars");  # this gets a list of all predictor variables
      current <- lm(initial.model);  # this is the current model
      while (TRUE) {  # process each model until we break out of the loop
        temp <- summary(current);  # summary output for the current model
        rnames <- rownames(temp$coefficients);  # list of terms in the current model
        print(temp$coefficients);  # write the model description
        p <- dim(temp$coefficients)[1];  # current model's size
        mse <- (temp$sigma)^2;  # MSE for current model
        cp <- (n-p)*mse/msef - (n-2*p);  # Mallow's cp
        fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f",
                       temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
        write(fit, file="");  # show the fit
        write("=====", file="");  # print a separator
        if (p > 1) {  # don't try to drop a term if only one is left
          d <- drop1(current, test="F");  # looks for significance of terms based on F tests
          pmax <- max(d[-1,6]);  # maximum p-value of any term (have to skip the intercept to avoid an NA value)
          if (pmax > alpha.to.leave) {
            # we have a candidate for deletion
            var <- rownames(d)[d[,6] == pmax];  # name of variable to delete
            if (length(var) > 1) {
              # if an intercept is present, it will be the first name in the list
              # there also could be ties for worst p-value
              # taking the second entry if there is more than one is a safe solution to both issues
              var <- var[2];
            }
            write(paste("--- Dropping", var, "\n"), file="");  # print out the variable to be dropped
            f <- formula(current);  # current formula
            f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")));  # modify the formula to drop the chosen variable (by subtracting it)
            current <- lm(f);  # fit the modified model
            next;  # return to the top of the loop
          }
        }
        # if we get here, we failed to drop a term; try adding one
        # note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch
        a <- tryCatch(add1(current, scope=full, test="F"), error=function(e) NULL);  # looks for significance of possible additions based on F tests
        if (is.null(a)) {
          break;  # there are no unused variables (or something went splat), so we bail out
        }
        pmin <- min(a[-1,6]);  # minimum p-value of any term (skipping the intercept again)
        if (pmin < alpha.to.enter) {
          # we have a candidate for addition to the model
          var <- rownames(a)[a[,6] == pmin];  # name of variable to add
          if (length(var) > 1) {
            # same issue with ties, intercept as above
            var <- var[2];
          }
          write(paste("+++ Adding", var, "\n"), file="");  # print the variable being added
          f <- formula(current);  # current formula
          f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" + ")));  # modify the formula to add the chosen variable
          current <- lm(f);  # fit the modified model
          next;  # return to the top of the loop
        }
        # if we get here, we failed to make any changes to the model; time to punt
        break;
      } 
    }

Created by Pretty R at inside-R.org

Update (08/21/17): I've posted an updated/improved (?) version of the code, including a demonstration, in an R notebook.

40 comments:

LarryMarch 3, 2011 at 9:12 AM
Paul,
Love the implementation. I've been using the leaps package to do backwards selection but like you I'm not a huge fan of BS or stepwise. To me its basically a good start but that's all it is.

One method I did was to do a "bootstrap" and do several replications of the selection process over random samples of the data. Then I just created a list of the most frequently selected variables from selection. Again not perfect but may provide a smoother analysis.

I'll give you're routine a try next time I'm creating a model.
ReplyDelete
Replies
Paul A. RubinMarch 3, 2011 at 9:28 AM
@Larry: Thanks for the comment, and I appreciate having a second set of eyes on it.

The bulk of the opposition to stepwise I've seen on the Net also applies to best subsets, and (if I dare paraphrase some lengthy discussions) seems to center on letting the data dictate the model. I share those concerns, and caution students accordingly, but in exploratory mode you sometimes need help paring down a universe of models into a set small enough that you can actually study them.

In those situations, my first choice is best subsets, and the only time I'd ever advocate stepwise would be as a pre-filter for best subsets (when you have too many variables for best subsets to handle). In those situations, I run (bi-directional) stepwise multiple times, first starting from an empty model, then a full model, then an initial model comprised of all the variables excluded from the previous models. Any variable that can't find traction in any of those runs gets the heave-ho, until the set or remaining variables is small enough to run through best subsets. Your bootstrapping idea is different, but in a similar vein.

I'm curious why you like backward selection. I've steadily abjured both backward and forward selection. The few papers where I've seen either used seemed to be trying to create an ordering of variables from strongest to weakest predictors.
ReplyDelete
Replies
LarryMarch 3, 2011 at 10:08 AM
I don't really have reason why I like backwards selection better in that its quicker than stepwise. I've read that Forward Selection can have a weaker selection criteria because it doesn't look at all the data while Backward Selection works from all the data and excludes from there. That's a poor man's description but its all i remember.

I forgot to mention. You might want to put your code on the R tag at Github.
https://github.com/languages/R
ReplyDelete
Replies
Bo JensenMarch 5, 2011 at 8:30 AM
Paul, just a side-remark, you might want to add a syntax highlighter for beautifying code, so it's easier to read. It does not need to be complicated, you can use online services such as http://alexgorbatchev.com/SyntaxHighlighter/ , though R is not supported here others might. Just a suggestion :-)
ReplyDelete
Replies
Paul A. RubinMarch 6, 2011 at 2:32 PM
@Larry: I took a look at Github, but I think it might be overkill for one little function, particularly since I have no ambitions to maintain it. (Not to mention I have a SourceForge account, and account proliferation already has my eyes crossing at random intervals.) If someone smacks me with a two-by-four and I suddenly develop an ambition to blow up that function into an R package, I'll revisit Github. Thanks for the suggestion, though.
ReplyDelete
Replies
Paul A. RubinMarch 6, 2011 at 2:42 PM
@Bo: Hadn't thought of syntax highlighting, but you're right, it makes the code more readable. After a little research, I settled on Pretty-R, which did a nice job -- other than introducing the dreaded horizontal scroll.
ReplyDelete
Replies
Bo JensenMarch 6, 2011 at 5:08 PM
Sorry for making you work on a Sunday :-) Looks much better, nice feature with links into R-docs on function names.
ReplyDelete
Replies
Samik RApril 5, 2011 at 4:32 PM
Nice implementation. In Oracle Crystal Ball, we have implemented both forward and iterative stepwise regression, and in my experience, when there are too many variables to select from, iterative stepwise regression fares better than the forward variant. Although, as you have mentioned, the result from any of these methods should be a starting point for more in-depth analysis, rather than the finish line.
ReplyDelete
Replies
Paul A. RubinApril 5, 2011 at 4:41 PM
"Too many variables" is the key for me. If I'm dredging for a model (not working from a theoretical basis), I'd rather use best subsets than stepwise. When the pool of variables is huge (which has happened to me with time series when I have indicators for month interacting with everything that can't outrun them), then I may use stepwise to winnow the pool for best subsets.
ReplyDelete
Replies
OferMay 23, 2011 at 11:35 AM
Thanks for this, that is exactly what i need. Can you post an example of input model and how you input alpha.

Thanks for your work, Ofer.
ReplyDelete
Replies
Paul A. RubinMay 23, 2011 at 5:40 PM
@Ofer: Done (http://orinanobworld.blogspot.com/2011/05/stepwise-regression-in-r-part-ii.html).
ReplyDelete
Replies
EricApril 2, 2012 at 5:12 AM
Thank you for the wonderful job.
ReplyDelete
Replies
SinaFebruary 21, 2014 at 6:00 AM
Hi Paul:
This is a really nice code and I like it.
I think we can even use this code for the Forward selection and Backward elimination as well. For example, to fit a forward selection with alpha to enter as 0.15, we can write stepwise(full.model, initial.model,0.15,1). And to fit a backward elimination with alpha to drop as 0.1, we write stepwise(initial.model,full.model,1,0.1). Here I interchanged the order of initial and full model in the arguments. Is that right? If this is correct, then you may want to add this capability in the description. And by the way, I tested your code over all three methods, and it is working like a charm.

Thank you.
Sina
ReplyDelete
Replies
August 17, 2017 at 3:32 PM
To prevent the new formula from capturing a local variable (I think) I had to add `env = environment(f)` to the line `f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")), env = environment(f));` I assume others have not had this capture issue, which makes me nervous that I don't actually grok what's going on. If you do please share!
ReplyDelete
Replies
August 17, 2017 at 7:39 PM
I had to pre-filter the call to stepwise with the following to remove variables whose p-values were assigned NA by lm. This seems a bit of a kludge, so I'm curious if there is a better way
``` mx <- lm(as.formula(f))
c <- mx$coefficients
c <- c[!is.na(c)]
# FIX ME if len(c) == 1 panic :)
f <- paste("d$evals ~ ", paste(names(c)[-1], collapse=" + "))
...
mx <- stepwise(as.formula(f), as.formula(f), 0.05, 0.05)```

p.s., I'd also like to learn how to replace 'Anonymous' with my google account name, which apparently requires more than selecting such in the 'Comment as' drop down ... :(
ReplyDelete
Replies
August 27, 2017 at 6:49 AM
hey paul

fianlly found time to play with the update. worked like a charm with one minor modification (likely caused by my not truely grokking R's type system).
i call your code with

f <- "d$evals ~ d$Min + d$Max + d$Mean + d$Count"
mx <- stepwise(as.formula(f), as.formula(f), 0.05, 0.05)

thus in stepwise.R full.model (and initial.model) are of type 'language'. when printed objects of type language include , which did not parse well later on.
The following change seemed to sort the issue.

if (is.language(full.model) | is.character(full.model)) {
fm <- as.formula(full.model)
} else {
fm <- as.formula(capture.output(print(full.model)))
}

if (is.language(initial.model) | is.character(initial.model)) {
im <- as.formula(initial.model)
} else {
im <- as.formula(capture.output(print(initial.model)))
}

thanks again for the update ... saved me bookoo time!!
ReplyDelete
Replies
August 28, 2017 at 4:04 AM
hey paul! i get the environment printed out ... ok, only when the formula is created in a function:

#!/usr/bin/env Rscript

bar <- function(gg, ff)
{
print(gg)
print(ff)
}

doit <- function(f)
{
g <- "A ~ B"
bar(as.formula(g), f)
}

f <- "d$evals ~ d$Min + d$Max + d$Mean"
doit(as.formula(f))

outputs

A ~ B

d$evals ~ d$Min + d$Max + d$Mean

so when i run the following as part of stepwise

print(R.version.string)
print(typeof(full.model))
print(full.model)
print(is.language(full.model))
if (is.character(full.model)) {
# if (is.language(full.model) | is.character(full.model)) {
cat("good\n")
fm <- as.formula(full.model)
} else {
cat("bad\n")
fm <- as.formula(capture.output(print(full.model)))
}

i get an error

[1] "R version 3.2.3 (2015-12-10)"
[1] "language"
d$evals ~ d$Min + d$Max + d$Mean + d$Count

[1] TRUE
bad
Error in parse(text = x, keep.source = FALSE) :
:2:1: unexpected '<'
1: d$evals ~ d$Min + d$Max + d$Mean + d$Count
2: <
^
Calls: doit ... formula -> formula.character -> formula -> eval -> parse
Execution halted

this is under linux. i get the same thing on a mac with
version.string R version 3.2.4 (2016-03-10)

i gather that you don't get the "" as part of the printed output.

adding 'is.language(full.model) |' solved my problem. i've got stepwise is doing what i need. i certainly don't expect you to keep hacking on my account. on the flip side, i'm happy to dig deeper if such would bring you value!

sorry this got kinda long ....

ReplyDelete
Replies
Paul A. RubinAugust 28, 2017 at 4:47 PM
Reminder: Because this site uses MathJax for rendering LaTeX math notation, you need to escape dollar signs to avoid messes like that above. I'm not sure if you need to escape tildes; I'm doing so below just to be safe.

I substituted your code (print statements etc.) into the stepwise function, created a dummy data frame "d" with some random data, and ran "stepwise(f, f, 0.05, 0.05)". The output (excluding the actual stepwise output) was:

[1] "R version 3.4.1 (2017-06-30)"
[1] "character"
[1] "d\$evals \~ d\$Min + d\$Max + d\$Mean"
[1] FALSE
good

(This is on Linux Mint, by the way, although I doubt the OS matters.) Then I ran "stepwise(as.formula(f), as.formula(f), 0.05, 0.05)" and got the following:

[1] "R version 3.4.1 (2017-06-30)"
[1] "language"
d\$evals \~ d\$Min + d\$Max + d\$Mean
[1] TRUE
bad

followed by, again, the correct stepwise output. I got no error messages from R.

I'm using the current version of R, so maybe that makes a difference?
ReplyDelete
Replies
August 30, 2017 at 5:50 AM
sorry about the formatting ... can't wait for the 'here is my new way' wars to end .... so i'm curious what you get as output when running the following from the command line (i'm also curious how to get mathjax to indent a line ... first two options from their web page failed, so i gave up :( ) anyway here is the code:

#!/usr/bin/env Rscript

bar <- function(gg, ff)
{
print(gg)
print(ff)
}

doit <- function(f)
{
g <- "A ~ B"
bar(as.formula(g), f)
}

f <- "C ~ D"
doit(as.formula(f))

$ x.R
A ~ B

C ~ D

so for the formula (f <- C ~ D) created in the 'global' space there is no attached environment printed, but for the formula created in function doit ( g <- A ~ B), the printing (of A ~ B) includes an environment. i'm guessing its part of the closure, but i'm not sure ....
ReplyDelete
Replies
September 1, 2017 at 8:51 AM
cheers paul. for some reason i need to add this to the 'showEnv=F' to 'fm <- as.formula(capture.output(print(full.model)))' but it sounds like you don't. i got it working so all is well. thanks for all your help!!
ReplyDelete
Replies
September 4, 2017 at 5:46 PM
excellent! ... right ... dam computer geeks creating models in functions and all. glad you were able to grok the issue and make the code more robust!!
ReplyDelete
Replies

Add comment

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.

OR in an OB World

Sunday, February 27, 2011

Stepwise Regression in R

40 comments:

Previous Posts

Labels