Sunday, February 27, 2011

Stepwise Regression in R

Let me start with a disclaimer:  I am not an advocate of stepwise regression.  I teach it in a doctoral seminar (because it's in the book, and because the students may encounter it reading papers), but I try to point out to them some of its limitations.  If you want to read some interesting discussions on the issues with stepwise, search the USENET group sci.stat.math for references to stepwise.  (If you are a proponent of stepwise, I suggest that you don flame retardant underwear first.)

Since I teach stepwise in my seminar, I would like to demonstrate it in R (not to mention some of my students are learning R while doing their homework, which includes a stepwise problem).  The catch is that R seems to lack any library routines to do stepwise as it is normally taught.  There is a function (leaps::regsubsets) that does both best subsets regression and a form of stepwise regression, but it uses AIC or BIC to select models.  That's fine for best subsets, but stepwise (at least as I've seen it in every book or paper where I've encountered it) uses nested model F tests to make decisions.  Again, if you search around, you can find some (fairly old) posts on help forums by people searching (fruitlessly, it appears) for a stepwise implementation in R.

So I finally forced myself to write a stepwise function for R.  My knowledge of R coding is very, very limited, so I made no attempt to tack on very many bells and whistles, let alone make it a library package.  Unlike most R routines, it does not create an object; it just merrily writes to the standard output stream.  There are a number of limitations (expressed in the comments), and I've only tested it on a few data sets.  All that said, I'm going to post it below, in case someone else is desperate to do conventional stepwise regression in R.

=== code follows ===
    #
    # This is an R function to perform stepwise regression based on a "nested model" F test for inclusion/exclusion
    # of a predictor.  To keep it simple, I made no provision for forcing certain variables to be included in
    # all models, did not allow for specification of a data frame, and skipped some consistency checks (such as whether
    # the initial model is a subset of the full model).
    #
    # One other note: since the code uses R's drop1 and add1 functions, it respects hierarchy in models. That is,
    # regardless of p values, it will not attempt to drop a term while retaining a higher order interaction
    # involving that term, nor will it add an interaction term if the lower order components are not all present.
    # (You can of course defeat this by putting interactions into new variables and feeding it what looks like
    # a first-order model.)
    #
    # Consider this to be "beta" code (and feel free to improve it).  I've done very limited testing on it.
    #
    # Author: Paul A. Rubin (rubin@msu.edu)
    #
    stepwise <- function(full.model, initial.model, alpha.to.enter, alpha.to.leave) {
      # full.model is the model containing all possible terms
      # initial.model is the first model to consider
      # alpha.to.enter is the significance level above which a variable may enter the model
      # alpha.to.leave is the significance level below which a variable may be deleted from the model
      # (Useful things for someone to add: specification of a data frame; a list of variables that must be included)
      full <- lm(full.model);  # fit the full model
      msef <- (summary(full)$sigma)^2;  # MSE of full model
      n <- length(full$residuals);  # sample size
      allvars <- attr(full$terms, "predvars");  # this gets a list of all predictor variables
      current <- lm(initial.model);  # this is the current model
      while (TRUE) {  # process each model until we break out of the loop
        temp <- summary(current);  # summary output for the current model
        rnames <- rownames(temp$coefficients);  # list of terms in the current model
        print(temp$coefficients);  # write the model description
        p <- dim(temp$coefficients)[1];  # current model's size
        mse <- (temp$sigma)^2;  # MSE for current model
        cp <- (n-p)*mse/msef - (n-2*p);  # Mallow's cp
        fit <- sprintf("\nS = %f, R-sq = %f, R-sq(adj) = %f, C-p = %f",
                       temp$sigma, temp$r.squared, temp$adj.r.squared, cp);
        write(fit, file="");  # show the fit
        write("=====", file="");  # print a separator
        if (p > 1) {  # don't try to drop a term if only one is left
          d <- drop1(current, test="F");  # looks for significance of terms based on F tests
          pmax <- max(d[-1,6]);  # maximum p-value of any term (have to skip the intercept to avoid an NA value)
          if (pmax > alpha.to.leave) {
            # we have a candidate for deletion
            var <- rownames(d)[d[,6] == pmax];  # name of variable to delete
            if (length(var) > 1) {
              # if an intercept is present, it will be the first name in the list
              # there also could be ties for worst p-value
              # taking the second entry if there is more than one is a safe solution to both issues
              var <- var[2];
            }
            write(paste("--- Dropping", var, "\n"), file="");  # print out the variable to be dropped
            f <- formula(current);  # current formula
            f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")));  # modify the formula to drop the chosen variable (by subtracting it)
            current <- lm(f);  # fit the modified model
            next;  # return to the top of the loop
          }
        }
        # if we get here, we failed to drop a term; try adding one
        # note: add1 throws an error if nothing can be added (current == full), which we trap with tryCatch
        a <- tryCatch(add1(current, scope=full, test="F"), error=function(e) NULL);  # looks for significance of possible additions based on F tests
        if (is.null(a)) {
          break;  # there are no unused variables (or something went splat), so we bail out
        }
        pmin <- min(a[-1,6]);  # minimum p-value of any term (skipping the intercept again)
        if (pmin < alpha.to.enter) {
          # we have a candidate for addition to the model
          var <- rownames(a)[a[,6] == pmin];  # name of variable to add
          if (length(var) > 1) {
            # same issue with ties, intercept as above
            var <- var[2];
          }
          write(paste("+++ Adding", var, "\n"), file="");  # print the variable being added
          f <- formula(current);  # current formula
          f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" + ")));  # modify the formula to add the chosen variable
          current <- lm(f);  # fit the modified model
          next;  # return to the top of the loop
        }
        # if we get here, we failed to make any changes to the model; time to punt
        break;
      } 
    }
Created by Pretty R at inside-R.org

Update (08/21/17): I've posted an updated/improved (?) version of the code, including a demonstration, in an R notebook.

40 comments:

  1. Paul,
    Love the implementation. I've been using the leaps package to do backwards selection but like you I'm not a huge fan of BS or stepwise. To me its basically a good start but that's all it is.

    One method I did was to do a "bootstrap" and do several replications of the selection process over random samples of the data. Then I just created a list of the most frequently selected variables from selection. Again not perfect but may provide a smoother analysis.

    I'll give you're routine a try next time I'm creating a model.

    ReplyDelete
  2. @Larry: Thanks for the comment, and I appreciate having a second set of eyes on it.

    The bulk of the opposition to stepwise I've seen on the Net also applies to best subsets, and (if I dare paraphrase some lengthy discussions) seems to center on letting the data dictate the model. I share those concerns, and caution students accordingly, but in exploratory mode you sometimes need help paring down a universe of models into a set small enough that you can actually study them.

    In those situations, my first choice is best subsets, and the only time I'd ever advocate stepwise would be as a pre-filter for best subsets (when you have too many variables for best subsets to handle). In those situations, I run (bi-directional) stepwise multiple times, first starting from an empty model, then a full model, then an initial model comprised of all the variables excluded from the previous models. Any variable that can't find traction in any of those runs gets the heave-ho, until the set or remaining variables is small enough to run through best subsets. Your bootstrapping idea is different, but in a similar vein.

    I'm curious why you like backward selection. I've steadily abjured both backward and forward selection. The few papers where I've seen either used seemed to be trying to create an ordering of variables from strongest to weakest predictors.

    ReplyDelete
  3. I don't really have reason why I like backwards selection better in that its quicker than stepwise. I've read that Forward Selection can have a weaker selection criteria because it doesn't look at all the data while Backward Selection works from all the data and excludes from there. That's a poor man's description but its all i remember.

    I forgot to mention. You might want to put your code on the R tag at Github.
    https://github.com/languages/R

    ReplyDelete
  4. Paul, just a side-remark, you might want to add a syntax highlighter for beautifying code, so it's easier to read. It does not need to be complicated, you can use online services such as http://alexgorbatchev.com/SyntaxHighlighter/ , though R is not supported here others might. Just a suggestion :-)

    ReplyDelete
  5. @Larry: I took a look at Github, but I think it might be overkill for one little function, particularly since I have no ambitions to maintain it. (Not to mention I have a SourceForge account, and account proliferation already has my eyes crossing at random intervals.) If someone smacks me with a two-by-four and I suddenly develop an ambition to blow up that function into an R package, I'll revisit Github. Thanks for the suggestion, though.

    ReplyDelete
  6. @Bo: Hadn't thought of syntax highlighting, but you're right, it makes the code more readable. After a little research, I settled on Pretty-R, which did a nice job -- other than introducing the dreaded horizontal scroll.

    ReplyDelete
  7. Sorry for making you work on a Sunday :-) Looks much better, nice feature with links into R-docs on function names.

    ReplyDelete
  8. Nice implementation. In Oracle Crystal Ball, we have implemented both forward and iterative stepwise regression, and in my experience, when there are too many variables to select from, iterative stepwise regression fares better than the forward variant. Although, as you have mentioned, the result from any of these methods should be a starting point for more in-depth analysis, rather than the finish line.

    ReplyDelete
  9. "Too many variables" is the key for me. If I'm dredging for a model (not working from a theoretical basis), I'd rather use best subsets than stepwise. When the pool of variables is huge (which has happened to me with time series when I have indicators for month interacting with everything that can't outrun them), then I may use stepwise to winnow the pool for best subsets.

    ReplyDelete
  10. Thanks for this, that is exactly what i need. Can you post an example of input model and how you input alpha.

    Thanks for your work, Ofer.

    ReplyDelete
  11. @Ofer: Done (http://orinanobworld.blogspot.com/2011/05/stepwise-regression-in-r-part-ii.html).

    ReplyDelete
  12. Thank you for the wonderful job.

    ReplyDelete
  13. Hi Paul:
    This is a really nice code and I like it.
    I think we can even use this code for the Forward selection and Backward elimination as well. For example, to fit a forward selection with alpha to enter as 0.15, we can write stepwise(full.model, initial.model,0.15,1). And to fit a backward elimination with alpha to drop as 0.1, we write stepwise(initial.model,full.model,1,0.1). Here I interchanged the order of initial and full model in the arguments. Is that right? If this is correct, then you may want to add this capability in the description. And by the way, I tested your code over all three methods, and it is working like a charm.

    Thank you.
    Sina

    ReplyDelete
    Replies
    1. Sina,

      Whether your method for backward elimination works will depend on what you specify for the initial model. If the initial model contains an unfortunate subset of the variables, you can put the stepwise function into a loop, adding and deleting things. The safe way to do backward elimination is to set the third parameter (alpha.to.enter) to 0 rather than 1. The first argument can be anything, as long as it is a valid model. The second argument should be the full model, as you had it.

      Delete
  14. To prevent the new formula from capturing a local variable (I think) I had to add `env = environment(f)` to the line `f <- as.formula(paste(f[2], "~", paste(f[3], var, sep=" - ")), env = environment(f));` I assume others have not had this capture issue, which makes me nervous that I don't actually grok what's going on. If you do please share!

    ReplyDelete
    Replies
    1. I've been running some tests. Let's say the data is in a data frame named df. If the full and initial models are specified with fully qualified names, such as "df\$y", "df\$x" etc., things work fine, even if some of the variable names match names of local variables in the function body. On the other hand, if you attach df and then specify the arguments as "y". "x" etc., collisions between data field names and local variables in the function cause the sorts of problems your "env = ..." cures.

      Delete
    2. hey paul. thanks for the quick follow up. sorry i forgot to include the formula i was using. the odd thing is that it uses fully qualified names:

      f <- "d$evals ~ d$Min + d$Max + d$Mean + d$Count"
      mx <- stepwise(as.formula(f), as.formula(f), 0.05, 0.05)

      i'll let you know if i find time to hack at this further and come up with anything, but alas school starts again soon ....

      Delete
    3. Should have warned you, the site uses paired dollar signs to indicate a LaTeX math expression. Hence the munging of your first line there. I get the gist, though.

      I can't see any way that fully qualified names could collide with any local variable names in the function (and they don't in any of my tests). Once I post the new and possibly improved version, maybe you can test whether the problem still occurs.

      Incidentally, the code you wrote should set mx equal to NULL (unless you tweaked my original code). The original did not return anything, just printed stuff. The new version returns the final model (an lm object).

      Delete
    4. cheers ... i tried markdown earlier ... yet another syntax ... oh the joy

      you are correct i added "return(current)".

      happy to experiment with the new code! if i still have this issue i'll try to work up a complete simple example for you to play with. given time :) i'll do that with the current code, but alas school starts soon....

      Delete
    5. Check the tail end of the post above for a link to the update.

      Delete
  15. I had to pre-filter the call to stepwise with the following to remove variables whose p-values were assigned NA by lm. This seems a bit of a kludge, so I'm curious if there is a better way
    ``` mx <- lm(as.formula(f))
    c <- mx$coefficients
    c <- c[!is.na(c)]
    # FIX ME if len(c) == 1 panic :)
    f <- paste("d$evals ~ ", paste(names(c)[-1], collapse=" + "))
    ...
    mx <- stepwise(as.formula(f), as.formula(f), 0.05, 0.05)```

    p.s., I'd also like to learn how to replace 'Anonymous' with my google account name, which apparently requires more than selecting such in the 'Comment as' drop down ... :(

    ReplyDelete
    Replies
    1. I've been working on an update to the code, which fixes some cases it did not handle properly. (The original code was, as noted in the post, not really production-ready.) One of the fixes is to screen out NAs in the p-values. I was already working around the NA for the intercept, but I had not defended against multicollinearity, which also causes NAs.

      I'll update the post when I've got an improved version of the code (hopefully soon).

      Regarding anonymity, I'm always logged into my Google account, so the first entry in the select reads "Paul Rubin (Google)". If I browse anonymously, the Google entry reads "Google Account" and selecting it causes Blogger to ask me to log in, after which the comment is recorded with my name.

      Delete
    2. cheers paul! thanks in advance for the update!!

      fyi, in addition to the multicollinearity, i got NA's for some of the p-values when i had too few sample data points (which i suspect is far from kosher), but users always blame the library.... eh? :)

      thanks for the pointers on anonymity. i tried logging out (the page thinks i'm already logged in) and then logging back in. it is even willing to notify me correctly! i also poked around in the google settings.... starting to think i'm just unloved!! :)

      Delete
    3. For an all-knowing, all-seeing Big Brother of the Internet, Google seems to be having an awfully hard time keeping straight who you are. ;-)

      Thanks for the bit about insufficient sample size; I'm going to add it to my unit tests. I should be able to avoid throwing around NAs, but I suspect that stepwise is doomed to fail (albeit hopefully gracefully) when you ahve more variables than observations. As we used to say in graduate school (at Michigan State, a university with a strong agriculture program), "you can't milk ducks". (Oh, and they get rather testy if you try.)

      Delete
    4. nice :)
      (did my graduate work at UW-Madison ... we had cows! still never tried to milk one ....) cheers!

      Delete
    5. Check the tail end of the post above for a link to the update.

      Delete
    6. Forgot to ask: did you ever try to milk a badger? Probably even worse than milking a duck (although badger are at least mammals).

      Delete
    7. hey paul, alas, i proving inept at finding the link to the update :(
      likely about as good as my budget milking, but i'm not sure that is causal ;-)

      Delete
    8. The update is tucked in just below the scrollable code window; the link is the word "posted". I have to admit it's a tad inconspicuous. Anyway, the link (to a recent post that explains changes and links to the new code) is https://orinanobworld.blogspot.com/2017/08/updated-stepwise-regression-function.html.

      Delete
  16. hey paul

    fianlly found time to play with the update. worked like a charm with one minor modification (likely caused by my not truely grokking R's type system).
    i call your code with

    f <- "d$evals ~ d$Min + d$Max + d$Mean + d$Count"
    mx <- stepwise(as.formula(f), as.formula(f), 0.05, 0.05)

    thus in stepwise.R full.model (and initial.model) are of type 'language'. when printed objects of type language include , which did not parse well later on.
    The following change seemed to sort the issue.

    if (is.language(full.model) | is.character(full.model)) {
    fm <- as.formula(full.model)
    } else {
    fm <- as.formula(capture.output(print(full.model)))
    }

    if (is.language(initial.model) | is.character(initial.model)) {
    im <- as.formula(initial.model)
    } else {
    im <- as.formula(capture.output(print(initial.model)))
    }

    thanks again for the update ... saved me bookoo time!!

    ReplyDelete
    Replies
    1. You're welcome for the update ... but you lost me here. I created a data frame "d" with the five variables you used, computed f as above and computed mx both as written (using as.formula(f) as the model arguments to stepwise) and using just f as the model arguments. Both ran fine, produced the same results and generated printed output with no glitches I can see.

      If you use as.formula(f) as a model argument, its class is "formula", which causes as.language(as.formula(f)) to return true. So your hack basically says "if he entered a formula or if he entered a string, set fm to as.formula(full.model)", which pretty much guarantees the condition will evaluate to true and the line will execute. I think that will cause problems if you have a collision between a variable name in the model and a variable name used in the function, which is why I needed the goofy looking else clause in the first place.

      You said that, when printed, objects of type language include that caused problems. Did you mean a comma, or did something not print in your comment?

      Delete
  17. hey paul! i get the environment printed out ... ok, only when the formula is created in a function:

    #!/usr/bin/env Rscript

    bar <- function(gg, ff)
    {
    print(gg)
    print(ff)
    }

    doit <- function(f)
    {
    g <- "A ~ B"
    bar(as.formula(g), f)
    }

    f <- "d$evals ~ d$Min + d$Max + d$Mean"
    doit(as.formula(f))

    outputs

    A ~ B

    d$evals ~ d$Min + d$Max + d$Mean


    so when i run the following as part of stepwise

    print(R.version.string)
    print(typeof(full.model))
    print(full.model)
    print(is.language(full.model))
    if (is.character(full.model)) {
    # if (is.language(full.model) | is.character(full.model)) {
    cat("good\n")
    fm <- as.formula(full.model)
    } else {
    cat("bad\n")
    fm <- as.formula(capture.output(print(full.model)))
    }

    i get an error

    [1] "R version 3.2.3 (2015-12-10)"
    [1] "language"
    d$evals ~ d$Min + d$Max + d$Mean + d$Count

    [1] TRUE
    bad
    Error in parse(text = x, keep.source = FALSE) :
    :2:1: unexpected '<'
    1: d$evals ~ d$Min + d$Max + d$Mean + d$Count
    2: <
    ^
    Calls: doit ... formula -> formula.character -> formula -> eval -> parse
    Execution halted

    this is under linux. i get the same thing on a mac with
    version.string R version 3.2.4 (2016-03-10)

    i gather that you don't get the "" as part of the printed output.

    adding 'is.language(full.model) |' solved my problem. i've got stepwise is doing what i need. i certainly don't expect you to keep hacking on my account. on the flip side, i'm happy to dig deeper if such would bring you value!

    sorry this got kinda long ....

    ReplyDelete
  18. Reminder: Because this site uses MathJax for rendering LaTeX math notation, you need to escape dollar signs to avoid messes like that above. I'm not sure if you need to escape tildes; I'm doing so below just to be safe.

    I substituted your code (print statements etc.) into the stepwise function, created a dummy data frame "d" with some random data, and ran "stepwise(f, f, 0.05, 0.05)". The output (excluding the actual stepwise output) was:

    [1] "R version 3.4.1 (2017-06-30)"
    [1] "character"
    [1] "d\$evals \~ d\$Min + d\$Max + d\$Mean"
    [1] FALSE
    good

    (This is on Linux Mint, by the way, although I doubt the OS matters.) Then I ran "stepwise(as.formula(f), as.formula(f), 0.05, 0.05)" and got the following:

    [1] "R version 3.4.1 (2017-06-30)"
    [1] "language"
    d\$evals \~ d\$Min + d\$Max + d\$Mean
    [1] TRUE
    bad

    followed by, again, the correct stepwise output. I got no error messages from R.

    I'm using the current version of R, so maybe that makes a difference?

    ReplyDelete
    Replies
    1. I should add that the quotes around the formula in the first instance (when it is type "character") are expected and nothing, I think, to worry about.

      Delete
  19. sorry about the formatting ... can't wait for the 'here is my new way' wars to end .... so i'm curious what you get as output when running the following from the command line (i'm also curious how to get mathjax to indent a line ... first two options from their web page failed, so i gave up :( ) anyway here is the code:

    #!/usr/bin/env Rscript

    bar <- function(gg, ff)
    {
    print(gg)
    print(ff)
    }

    doit <- function(f)
    {
    g <- "A ~ B"
    bar(as.formula(g), f)
    }

    f <- "C ~ D"
    doit(as.formula(f))


    $ x.R
    A ~ B

    C ~ D

    so for the formula (f <- C ~ D) created in the 'global' space there is no attached environment printed, but for the formula created in function doit ( g <- A ~ B), the printing (of A ~ B) includes an environment. i'm guessing its part of the closure, but i'm not sure ....

    ReplyDelete
    Replies
    1. Yes, I get an environment identifier in the output for the first formula (but not the second). Note that you can suppress that if you wish. Change print(gg) to print(gg, showEnv = F) and see what you get.

      Delete
  20. cheers paul. for some reason i need to add this to the 'showEnv=F' to 'fm <- as.formula(capture.output(print(full.model)))' but it sounds like you don't. i got it working so all is well. thanks for all your help!!

    ReplyDelete
    Replies
    1. You're welcome, and you're correct about it not being an issue for me (for some reason). I just reran my unit tests, and none of them printed out "" even once. I'm glad you're set now.

      Delete
    2. Okay, I've sorted this out. The issue arises consistently if (and only if) you define your models inside a function, rather than in the global environment. I've never done that, which is why the bug didn't bite me, but I can definitely reproduce it. So I've added a unit test for it and posted a new version of the notebook with the fix (as above, for both fm and im) in it.

      Delete
  21. excellent! ... right ... dam computer geeks creating models in functions and all. glad you were able to grok the issue and make the code more robust!!

    ReplyDelete

Due to intermittent spamming, comments are being moderated. If this is your first time commenting on the blog, please read the Ground Rules for Comments. In particular, if you want to ask an operations research-related question not relevant to this post, consider asking it on Operations Research Stack Exchange.