|
||||||||
|
||||||||
"What is logistic regression?logistic regression: a type of regression used when the dependant variable is binary or ordinal. A lot of statistics is concerned with predicting the value of a continuous variable: blood pressure, intelligence, oxygen levels, wealth and so on. This kind of statistics dominates undergraduate courses, and in social science analysis. But what do you do if your dependant variable is binary? What if, for example, you're running a medical study where you want to predict whether someone will live or die in a particular treatment regime? In this case, your dependent variable, survival, can only have two values. It isn't continuous, it's binary. In the past, an accepted way around this was to simply use standard linear regression, and treat the dependent as if it was binary. If the two values were coded as 0 and 1, then any value of .5 or above would be treated as a 1, and anything below .5 would be treated as a zero. However, this approach is no longer considered acceptable, as it has several problems. What's wrong with regressing against binary dependent variables?The first problem is apparent to even a casual observer: the predicted values have no meaning. If you dependent variable can only be zero or one (such as alive or dead), then a value of 3 would indicate that the subject is alive. But what else? This isn't a probability value or likelihood, or even a percentage. It has no real-world interpretation. Equally meaningless is comparing different predicted values to each other. An even more serious problem is that such an analysis violates many assumptions of linear regression. For example, the assumption of homoscedacity won't hold. Homoscedasticity means that the variance around the dependent variable is similar for all values of the independent variable. Variance for a distribution of a binary variable is PQ where P is the probability of a zero, and Q is the probability of a 1. (another assumption that is violated is that Y-Y' is not normally distributed). A better solution is to use either discriminant analysis (DA) or logistic regression. Advantages and disadvantages of logistic regressionLogistic regression has several advantages over discriminant analysis: * it is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group * It does not assume a linear relationship between the IV and DV * It may handle nonlinear effects * You can add explicit interaction and power terms * The DV need not be normally distributed. * There is no homogeneity of variance assumption. * Normally distributed error terms are not assumed. * It does not require that the independents be interval. * It does not require that the independents be unbounded. With all this flexibility, you might wonder why anyone would ever use discriminant analysis or any other method of analysis. Unfortunately, the advantages of logistic regression come at a cost: it requires much more data to achieve stable, meaningful results. With standard regression, and DA, typically 20 data points per predictor is considered the lower bound. For logistic regression, at least 50 data points per predictor is necessary to achieve stable results. How is logistic regression done?We've already talked about why you can't regress against the binary variable (0-1 values), so that's out. W hat about the probability of a 1? This would be a range of numbers between 0 and 1. In fact, I have seen research papers where the authors have done just that. However, it's not good practice. This is continuous, but it's still between 0 and one: it's bounded. A value of 1.1 or -3 make no sense. What's needed is a continuous, unbounded dependent variable. The dependent variable in a logistic regression is the log of the odds ratio. Or in mathspeak, ln(p/(1-p)) This is known as the logit. This is the dependent variable against which independent variables are regressed. Interpreting the results of a logistic regressionAt first glance, logistic regression results look familiar, especially to someone familiar with standard regression: there is a regression equation, complete with coefficients for all the variables. However, these regress against the logit, not the dependent variable itself! Logit = a + bX1 + cX2 Formula for converting logit to probabilitiesIf your regression equation is Logit = a + bX1 + cX2 etc, then the first step is to calculate the logit using that formula. The logit is then converted into a probability using this formula: This number gives you the probability of a 1, given the current configuration of all the predictors. For example, it might give the probability of survival given various lifestyle factors, or the probability of contracting a disease. Effect sizeLogistic regression is a bit like regression, so people who are familiar with regression ask "what's the R value?" In standard regression, R (or R squared in particular) gives you an idea of how powerful your equation is at predicting the variable of interest. An R close to 1 is a very strong prediction, whereas a small R, closer to zero, indicates a weak relationship. There is no direct equivalent of R for logistic regression. However, to keep people happy who insist on an R value, statisticians have come up with several R-like measures for logistic regression. They are not R itself, R has no meaning in logistic regression. Some of the better known ones are: Cox and Snell's R-Square Pseudo-R-Square Hagle and Mitchell's Pseudo-R-Square UsesLogistic regression is perfect for situations where you are trying to predict whether something "happens" or not. A patient survives a treatment, a person contracts a disease, a student passes a course. These are binary outcome measures. It is particularly useful where the dataset is very large, and the predictor variables do not behave in orderly ways, or obey the assumptions required of discriminant analysis. The results of logistic regression are somewhat mystifying, since the original variable of interest (such as survival) disappears and is replaced by the logit. But with a good understanding of how to convert logit values into probabilities, this method can be powerful tool. © 2007 David Dufty Further reading: Statnotes binary logistic regression " |
||||||||