Stop!!! Is your dependent variable dichotomous? If so, proceed. Is it a polychotomous categorical variable that you can collapse into a dichotomous variable? Then do so and proceed. If it is neither of those, go back to this Page.

You may find StataCorp's videos on logistic regression helpful.

When ready, you may find the generic multiple logistic regression do-file a useful complement to this Page (withheld).

Dependent Variable Independent Variables

Logistic Regression Dichotomous* Interval or dichotomous**

*It's not uncommon to have to collapse categories in a polychotomous ordinal variable in order to make it amenable to logistic regression. Alternatively, one can use ordered logistic regression (ologit) in such cases.

**If you have a polychotomous variable, use i. or create dummies; always leave a reference category out of the model.

Basic Command Structure

There are two different commands, both equivalent. They'll simply return your results in different form (logits are the natural logarithm of odds ratios). Logistic will return odds ratios; logit will return logit coefficients.

. logistic dv iv1 iv2 iv3

. logit dv iv1 iv2 iv3

Command Structure for Automatically Generating Table of Results

One important thing – the “label” option tells Stata to use the variable label in the table, rather than the variable name (since variable names often suck). So, it’s worth putting thought into those variable labels ahead of time so that you don’t have to change them much (if at all) once they’re in the table.

. eststo: logit dv iv1

You can also have Stata run the regression and not show the results, but store them “quietly”.

. eststo: quietly logit dv iv1 iv2

. eststo: logit dv iv1 iv2 i.iv3

. esttab using S:\!!!!SO401_Fall24\Students\Wade_Roberts\ProjectFolder\filename.rtf, label title(Table 1: Logistic Regression Results) nonumbers mtitles("Model 1" "Model 2" "Model 3") ci(#) b(#) compress replace nobaselevels nolabel nodepvars eform

. eststo clear

If you want to have the odds ratios reported in the table (as opposed to logits), you’d just include the eform (exponentiated form) option at the end of the esttab command.

The #s refer to the number of decimal places for standard errors, betas (coefficients), and your r-squared. I usually do 3, 3, and 4. The “nobaselevels” tells it to leave off the reference group in i.catvar cases. You can look at the . help esttab screen to see all the rest.

After the esttab command is run, Stata will toss up a link in the results window to the rtf document that was created (which is technically saved in some default folder). Just click on it and Word will open it up (and you can subsequently save it as a Word document; it just works better to start it as a rtf (rich text file) document for some reason). The resulting table might not be exactly as you want it, but it's pretty darn close.

. eststo clear [run this command to clear your slate so that you can do models for a different table; that different table should be given a different .rtf filename.]

In addition to or in lieu of a table of regression results, you might also opt to produce a forestplot graph of the results. With the following sequences of commands, you can plot the estimate (OR) and CIs for one or more models, though I recommend keeping it to three or fewer models in a graph.

. logistic dv iv1 iv2

. estimates store model1

. logistic dv iv1 iv2 iv3

. estimates store model2

. coefplot model1 model2, drop(_cons) xline(1) eform

How do I incorporate polychotomous categorical variables into my regression? No form of regression allows you to incorporate polychotomous categorical variables into your model as an independent variable, particularly if it is nominal (ordinal can be a gray area). It is inappropriate to include a multiple category nominal variable (e.g., race, marital status) in the regression, since a higher value doesn’t mean anything. In basic terms, NEVER put a polychotomous categorical nominal variable into the model "as is." We can deal with this in a couple of ways. Before we can bring such a variable into the equation, we need to dummy it out. We then leave the “default category” out of the equation. Each category you do include is relative to your default category. Additionally, NEVER put an i. prefix in front of a true interval variable.

Use the i. option; it does the dummying out for you. See AGIS, p. 281-282.

. logistic dv iv1 iv2 i.race

What if you want to know if two categories in your polychotomous categorical variable are significantly different from one another, but neither is the reference group? You can see if their confidence intervals overlap. If they do, they are not significantly different from one another. This is even easier to see if you do a graphing option that includes the 95% CIs (see further below). Alternatively, you can run test commands after your regression, where the null hypothesis is that there is no difference. If the p-value from the test command is <.05, then you can reject the null and be confident that there is a significant difference between the two.

. test 2.race=3.race

. test 2.race=4.race

Generate dummies first. Leave at least one out of the model to serve as the reference category(ies) and include the rest in the model. To generate dummy variable from a polychotomous variable, use the following command. It will generate dummy variables and name each one catvar1 catvar2, etc.

. tab catvar, gen(catvar)

Note: You can do the same thing for ordinal variables as you do for nominal variables. Other options include treating ordinal variables as quasi-interval or collapsing them so as to make a dichotomous variable.

Interpreting Logistic Regression Results

As with multivariate OLS regression, only dichotomous and interval independent variables are allowed in logistic regression. The odds ratios that are produced thus reflect the association between a dichotomous IV and a dichotomous DV, and/or the association between an interval IV and a dichotomous DV. For the former (dichotomous and dichotomous), the odds ratio is exactly what it says -- the ratio of the odds for each of your independent variable groups. The odds for each IV group are calculated based on the "presence/absence" ratio for each group. Those odds of group 1 are then divided by the odds of group 0 to establish the odds ratio. For the second IV (interval), the interpretation (and calculation) of the odds ratio is a little different, though the same basic logic applies. Behind the scenes, Stata will calculate the odds of Y=1 at various levels of X (e.g., X=1, 2, 3, 4, etc.). The odds ratio is for an interval IV is based on how the ratio (of odds), in general, changes for a one-unit increase in X. The there is no change in odds as X increases, then the odds ratio will be 1 (1x something = no change in that thing). An odds ratio greater than 1 indicates an increase in the odds of Y=1 for a unit increase in X. An odds ratio less than 1 indicates a decrease in the odds of Y=1 for a unit increase in X.

For each odds ratio a z-value is calculated (this is the test statistic). These z-values come about as a result of standardizing the logistic regression coefficient when testing whether or not the IV is associated with the DV. It is the estimate divided by the standard error.

Ignore the pseudo-R-squared. It doesn’t mean what R-squared means in OLS regression and many statisticians argue for ignoring it. To be honest, there's a lot of disagreement about what it actually reflects. If you want a neat alternative, you can have Stata calculate the percentage of cases in your sample that were correctly predicted by the resulting model. To do so, run the following post-estimation command:

. estat classification

Logistic regression can report out logits or odds ratios. They’re the same thing, only in different mathematical form (logit is the natural logarithm of the odds ratio).

. logit dv iv1 iv2 iv3

. logistic dv iv1 iv2 iv3

Logits are "similar" to regression coefficients in OLS regression. They can be negative or positive; Zero is the null hypothesis (ie., no effect). You should simply note which are statistically significant and whether they increase (+) or decrease (-) the likelihood of Y (keep in mind you are not estimating a value that can take on any number, as with OLS regression; you’re assessing how each IV impacts the likelihood of Y being a 1 rather than a 0).

Odds ratios are a bit different. They only range from zero up. With odds ratios, a value of 1 is the equivalent of 0 (ie., no effect or association). In fact, the null hypothesis(es) will be that an OR = 1. An odds ratio that is less than 1 indicates that as X goes up, there's a corresponding decrease in the likelihood of Y. An odds ratio greater than 1 indicates that an increase in X increases the likelihood of Y. Of course, an OR of 1 indicates that IV does not affect, in either direction, the likelihood of Y. I recommend using the following command immediately after a logistic regression to help interpret odds ratios. Focus on the % and the %StdX columns (the legend is included in the output). The SDofX column tells you what a standard deviation of X is in each case.

. logistic dv iv1 iv2 iv3

. listcoef, help percent

Comparing Effect Sizes

To compare effect sizes in logistic regression, you first have to consider whether your IVs are measured on different scales, which is typically the case. If they are on the same scale, you can simply compare their odds ratios or logits. If they are on different scales, which is typically the case, you will have to compare the change in the odds of your DV, given a standard deviation change in each IV. To do this you can run the listcoef, help percent post-estimation command. Consult the %StdX column described above.

I actually recommend comparing the %StdX value for interval variables to the straight-up odds ratio for dichotomous variables. While it may make sense to look at the change in odds for Y for a standard deviation change in an interval X variable, it doesn't make any sense to do a standard deviation change in a dichotomous X variable. So, in terms of comparison, I'd compare the regular ORs for dichotomous variables and the %StdX values for any interval variables.

You can also use dominance analysis to assess independent variables' relative importance to the model.

Logistic Regression and Margins (Predicted Probabilities)

One nice feature in Stata is the ability to calculate predicted probabilities of Y based on logit or logistic regression results. These are much more intuitive to understand than either odds ratios or logits. They give you a sense of the probability of someone with characteristic X (or combination of characteristics) being a 1 on Y (whatever Y is). You would run the following command(s) after a particular logistic regression result. Note: It's best to rerun your logistic regression and i. even the dummy variables (the margins command expects it in many cases).

. logistic dv i.catvar i.dummy etc....

The following command will show you the predicted probabilities for each category of a categorical variable, holding other variables in the model at their means. StataCorp's videos on margins may also be helpful.

. margins catvar, atmeans

. margins female, atmeans [Note: you will need to i. your categorical variable (even if it's just a dummy) in your regression before the margins command will work]

To calculate predicted probabilities for an interval variable, you’ll have to specify “points” at which to calculate those probabilities (since an interval can take on so many unique values). I recommend running the codebook command on your interval variable to remind yourself of the range, so as to inform your lower and higher values.

. margins, at(intervar = (lowervalue(increment)highervalue)) atmeans

. margins, at(gpa = (0(.5)4)) atmeans

The following show you how can you specify combinations of characteristics:

. margins, at(age=(20(10)80) race=1)
. margins, at(age=(20(10)80) race=1 male==1)

Keep in mind what the margins (predicted probabilities) are showing. If I saw a predicted probability of .65 for males, I would say that "males have a 65% probability of being/experiencing/exhibiting Y" (whatever Y is), "with other factors held constant at their means." That's a bit different than saying 65% of males are Y.

To read more on the margins command, see this web page. There's also googling to be done.

How do I build models (sets of variables)? It really depends on how you want to tell the empirical story. Here are some strategies:

Simple: Sometimes you may want to keep things simple and just run a full model that includes all of your control and primary independent variables of interest. This will allow you to determine the unique, independent effect of each variable, controlling for the others. There's often something to be said for simplicity.
Mediation and/or Effect Persist Approach: Sometimes people like to establish a relationship between their primary IV(s) of interest and their DV. They then seek to “unpack” this relationship by incorporating “mediating variables” (see Allison, p. 60-61). Theoretically, mediating variables, once incorporated, take over some or all of the impact of your primary IV(s). Alternatively, you might conduct a similar set of models with the intent of identifying any initial relationship between your primary IV(s) and your DV. In the subsequent model you add the control variables to assess whether the effect(s) of your primary IV(s) persists, even after controlling for covariates.
Sets approach: Sometimes people like to group their IVs into categories of variables, introducing each group of variables one at a time. For example, you might include sociodemographic variables as a set; next you might include affiliation variables; etc. You might consider doing each group one at a time and then a “full model” with all of the variables included.
Moderation/Interaction Effect Approach: Sometimes you may hypothesize that the effect of X1 on Y is moderated by (or contingent on) the value of X2. In these circumstances, you can include an interaction term in your model. Interpreting interaction effects from coefficients (or ORs) is often difficult, so translating your results to a margins graph is wise move.
Curvilinear effect: If you suspect, based on the literature or an examination of the bivariate relationship, that X's effect on Y is not linear (assumed in regression), you can assess for a curvilinear relationship by including a squared term of X (X-squared). In a Stata regression command, it could be included in the following manner: regress dv c.intvar##c.intvar. In the predxcon command, you would specify your intvar as the xvar in the command but then also include the poly(2) option, which will include the quadratic (squared) term. E.g., predxcon dv, xvar(intvar) f(specify low value) t(specify high value) inc(specify increment) poly(2) adjust(include control variables here) graph

The nestreg command provides a way of doing some of this model building (I actually don't like using nestreg...but I figured I'd toss it in). It only works when you add variables, not alternate or take away variables as you go across models. See AGIS, p. 278-279. The nestreg command also provides you a simple report for how each variable or group of variables changes your R². One additional benefit of using nestreg is that it will hold your sample constant across all of the models. As an example:

. nestreg: logistic dv (model 1 IVs) (IVs to be added in model 2) (IVs to be added in model 3)

If you don't want to use nestreg (e.g., when you don't want models to simply be additive), you may still want to hold your sample constant across models so that changes in the sample don't account for changes in regression coefficients. To do so, you can use the "if" subcommand and tell Stata to run models on cases not missing any of the variables that will eventually make it into the models. As an example:

. logistic dv iv1 iv2 if !missing(dv,iv1,iv2,iv3,iv4)

. logistic dv iv1 iv2 iv3 iv4 if !missing(dv,iv1,iv2,iv3,iv4)

How many variables can I include in the regression model? In general, you want to keep the number of IVs to less than 1 per 10 observations. This ensures that Stata has enough data from which to calculate estimates. So, if you have a sample of 60, keep the number of IVs to less than 6. It’s even better to keep it to a ratio of 20 observations to every 1 IV. Balance the need to be thorough with the desire to keep things simple. In other words, don't throw in the kitchen sink just because you can. Be judicious.

Graphing Options

The most accessible way to present results from logistic regression is not just through logits or odds ratios, which can be difficult to understand, but rather through predicted probabilities (margins). These tell/show you the predicted probability of Y for different statuses in your categorical and/or interval independent variables.

Immediately after a particular logistic regression model:

. margins catvar, atmeans

. marginsplot, xdimension(catvar)

The following versions won’t “connect the dots” with a line.

. margins catvar, atmeans

. marginsplot, xdimension(catvar) recast(bar)

. marginsplot, xdimension(catvar) recast(dot)

Or, right after the logistic regression (to plot the ORs). Difficult to show the reference category in the graph, though (unlike with marginsplot).

. coefplot, drop(_cons) xline(1) eform xtitle(Odds ratio)

Alternatively, like Acock on p. 312 (ch. 11). Here he talks about doing a bar graph of the odds ratios. The bars display the percentage change in odds for a 1 standard deviation increase in each X.

http://www.stata.com/meeting/germany14/abstracts/materials/de14_jann.pdf

Other Concerns

Diagnostics -- Multicollinearity (See the Diagnostics Page).

Interaction effects

One may hypothesize that the effect of X1 on Y may vary depending on the value of X2. We refer to these kinds of situations as interaction effects. Unfortunately, interaction effects can get quite complicated to interpret, particularly when interacting two interval variables. It can be easier to interpret interaction effects between an interval and categorical variable or between two categorical variables, though. In fact, there are user written commands that do a great job of making this easy and graph the results so that you can visualize the interaction.

The following command will interact your specified interval variable and your specified categorical variable while controlling for designated control variables. Note, predxcon will automatically determine whether an OLS or logistic regression is required.

. predxcon dv, xvar(intvar) f(lowervalue of intvar) t(uppervalue of intvar) i(desired increments) class(catvar) adjust (control variables) results graph

The following command will interact your two categorical variables.

. predxcat dv, xvar(catvar1 catvar2) adjust(control vars) results graph

To read more about interactions, see Acock Ch. 10 on multiple regression, p. 282-289.

This webpage shows an interesting way to graphically represent the interaction of two continuous variables.

You may also find the following sites at UCLA helpful:

Categorical by continuous variable interaction (Links to an external site.) (Links to an external site.) (Logistic regression)

Continuous by continuous variable interaction (Links to an external site.) (Links to an external site.) (Logistic regression)

Categorical by categorical variable interaction (Links to an external site.) (Links to an external site.) (Logistic regression)