Page: Univariate Descriptive Statistics and Graphs
Summary statistics and associated graphs not only provide useful information about our variables; they are invaluable in getting to know one's data before moving on to more complex analyses. Sometimes, simple univariate descriptive statistics is all we want to know (e.g., the mean and standard deviation of a standardized test score).
Your initial step in getting to know your variables (beyond reading the documentation and codebook), is to use the codebook command:
. codebook var1 var2 var3
See this Page for how to use the codebook results to identify your variables as interval or categorical.
Interval Variables
For interval variables, the most commonly used descriptive statistics are measures of central tendency (e.g., usually the mean; sometimes the median, and, in rare cases, the mode), the standard deviation (dispersion), and the range.
. summarize intvar
. sum intvar, detail
Graphing Options
For interval or continuous variables, one of the most informative graphing options may be the box plot. One can use a simple bar graph, but that will usually only show the mean -- not all that interesting. The box plot, on the other hand, gives the viewer a sense of the distribution of the population or sample on a particular interval variable. The line inside the box represents the median (50th percentile) value. The upper and lower ends of the box represent the 25th and 75th percentile values (thus the middle 50% of cases fall within the box). The whiskers extend to the upper and lower "adjacent values." These upper adjacent value (whisker) extends to the highest value in the data set that is not greater than the upper quartile (75th percentile) value + 1.5*the interquartile range (which is the 75th percentile - the 25th percentile). The lower adjacent value (whisker) extends to the lowest value in the data set that is not less than the lower quartile value - 1.5*the interquartile range. Any dots represent "outside" (i.e., outlier) values) that fall outside the adjacent values.
. graph box intvar
You may opt to remove outliers from the box plot in some cases:
. graph box intvar, nooutside
If you wish to visualize the distribution that the box plot is representing, you can use the stripplot command. The settings are only a starting point. Feel free to adjust as needed.
. stripplot intvar, box(barw(0.2)) pct(0.1) boffset(-0.15) vertical stack height(0.4)
Or, you can show the bar graph and/or stripplot over a categorical variable:
. graph box intvar, over(catvar)
. stripplot intvar, box(barw(0.2)) pct(0.1) boffset(-0.15) vertical stack height(0.4) over(catvar)
Still yet another option is to use a violin plot. The violin plot can be created using the the user-written vioplot command. In the resulting graph, the box in the middle shows the median, interquartile range, and the 95% confidence interval. The outer area shows a kernel density (smoothed) representation of the distribution (in mirror image). The wider the area, the higher the probability of a case existing at that value. The widest area thus represents the mode.
. vioplot intvar
. vioplot intvar, over(catvar)
You may also use a histogram to visualize your interval variable.
. histogram intvar
. histogram intvar, normal
Categorical Variables
For categorical variables, descriptive statistics are usually limited to frequency distributions (counts) and associated percentages.
. tab catvar
. fre catvar
Graphing Options
The first graphing command will represent actual counts or frequencies for each of the categories in your variable. The second command reports out the percent of respondents in each category. The third and fourth commands show different pie graph configurations. For all of these, you may opt to work through the Stata menu system, as there are a lot of possible options. You can also use basic commands and make subsequent changes to the graphs while in editor mode.
. graph bar (count), over(catvar)
. graph bar, over(catvar)
. graph pie, over(catvar)
. graph pie, over(catvar) pie(_all, explode)