Page: Cleaning and Prepping Data
Cleaning and preparing data for use in analyses requires skill and patience. Much of the work of data analysis Links to an external site. revolves around getting raw or messy data prepared for analysis. The goal should be to clean and prepare your data with as few commands as possible. This contributes to both transparency and your ability to diagnose any problems you may later detect. It is always best to take a few minutes to think before acting. Even better would be to actually type/write out what the new variable will look like and then contemplate what options you have for getting there.
Here are some of the more common clean/prep commands. You may also want to consult the Data Management and Prep section of the Stata tutorial video webpage.
Cloning
Use if the variable is “ready to go” except for its name. Cloning does exactly as the command's name implies -- it generates a clone of a variable while giving it a new name. Any value/variable labels will be cloned in the process. The original variable is left unchanged.
Cloning/renaming tutorial
Links to an external site.
. clonevar newvar = oldvar
Recoding a Categorical Variable
This command is ideal for collapsing categories. It can also be used to reverse code or assign missing values a period. Because you can generate a new variable in the process, this command typically makes cloning unnecessary.
This command can also be used when categorizing an interval variable.
This command does require that you understand and use appropriately the various value assignment options. For example: (# = # “label”)(#/# = #)(# # = #), etc.
-
-
- # = # (current value = this value in new variable)
- #/# = # (the slash indicates “through”; e.g., 1/4 = 1)
- # # = # (putting a space between the two numbers is the equivalent of saying “or”
- Including a "label" in quotes after each assignment will help generate a new label set.
-
Recoding categorical variable tutorial
Links to an external site.
. recode ………, gen(newvar) label(newvarlabelset) test
Reverse Coding Categorical Variables
This is a user-written command that simply reverses your variable (var), generating a new variable in the process. That new variable will have a "rev" prefix attached (e.g., revvar). This command is nice to use if everything is set (variable name, value labels, missing is .) except the order needs to be reversed. You may opt to follow with a follow-up “replace” command if the missing values need to be set to . still.
Reverse coding categorical variables tutorial
Links to an external site.
. revrs var
Encode
Occasionally you may have a variable that is comprised of text entries (e.g., a community organization may have typed in clients' race/ethnicities rather than assigning numeric values -- white; Black; Hispanic, etc.). You may want to convert these text entries into numeric values and have the different text entries become the value labels for those numeric values. Fortunately, Stata makes this fairly easy. Once converted, you may need to collapse categories and/or recode (e.g., collapsing Hispanic and hispanic).
. encode textvar, gen(newvar)
Converting a string variable (that should be viewed as numeric) to numeric format
Occasionally you may have what should be seen by Stata as a numeric variable (e.g., 1, 2, 10, 3, 45) interpreted as a string variable because the variable includes a word, letter, or comma, etc. in some entries (e.g., N/A). Assuming you want to have Stata view the numeric entries as the numbers they are and simply turn the text entries to missing data, you can use the following code (the force command forces non-numeric entries to be converted to missing). Ultimately, you may want to work within the dialogue box for "destring" as there are other options you may want to take advantage of.
. destring var, generate(newvar) force
Encoding and Creating Dummy Variables from String Variables Containing "Check All That Apply"
It's not uncommon to have a question on a survey where respondents are asked to "check all that apply." This typically results in a string (text) variable where you have any number of combinations of fixed text responses, separated by a semi-colon or similar. One way to deal with this is to generate dummy variables for each of the available options. Stata is able to search the list of checked responses and assign a 0 or 1 based on its presence or absence.
. gen option1 = strpos(stringvar, "option1")>0
An example would be a question/variable that asks about social media use (socialmedia), for which respondents could select Facebook, Instagram, Twitter, or Other. Different respondents will obviously select different options or combinations of options. The resulting variable will be a string variable. In this example, you could create a dummy variable for twitter using the following command.
. gen twitter = strpos(socialmedia, "Twitter")>0
Generate/Replace Commands
The generate/replace commands are great to use if you just need to generate a new variable from an old one using a mathematical transformation (e.g., generate mhik=mhi/1000).
Be careful about missing values. (replace newvar = # if existingvar > # will assign missing values a #. So, type: replace newvar = # if existingvar > # & existingvar < . As an example, say you have a race variable like the following that you want to turn into a dichotomous variable. Take note of how the commands in this case need to take into account this quirk in Stata so that respondents who have missing values on race are not assigned a 0 in the new "person of color" (poc) variable.
-
- Race
- 1 – White (100 respondents)
- 2 – Black (50 respondents)
- 3 – Hispanic (35 respondents)
- 4 – Asian (30 respondents)
- 5 – other (10 respondents)
- . (5 respondents have missing values on race)
- . generate poc =.
- . replace poc = 0 if race == 1
- . replace poc = 1 if race > 1 & race < .
- Race
If constructing a new categorical variable from existing variables, you’ll still need to create a value label set. This is most easily done from within the variables manager window.
When using generate/replace, you may need to think very carefully about how you use different options (> < => != & | etc.)
. generate newvar = .
. replace newvar = 1 if (insert condition(s))