As a result of my trials and tribulations in data preparation, I present to you my bird’s-eye guide to data preparation.
Disclaimer: This is by no means comprehensive, but it is how I like to think about the big picture steps, with tips in R.
Step 1: Make sure your data file came with a data dictionary, this will be your new best friend who just so happens to know everything there is to know about your dataset. Get to know it, take it out to dinner, and be comfortable with it.
Step 2: Begin the tidying process. And with tidying comes the Tidyverse R package aka my holy grail. I personally like to begin by selecting my rows and columns of interest. Most of the time, I like to do this in small bite-sized pieces first, like a quick pilot study to make the data more digestible. This could be isolating maybe 5 individuals with my features of interest and getting comfortable with them.
Step 3: Now I like to inspect columns of interest and this is where the piping %>% starts to go crazy. Here, I mutate my columns according to my needs: recode, rename, reorder, change the data type, you name it. You can mutate your missing data too, but some decision-making must go into this step. For example, let’s say my variable for BMI has two types of missing data: a value of -4 indicates “not available” and a value of 99 indicates “not evaluated.” If you want to consider these both as “missing,” you can mutate them as NAs, if not, you can keep them as is. CAVEAT: it depends on your data and there are different types of missing data! However, converting your missing data to NAs is helpful because some packages may not recognize the way your missing data is coded, especially if you choose to impute your data. I like to use the MICE R package for imputation.
Step 4: Your data is looking good; your variables are coded the way you want them, and your missing data has been handled. You can now do other things like normalizing, scaling, and combining if needed. Here I usually find myself having to create composite variables from my base variables.
Step 5: It’s nice to ensure that your data cleaning is going well by visualizing your data along the way. Do a quick histogram to inspect that your variable distribution looks right. Pull up some summary stats to verify that your variables made it through cleaning as intended.
You’ve made it from dirty to clean — Hoorah! This was a simple guide and I appreciate you following along. We in the biz like to say “garbage in, garbage out” but if you keep it tidy, it’ll be mighty!