Question: What are best practices and tools to help clean up data sets?

stevie asked on August 23, 2019 19:07
173 views | 0 answers | #20642


I was recently at an event in Black River Falls WI with @crispinpierce and his students @Aleah and @Cbarnes9 who were presenting on the Purple Air data they have been collecting in Western Wisconsin in areas affected by frac sand mining. @Aleah mentioned they have a good bit of data from this that needs to be cleaned up to be used. A lot of it she has been cleaning by hand. I was wondering if people could share some best practices around cleaning data sets, and some tools you're found to be helpful for this? I have a couple, I'll share below!



17 Comments

@Aleah I recently took a Data Carpentry course on this topic which offered some neat tools and practices.

For best practices, here are some of the things they highlight:

  • Never modify your raw data. Always make a copy before making any changes. (this was their biggest one, they brought it up repeatedly)
  • Keep track of all of the steps you take to clean your data.
  • Record metadata in a separate plain text file.
  • Use data validation to prevent accidentally entering invalid data.
  • Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).
  • Record zeros as zeros.
  • Use an appropriate null value to record missing data.
  • Don’t use formatting to convey information or to make your spreadsheet look pretty.
  • Place comments in a separate column.
  • Record units in column headers.
  • Include only one piece of information in a cell.
  • Avoid spaces, numbers and special characters in column headers.
  • Avoid special characters in your data.
  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.
  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

For tools, one they introduced that I really enjoyed for cleaning data was OpenRefine. On the website they say "OpenRefine is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data." There's a good intro video on their homepage. Also, for the full lesson plans on that course I took, you can find the syllabus, lessons, and resources they shared here!

Hope this helps!

Hi @stevie, @Aleah, @Cbarnes9, and @crispinpierce,
stevie's list is excellent, but perhaps more detailed than is necessary. Please correct me if I am mistaken about what the UWEC group is doing, but the Purple Air allows its data to be downloaded: http://www.purpleair.com/sensorlist, and this can be opened with EXCEL. I have found that data cleaning (which can mean different things to different people and depend on the analysis software one wants to use) is relatively simple within EXCEL. However, many of the functions that one might need are not commonly described. If Aleah can post about 20 rows of data and describe the cleaning she is doing by hand, I might be able to help out. I have found no problem with manipulating this downloaded data in EXCEL and importing into R for analysis and graphing.


Hi @jeffalk , thanks for offering help! Which are your favorite functions to use within Excel for data cleaning? Can you describe your process here?

Is this a question? Click here to post it to the Questions page.


Hi @liz, I just lost my lengthy comment. I clicked on the "preview" button below and then wanted to change some things and couldn't get back and all was gone. Maybe I'll try again tomorrow.


oh no! I wonder if there's a way to make it more obvious that once someone has clicked "preview" to preview their comment, that the "preview" button must be clicked again to resume editing. What do you think?

Is this a question? Click here to post it to the Questions page.


Hello @liz; That might help if the someone, like me, had sense enough to look at the buttons down at the bottom (as I did not!). Here I'll try again.

I was hoping @Aleah would respond with an example of what she was working on. I believe context and example are often more helpful for understanding than an attempt at abstract explanation. So I will supply some context and example.

I was doing some home heating analysis for our home last winter. I have data loggers around my house and yard for temperature and some loggers to indicate the times the furnaces are on. But I do not have anything for other outside weather factors. So I went to www.ncdc.noaa.gov the national climatic data center and downloaded data from the Iowa City airport since I live in Iowa City. Workbook 1 below contains 20 rows of data, columns A through K, selected from the download.

Workbook1.xlsx

It should be clear that many of Stevie's best practices are violated and the data needs cleaning. Columns L and beyond are what I did and workbook 2 below will help explain in detail:

Workbook2.xlsx

First, column A, labelled "Date" is text. One can tell because it is ALIGNED on the left. Workbook 2 columns C through G show how I extract year, month, day, hour, and minute from cell A2 using the "MID" Excel function. Use the EXCEL Help to get the details of the MID function. Looking at cell C2 shows how to get the year. Note the multiplication by "1" (*1). The MID function returns text format and multiplying by "1" changes it to standard or numerical format. Cell H2 shows how to use the "DATE" Excel function to put these together to get date and time into a standard format. (Ignore the "+1462" which I had to add to compensate for the difference between MS dating and Macintosh dating). These could all be put into one function if one is so inclined. Workbook 2 cell D5 shows cell D8 from workbook 1. This is also aligned left and is text. The glossary for the download told me that text in that position was a very small amount of precipitation so I wanted to ignore text. Workbook 2 cell F5 shows if the cell is empty, I code it as "-99", if it is text, I code it as "-199" and otherwise it just stays as the number it is. Workbook 2 cell E8 is from workbook 1 cell E8. I was only interested in whether there was rain or snow so I used the MID function and a series of "if" statements to extract what I wanted as shown in cell H8. I use "-99" and "-199" often because these numbers are unlikely to be actual weather data values (and they will be obvious in a graph). Of course I have to write down this coding so I remember. One just uses "Edit fill down" to do these transforms for the entire data set. Everything I wanted is then copied into columns S through AE as in workbook 1. I begin a NEW workbook and copy these cells and use "PASTE SPECIAL VARIABLES" to paste these in the new workbook. Paste special variables eliminates the functional history and everything is plain numerical values. This is then saved as ".csv" and imports easily into R. It should be obvious that I am not a "programmer". Efficiency and elegance are not my top skills. I prefer something I understand and can effect quickly. Date and time are the only potential problems I saw in downloaded purple air data so I hope this helps Aleah.

As for my project of home heating analysis it dawned on me in the spring that whether my roof was snow covered and white (and reflective) or clear and charcoal could make a difference on heat in the attic and second floor. Unfortunately NCDC data won't tell me about my roof color. Maybe next winter!


Thank you very much @jeffalk ! Great narrative.


Reply to this comment...


@IshaGupta18 maybe your guide to Simple Data Grapher could address some of this? 👍

Is this a question? Click here to post it to the Questions page.

hi @IshaGupta18 i'd love to see your guide to Simple Data grapher when it's ready :)


Reply to this comment...


Oh man. Where to start? Even before you correct based on a model, you would want some normalization by timestamp at maybe 5 minute or hour intervals. https://t.co/3Ns6mSHzHT

— Calvin ([@CalvinCupini](/profile/CalvinCupini)) August 26, 2019

Is this a question? Click here to post it to the Questions page.

Reply to this comment...


@stevie, @warren, @jeffalk, @Aleah and @Cbarnes9. Our issue with the data files is to be able to amalgamate numerous 24-hour csv files that are written to the PurpleAir sd card into a single Excel workbook so we can analyze and plot all the data.

Four such files are attached to this note. Any help would be appreciated.

@stevie, @warren, @jeffalk, @Aleah and @Cbarnes9. Our issue with the data files is to be able to amalgamate numerous 24-hour csv files that are written to the PurpleAir sd card into a single Excel workbook so we can analyze and plot all the data.

Four such files are attached to this note. Any help would be appreciated. (Having trouble dragging and dropping the four files into this post.)


Hi Crispin - about getting these files up, sorry you're having trouble. Curious, are you dragging them into your email in reply or directly onto the Public Lab website? If you're on your email, you can try clicking the link to the website. You can also just email them to me and I can try to help!!

Is this a question? Click here to post it to the Questions page.


Hi @stevie, @Aleah, @Cbarnes9, and @crispinpierce, I may be jumping the gun here but since .csv files are openable by EXCEL and can be copied and pasted into a single workbook, I assume the problem is in what is meant by "amalgamate" so besides some example files please supply some more explanation about "amalgamating". Thanks,


We might also find a time to jump on a video call, and do some screensharing to get on the same page (literally) and make some progress! Every Tuesday at 2pm Central / 3pm ET, anyone can click into our "office hours" here: https://zoom.us/j/934187763 . Notes here: https://pad.publiclab.org/p/opencall


Reply to this comment...


Log in to comment