Public Lab Wiki documentation



sandbox-cleaning-and-organizing-data

This is a revision from November 23, 2021 23:51. View all revisions
1 | 2 | | #28262

After you’ve collected environmental data from a sensor, monitor, or other piece of equipment, one of the next steps is to organize and “clean” it!

Cleaning includes making sure the dataset is complete and consistent. Organizing the data into a table in a meaningful way gets it ready for making charts, graphs, and other visualizations. Below are some resources on cleaning data, including making tables of tidy data.


Making tables of tidy data

Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.


There are two sets of anthropomorphized data tables. The top group of three tables are all rectangular and smiling, with a shared speech bubble reading “our columns are variables and our rows are observations!”. Text to the left of that group reads “The standard structure of tidy data means that “tidy datasets are all alike…” The lower group of four tables are all different shapes, look ragged and concerned, and have different speech bubbles reading (from left to right) “my column are values and my rows are variables”, “I have variables in columns AND in rows”, “I have multiple variables in a single column”, and “I don’t even KNOW what my deal is.” Next to the frazzled data tables is text “...but every messy dataset is messy in its own way. -Hadley Wickham.

Images: Illustrations from the Openscapes blog “Tidy Data for reproducibility, efficiency, and collaboration” by Julia Lowndes and Allison Horst, CC BY



An example of “tidy data” from an air quality sensor might look like this:

An example of tidy air quality data in a table. There are four columns of variables with names and units 'id', 'date (dd-mm-yy)', 'time (hh:mm:ss)', and 'PM2.5 concentration (micrograms per meter cubed)'. There are four rows of data.


Each variable forms a column: sensor ID number, date, time, and the air quality measurement of particulate matter are individual variables. Each variable gets its own column in the table. The column header at the top lists the variable name and its units of measurement.

Each observation forms a row: this sensor took an air quality measurement every minute. Each measurement gets its own row in the table.

Each cell is a single measurement: each block in the table shows one piece of data---one time, one PM measurement, etc.



Cleaning data

More to come here!



Questions on organizing and cleaning data




More resources on organizing and cleaning data