Anyone who has ever tried to do analysis on data which was collected without any kind of validation knows the misery of trying to tidy it up so the results can be used without too much pain.
Issues that prevent instant analysis include:
- Leading spaces
- Trailing spaces
- Extra spaces between words
- Spelling mistakes
- Use of letters instead of digits (letter O instead of the figure 0)
- Use of digits instead of letters (figure 0 instead of letter O)
- Missing data on an ID field
- Incorrectly formed dates
... the list goes on!
To save a lot of time and heartache, it's important to make sure your data collection as is clean as it can be - even if it doesn’t feel necessary at the time.
Perhaps you have a spreadsheet shared amongst a team for tracking something, maybe using Google Forms (other forms automation solutions are available) to get booking information for an event, you might even have custom web forms which send data into a SQL database.
You might want to use that data in a way you don't currently intend one day.
If saving from web pages into a database it is vitally important that you protect against the various text field attack vectors (SQL injection is a notable example).
Our advice when implementing your data collection system (the best system being some sort of form) is to provide a pick list for people to choose instead of free text.
There are several more solutions for helping with data entry which have the added benefit of saving keystrokes including date pickers and mandatory fields.
If you are struggling with cleaning up a dataset, need help adding data validation to your solution or help building a solution from scratch, please let us know - we're looking forward to helping you.
#data #people #automation #sql #database #building #forms #dataquality #cleandata