When implementing big data, a frequently used strategy is harvesting all data possible and then finding ways to extract the most helpful metrics for purposes of analysis later. While this does generate a large volume of data, it is not uncommon for a company’s dataset to include a lot of unnecessary information that simply gets in the way. However, data collection does not have to be such an inefficient process, there is a better way.
Mark Twain once said, “Data is like garbage. You’d better know what you are going to do with it before you collect it.” Planning for the collection of data is as important as the collection itself, and the data field is critical to the planning process. The results of big data extraction, after all, can only be as good as the data field they are based upon.
So, how do you improve the data field? Start by defining the purpose and context of the data in your application, and everything else will follow from there.
Defining Purpose and Context
It is risky to collect a large dataset with the hope of finding uses for it later. Without a defined purpose, as well as plans for extraction, the data may not be collected properly for any future uses, or there may be additional data necessary to make it useful. Those in decision-making positions need to ask themselves, “What do we hope to accomplish by collecting and analyzing certain data sets?” If management is unable to answer that question, no volume of data will improve things.
Once purpose has been established, decision-makers must look at context. A company may extract large volumes of data that fit within its purpose while at the same time finding that some of that data is not relevant to the task at hand. Data should be utilized within the context of the specific goals and tasks that make up the greater purpose.
Create a Data Dictionary
There is a concept within the big data paradigm that involves building what is known as a ‘data dictionary’. This dictionary defines multiple parameters covering everything from data element definitions to validation rules and provides a standardization for its data. The dictionary is critical to those tasked with collecting and analyzing data as it provides definition for what they are doing and helps communicate with stakeholders to ensure they are meeting requirements. In addition, it groups information in once place, making it easier for database design and management. Below is a simple example of a data dictionary for an employee data study. Naturally, other fields can be used such as the precision of a measurement or the confidence in the answer, but the idea is to clearly define the parameters of your data using fields which add value to the data.
|Attribute Name||Type||Allowed Values||Notes|
|Employee ID Number||Numeric||0001-9999||ID number assigned to participant in sequential order|
|Group number||Numeric||1-30||Group assigned to participant based on ID number|
|Age in years||Numeric||18-75||Participant’s age|
|Gender||Numeric||1=male, 2= female||Participant’s gender|
|Date of Survey||mm/dd/yyyy||01/01/2017-01/01/2018||When the employee completed the survey|
Conduct Regular Data Purges
Not all data is useful, despite what many big data evangelists claim. Data is only useful if it fits into an organization’s purpose and the tasks necessary to fulfill that purpose. As such, there is no point in keeping data that has no use. A regular purge of useless data will improve the data field by keeping it as small and concise as possible.
Failing to purge unnecessary data just causes the field to grow bigger with each passing day. The more data present in the field, the more difficult it is to find and extract data useful for purpose, and the more potential there is for bias.
Track Data Changes
Taking regular snapshots of the data field makes it possible to track changes in collection and analysis over time. Why does this matter? Because changes within the company may change either purpose or context. Data field snapshots make it easier to realign the data field to accommodate for any changes in purpose or context.
Big data is a great concept that helps a lot of organizations do better. In order for it to perform as intended though, the data field must be continually improved. The more deliberate the data collection, the more meaningful and useful the results.