Best Practices for Managing Data both Big and Small
By Linda Powell, Chief Data Officer, Consumer Financial Protection Bureau
What is Big Data?
In simplistic terms, big data is just more; more complexity, more volume, and more velocity. One example of this pace of change is that ten years ago I tracked my steps per day with a pedometer. Now I can wear a Fitbit that measures my number of steps, heart rate, stairs climbed, sleep patterns, and more for every hour of the day. With a Fitbit my data are more complex, the volume of the data is larger, and the velocity of information is higher. Wearable technology is just one example of the application of big data.
Is Big Data Managed Differently?
Yes and No. Data management best practices are the same regardless of the size of the data. However, the need to organize data efficiently is magnified as the size and complexity grow. For a sound data management program there needs to be a focus on the fundamentals before you can have reliable data analytics and business intelligence. In Maslow’s hierarchy of needs you need food and safety before you can have love, esteem, or self-actualization. Focusing on love or esteem when you don’t have food or safety makes it hard to survive. Similarly, trying to create an analytics program without an appropriate technical and governance infrastructure is, at best, inefficient.
It is common for an analyst to start with data management by getting some data and some statistical software. Under this approach, the analyst may spend 80 percent of their time doing prep work to organize and understand the data and 20 percent performing analysis. This approach works for an isolated project but does not enable or support an enterprise analytics program.
The most often overlooked fundamental of data management is the curation of metadata. Metadata is the information that describes the data–much like the label on a soup can. Metadata tells you what you have and what it means and there are several kinds of metadata. Metadata, can be a catalog, much like a library card catalog tells you about the library’s book collection, without giving you access to the data. Metadata can also be organized at data element level to give meaning to the data items within the dataset much like the information on the soup can label tells you about the contents.
In fact, I often equate data management to a can of soup. The soup is the data, the can is the database, and the label is the metadata. If you rip the label off the can, you can still eat the soup but you may not know if it is cream of mushroom or cream of celery soup until you open the can. Now imagine a pantry full of cans with no labels.
As complexity, volume, and velocity increase, the need for adherence to best practices becomes critical
To understand what you have you would need to open all of the cans and examine the contents. When the only metadata maintained about the content of a file or dataset is the name of file you have the equivalent of a pantry where the only labels are soup, beans, vegetable, fruit, and other. It makes cooking challenging, and if you have to worry about food allergies, potentially dangerous.
Another data management fundamental is the use of data standards. In fact, standardized definitions are often overlooked or undervalued. However, standards ensure that (things) are interoperable, easier to work with, and more efficient. Our lives are filled with standards, whether it’s standards that apply to our electrical outlets, internet protocols, or the dictionaries we use. Just like these standards, data standards allow for the seamless flow of data across users without needing to transform and redefine the data at each stage of use. In short, standards help make data management and usage more efficient, easier, and cheaper, and they result in higher quality data.
The use of data standards can also reduce the need for cross-references that allow the joining of different datasets. For example, companies have a variety of identifiers including tax ids, stock tickers, and various ids assigned by vendors. When trying to join financial statement data with stock prices and credit ratings the first step is to create a cross-reference of identifiers of all the companies you are analyzing. As noted above, this type of prep-work can account for 80 percent of the time spent on an analytic project. Alternately, if the cross-reference data is curated and maintained in a database that all of the analysts can use, the analyst can skip the data prep-work and go straight to the analysis. When everyone adheres to a data standard that uses the same identifier for companies, the need to curate a cross-reference is eliminated.
Metadata, standards, and crosswalks are critical elements in addressing complexity. To address the problem of more volume, there are new technologies such as Massively Parallel Processing databases and Hadoop that help enable faster processing and retrieval when working with terabytes of data. There are advantages to these technologies such as speed and many of them are open source which can help reduce costs. However, retraining end users how to access data from unfamiliar platforms can be time consuming and costly. Fortunately, there continue to be advances that help these new tools replicate common access methods such as SQL.
Is More Data Always Better?
In short, it depends. Weather forecasting was one of the first uses of big data and having more has facilitated better predictions of weather hazards to help save lives. Another driver of the prevalence of big data is the evolution of social media. Social media consumes and analyzes large quantities of data to derive new insights such as making suggestions on what individuals might like or who they might know. But many of the types of analysis currently performed using big data were successfully conducted before it became a lexicon. So the question, “is more better?” remains.
Historically, economists would evaluate financial markets by examining samples from within the market. Using a sample requires knowledge of how the sample was drawn and in some cases weights that need to be reviewed and updated. The evolution of technology has made it easier to process more data and in some cases to draw on the entire population of interest. However, the costs associated with collecting and curating more data can outweigh the benefits of not having to define samples or potential precision improvements, and challenges in understanding the wider applicability and validity of findings often remain. Determining the appropriate complexity, volume, and velocity of a data source needs to be based on the costs, benefits, and needs of what a user plans to do with the data.
Best practices in data management are important to ensure you have efficient, accurate, and usable sources of data to make sound business decisions. An organization with limited, uncomplicated data will reap benefits and reduce maintenance costs over time by following data management best practices. As complexity, volume, and velocity increase, the need for adherence to best practices becomes critical. If I only buy and eat chicken noodle soup, the label may be less important than if I want to buy and store a variety of soups and know what I’m eating for dinner.