Leveraging Big Data for Data Analytics
By Derek Wilson, President & CEO, CDO Advisors LLC
There are three types of data that companies are faced with managing in the modern IT environment; structured, unstructured, and streaming. Structured data is what our legacy database platforms are built to store. The data is mapped and stored in well-defined tables such as relational database management systems. Unstructured data refers to data that is stored without having a pre-defined structure. Examples of unstructured data include word documents, emails, images, or audio files. Streaming data comes from devices that send out data on regular schedules. Examples of streaming data include log files, IOT devices such as smart thermostats or trading floor information. Leveraging the cloud makes setting up and collecting these types of data affordable for companies of any size.
Once you know what type of data you want to collect, you need to consider how to ingest and manage the data to prepare it for your business users to leverage. In a modern data architecture, there are three base frameworks to store the data; raw, sandbox, and normalized. The first layer of data ingestion for an analytics environment is the raw framework. Any data required for analytics are pulled from the sources systems as is, directly to the raw framework. The raw data stores data that is structured, unstructured, or streaming without additional cleansing or transformations. This framework becomes the single source of data. The next layer in the framework stores the sandbox environment. In this environment, you take data from the raw framework and transform it into usable information for your users. This could include applying mappings or creating aggregations to summarize data.
In addition to having a strong framework to store the various formats of data, you should create the appropriate personas for your organization
Data in this environment is not for full production use. It should be used to test out new data structures, mash-ups and data sources. Limit the use of this environment to your data scientists or analysts that need immediate access to data, however, they understand the data quality may be suspect. In the normalized framework, you store data in the best manner to allow your users to access the data. Data in this framework is cleaned and readily available for anyone in the organization to leverage. Access to the data could be through structured database tables, data warehouses, and flattened tables, in addition to API calls to unstructured or streaming data. Reporting and analytics tools connect to this data to supply a single source or truth to the enterprise.
In addition to having a strong framework to store the various formats of data, you should create the appropriate personas for your organization. A persona represents a cluster of your user base and documents how each persona prefers to interact with data and reporting tools. For example, most organizations have a set of users that want to interact with their data through email or canned reports; you could call this group Casual Report Users. They do not want to click on links and change parameters to get to their required data. Other examples of core personas to consider are Business Analyst.
Analytics Consumer and Data Scientists
While some of these may be role titles also, you can create personas for each that clearly define the interactions and necessary access to the data. In addition, you should define the data quality and data frequency for each persona. A Data Scientists persona may need direct access to the raw framework in an ad-hoc manner. A Business Analyst persona may need to access the cleansed and well formatted normalized framework that can run reports or ad-hoc queries. Taking the time to clearly define personas will enable you to provide the appropriate level and types of access that your users require. It also allows your IT function to build for a finite number of user types and know exactly what are the agreed upon data requirements across the organization.
Having built out the production big data environment along with the proper frameworks and personas, it will enable you to create better data analytics over using your legacy data environment. With this big data environment set up, you will achieve faster delivery of analytics. Users will know where to access the data and the best way for them to access the data based on the personas. Also, users will have access to more data of all types structured, unstructured or streaming. Predictive analytics can be created more quickly because the data required in located in a single environment. Data Scientist will spend less time searching source systems for the data they require. Finally, you can build out a consolidated 360-degree view of the customer. Data from all parts of the frameworks can be pulled into a single view of your customer and customer experiences.
Building out modern big data architecture to achieve data analytics can be done systematically. Start by understanding the types of data you need to ingest along with the personas that will access the data. Finally, use the data to create predictive analytics, reporting, and data analysis that helps you improve your organization’s actionable insights.