The Advent of Data Science
By Larry Pickett, CIO, Purdue Pharma
Data science is the application of advanced processes and technologies to extract knowledge or insights from large volumes of disparate datasets. Both scientific research and commercial operations rely on understanding and processing large amounts of data. Previous approaches were more basic, using purely statistical analysis and data mining, whereas data science enables the creation of advanced machine learning algorithms, along with predictive and prescriptive analytics, to give a much richer view of more complex data. When these techniques are applied to datasets using a powerful hardware and software analytics platform, large volumes of data can be processed very quickly to accelerate business decision-making and results. At Purdue Pharma, we call this our Big Data Analytics Platform; the creation that was inspired by our CEO Mark Timney’s directive to use facts and data to inform and support business decisions to execute our new corporate strategy.
Purdue Pharma IT took on the challenge of creating the Big Data Analytics Platform, including building a data sciences team, in a non-traditional way. Observing that pharmaceutical and healthcare industries have been laggards in the use of data analytics technologies and data sciences principles, we investigated technologies and approaches from other industries, specifically the financial industry. Leveraging the services of specialized vendors who make high performance servers and analytics platforms for high frequency trading financial institutions such as banks and hedge funds, we built a platform with the specifications similar to what is currently used in the trading industry.
Data Science is emerging as a competitive weapon to create economic value and drive business growth for pharma companies
The new platform combines three elements in a unique and powerful way: First, a high performance purpose-built server/cluster with more than 1 TB of RAM and 200 plus cores using some of the same technologies used to build CRAY supercomputers. Second, a proprietary ultra-high performance database; and third, an experienced data sciences team recruited from the financial services industry. We partnered with our business colleagues in R&D, Commercial, and Business Development, who married their specialized domain knowledge and their datasets, with the technology to solve some of the more challenging analytics related problems.
Purdue Pharma’s Big Data Analytics Platform enables new capabilities we were previously unable to deliver. We created new algorithms in Machine Learning enabling completion of tasks in fractions of the time required to run the same simulations using traditional enterprise systems. This is enabling us to attack one of the most significant operation problems across multiple business areas: the data analytics backlog. For example, the system allowed us to perform complex data mining, pivots, and data manipulations across 17 billion insurance claim records in a matter of seconds. We were also able to load and aggregate commercial and claims datasets onto the same platform at the same time enabling queries across both datasets that were previously not possible.
Using our traditional enterprise database systems, our most expensive and time consuming data analysis process was identifying the cohorts of interest on claims data. In collaboration with the Risk Management and Epidemiology team, our analytics team identified the common themes for future projects of interest and created an interface using the new analytics platform that reduced the cohort creation time from hours to minutes. This helps researchers quickly analyze Drug and Disease patterns from billions of claims records using ultra high performance database and export the cohort to other tools (such as excel or SAS) for statistical analysis.
Another benefit of this platform and approach is that it enabled our company to quickly break down the silos of data and information across different departments and leverage these datasets across the entire company. With the millions of dollars that companies spend on purchasing data each year, allowing broad access to the same data across the company reduces duplicate spend and provides an immediate payback. A cross-functional governance committee was established to further ensure that this new analytics platform is delivering continued business results through applications such as sales and demand forecasting, ad-hoc statistical analysis, identification of products to target for acquisition, and health outcomes and pharmacoeconomic analysis.
Last year, Purdue Pharma IT enhanced the analytics strategy by creating a web browser based application using R Shiny technology that, for the first time, gave cross functional access to CDISC compliant Clinical Trial data across nine product lines and 114 studies. In collaboration with data management, statistical programming and clinical scientists, this new tool allows users to build custom views across domain specific datasets relevant to their specific review needs. This scalable array driven flexibility sourcing a single common library eliminated redundancy in both programming and storage and is now serving as the platform for producing various forms of output including data visualizations.