Solving the Dark Data Puzzle

Repost from my work blog.


In this era of Big Data where more than 90% of the world’s stored information was created in the , there is a challenge that has emerged which we see very frequently.  The problem is “Dark Data” which is data and information that is segregated or hidden within an organization accessible by only a few people.

First, let’s address why Dark Data is a problem at all.  There are countless examples through history where one person or team makes a discovery but a completely different person or team creates a useful application…some that come to mind are ether (for medical anesthesia) and Teflon (for nonstick cooking ware).  With all the data accumulated in research and development today there are even more opportunities for applications resulting from solving the Dark Data puzzle.

At ExtensionEngine, a lot of our work involves solving the Dark Data problem.  For example, we recently completed a project for a large property and casualty insurance company where they had data spread across nearly 30 different regional offices and we were able to aggregate that information into a single data warehouse and then make it accessible throughout their organization, including from executive management to their field sales reps on iPads, and many others in between.  Another Dark Data solution we built and manage is for Harvard Business School professor Noam Wasserman’s research of entrepreneurial organizations and the founders who launch them.  Wasserman’s research was, for many years, “Dark Data” but then through a collaboration (CompStudy) with industry players including Ernst & Young, WilmerHale and Park Square, an executive search firm, we were able to create an application providing compensation benchmarking data for entrepreneurial executives.

Solving the Dark Data puzzle requires addressing the following issues:

  • Creating a central data repository. Often times data is spread out across systems and formats.  Bringing it all together with a single ontology is key.
  • Managing access.  One of the main legacy reasons for Dark Data is fear of it falling in to the wrong hands.  Designing a process and system to manage and secure access to the data is a top priority.
  • Cleaning data.  Even the best systems will have data quality issues and making it easy and quick to ensure dirty data doesn’t get into the system and being able to clean or expunge dirty data that does get in is a must.
  • Automatic publishing.  Ultimately, the key to solving the Dark Data problem is creating a way to automatically publish the right data in an informative fashion to the right people.  Often this means publishing using interactive charts and graphs.
With the right platform in place, “accidental innovation” can be accelerated and drive growth in not just the ever expanding body of human knowledge, but also business performance and growth.