The Eight Fold Path Towards Analytic Bliss

Eight Fold Path

Updated July 2014

The analytics landscape is vast. Data Science emerges from the  confluence of statistics, machine learning, resurgent soft AI, data mining and networks. With such diverse influences, it is easy to get bogged down in the details of individual techniques being promoted and competing for mindshare. Here is a high level flight over the terrain and its major tributaries. The focus is not on specific algorithms and techniques, but general approaches on bringing analytics to clients. Indeed the same technique may show up in more than one of the ‘paths’ to analytic enlightenment.

Go Fast:
Rewrite algorithms so they go faster. Low level parallelization, faster convergence, probabilistic methods, avoiding making all possible comparisons (combinatorial explosion), code optimization. In particular, techniques such as Deep Learning, Natural Language processing, and Network analytics have reached a level of maturity and optimization that allows them to to be used practically on large corpuses of data; indeed they are the basis of many current startups.

Go Big (Data):
Modify existing algorithms so they fall into a map reduce or split-apply-combine pattern, move data faster through a network (bigger pipes), have data processing tied to disk storage rather than ram. Over the last two years, the limitations of the original Hadoop patterns for complex analytics problems have become more apparent, and in-memory analytics are gaining ground.

Be Real:
Applies to sensor streams of real time data in healthcare, finance, nuclear facilities, energy grids so a critical decision can be made minutes earlier. 5 minutes is the difference between a healthy baby and one with cp. As personal health monitoring and wearables become mainstream, this field will explode as people seek insight into themselves.

Be Conservative:
Most analytics stacks are focussing on methods that have been used for thirty years and implemented in higher end BI platforms, advanced statistical packages. Focus is on making such techniques available to broader audiences. The usual suspects, classic multivariate statistical methods such as Principal Components Analysis, and Discriminant Analysis, Logistic Regression, Kernel methods, are moving from specialized statistical software to mainstream analytical databases.

Be Edgy:
Focus is on techniques coming out of the research community, but not broadly implemented in commercial software (though may be open source). Natural language processing (NLP)  and Graphical Models used to be edgy, but no longer is. Ensemble methods such as Random Forests still exist more in the bleeding edge of R and Python. The US and European brain mapping projects are introducing new edgy techniques tied to state of the art diagnostic sensoring techniques.

Be Best Practices:
Bring in statistical best practices to confirm data properties match assumptions of analytical methods; the software acts as an expert system emulating an experienced data analyst. As the range and availability of analytical techniques increases and reaches a wider audience  there is a greater need for the discipline, due diligence and quality assurance best practices developed by statisticians and scientists in various disciplines. Getting a data driven answer is getting easier and easier. But is it the right data? With the right assumptions? Correctly sampled, filtered and cleaned? So as to give clearly interpretable answers.

Be Easy:
Make predictive modeling and computationally intensive analytic techniques available to a wider audience with less analytics domain knowledge, or programming abilities. Integration of analytics into automated workflows means the end user is often unaware of the underlying analytics; instead you interact with the results, whether they be a comparison of prices, the likelihood you will win a case, an interactivevisualization summarizing relationships in an area you are researching.

Be Visual:
Make analytics easier to use and interpret via better visuals. Could be considered a version of Be Easy but the focus differs and emphasizes techniques from statistical graphics, info graphics and data visualization to focus perception and insight. There are several schools of approach here. Statistical graphics that focus on a clear statistical relationship between data and display. Info graphics where designers who are also data scientists deliver new computational techniques to summarize data. There is an ongoing shift from visualization as data summary and interpretation framework, to the visualization being the user interface into the data.

Context:

If the above is the general landscape, context is the specifics of your situation in the landscape, and the best way for you to make gains from your unique starting point.

0. Data Science is fuelled by a grand confluence of statistics, data mining, soft AI, complex networks, machine learning, text processing techniques tied to big data distributed processing patterns. These techniques have broken out of their confines in academia and custom software to be much more generally available.  The uptake of, and business appetite for Data Science is aided and abetted by rise of information visualization to develop systems of strategic interpretation, and increasingly interaction, with the data collected by organizations. BUT each of these analytics “schools” has very different perspectives and “favourite” techniques. so, there is a need to identify minimal sets of reliable techniques for a particular application domain. When do you cluster via K-Means versus Fuzzy-C-Means; and what kind of problems are suitable for classification in the first place.

1. The best path will differ for different application domains. You need to understand the skills of those likely to do the analysis and those likely to consume the results in a particular domain.

2. Two  current routes to make analytics more widely available to technically capable staff who are not necessary domain specialists in a specific school of analytics:

  • extension of relational database driven ETL to more complex transformations that can be tracked via a dynamic data model. Recall 80 percent of the effort is getting data in shape often using simpler analytics techniques associated with ETL.
  •  visualization can bring analytics to less technical audiences and help encode best practices.

3. For any applied domain you need to identify who will build the models and who will use model outputs and be clear how analytics tools and outputs are presented to the differing skill-sets of the analytics producers and the analytics consumers. Build on the knowledge that already exists.

4. Current focus and buzz in industry remains on going big; but that’s just the first wave. If trends in Data Science follow the history of Relational Databases and Business Intelligence, once there is standardization on a platform, longer term gains are for the technology to Be Easy, Be Best Practices and Be Visual. Be Conservative focuses on a minimal analytics stack and does not differentiate. Be Fast will be critical in certain problem domains, and less relevant elsewhere. You can Be Edgy commercially by being conservative academically and focus on methods with a five to 10 year track record and good theoretical analysis tied to an an applied domain where it is obviously superior to existing techniques; ideally there exist several open source implementations that can be studied and optimized; relationships with relevant open source and academic communities will be a channel for introducing innovation.

Focussing on Be Real will force you to make gains everywhere else — a stretch problem that unfolds innovation. Moving beyond prediction from historical data to simulation of possible futures is still under represented in the Data Science landscape.

Leave a comment