Big Data, Small Models – The 4F Approach
Kajal Mukhopadhyay, Ph.D.August 17, 2015
Most of us understand the concept of Big Data as it is a commonly used term within the digital advertising industry. However, the environment of Big Data is sometimes confusing, particularly when it comes to physically dealing with it. Storing, accessing, harnessing, deriving insights and building data models on Big Data all become a major hurdle for operations and IT infrastructures. Additionally, businesses have difficulty justifying the investment in building such a platform.
The definition of Big Data includes the 3Vs, (optionally a fourth V was defined by IBM at the beginning of 2015). In simple terms these are:
Volume – data size in terabytes, petabytes, exabyte and more
Variety – types of data such as structured, unstructured like textual, audio, video, beacon signals
Velocity – speed of data creation, storing, access in real-time, an equation of mass times speed
Veracity – uncertainty, noise, quality and usability of the data
The most common data challenge for businesses is volume, followed by velocity and variety.
With Big Data also comes the need to understand the business value of harnessing it, including but not limited to: data processing, summarization, modeling, predictive analytics and data visualization. Finding a uniform solution or platform to deliver Big Data analytics is a challenge for many organizations, let alone justifying the cost of building such solutions.
To overcome these challangees, I would like to introduce a new way of the operationalizing Big Data analytics. By leveraging selected analytical methodologies along with modeling principles, we can fit data to the present landscape. The term I use is 4Fs (purposefully rhyming with the 4Vs tomake it easier to remember):
- Focus one KPI at a time: Analyzing too many goals, objectives or KPIs within a single data model is difficult with Big Data. Try building one KPI at a time within a single analytic framework. If needed, try combining individual KPIs at a higher level for broader insights (newer statistical methods may be needed to achieve this). For most cases, a singular KPI model is far more effective than complex, multi-KPI models.
- Few controlling factors: In an ideal world, goals and objectives are influenced by numerous factors. However, in reality, there are only few that have the most impact on the KPIs. Identify and use fewer controlling factors in your model. Often a set of two by two causal models is good enough explain the KPI behavior.
- Fast computation: Fast computation is critical for a Big Data environment. Computation algorithms for any types of data analytics and visualization must be distributed (parallel processing), additive (map/reduce) and modular (multiple model join).
- Forward-looking: Efficacy of Big Data models rely on the predictive accuracy of the outcome variables or short-term forecasts. Analytics should focus on building models that can predict the “next best outcome” within a given set of objectives. It may not focus on explaining the historical behaviors, but rather concentrate learning algorithmically and predict the most likely outcome.
In order to apply the 4F principles to data modeling, I suggest the following three types of popular and widely used statistical methods. All are simple to execute and implement and provide fast answers.
- The first is a class of descriptive methods that utilize frequency distribution, binning, classifications and statistical tests. With Big Data, any measure of KPIs using these principles will have smaller standard errors. Any bias can be adjusted using simple A/B tests, differences with respect to benchmarks, trends and applying a universal control on the data systems under certain environments.
- The second is the class of models based on the principles of conditional probabilities and any variations of Bayes classifications. Typically, one can store them as some form of recommendation tables. These models can be utilized to build fast inferences and use of indices, odd-ratios, scores and other measures of KPIs.
- Lastly, the Bayesian network and graphs constitute the class of models that can be used in a Big Data environment. Much of the Artificial Intelligence and machine-learning algorithms in computation engineering are based on these type causal models.
For any practitioner of Big Data analysis, adopting a pragmatic approach to data modeling, such as the 4Fs, is of tremendous value. All three classes of models can be combined to create new ways of analyzing humongous data while being efficient, fast and statistically relevant.
 The Four V’s of Big Data – IBM Infographics. http://www.ibmbigdatahub.com/infographic/four-vs-big-data
 Marketing Analytics Conference - The DMA, Chicago, March 9-11, 2015. http://www.slideshare.net/KajalMukhopadhyayPhD/mac-presentation-kajal-mukhopadhyay-20140307
 Causality: Models, Reasoning and Inference, Judea Pearl, 2nd Edition, Cambridge University Press, New York 2009.