Tuesday 18 November 2014

Big Data is not just about size...

The term "Big Data" gets used a lot these days, and some people think you only have Big Data problems if you have billions of records and petabytes or even zettabytes of information in your system. In fact most articles I've seen start off by saying how much data modern systems have to deal with.

However Big Data is not only the domain of the googles of this world. You can have a Big Data problems in projects that don't have that much data.

In 2001 a paper on data management coined the 3 V's: Volume, Velocity, and Variety.

Volume is simply the size of the data, in raw bytes or number of records. A million records is not that many if they are simple and small, if they large and hard to index however multimillion record databases can become troublesome.

Velocity is the speed that the data is being collected. Website logs are a great example of this they can generate thousands and thousands of new records every hour even for relatively small sites.

Variety is the complexity of the data, anyone who has written SQL knows that even small databases can become sluggish if you try to link too many tables in a single query.

Today some people like to add other "Vs" to the mix:

Veracity is about biases and "noise" in the data, how meaningful is the method of analysis to the data. Any statistical analysis will have the odd result that falls way outside the line of best fit for the rest of the data. How often this occurs and how you have to deal with it can effect your data solution

Validity is similar to veracity and tends to deal with the applicability of the data to the question being asked, like web proxies and caches mean server logs are not a true reflection of actual page views.

Volatility may sound like speed but rather than how fast it's coming in, volatility deals with how fast the data gets out of date. Trading prices for shares are out of date almost before you get the data, with some companies doing micro trades and relying on millisecond differences in buy and sell times.

and finally...

Value is a measure of how much the data is worth in real world terms. This is often the most important factor because it drives the business.

Now if any of these factors reaches the point where it requires developers to introduce special strategies, processes, or systems in order to manipulate and store the data, then you have just crossed into the world of Big Data.

So the "big" in Big Data refers more to the requirements of the data than the data itself.

That's it, Big Data explained. Your welcome.

Now as far as solutions to the problems of Big Data...

You can take your pick because frankly there are as many Big Data solutions are there are Big Data problems.

The technologies that most projects encounter first are storage and retrieval specific, like clustering, sharding, distributed storage, cloud based infrastructure, search based applications, key-value stores and other NoSQL approaches to data storage.

Then as you approach the more complex tasks of analysis and manipulation you start to hit distributed computation, genetic and machine learning algorithms, signal processing, time series analysis, wavelets and other semi predictive techniques.

In the end Big Data is simple, it happens when the requirements of you're data application go beyond the systems you're currently using and the solution is that you need to evaluate those systems, starting with how you store and organise your data.

No comments:

Post a Comment