For many, this question is almost an irrelevance. The question that should always start the conversation is “ What do you want to achieve?”, yet in my personal experience it never has and when I have introduced it I have been made to feel uncomfortable. Many feel that they must have a big data project in their portfolio and the why? and how? is of less importance. A high proportion want answers to fairly simple questions they can’t currently get answered and are lead to believe that answering those questions is indeed big data, but it rarely is.
Let me make this very simple. With few exceptions, there is only one reason why you might want a Big Data solution: Because you have so much data that anything less could not analyse it and provide solutions in the timescales you need. There are two key elements here; Timescales and volume of data to be analysed.
Timescales is the simplest one so let’s deal with that first. There tends to be two timescales: 1. Instant response and 2. Non-urgent responses. The latter is by far the more common and is typified by the “Data warehouse” approach. The former is typified by the “search engine” scenario.
Although the search engine appears to be providing instant response, in reality it is merely searching well-ordered indexes that have been populated at a leisurely pace, so in fact it does not differ as much from the data warehouse scenario as one might at first think.
The data warehouse is a model of efficiency where the questions are carefully defined in advanced, most of the processing done and the answers stored away until needed. Often further processing is then carried out at the point of consumption.
Again you may be thinking that there are more parallels than differences between the two approaches apart from all the hype. You’d be right.
What does Big data do that is different?
Well the term as we understand it, owes its existence to Google’s own solutions to the search engine problem. Perhaps another penny has dropped for you now. Hadoop, Map-reduce and all those sexy terms refer to a simple and very powerful approach to getting a huge job done efficiently.
The infrastructure relies on the idea of dividing each job into smaller jobs and continuing to do so until each is quite manageable and then delegating them to different machines. If you’re a software engineer, think Jackson. A simplified view might be that you have five people doing operational work and a manager coordinating that work and responding with a single answer to his sponsor. If you ever attended management courses you will surely remember this type of organisation. Well that’s the big idea.
Why is this better? Well it allows a vast, unlimited number of servers to work on bits of the problem at the same time, thus speeding up the time to complete. This allows one to demand immediate answers to questions that are more efficiently dealt with over a longer time-frame and there lies the risk.
Is Big data Machine Learning?
No, it definitely is not, but of course it can be useful for doing this. However, it is very important to understand that there is a plethora of tools, many free and some you already have in the toolkit such as excel, that can very effectively carry out machine learning tasks if you take a little time to learn them. Not only is it infinitely easier to learn and carry out such analysis on tools like SQL server, Excel, etc. than it is to spin up a big data factory on AWS and become a data scientist just to find out if it will rain tomorrow.
Very few questions you are likely to want answers to require anything more than traditional statistical approaches or even simpler BI reporting that can be carried out very effectively on stunning data volumes and extremely complex problem domains with tools like EXCEL (try SOLVER or explore the many regression functions), POWERBI, KNIME, RAPIDMINER, MATLAB, OCTAVE, Google FUSION TABLES, TABLEAU. Many are free and very good tutorials can be found online. The best thing about these tools is that you can test your hypothesis and decide whether a major project is worthwhile.
How big is big?
Well there are truly big problems and yours may well be one of them, but the vast majority of questions can be answered with a well specified windows server or your personal preference.
remember also that remarkably small samples are known to provide extraordinary insights that improve very little when expanded.
For a better technical analysis than I could offer have a look at this very good blog. It gets to the point much faster than I do
As usual, comments only via email. No new subscribers being accepted at this time.