How much data is “Big Data”

For many, this question is almost an irrelevance. The question that should always start the conversation is “ What do you want to achieve?”, yet in my personal experience it never has and when I have introduced it I have been made to feel uncomfortable. Many feel that they must have a big data project in their portfolio and the why? and how? is of less importance. A high proportion want answers to fairly simple questions they can’t currently get answered and are lead to believe that answering those questions is indeed big data.
Let me make this very simple. With few exceptions, there is only one reason why you might want a Big Data solution: Because you have so much data that anything less could not analyse it and provide solutions in the timescales you need.  There are two key elements here; Timescales and volume of data to be analysed.

Timescales is the simplest one so let’s deal with that first. There tends to be two timescales: 1. Instant response and 2. Non-urgent responses. The latter is by far the more common and is typified by the “Data warehouse” approach. The former is typified by the “search engine” scenario.
Although the search engine appears to be providing instant response, in reality it is merely searching well-ordered indexes that have been populated at a leisurely pace, so in fact it does not differ as much from the data warehouse scenario as one might at first think.
The data warehouse is a model of efficiency where the questions are carefully defined in advanced, most of the processing done and the answers stored away until needed.  Often further processing is then carried out at the point of consumption.
Again you may be thinking that there are more parallels than differences between the two approaches apart from all the hype. You’d be right.

What does Big data do that is different?

Well the term as we understand it, owes its existence to Google’s own solutions to the search engine problem. Perhaps another penny has dropped for you now.  Hadoop, Map-reduce and all those sexy terms refer to a simple and very powerful approach to getting a huge job done efficiently.

The infrastructure relies on the idea of dividing each job into smaller jobs and continuing to do so until each is quite manageable and then delegating them to different machines. If you’re a software engineer, think Jackson.  A simplified view might be that you have five people doing operational work and a manager coordinating that work and responding with a single answer to his sponsor. If you ever attended management courses you will surely remember this type of organisation. Well that’s the big idea.
Why is this better? Well it allows a vast, unlimited number of servers to work on bits of the problem at the same time,  thus speeding up the time to complete. This allows one to demand immediate answers to questions that are more efficiently dealt with over a longer time-frame and there lies the risk.
Is Big data Machine Learning?
No, it definitely is not, but of course it can be useful for doing this. However, it is very important to understand that there is a plethora of tools, many free and some you already have in the toolkit such as excel,  that can very effectively carry out machine learning tasks if you take a little time to learn them. Not only  is it  infinitely easier to learn and carry out such analysis on tools like SQL server, Excel, etc. than it is to spin up a big data factory on AWS and become a data scientist just to find out if it will rain tomorrow.

Very few questions you are likely to want answers to require anything more than traditional statistical approaches or even simpler BI reporting that can be carried out very effectively on stunning data volumes and extremely complex problem domains with tools like EXCEL (try SOLVER or explore the many regression functions), POWERBI, KNIME, RAPIDMINER, MATLAB, OCTAVE, Google FUSION TABLES, TABLEAU. Many are free and very good tutorials can be found online. The best thing about these tools is that you can test your hypothesis and decide whether a major project is worthwhile.

How big is big?

Well there are truly big problems and yours may well be one of them, but the vast majority of questions can be answered with a well specified windows server or your personal preference.
rememeber alos that remarkably small samples are known to provide extraordinaty insihts that improve very little when expanded.
For a better technical analysis than I could offer have a look at this very good blog. It gets to the point much faster than I do

As usual, comments only via email. No new subscribers being accepted at this time.

Image result for bot wikipedia

I wont be long-winded about this, I’ll discuss it via email with anyone who is interested, but I’ll break with my usual mode and come straight to the point.
A great many people who know little at all of machine learning  and even less about people and many more who are simply  oblivious to the potential consequences of their words are talking about the miraculous things we can expect from Machine learning.

What is ML in a nutshell?
Academics break ML into two modes:  Supervised and Unsupervised.
In the case of the former we give the machine a large corpora of content and ask it to decide what will happen next, or to find other similar instances. A translation service for example  begins this way and learns after a while to translate without help.
In the latter case, we give it a body of content and ask what it thinks of that.. Google search is an example of this approach and it simply makes sense of what it finds.

Often we give it a few hints like “Classify this for me and establish links” as in Google search. This would be a “Classification problem”. We might on the other hand ask it to read the racing papers and decide who will win the four o’clock today. This would be a “Regression problem” because we are asking it to look at the past and predict the future. Yes all of this is highly condensed as promised, if you are an expert you don’t need my explanations.
Understanding what the customer will want next year, predicting the weather, finding Oil under the sea, predicting tumours, the challenges are endless and the rewards enormous.

What is the loop of self-destruction?
The loop happens when, thanks to social media, a good, but no a sole example, the machine begins to make judgements that influence the data and then discover exactly what it predicted.

As with humans this will give it the machine equivalent of a big head and possibly some citations and will lead to even greater confidence and fewer checks and before anybody spots it, it I all too late.
If any Movie producers out there are stuck for an idea, I am available to help with the plot. Here is a simple example we are all aware of:
Joe Gel, and Josephine Lotion our dear friends, represents an enormous body of intelligent and informed people who spend most of their waking hours  checking back with their phones for reassurance. Joe searches Google for Tom Raspberry, his favourite politician and receives a huge list of pages. The ML in google notes his interests and begins sending him dozens of articles about Tom Raspberry, what he says and does and what people say about him. Unwittingly Our pal Joe has become astonished by the fact the whole world seems obsessed with Tom R and realises subconsciously how important to is be aware of Tom R. He begins to tweet and have the odd Facebook conversation about something he read. Immediately the ML in Facebook and the one in Twitter hone in his apparent obsession with Tom R and all begin to bombard him with content and introduce him to thousands of people with the same problem. Poor Joe.

Now our Machine does a Recce to see what are people talking about and it discovers that millions are talking and reading about Tom raspberry and concludes that tis is the way to keep the customer happy so it ups its game and heightens the emphasis. It also confidently announces that Tom R will undoubtedly be unstoppable in the forthcoming election.

Joe and Josephine realise the importance of not standing in the way of a social crowd and are not about t be shunned and subconsciously they begin to take more interest in the positive stories about Tom which now triggers the Machine to filter their feeds and search results and friend recommendations etc more toward the positive . You don’t need me to finish the plot. There is only one way this is going. Imagine if the secret services relied on this kind of information to brief their bosses. But they do, don’t they.

You may well think, as I do , that despite the  shear “wrongness” of rigging democracy, whether by design or accident, it matters little who is elected anyhow. In that case imagine the same scenario when the machine turns its hand to guiding change in a government department or a large business , or guiding product development or even finding the cure for cancer. If you would like to see many better examples with a strong scientific analysis, check out Weapons of Math destruction.

One wonderfully simple yet highly destructive outcome of ML that I have seen up close is the  call centre  automated system that recognises your telephone number, calculates your value as a customer and decides if you will be answered, how long you will have to wait and whether you get to speak to somebody skillful.  Just to update my card details for a £20 a month hosting service, I had 11 hours of my time wasted, had my service disrupted and was threatened by a bot with £150 fine to put the service back on.
I hate to disappoint you, but if you have ever had an IM conversation with a patient lady on the support portal “That was no lady” nor was it my wife, that was a distant cousin of Cortana.
If she did not know the answer, or more likely the question, you were never going to be served.
If you are wondering what might happen to your pension, your job and your home if these guys get involved in stock trading, well take a look here  According to a 2014 report, sixty to seventy per cent of price changes are driven not by new information from the real world but by “self-generated activities”.

It’s not all negative by any means. I actually do use ML to predict the winners of tomorrows racing with a consistent level of profit. When I get it wrong, usually after a late night of programming with insufficient testing, my winnings disappear very quickly into someone else’s pocket and I sit up and take notice.
I sincerely hope that someone starts sitting up and taking notice soon  of the impact of poorly programmed Bots that are already beginning to increase risk for the most powerful nations on earth.