Get that Big Data Project under way without breaking the bank

The background is a Must-Read.

Big data was announced to me in 2011 when I was busy trying to create some insight from more than 100 acquisitions of a global publisher. Back then it meant Hadoop, Nosql and instant processing as opposed to Data Warehouses, Data Marts and BI services.
By 2013 the word Data Scientist had come of age and many people I knew used this moniker to attract bigger salaries, not all of them well practiced in beating information out of data either. Today Big Data means something different again and that may well change further.


The above needs no introduction and has no owner. Certainly, it became the underlying war cry after the early madness the first wave of data warehousing went bad and they began tearing these over-engineered montstrosities down.  It means “UNLESS YOU KNOW WHAT YOU WANT OUT, WHY ON EARTH ARE YOU DOING THIS?”


So you would think, but experiences tells us that people, even when they are at the top of the tree often want to be told they are right far more than they want insight. Every Data Scientist can relate to that job, or those jobs. Every Data Scientist also knows that without a satisfactory answer to the short question he/she is up against it.

Every organisation with turnover greater than about 20m has found that they must have key systems to support acquiring and managing customers, managing stock and materials and a myriad of other processes.  The cost of developing this software is beyond most and the skill to do it well is very rare, hence they buy siloed systems such as CRM, ERP, CMS, build some others and run them mostly as silos. There is now, in every business, a myriad of unrelated systems owning part of the audit trail for each customer and each product and so-forth.
These COTS systems have locked down data stores, many with numbered tables and numbered data columns just to provide further barriers. Each vendor wants to ultimately sell you a one-size fits-all solution. Upscale this to a global group of say 100 operating companies and you begin to see how bad things can get.  This is why, when you call the call-centre in India, they have no idea who you are, or what you want and care less regardless how polite they might be. Everybody knows this situation is impossible to run efficiently and when it comes to getting a higher level strategic view for key decisions and product or service strategy then the   average business is struggling in the land of hunches.
Data warehousing and then Big Data are politically and technically different ways of trying to pull some insight out of the trail of transactions after the fact and learn something that might be used to improve the bottom line in some way in future. For another insight  read this article

The hype and how it may relate to your situation

The toughest thing to find amongst all the “Two and a half quintillion bytes or 2,500,000,000,000,000,000 bytes of data produced in 2017 is a single verifiable example of a business that used big data to gain insight it could not have found elsewhere more cheaply and made significant investment returns.   Yes, they certainly do exist, but I don’t need a Big Data operation to gain the insight that they are rare enough to trigger some caution before jumping in with a big yahoo!.

In my day-to-day consulting activities, I have seen a few claim to have a “Big-Data” project but none that both qualified for that title and also delivered significant value. The nearest was a recommendation engine which annoyed more customers than it helped and could have been achieved much more effectively by my two best people in the Philippines at a cost of around 50k p.a. just combing transaction records for clues and calling customers to verify their hypotheses.
The majority of my Footsie 100 ,  S&P 500 and Public sector clients have weak or non-existent data marts or data warehouses and rely on extraordinarily risky data collected by one or two “Gurus” via exports, dumps, SQL and good old Excel with weak or non-existent quality control. This includes reports going into board meetings in some cases. It is very easy to see why when you tackle the sheer scale of the job of making their data from the many (not even catalogued in many cases) systems sufficiently compatible to use it in any kind of analysis.

If these businesses had all their valuable  data in compatible formats and accessible to a data analyst with the most basic tools, even Excel in many cases, the gains would be enormous, however this is beyond them and Big Data is rarely offering to tackle this problem in any significant way or at all. By the way, any data analyst who suggests that big data can use data without first cleansing and cleaning it is a “Porky Pie merchant” and you should show him the door, so before you blow your budget on fancy mind-reading tools in the cloud, take a little time to think about what you are hoping to achieve and introduce your evangelists to more practically minded people so they both can benefit, it can only do you good.

Integration is not Big-data and Vice versa

Don’t fall into the trap of confusing data analysis and Big Data with systems integration. Integrating systems is about creating real-time, or near-real-time connections between systems in order that customers and employees can see a broader or even 360-degree view of key process areas such as marketing, delivery, customer support. For example, a customer service person can see the details of the order last week, the attempted delivery yesterday and the customer ticket opened early this morning and react appropriately. This is not a challenge that would normally be solved by data analysis or Big data solutions but by intelligent system integration. For a better view read this

The real challenge of data science

Each organisation and each CIO has a different reason for setting out on this quest, but ultimately, every business investment must deliver a return at a certain level to be acceptable to the business. We all understand at some level the importance of agile thinking and the necessity of being able to change tack in response to changing environments, but an acceptable end-goal is non-negotiable.
If you have the fortitude and foresight to begin with “why?”, then “what?” will become achievable and meaningful.


Most readers will have expected to jump to “what?”, having gotten over the surprise of starting with” why?”, read on ..

Before you alienate everyone who used to be on your side, by potentially replacing them with a suspect new system. or shifting their perceived power-base, or undermining hard-working and valuable BI teams, think long and hard about the “Why” question above and then about who should be owning and driving this initiative.

Next think about whether your core values, beliefs and principals as a business are up for grabs, or have potential to be reviewed and who will own the governance of this aspect of the initiative and any reviews and actions deriving from them. Do both of these things well and you may survive the next step.

Finally, it is very important to ensure that technical leads and business sponsors are genuinely business specialists who understand the business need rather than the data and technology. Data and technology skills can be brought in, whereas inside knowledge takes a time to establish. A good balance of business sponsorship and external people can drive a very healthy objective culture where people are both supported and challenged that will contribute strongly to your Big Data Project success.

What and How are sometimes interchangeable in terms of when they occur and of course they are iterative. Sometimes (often if you really know what you are doing), in information gathering you need to gather some high level information, process it and get a broad feel for the landscape before deciding  what options you have and what to do in the next step of orientation.  I often use the OODA loop as a metaphor for this process: orientation, building hypothesis, then repeating until you hone-in on the few actionable nuggets you had hoped for, or the Abort signal.
Just be aware of the following:
Working with no hypothesis is not investing, it is walking in a casino and putting your money on red 7. Working with a hypothesis is even more risky if you are in love with your idea. If you interrogate data enough it will tell you anything you want to hear and if you engage in practices like Machine Learning with some of the less understood algorithms in particular, you will have no problem at all in finding what you want to find. Be warned.

Furthermore: an assumption in many Big Data projects is that you will uncover insights you had not expected, this assumption stems from the underlying assumption that some form of Machine learning will be used. If this is a central theme in your project, then you need to accept that “Stuff you don’t know” as per “Johari’s window”, is in fact unknown, i.e it may not exist. Highlight this, place it on your risk log and engage the best brains to tackle it with the right measures.

Establish with your governance team what you believe you can achieve, where your boundaries are, how and when you will review this governance and in terms of business process and bottom line, what you are hoping to achieve. Then hold regular reviews with project management of how you are progressing towards these goals and the state of Risks and Opportunities.


Sooner or later you are going to have to engage people who can boil down the high-level business goals into project steps and technical approaches and form a way-map of how you hope to progress.  Even if this is very exploratory and the second milestone is knowing whether this project is feasible, it is vital to have reliable milestones, signposts and decision stages if only to drive the direction of work and the effort and technology needed to accomplish it and of course, How much? which is itself a key part of how?

You have probably already guessed that this is something of an iterative process because you need high level costs first and then more accurate costs and so-forth. It has probably also occurred to you that the earlier the technical evangelists take part in this planning the better, if everyone is to save some time and accountability is to be maximised.


Like all projects, Big data projects need timeframes and milestones and while all these can be managed and changed within typical project approaches including traditional agile, they can’t be ignored altogether.

Below are the selected comments left by some key business leader with experience of Big data initiatives.


There are many ideas about where this data should be stored, cleaned and analysed and it can cause enormous extra challenges if handled badly.

The key to data whether big or not is data quality. What quality means in the simplest terms is completeness, cleanness, type and structure safe, comparable.  We could add to that, but this is the basic minimum.

e.g. If certain fields are missing altogether then clearly counts will be misleading, if the data is right but in the wrong format, that won’t work either, if we have age ranges or date ranges that are not comparable then comparisons won’t work or will be misleading, if some seasons or some product lines for example are well represented and others not, the comparisons will look very misleading.

Given that the difference between vital accurate information and utterly, misleading garbage is very small and entirely indiscernible in a report, It can’t be stressed enough how important it is to get excited about the data not the technology.

What all that means for the where question is that you either need to be able to export the quality regime successfully or get the data to a central point.  Given the importance of some of this information to corporate decisions it is natural to want it at head-office, but often very successful distant businesses have their own regimes that are best left intact and the lightest touch possible can be attractive.
Accessing data over long distances is not technically difficult regardless where it is stored so the creation of data warehouses, data lakes and remote data marts are all very legitimate approaches that in some cases can live side by side. At the risk of repeating myself, the only thing you need to centralise is data quality assurance.


Interviews taken from

Ruben Sigala, chief analytics officer, Caesars Entertainment: “What we found challenging, and what I find in my discussions with a lot of my counterparts that is still a challenge, is finding the set of tools that enable organizations to efficiently generate value through the process. I hear about individual wins in certain applications, but having a more sort of cohesive ecosystem in which this is fully integrated is something that I think we are all struggling with.”

Zoher Karu, vice president, global customer optimization and data, eBay: “One of the biggest challenges is around data privacy and what is shared versus what is not shared. “

Ruben Sigala: “You have to start with the charter of the organization. You have to be very specific about the aim of the function within the organization and how it’s intended to interact with the broader business. “

Vince Campisi, chief information officer, GE Software:
“One of the things we’ve learned is when we start and focus on an outcome, it’s a great way to deliver value quickly and get people excited about the opportunity. And it’s taken us to places we haven’t expected to go before. So we may go after a particular outcome and try and organize a data set to accomplish that outcome. Once you do that, people start to bring other sources of data and other things that they want to connect.”

Ash Gupta, chief risk officer, American Express:  The first change we had to make was just to make our data of higher quality. We have a lot of data, and sometimes we just weren’t using that data and we weren’t paying as much attention to its quality as we now need to. That was, one, to make sure that the data has the right lineage, that the data has the right permissible purpose to serve the customers.”

Victor Nilson: “Talent is everything, right? You have to have the data, and, clearly, AT&T has a rich wealth of data. But without talent, it’s meaningless. Talent is the differentiator. The right talent will go find the right technologies; the right talent will go solve the problems out there.”

Zoher Karu: “Talent is critical along any data and analytics journey. And analytics talent by itself is no longer sufficient, in my opinion. We cannot have people with singular skills. And the way I build out my organization is I look for people with a major and a minor. You can major in analytics, but you can minor in marketing strategy. “




Big Data V Data Analysis

1.       Making decisions data-driven
Yes, when understood properly this is sage advice, but it is unlikely to ever be part of a Big-Data project. Of course there is no law to say you can’t call it that.

Decisions need to be made by accountable people and this is unlikely to go away for a very long time. These decisions should be informed by insight gained from data wherever possible and should be tempered by experience and Tacit knowledge.
Formula one cars collect and analyse more data than any other organisation, machine, process, or system, but there is no plan for a self-driving F1 car and the leading engineers are very clear that the driver is the key component and not even to be taxed with too much information.

Even the vast amount of information collected within an F1 car does not need the infrastructure of a Big Data installation and can be handled in a much simpler way .

2.       Finding new insight

The type of use case is quoted over and over by bloggers who know nothing at all about systems or data analysis and include:

Analysing customers actions and finding new opportunities to make a recommendation. This is the most common case quoted.  There have been very adequate machines around sine pre 2010 to do just this and their success, though limited by a range of things, reached its maximum sometime around 2013. This is unlikely to ever be improved on for many reasons such as limited desire to spend money, limited need to keep buying the same thing they were interested in last week, flaws in the view that we are all alike and many more fairly obvious shortcomings.  A small PC with a free RDBMS could do this job for even the biggest retailer just as efficiently with a little data planning and governance. The key is not to slow down the eCommerce by crunching this stuff while the customer waits or on the same platform.

The only not very smart algorithm that shows any results at all, is the one that says, “a high proportion of people who bought A also bought B and therefore it allows you to offer B as an extra or place adverts in front of them next day or mail them a deal etc. That is really it for the moment, by and large. This approach does not need to know why because it is not concerned about upsetting a few customers. This latter situation will change rapidly though as customers vote with their feet, or mouse.

3.       Social media insight

Some examples of gaining insight, especially when related to the US election and Brexit are referring to an area that must surely be curtailed very soon. This is social media businesses and those with phone apps in your pocket  eavesdropping, analysing mood and attitude, generating mood and attitude, stalking every move, every call, every text and accosting people at times when they are judged to be vulnerable with whatever garbage the advertiser will pay for .E.G. One Australian firm was found targeting teenagers on Facebook who felt worthless.

I have no doubt this type of thing could be politically or financially rewarding as can hacking, robbery, murder and all sorts of crimes. I do hope you are not planning to get involved in this sort of thing. Aside from that, the data being snooped on does not belong to these companies and regardless of any agreement they were forced to sign, basic human rights in most civilisations would make this behaviour illegal. We just need a little time for more people to wake up and start demanding action,
This is one example that does require Big Data to handle the enormous task of analysing all this data. In Googles case, their analysis machines burn an equivalent amount of electricity to the city of  San Francisco

4.       Preventing fraud
This is a very annoying and lazy misuse of technology that really ought to be curbed.

How many times have you got to your hotel in a strange country, walked outside, eaten and had your card refused. That is Big Data, or rather small annoying, pointless data, making dumb assumptions.  The reason that the bottom line suggests it is effective is that you the unfortunate traveller had to call them and prove who he was, if indeed he could get through from abroad.  Nothing intelligent at all about any of this and as soon as there is an alternative bank, I will become an ex customer. Systems people tried this approach to monitor for hacks etc and are also very underwhelmed by the results.

5.       Monitoring innate things like water supply pipelines.
In India where water is so scarce, DLF use big data to predict problems and aid with maintenance and avoidance. This is big data and machine learning working well to do something useful and valuable.

There are many opportunities to use data analysis to improve the value we provide to  our customer, monetise that value effectively and hang on to some of it. Most of these opportunities can be grasped with fairly basic equipment and tools and very little of it needs to involve huge data volumes or very pricey infrastructure. There are also opportunities to really add to our knowledge by allowing ML algorithms to adapt different viewpoints we would not even think of and highlight different opportunities for us to pursue and take advantage of. Just follow the basic steps outlined above and you will save a great deal of wasted effort and come away with what you had hoped for or even better.


The difference between wasting a great deal of time and money on nothing and building a route to insight and improvement for your business lies in following a few simple processes and breaking your journey into achievable steps, even when the early steps might be described as “Where are we?”, and “Why are we doing this?”.

The things to avoid are hype, “Suck it and see” mentality, disrespect for data quality, confusing Systems integration with Big data/Data analysis, sitting on the fence too long, trying to do it all yourself, handing it all to a technical supplier, talking sexy technical terms instead of business outcomes, listening to the guy who is good with computers.