Distributed Analytics in IoT – Why Positioning is Key

analytics-word-cloud

The current global focus on the “Internet of Things (IoT)” have highlighted extreme importance of sensor-based intelligent and ubiquitous systems contributing to improving and introducing increased efficiency into our lives. There is a natural challenge in this, as the load on our networks and cloud infrastructures from a data perspective continues to increase. Velocity, variety and volume are attributes to consider when designed your IoT solution, and then it is necessary to design where and where the execution of analytical algorithms on the data sets should be placed.

Apart from classical data centers, there is a huge potential in looking at the various compute sources across the IoT landscape. We live in a world where compute is at every juncture, from us to our mobile phones, our sensor devices and gateways to our cars. Leveraging this normally idle compute is important in meeting the data analytics requirements in IoT. Future research will attempt to consider these challenges. There are three main classical architecture principles that can be applied to analytics. 1: Centralized 2: Decentralized and 3: Distributed.

The first, centralized is the most known and understood today. Pretty simple concept. Centralized compute across clusters of physical nodes is the landing zone (ingestion) for data coming from multiple locations. Data is thus in one place for analytics. By contrast, a decentralized architecture utilizes multiple big distributed clusters are hierarchically located in a tree like architecture. Consider the analogy where the leaves are close to the sources, can compute the data earlier or distribute the data more efficiently to perform the analysis. This can have some form of grouping applied to it, for example – per geographical location or some form of hierarchy setup to distribute the jobs.

Lastly, in a distributed architecture, which is the most suitable for devices in IoT, the compute is everywhere. Generally speaking, the further from centralized, the size of the compute decreases, right down to the silicon on the devices themselves. Therefore, it should be possible to push analytics tasks closer to the device. In that way, these analytics jobs can act as a sort of data filter and decision maker, to determine whether quick insight can be got from smaller data-sets at the edge or beyond, and whether or not to push the data to the cloud or discard. Naturally with this type of architecture, there are more constraints and requirements for effective network management, security and monitoring of not only the devices, but the traffic itself. It makes more sense to bring the computation power to the data, rather than the data to a centralized processing location. 

There is a direct relationship between the smartness of the devices and the selection and effectiveness of these three outlined architectures. As our silicon gets smarter and more powerful and efficient, this will mean that more and more compute will become available, which should result in the less strain on the cloud. As we distribute the compute, it should mean more resilience in our solutions, as there is no single point of failure.

In summary, the “Intelligent Infrastructures” now form the crux of the IoT paradigm. This means that there will be more choice for IoT practitioners to determine where they place their analytics jobs to ensure they are best utilizing the compute that is available, and ensuring they control the latency for faster response, to meet the real time requirements for the business metamorphosis that is ongoing.

Why IoT practitioners need to “Wide Lens” the concept of a Data Lake

As we transition towards the vast quantity of devices that will be internet enabled by 2020, (anything from 50-200 billion experts estimate), it seems that the current cloud architectures that are being proposed are somewhat short on the features required to enable the customers data requirements on 2020.

I wont dive hugely into describing the technology stack of a Data Lake in this post (Ben Greene from Analytics Engines in Belfast, who I visit on Wednesday en route to Enter Conf, does a nice job here of that in his blog here). A quick side step, if you look at the Analytics Engines website, I saw that customer choice and ease of use were some of their architecture pillars, when providing their AE Big Data Analytics Software Stack. Quick to deploy, modular, configurable  with lots of optional high performance appliances. Its neat to say the least, and I am looking forward to seeing more.

The concept of a Data Lake has a large reputation in current tech chatter, and rightly so. Its got huge advantages in enterprise architecture scenarios. Consider the use case of a multinational company, with 30,000+ employees, countless geographically spread locations, multiple business functions. So where is all the data? Its normally a challenging question, with multiple databases, repositories and more recently, hadoop enabled technologies storing the companies data. This is the very reason why a business data lake (BDL) is a huge advantage to the corporation. If a company has a Data Architect at its disposal, then it can develop a BDL architecture (such as shown below, ref – Pivotal) that can be used to act as a landing zone for all their enterprise data. This makes a huge amount of sense. Imagine being the CEO of that company, and as we see changes in the Data Protection Act(s) over the next decade, a company can take the right step towards managing, scaling and most importantly protecting their data sets. All of this leads to a more effective data governance strategy.

Pivotal-Data-Lake

Now shift focus to 2020 (or even before?). And lets take a look at the customer landscape. The customers that will require what the concept of a BDL now provides will need far more choice. And wont necessarily be willing to pay huge sums for that service. Now whilst there is some customer choice of today, such as Pivotal Cloud Foundry, Amazon Web Services, Google Cloud and Windows Azure, it is predicted that even these services are targeted at a consumer base of a startup and upwards in the business maturity life cycle. The vast majority of cloud services customers in the future will be everyone around us, the homes we live in and beyond. And the requirement to store data in a far distance data center might not be as critical for them. It is expect they will need far more choice.

I expect in the case of building monitoring data, which could be useful to the wider audience in a secure linked open data sets (LOD’s) topology. For example, smart grid provider might be interested in energy data from all the buildings and trying to suggest optimal profiles for them to reduce impact on the grid. Perhaps the provider might even be willing to pay for that data? This is where data valuation discussions come into play, and is outside the scope of the blog. But the building itself, or its tenants might not need to store all their humidity and temperature data for example. They might some quick insight up front, and then might choose bin that data (based on some simple protocol describing the data usage) in their home for example).

Whilst a BDL is built on the premise of “Store Everything”, it is expected that whilst that will bring value for these organisations monitoring consumers of their resources, individual consumers might not be willing to pay for this.

To close, the key enablers to these concepts are the ensure that real time edge analytics and increased data architecture choice. And this is beginning to happen. Cisco have introduced edge analytics services into their routers, and this is a valid approach to ensuring that the consumer has choice. And they are taking the right approach, as there is even different services for different verticals (Retail, IT, Mobility).

In my next blog, Edge Analytics will be the focus area, where we will dive deeper into the question. “where do we put our compute?”

IoT and Governance. Its a game of RISK

Due to the sheer volume of devices, data volume, security and networking topologies that result from IoT, it is natural for there to be a lot of questions and legal challenges around governance and privacy. How do I know my data is secure? Where is my data stored? If I lose a device, what happens to data in flight?

The National Fraud Intelligence Bureau has said that 70% of the 230,845 frauds recorded in 2013/2014 included a cyber-element, compared to 40% five years ago. This would indicate that we aren’t doing a very good job on protecting the existing internet enabled devices, so why should we be adding more devices? If we internet enable our light bulbs and heating systems (Nest being acquired by Google a good example) to control from our mobile phone, can the devices be hacked to tunnel to our mobile phone data?

It is not only the singular consumer that needs to be aware of privacy and governance. Businesses too will need to ensure when they adopt IoT, they must place resources at the door of the legal requirement and implications of IoT enablement. A key aspect of this will be to ensure their internal teams are aligned in relation to IoT, and more specifically, security, data protection and privacy.

More and more, governments and regulatory bodies have IoT in their remit. This included the EU commission who published a report that recommended that IoT should be designed from the beginning to meet suitable governance requirements and rights, including right of deletion and data portability and privacy.

The draft Data Protection Regulation addresses some of these measures including:

  • Privacy by design and default – to ensure that the default position is the least possible accessibility of personal data
  • Consent
  • Profiling – clearer guidelines on when data collected to build a person’s profile can be used lawfully, for example to analyse or predict a particular factor such as a person’s preferences, reliability, location or health
  • Privacy policies
  • Enforcement and sanctions – violations of data privacy obligations could result in fines of up to 5% of annual worldwide turnover or €100m, whichever is greater

The first point above, privacy by design is normally an afterthought unfortunately. Whilst not a requirement by the Data Protection Act, it makes the compliance exercise much smoother. Taking such an approach brings advantages in building trust and minimizing risk.

IoT presents a number of challenges that must be addressed by European privacy regulators as IoT evolves. It is predicted that the scrutiny on these challenges will increase as the device number increases.

Some of the challenges include:

  • Lack of control over the data trajectory path
  • The lack of awareness by the user of the devices capabilities
  • Risk associate with processing data beyond original scope, especially with advances in predictive and analytic engines
  • Lack of anonymity for users
  • Non threat everyday devices becoming alive to threat

As can be seen from these challenges above, there are characteristics in common, such as control, security and visibility which makes governance of IoT a bigger challenge than expected.

Finally, governance in IoT is expected to follow other technologies. Up to now, the software industry has not had single standards for the complete service portfolio (including cloud), although government are addressing this. From the geographical standpoint, different regulations are commonplace for different jurisdictions in IT, so IoT is predicted to follow suit.