Turning the Big Data Warehouse Architecture on its Head

May 31, 2018

Rise of the cloud-based enterprise data warehouse

We’re at a time where the cloud-based enterprise data warehouse is rapidly gaining momentum. The ease of storing, managing and analyzing large volumes of data in the cloud is driving workloads away from traditional on-premise hardware systems. Along with sizable cost savings in hardware, software licenses and maintenance. Recent research in 2017 shows nearly 40% of existing workloads (transactional and warehouse) have already shifted to the cloud and further forecasts almost 60% will have shifted by the end of this decade, led by offerings such as AWS Redshift, Azure SQL Data Warehouse and Google BigQuery.

The fatal flaw of any data warehouse – data transfer

But as any technology gains steam and becomes a de-facto standard, there’s always something new lurking around the corner which continues building on prior successes while overcoming its fatal flaws.

The fatal flaw of any data warehouse (whether on-premise or cloud- based) is undoubtedly related to the “movement”, or need to transfer, data across geographies. As the amount of newly generated data from all things connected – e.g. sensors, machines, smartphones, – continues to rise to terabyte and petabyte levels within an enterprise, it simply takes too long and costs too much money to constantly move data across geographies to a central location.

The large cloud providers have tried to address this issue in-part with transfer services that ship large volumes of data to the cloud – both internet-based (limited by the laws of physics) and physical transfer. AWS offers their Snowball and Snowmobile appliances for physical transfer and Google has been helpful providing their own analysis on the fact that it can take days/months/years to ship large volumes of data across geographies and when it might make sense to ship physically. Even with a decent 100 Mbps Internet connection, it is estimated to take 12 days to ship 10TB of data to the cloud with the Internet. In a world where many connected devices generate hundreds of megabytes or even terabytes of data every day, this latency is unacceptable. In the mean time, new data continues to be generated more quickly than it can get shipped. The result is unacceptable delays into gaining actionable insight from newly generated data and the inability to analyze a complete dataset that mixes the newest data with historical data such as a time series analysis.

Edge computing to the rescue? – directionally correct, but not a fix

I know what you’re thinking… isn’t this what edge computing services are supposed to address? Real-time analysis of data can be accomplished by bringing storage and compute closer to the devices generating the data. The idea of extending to the edge is directionally correct, but doesn’t tackle the problem of data transfer head-on.

The edge needs to be far more intelligent than what most edge computing services are able to provide today. Most edge computing software runs embedded within an edge device or a nearby edge gateway. These devices are designed for low cost and low power consumption which means they are suitable for basic analytics including local real-time event processing and are able to retain data for query for several hours at the most. Analysis can be done within a single edge but without awareness, or federation, of the data being generated at other edges. So ultimately data needs to be reduced and then shipped to a big data-center to be stored in a data warehouse for long-term retention and deep analysis.

But how does one determine what data to reduce? What data should be purged and lost forever? How do you know? There are certainly some clear places to reduce data such as status notifications (e.g. simple working/not-working). But the majority of data will likely be useful for subsequent analysis or to run machine learning algorithms against – whether ad hoc, forensic or time series analysis. So there’s an unfortunate tradeoff on being forced to reduce data at the edge to lessen the volume for transfer versus maintaining all of the data so insights that can be gleaned from retained data for extended periods of time – days, months or for some use cases even years.

Distributing the data-warehouse – anywhere you need it

What’s needed is a distributed analytics platform that retains the positive attributes of on-premise and cloud-based big-data warehouses but can eliminate the need to move data altogether. And that’s exactly what we’ve accomplished at Edge Intelligence. Structured and semi-structured data can be efficiently stored across geographies close to wherever data is generated – scalable to tens, hundreds, or even thousands of locations. These locations can be at the edge (e.g. retail stores, manufacturing sites, oil rigs, banking branches, hospitals, wireless base-stations). And they can also leverage “nearby” public cloud and micro data-center locations with on-demand storage and compute. Customers have complete flexibility to store their data on-premise, in the cloud, or at the edge – or any combination thereof – depending on the use case and economics. While the data is distributed across geographies, the management and analysis is federated such that the entire system acts logically as though all of the data were in a single central location. So complex event processing, machine learning and querying of data using standard SQL commands, data visualization tools and interfaces can all be done easily without having to move data across geographies and disparate systems. Customers have flexibility as to how long they wish to retain data within a system that can easily scale to petabytes. Encryption is applied to all in-flight and at-rest data along with authentication measures to ensure the safety of data stored across the entire platform.

Bring the warehouse to the data

It’s time to turn the big data warehouse architecture on its head. Instead of continuing to try and move large volumes of data to the warehouse, let’s start moving the warehouse to the data. Rethink the data warehouse as you know it and say goodbye to the lost business insights and headaches that are a direct result of having to transfer data across geographies.

Request a product demonstration so we can prove to you just how easy it is to instantly gain access to all of your data, across geographies, in an instant.

Neil Cohen is the VP of Marketing at Edge Intelligence

Blog