Leading the public Web into big data

Mark Little, vice president, Middleware Engineering at Red Hat, explains how large-scale elastic architecture for data-as-a-service (LEADS) will enable enterprises to leverage all of the public data on the web.

Tags: Red Hat Incorporation
  • E-Mail
Leading the public Web into big data Little says the investment needed for storing and processing even a small portion of the Internet is exorbitant.
By  Mark Little Published  December 6, 2013

Mark Little, vice president, Middleware Engineering at Red Hat, explains how large-scale elastic architecture for data-as-a-service (LEADS) will enable enterprises to leverage all of the public data on the web.

Data, both public and private, is growing at an unfathomably rapid rate. In the time it takes you to read this sentence, the volume of data in existence worldwide will have grown by more than the volume of data contained in all of the books ever written. For instance, on YouTube alone, 35 hours of video are uploaded every minute, furthermore unstructured data is growing at a rate of 80% year-on-year. As companies large and small capture massive volumes of customer data, as the regulatory framework around data retention tightens, data storage has evolved from a routine IT process to a major business issue.

In some sectors, such as meteorology, seismology, power distribution and financial services, gathering huge volumes of data are integral features of the business. These businesses have grown into big data and they have pockets deep enough to cover the cost of the necessary IT infrastructure. But businesses are now engaging with their customers across multiple channels: on the web, across mobile devices, through social media and face to face.

The benefits of capturing and securely analysing data should be obvious to all, yet according to a recent IDC report less than 1% of data is analysed and more than 80% of it remains unprotected. Why then is so little analysed and why does so much go unprotected? Put simply, big data is too big for most for all but the largest companies. Traditional storage models are not scalable enough, while the processing power required to quickly make sense of an almost limitless long tail of potentially useful information is beyond all but those with the deepest pockets.

The objective of LEADS is to build a decentralised DaaS framework that runs on an elastic collection of micro-clouds. LEADS will provide a means to gather, store, and query publicly available data, as well as process this data in real-time. In addition, the public data can be enriched with private data maintained on behalf of a client, and the processing of the real-time data can be augmented with historical versions of the public and private data.

Clearly, the financial investment needed for crawling, storing, and processing even a small portion of the internet is very high, making such a task prohibitively expensive for SMEs and start-up companies.

The monetary cost of the infrastructure is among the critical factors determining how to store big data. As noted, this problem is especially acute for SMEs that have limited resources. Therefore, any new solution should offer pricing competitiveness with, or lower than, conventional data centres to appeal.

The LEADS platform will be designed from the ground up to account for privacy, security, energy-efficiency, availability, elastic scalability, and performance considerations. The project will be validated on use-cases involving the crawling of web data and its exploitation in different app domains.

In selecting technology partners the European Union, which is behind this initiative, needed to avoid proprietary providers from the outset. Choosing between products that purportedly offer a big data solution can be a large enough hurdle to scare even the most robust IT managers. All of the main proprietary storage vendors have a solution in the big data space, typically a package comprising their own hardware with preconfigured software.

Using open source for LEADS avoided the project becoming locked into a particular hardware vendor or the high software licensing costs associated with proprietary operating systems, middleware and applications. It also has all the advantages of continuous testing, refinement and innovation. Furthermore, open source technology can work alongside existing storage infrastructure.

An open source scale-out solution for unstructured data is used so that it can grow as required, creating an infinitely large pool of data. It’s a solution that spans seamlessly into the cloud. By accessing cloud-based storage resources, capacity can then be turned on or off according to demand. This is particularly useful where customer demand is challenging to predict.

Of course, predicting demand, particularly in recent years, is a major headache. Make no mistake, data growth is becoming the biggest challenge for enterprises to manage their own data centre hardware infrastructure. The economic downtown forced many IT managers to defer infrastructure and technology upgrade cycles. LEADS provides an economical approach to processing large amounts of data by sharing the collection, storage and querying of public and private data.

Add a Comment

Your display name This field is mandatory

Your e-mail address This field is mandatory (Your e-mail address won't be published)

Security code