Big data is big business – ask anyone in the data center industry (or enterprise in general) and chances are fairly high that you’ll get an earful. Though it’s existed in some form for quite some time, we’re just now starting to realize the true power that’s hidden just below the surface – the marketing and organizational insights that can be garnered with the capacity to analyze it. The trouble lies in reaching that capacity – and the massive strain it’s going to put on data centers and businesses the world over. Big data is, by its very nature, an incredibly broad concept, one which spans a wide array of industries, technologies, and organizations.
Perhaps as a result of this scope, it’s been driving quite a few powerful, unique, and innovative technologies all across the board – virtually all of them designed to help organizations cope with the strain of big data management. Today, we’re going to take a closer look at a few of those technologies, courtesy of a Tech Republic interview with Dr. Satwant Kaur; author of Transitioning Embedded Systems to Intelligent Environments and referred to as “The First Lady of Emerging Technologies.”
Database technology, Kaur said, is undergoing a significant evolution as a result of the big data craze. This is primarily out of necessity: “traditional, row-oriented databases are excellent for online transaction processing with high updates speeds, but they fall short on query performance as the data volumes grow and as data becomes more unstructured.”
Column-oriented databases solve this problem, Kaur explained. This is due to how they store data – the focus on columns allows for “huge data compression and very fast query times,” though at the cost of considerably slowed update times.
A switch from rows to columns isn’t the only way databases are changing, either. Kaur also brought up Schema-less (NoSQL) databases in the interview. These databases address unstructured data through increased performance as a result of scalability and distributed processing, while doing away with many conventional database restrictions, such as read-write consistency.
There’s one word to describe why the MapReduce programming model has gained so much traction as a result of big data: scalability. MapReduce allows organizations to easily and efficiently co-ordinate distributed computing operations, running distributed programs in parallel between thousands of servers or server clusters. It accomplishes this by enabling database managers to simply make use of the “Map” and “Reduce” functions – which convert and combine data-sets between them – while the MapReduce infrastructure itself handles the program and management.
MapReduce in and of itself has gained a great deal of ground in the data market, but one implementation in particular stands out: Hadoop. It is by far the most popular implementation of the model, and the fact that it’s completely open-source is a huge plus for many IT professionals and data center management officials. It’s flexible, powerful, and easy to implement; designed from the ground up to manage large volumes of fluctuating data.
Another technology pointed out by Kaur as particularly vital where big data is concerned was Hive, a “SQL-Like bridge” which alows conventional Business Intelligence applications to run queries against a Hadoop cluster.” Like Hadoop itself, Hive is open-source; originally developed by Facebook to address its own big data concerns. Hive’s strength lies in the fact that it makes Hadoop considerably more familiar and easy-to-use for professionals voiced only in traditional Business Intelligence – it bridges the gap between BI and Big Data.
Kaur points out Pig as another emergent technology; “another bridge that tries to bring Hadoop closer to the realities of developers and business users, similar to Hive.” Where the difference lies is that PIG uses a “Perl-Like” language instead of an “SQL-Like” language, allowing it to execute queries over data stored on a Hadoop cluster. Like Hive and Hadoop, it’s fully open-source.
WibiData is yet another technology which attempts to bridge the disparity between big data and more traditional analytics (I’m noticing a trend here). Built on top of HBase, WibiData combines web analytics with the power of Hadoop. It’s easy to see why such a technology has caught on, as it’s fairly rare these days to see a web master who hasn’t dealt with unstructured data at least in passing. To them, WibiData is probably a breath of fresh air; a solution which allows them to more easily work with their user data and respond to user behavior in real-time with better content, better decisions, and better communication.
“Perhaps the greatest limitation of Hadoop,” explained Kaul, “is that it’s a very low-level implementation of MapReduce, requiring extensive developer knowledge to operate.” Platfora attempts to address this issue by turning user queries into Hadoop jobs automatically. Essentially, this allows virtually anyone to access the power of Hadoop for their unstructured data needs. Again, it’s easy to see why this technology has taken off – not everyone who requires analysis of unstructured data has the knowledge base to use Hadoop on its own.
Storage technologies – and, by association, vendors and OEMs – have been exerting a great deal of influence on the data industry over the past several years. This influence is only going to become more intense as data volumes grow ever larger and the demand for efficient, effective, and low-cost storage techniques and technologies becomes higher and higher. Of all the entries on this list, this is probably the one which has the most direct impact on the data center – after all, where do you think all this unstructured data is being crunched?
SkyTree is a predictive analytics platform which utilizes machine learning to tackle the issue of Big Data, allowing for automatic data exploration which stands head and shoulders above more conventional methods (which, by and large, simply do not work). Again, it’s quite clear why it’s gaining so much ground – it’s easy to use, efficient, and incredibly effective, to boot.
Last, but certainly not least, we have cloud computing as a whole. As I’m sure you’ve all noticed, many of the emerging technologies on this list are at least partially based in the cloud. That’s no accident – as the need for better analytics becomes ever more dire, I feel that more and more organizations are going to start flocking to the Cloud, as they realize the true value the technology holds for them in storage, management, and analysis of unstructured data.