Those are questions that many of us ask these days and there’s a good reason for that. Just like the early days of cloud computing, when every one of us had a different idea in mind to what it actually was, Big Data technologies are commonly mistaken to be considered as only relevant for the petabyte club members. But the truth is that there are benefits of using Hadoop for example, even for lower scales of data. Some of these benefits are also discussed in an interview with Mike Olson, Cloudera CEO.
- Affordable infrastructure: You do not need to purchase expensive hardware or a high-end storage infrastructure for Hadoop. The idea is to use commodity hardware with locally attached storage. The hardware and storage can be physical or virtual, it can be on promise or hosted on a public cloud such as Amazon Elastic Compute Cloud (EC2), and you can dynamically add or remove resources to scale for changing processing levels.
- Rapid data processing: Many of us are starting our journey with single-node Hadoop clusters in test environments but Hadoop is designed to run on multi-node clusters, which allow parallel and balanced data processing. Today, there are already companies such as the music-streaming service Spotify, that manage Hadoop environments with hundreds of nodes, each is capable of independently processing the pieces of data it holds, much faster than any single server can process, regardless of how many CPUs it has.
- Manage unstructured data: Unlike traditional relational databases that rely on predefined data schemas, Hadoop Distributed File System (HDFS) allows you store any format of data in an unstructured manner. These data can be videos, photos, music, streams of social media content or anything else. It doesn’t mean you do not need to plan ahead and figure out which business questions you want to answer, but it definitely means that when new questions arise, it will be much easier for you to make the adjustments that will allow you to answer them.
- Redundancy & High Availability: Hadoop distributes each piece of data to multiple nodes in the cluster (number of copies is configurable) so if one of the nodes fails (which is more likely to happen due to the use of commodity hardware), you will not lose any data. This eliminates the need to use expensive RAID devices or commercial cluster software.
- MF Costs reduction: Most companies that manage large data repositories on mainframes seek ways to reduce CPU peaks in order to cut down software license costs. This is often quite a challenge, especially during end of quarter or end of year times, or during holiday shopping seasons. By shifting some of that processing activity to Hadoop you can reduce your costs and sometimes even get the processing done faster.
It will take some time until the Internet of Things hit us all and we will be flooded by an unmanageable amount of data that will leave us with no choice but to use Big Data technologies. There is no reason to wait until then if we can already now adopt these for the benefits they provide and the value they add, addressing some of the challenges that are not necessarily volume dependent.