Can we back it up......What is Hadoop? First and foremost, Hadoop - TopicsExpress

/training/etc

Can we back it up......What is Hadoop? First and foremost, Hadoop is software. It’s something you download, install and run. Second, it’s freely available from an Open Source community (called Apache). Third, it’s a collection of many open source projects that are designed to be integrated to give you never-seen-before data storage and processing power. Folks refer to the collection of all these software and technologies as the “Hadoop ecosystem.” So, what then is Hortonworks? Hortonworks is a for-profit company that does the dirty work of pulling together all those open source products in the Hadoop ecosystem and distributing them as the Hortonworks Data Platform (HDP). It’s still free, and Hortonworks makes their money from training and professional services. Hortonworks doesn’t change the software, so the HDP is just Hadoop software shrink wrapped for Enterprise customers. It’s called a Hadoop distribution, and there are a number of other ones including Cloudera, Teradata, Pivotal and MapR. What does Hadoop do? Companies that run Hadoop load it on a collection of commodity computers (run-of-the-mill servers you buy from vendors like Dell or HP). There’s nothing special about them, so they’re relatively cheap! If they have 10 computers, it’s called a 10-node Hadoop cluster. A 10-node cluster is a starting point. Companies usually expand to 30-, 50-, or 100-node clusters over time. Large companies, like Yahoo, run tens of thousands of nodes in their Hadoop clusters! Once Hadoop is loaded, companies use the cluster to store data - a lot of data! If you have a 10-node cluster and each node has a 2-terabyte disk drive, your cluster is a 20-terabyte cluster. That’s a lot. The beauty of Hadoop is that when you need more storage than that, you don’t go and buy fancy (expensive) data storage devices. You just buy more commodity (i.e. cheap) computers and add them to your cluster. Hadoop can now store more data. That’s how Yahoo has grown to 10,000+ nodes. But, that’s not the only benefit. With all that data stored on all those computers, Hadoop can run data analysis programs on that data and produce interesting new insights into the data. Companies are learning to find value in these new insights. With 10 nodes in your cluster, Hadoop can have analytics running simultaneously on all 10 nodes. So, in effect, you have a super high-performance computer running on vanilla hardware - and THAT is revolutionary. Why do companies need Hadoop? Companies previously stored data of a type such as financial data, customer data, inventory data, and employee data that could pretty much work in an Excel spreadsheet. It was numbers and words, pretty much. They don’t actually use Excel because they need speed and performance for organizing, sorting, reporting and all the other things they do with their data. But, the point is, that data was fairly predictable. Then comes social media, and we’re getting floods of tweets, and posts, and pictures and videos, and music, and other kinds of data that just don’t fit well into something that thinks like a spreadsheet. Also, more and more machines are producing data of their own. Consider factory robots, or automobiles, or all the servers in company data centers. They’re all cranking out different kinds of data that companies KNOW have value, if they could just store and analyze it. That’s what Hadoop does well. It is built to handle these new kinds of data at the volumes we are now seeing. The chart shown above, Most Common New Types of Data, shows some new data types that just crush traditional data storage technologies. There’s lots of buzz, but is it real? Yes, it is real! Check out the following data. The big data market is expected to grow about 32% a year to $23.8 billion in 2016. The market for analytics software is predicted to reach $51 billion by 2016. Global spending on big data will grow 48% between 2014 and 2019. Big data revenue will reach $135 billion by the end of 2019. The numbers speak for themselves. Again, this is real! Companies already have data storage and analysis tools. How does Hadoop fit in? Check out Use Case: EDW with Hadoop on the left. This chart shows that Hadoop (labeled the “Big Data Platform”) will fit in nicely with the existing Enterprise Data Warehouse that many large companies presently use to house their data. This shows that Hadoop doesn’t REPLACE traditional data storage and analysis tools. It will dovetail nicely and provide capabilities that traditional systems can’t provide. Is Hadoop another name for “The Cloud?” Nope. When people talk about The Cloud, they’re saying that instead of having all their computer resources spread all over their company, they’re going to centralize software tools and data storage in a centralized location, which many people call the cloud. Sometimes, they’ll just centralize but keep it within their own control. In other words, their employees will maintain and manage their cloud. That’s called a Private Cloud. Other times, the company will say, “We don’t want the hassle of managing it.” So, companies like Amazon Web Services will “rent” computer resources to those kinds of companies. Providers like Amazon are sometimes called a Public Cloud. Hadoop software can be run and operated in either a private OR public cloud. Got it. For the most part. What’s with all the weird Hadoop names? Well, it’s kind of a fad to come up with names that may or may not mean anything. They definitely want to avoid creating acronyms. That’s very passe. Look at the chart below for some of the project names that are part of the Hadoop ecosystem. And how does this translate to training? Well that is another whole discussion. check back NEXT WEEK for more on training

Posted on: Sat, 11 Oct 2014 14:48:43 +0000

Can we back it up......What is Hadoop? First and foremost, Hadoop - TopicsExpress

Trending Topics

Recently Viewed Topics