The fresh and new issue of Software Developer's Journal out now!
Software Developer's Journal presents 'Hadoop and friends', a huge compilation of knowledge on Apache Hadoop and it's components!
Read about old and new solutions on 100 pages of the magazine.
Apache Hadoop Ecosystem content list:
Grokking the Menagerie: An Introduction to the Hadoop Software Ecosystem
by Blake Matheny
The Hadoop ecosystem offers a rich set of libraries, applications, and systems with which you can build scalable big data applications. As a newcomer to Hadoop it can be a daunting task to understand all the tools available to you, and where they all fit in. Knowing the right terms and tools can make getting started with this exciting set of technologies an enjoyable process.
Hands-on with Hadoop
by David Arthur
The Apache Hadoop ecosystem is home to a variety of libraries and back-end services that enable the storage and processing of vast amounts of data. At the lowest level there is a complex and robust Java API. Many higher level abstractions have come about in the past years. We will look at the Java API, Apache Hive, Apache Pig, and as an added bonus Apache Solr.
Data Collection and Management in the Era of Big Data
by Aaron Kimball
Conventional industries such as telecommunications, healthcare, retail, and finance as well as high-tech sectors like online gaming, video and media streaming and mobile applications are now beginning to accumulate data at web scale. To stay current, software engineers will need to gain expertise in skills for collecting and managing Big Data. In this article we will explore practical lessons in how to collect and manage large data from diverse data sources.
Introduction to MapReduce
by Rémy Saissy
Processing and querying massive amounts of data, indexing the web… We hear about Hadoop and MapReduce a lot these days. But how does MapReduce work? How does one read the output of a job? That is the discussion at hand.
Hadoop: solving problems with MapReduce
by Sergey Enin
A new type of computer science problems has appeared within the last 10–20 years – new technologies allow to access large amount of data. This raises a question – how to effectively store, process and analyze such huge data sets?
Data clustering using MapReduce: A Look at Various Clustering Algorithms Implemented with MapReduce Paradigm
by Varad Meru
Data Clustering has always been one of the most sought after problems to solve in Machine Learning due to its high applicability in industrial settings and the active community of researchers and developers applying clustering techniques in various fields, such as Recommendation Systems, Image Processing, Gene sequence analysis, Text analytics, etc., and in competitions such as Netflix Prize and Kaggle.
Writing Hive UDFs and serdes
by Alex Dean
In this article you will learn how to write a user-defined function ("UDF") to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.
Distributed coordination with ZooKeeper
by Scott Leberknight
Consider a distributed system with multiple servers, each of which is responsible for holding data and performing operations on that data. This could be a distributed search engine, a distributed build system, or even something like Hadoop which has both a distributed file system and a Map/Reduce data processing framework that operates on the data in the file system.
Curator and Exhibitor: A better way to use and manage Apache ZooKeeper
by Jordan Zimmerman
Apache ZooKeeper is an invaluable component for serviceoriented architectures. It effectively solves the difficult problem of distributed coordination (i.e. locking, leader selection, etc.). However, it can be tricky to use correctly. Curator and Exhibitor were written to make using and operating ZooKeeper easier and less error prone.
Bucket Cache: A CMS, Heap Fragmentation and Big Cache on HBASE Solution
by Chunhui Shen
As a comparison to the current cache mechanism (LruBlockCache and SlabCache) on HBase, this article would introduce a new block cache called BucketCache. It could greatly decrease CMS and heap fragmentation by JVM GC and also supports a large cache space for high read performance by using a high speed disk like Fusion-io.
Hybrid approach to enable real-time queries to end users
by Benoit Perroud
Since it became an Apache Top Level Project early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major downsides: query latency and data freshness.
One of Alibaba.com's Stories with Zookeeper
by Fuqiang Wang
Apache Zookeeper (aka. ZK) is a coordination service mostly for distributed systems. Most people would like to compare Zookeeper with Google’s Chubby, but I don’t think they are made exactly for the same purpose. Why is Zookeeper a coordination service while Chubby is usually called a distributed lock service?
Pattern, a Machine Learning Library for Cascading
by Paco Nathan
Pattern is a machine learning project within the Cascading API, which is used for building Enterprise data workflows. Cascading provides an abstraction layer on top of Apache Hadoop and other distributed computing topologies.
Direction of Hadoop Development
by Ted Yu
Coming back from HBT C 2012 (Hadoop and Big data Technology Conference), I was overwhelmed by the wide adoption of HBase in China. Prominent companies such as Taobao (Alibaba), Huawei, Intel and IBM are all endorsing HBase. In the US, HBase has actually become NoSQL and this trend is even more visible in China.
Apache Drill: Newcomer in the Hadoop Ecosystem. Apache Drill: Newcomer in the Hadoop Ecosystem
by Jacques Nadeau & Ted Dunning
Apache Drill is a new open source project that does ad hoc queries on a variety of data sources. Why Drill? The answer is simple: there is a need for speed. Hadoop is wildly popular and successful, but it’s based on batch processing and has high inherent latency. There is a need for ad hoc, real time query and analysis of large data sets, and that’s where the speed and flexibility of Drill comes in.