This blog gives you an introduction to Hadoop Clustering eco-system using HDInsight dataservice on Microsoft Azure cloud.

HDInsight is a dataservice offered by Azure Cloud to create cluster environment for your big data available in the Storage container as blob. HDInsight offers three types of majority cluster based on

Protect Your Data with BDRSuite

Cost-Effective Backup Solution for VMs, Servers, Endpoints, Cloud VMs & SaaS applications. Supports On-Premise, Remote, Hybrid and Cloud Backup, including Disaster Recovery, Ransomware Defense & more!
  • Hadoop
  • HBase
  • Storm

According to MicrosoftAzure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data. The Hadoop core provides reliable data storage with the Hadoop Distributed File System (HDFS), and a simple MapReduce programming model to process and analyze, in parallel, the data stored in this distributed system- HDFS.

This HDFS uses data replication to address any hardware, software, OS issues that arise when deploying such highly distributed systems ie cluster nodes.

Here are the Hadoop technologies in HDInsight dataservice as per Microsoft documentation.

Download Banner
  • Ambari:

    – Apache ambari is for provisioning,managing and monitoring Apache Hadoop clusters in Azure cloud. It includes a collection of operative tools and set of APIs used to reduce the complexity of Hadoop and simplifying the operation of clusters.

  • Avro:

    – It is a Microsoft .NET environment library, that helpful to handle data serialization.

  • HBase:

    – Apache HBase is a non-relational database built on Hadoop and designed for large amount of unstructured and semi-structured data, of millions of rows and columns. HBase clusters on HDInsight are configured to store data directly to storage container blobs with low latency and increased elasticity.

  • HDFS:

    – Hadoop distributed file system (HDFS) is a distributed file system with Mapreduce and YARN forms the HDFS Hadoop ecosystem. HDFS is the standard file system that uses in Hadoop clusters on HDInsight.

  • Hive:

    – Apache Hive is data warehouse software built on Hadoop that allows you to query and manage large data sets in distributed storage using a SQL – like language called HiveQL. Hive translates queries into  a series of jobs, and is appropriate to use with structured data than unstructured data.

  • Mahout:

    – Apache Mahout is a scalable library of machine learning algorithms that run on Hadoop.Using principles of statistics, machine learning applications teach systems to learn from data and to use past outcomes to determine future behavior.

  • Mapreduce and YARN:

    – Hadoop Mapreduce is a software framework for writing applications to process big data in parallel. It splits large datasets and organises the data into key-value pairs for processing. YARN is the next generation of Mapreduce, and splits major jobs into separate entities.

  • Oozie:

    – Apache Oozie is a workflow coordination system that manages Hadoop jobs.It can also be used to schedule jobs specific to a system, like Java programs and shell scripts.

  • Pig:

    – Apache Pig is a high level platform that allows you to perform complex Mapreduce transformations on very large data using a simple scripting language called Pig Latin. We can create user defined functions ( UDFs) as scripts using Pig Latin to run within Hadoop.

  • Sqoop:

    – Sqoop is a tool that transfers bulk data between Hadoop and relational databases such as SQL or other structured datastores efficiently.

  • Zookeeper:

    – Apache Zookeeper coordinates processes in large distributed systems by means of a shared hierarchical namespace of data registers (znodes). Znodes contain small amounts of meta information needed to coordinate processes.

Along with these technologies you can use Business tools –  such as Excel, PowerPivot, SQL Server Analysis Services and Reporting Services – retrieve, analyze, and report data integrated with HDInsight using either the Power Query add-in or the Microsoft Hive ODBC Driver.

Creating a HDInsight Hadoop Cluster –  A quick walkthrough in Azure Cloud:

Prerequisites:

  • A valid registered azure account with any one of the supported subscription.
  • A valid storage account under that a new container which stores big data as blob.
  • Additional storage accounts ( if you require more than one )
  • No of data nodes which constitutes the cluster size ( Refer pricing details before choosing the no of data nodes ).
  • A preconfigured virtual network preferably a LAN Network with local IPs for all nodes.

Steps to follow to implement a HDInsight Hadoop Cluster:

  • Login to your Azure Account . Select Data Services → HDInsight → Custom Create. In my case I am going to select Custom create to show all the options for creating the cluster. If you want to use Hadoop, directly you can select Hadoop option here.
1Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud providing a software framework designed to manage, analyze & report on big data.

Hadoop

  • Provide a cluster name, in my case it is Vembu hadoop cluster, and select the subscription that you are going to use for creating this cluster. Refer Microsoft for different level of subscription and its pricing. Choose cluster type as Hadoop, and select Hadoop version 3.1 which is the latest version that support YARN service.

2

  • Type  no of data nodes, in my case I selected for 4 nodes, select Region or Virtual Network. In my case I created already a virtual Network ( LAN Network ) with Virtual Network Subnets as in the snapshot.

3

  • Enter username and password for the cluster.  If you want to use Azure SQL database  for the same Hadoop cluster, select the option “Enter the Hive / Oozie Metastore”.

4

  • On next screen, you need to provide storage details. As already mentioned I created a storage account with a new container in the storage services. Here “clus2” is the name of the storage account which has the new container “vhds”. So I am selecting these inputs as in the below screen shot. Also I selected 0 as the additional storage accounts, which means I uses only one Storage Account to store the data on “vhds” container only.

5

  • Select the tick mark, you will be prompted to creating cluster environment and it will be available to use.

6

In the forthcoming blogs I explore more details on Implementation of other Hadoop services on adding these cluster nodes and share the details.

 

Rate this post