This compilation of top Hadoop interview questions is your definitive guide to crack a Hadoop job interview and your key to a Big Data career!
If there is any new Hadoop interview question that have been asked to you, kindly post it in the the comment section.
1. What is Hadoop Map Reduce ?
For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used. Data analysis uses a two-step map and reduce process.
2. How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
3. Explain what is shuffling in MapReduce ?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
4. Explain what is distributed Cache in MapReduce Framework ?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.
5. Explain what is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines
6. Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?
In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own JVM process
Hadoop performs following actions in Hadoop
Client application submit jobs to the job tracker
JobTracker communicates to the Namemode to determine data location
Near the data or with available slots JobTracker locates TaskTracker nodes
On chosen TaskTracker Nodes, it submits the work
When a task fails, Job tracker notify and decides what to do then.
The TaskTracker nodes are monitored by JobTracker
7. Explain what is heartbeat in HDFS?
Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is some issues with data node or task tracker
8. Explain what combiners is and when you should use a combiner in a MapReduce Job?
To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation performed is commutative and associative you can use your reducer code as a combiner. The execution of combiner is not guaranteed in Hadoop
9. What happens when a datanode fails ?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node
10. Explain what is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.
11. Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
- LongWritable and Text
- Text and IntWritable
12. Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers
13. Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
14. Explain what happens in textinformat ?
In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the byte offset of the line. For instance, Key: longWritable, Value: text
15. Mention what are the main configuration parameters that user need to specify to run Mapreduce Job ?
The user of Mapreduce framework needs to specify
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
16. Explain what is WebDAV in Hadoop?
To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem by exposing HDFS over WebDAV.
17. Explain what is sqoop in Hadoop ?
To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as well as exporting data from HDFS file to RDBMS
18. Explain how JobTracker schedules a task ?
The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that JobTracker is active and functioning. The message also informs JobTracker about the number of available slots, so the JobTracker can stay upto date with where in the cluster work can be delegated
19. Explain what is Sequencefileinputformat?
Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format which is optimized for passing data between the output of one MapReduce job to the input of some other MapReduce job.
20. Explain what does the conf.setMapper Class do ?
Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and generating a key-value pair out of the mapper
21. Explain what is Hadoop?
It is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides enormous processing power and massive storage for any type of data.
22. Mention what is the difference between an RDBMS and Hadoop?
The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.
23. Mention Hadoop core components?
Hadoop core components include,
24. What is NameNode in Hadoop?
NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.
25. Mention what are the data components used by Hadoop?
Data components used by Hadoop are
26. Mention what is the data storage component used by Hadoop?
The data storage component used by Hadoop is HBase.
27. Mention what are the most common input formats defined in Hadoop?
The most common input formats defined in Hadoop are;
28. In Hadoop what is InputSplit?
It splits input files into chunks and assign each split to a mapper for processing.
29. For a Hadoop job, how will you write a custom partitioner?
You write a custom partitioner for a Hadoop job, you follow the following path
Create a new class that extends Partitioner Class
Override method getPartition
In the wrapper that runs the MapReduce
Add the custom partitioner to the job by using method set Partitioner Class or – add the custom partitioner to the job as a config file
30. For a job in Hadoop, is it possible to change the number of mappers to be created?
No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits.
31. Explain what is a sequence file in Hadoop?
To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.
32. When Namenode is down what happens to job tracker?
Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.
33. Explain how indexing in HDFS is done?
Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.
34. Explain is it possible to search for files using wildcards?
Yes, it is possible to search for files using wildcards.
35. List out Hadoop’s three configuration files?
The three configuration files are
36. Explain how can you check whether Namenode is working beside using the jps command?
Beside using the jps command, to check whether Namenode are working you can also use
37. Explain what is “map” and what is “reducer” in Hadoop?
In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
38. In Hadoop, which file controls reporting in Hadoop?
In Hadoop, the hadoop-metrics.properties file controls reporting.
39. For using Hadoop list the network requirements?
For using Hadoop the list of network requirements are:
Password-less SSH connection
Secure Shell (SSH) for launching server processes
40. Mention what is rack awareness?
Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.
41. Explain what is a Task Tracker in Hadoop?
A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.
42. Mention what daemons run on a master node and slave nodes?
Daemons run on Master node is “NameNode”
Daemons run on each Slave nodes are “Task Tracker” and “Data”
43. Explain how can you debug Hadoop code?
The popular methods for debugging Hadoop code are:
- By using web interface provided by Hadoop framework
- By using Counters
44. Explain what is storage and compute nodes?
The storage node is the machine or computer where your file system resides to store the processing data
The compute node is the computer or machine where your actual business logic will be executed.
45. Mention what is the use of Context Object?
The Context Object enables the mapper to interact with the rest of the Hadoop
system. It includes configuration data for the job, as well as interfaces which allow it to emit output.
46. Mention what is the next step after Mapper or MapTask?
The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output.
47. Mention what is the number of default partitioner in Hadoop?
In Hadoop, the default partitioner is a “Hash” Partitioner.
48. Explain what is the purpose of RecordReader in Hadoop?
In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.
49. Explain how is data partitioned before it is sent to the reducer if no custom partitioner is defined in Hadoop?
If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.
50. Explain what happens when Hadoop spawned 50 tasks for a job and one of the task failed?
It will restart the task again on some other TaskTracker if the task fails more than the defined limit.
51. Mention what is the best way to copy files between HDFS clusters?
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
52. Mention what is the difference between HDFS and NAS?
HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.
53. Mention how Hadoop is different from other data processing tools?
In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.
54. Mention what job does the conf class do?
Job conf class separate different jobs running on the same cluster. It does the job level settings such as declaring a job in a real environment.
55. Mention what is the Hadoop MapReduce APIs contract for a key and value class?
For a key and value class, there are two Hadoop MapReduce APIs contract
The value must be defining the org.apache.hadoop.io.Writable interface
The key must be defining the org.apache.hadoop.io.WritableComparable interface
56. Mention what are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are
- Pseudo distributed mode
- Standalone (local) mode
- Fully distributed mode
57. Mention what does the text input format do?
The text input format will create a line object that is an hexadecimal number. The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.
58. Mention how many InputSplits is made by a Hadoop Framework?
Hadoop will make 5 splits
1 split for 64K files
2 split for 65mb files
2 splits for 127mb files
59. Mention what is distributed cache in Hadoop?
Distributed cache in Hadoop is a facility provided by MapReduce framework. At the time of execution of the job, it is used to cache file. The Framework copies the necessary files to the slave node before the execution of any task at that node.
60. Explain how does Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?
Classpath will consist of a list of directories containing jar files to stop or start daemons.
61. What do the four V’s of Big Data denote?
IBM has a nice, simple explanation for the four critical features of big data:
- Volume –Scale of data
- Velocity –Analysis of streaming data
- Variety – Different forms of data
- Veracity –Uncertainty of data
62. How big data analysis helps businesses increase their revenue? Give example.
Big data analysis is helping businesses differentiate themselves – for example Walmart the world’s largest retailer in 2014 in terms of revenue – is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue.
63. Name some companies that use Hadoop.
Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
64. Differentiate between Structured and Unstructured data.
Data which can be stored in traditional database systems in the form of rows and columns, for example the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganized and raw data that cannot be categorized as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.
65. On what concept the Hadoop framework works?
Hadoop Framework works on the following two core components-
HDFS – Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.
Hadoop MapReduce-This is a java based programming paradigm of Hadoop framework that provides scalability across various Hadoop clusters. MapReduce distributes the workload into various tasks that can run in parallel. Hadoop jobs perform 2 separate tasks- job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job and combines the data tuples to into smaller set of tuples. The reduce job is always performed after the map job is executed.
66. What are the main components of a Hadoop Application?
Hadoop applications have wide range of technologies that provide great advantage in solving complex business problems.
Core components of a Hadoop application are-
- Hadoop Common
- Hadoop MapReduce
Data Access Components are – Pig and Hive
Data Storage Component is – HBase
Data Integration Components are – Apache Flume, Sqoop, Chukwa
Data Management and Monitoring Components are – Ambari, Oozie and Zookeeper.
Data Serialization Components are – Thrift and Avro
Data Intelligence Components are – Apache Mahout and Drill.
67. What is Hadoop streaming?
Hadoop distribution has a generic application programming interface for writing Map and Reduce jobs in any desired programming language like Python, Perl, Ruby, etc. This is referred to as Hadoop Streaming. Users can create and run jobs with any kind of shell scripts or executable as the Mapper or Reducers.
68. What is the best hardware configuration to run Hadoop?
The best configuration for executing Hadoop jobs is dual core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low – end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.
69. What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
- Text Input Format- This is the default input format defined in Hadoop.
- Key Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
- Sequence File Input Format- This input format is used for reading files in sequence.
70. What are the steps involved in deploying a big data solution?
- Data Ingestion – The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel , RDBMS like MySQL or Oracle, or could be the log files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.
- Data Storage – The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.
- Data Processing – The ultimate step is to process the data using one of the processing frameworks like mapreduce, spark, pig, hive, etc.
71. How will you choose various file formats for storing and processing data using Apache Hadoop ?
The decision to choose a particular file format is based on the following factors-
- Schema evolution to add, alter and rename fields.
- Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
- Splittability to be processed in parallel.
- Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
CSV files are an ideal fit for exchanging data between hadoop and external systems. It is advisable not to use header and footer lines when using CSV files.
Every JSON File has its own record. JSON stores both data and schema together in a record and also enables complete schema evolution and splitability. However, JSON files do not support block level compression.
This kind of file format is best suited for long term storage with Schema. Avro files store metadata with data and also let you specify independent schema for reading the files.
A columnar file format that supports block level compression and is optimized for query performance as it allows selection of 10 or less columns from from 50+ columns records.
72. What is Big Data?
Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume and variety that requires cost effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.
73. What is a block and block scanner in HDFS?
Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
74. Explain the difference between NameNode, Backup Node and Checkpoint NameNode.
NameNode: NameNode is at the heart of the HDFS file system which manages the metadata i.e. the data of the files is not stored on the NameNode but rather it has the directory tree of all the files present in the HDFS file system on a hadoop cluster. NameNode uses two files for the namespace-
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint Node: Checkpoint Node keeps track of the latest checkpoint in a directory that has same structure as that of NameNode’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
Backup Node also provides check pointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
75. What is commodity hardware?
Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.
76. What is the port number for NameNode, Task Tracker and Job Tracker?
Job Tracker 50030
Task Tracker 50060
77. Explain about the process of inter cluster data copying.
HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.
78. How can you overwrite the replication factors in HDFS?
The replication factor in HDFS can be modified or overwritten in 2 ways-
- Using the Hadoop FS Shell, replication factor can be changed per file basis using the below command-
$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
- Using the Hadoop FS Shell, replication factor of all files under a given directory can be modified using the below command-
- $hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)
79. Explain the difference between NAS and HDFS.
NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
80. Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.
Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1 ,if the DataNode crashes under any circumstances, then only single copy of the data would be lost.
81. What is the process to change the files at arbitrary locations in HDFS?
HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.
82. Explain about the indexing process in HDFS.
Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.
83. What is a rack awareness and on what basis is data stored in a rack?
All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.
84. What happens to a NameNode that has no data?
There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
85. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail.
The Hadoop job fails when the NameNode is down.
86. What happens when a user submits a Hadoop job when the Job Tracker is down- does the job get in to hold or does it fail.
The Hadoop job fails when the Job Tracker is down.
87. Whenever a client submits a hadoop job, who receives it?
NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.
88. What do you understand by edge nodes in Hadoop?
Edges nodes are the interface between hadoop cluster and the external network. Edge nodes are used for running cluster adminstration tools and client applications.Edge nodes are also referred to as gateway nodes.
89. Explain the usage of Context Object.
Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application level status updates. ContextObject has the configuration details for the job and also interfaces, that helps it to generating the output.
90. What are the core methods of a Reducer?
The 3 core methods of a reducer are –
- setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc.
Function Definition- public void setup (context)
- reduce () it is heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
- cleanup () – This method is called only once at the end of reduce task for clearing all the temporary files.
Function Definition -public void cleanup (context)
91. Explain about the partitioning, shuffle and sort phase
Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.
Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is same for any key irrespective of the mapper instance that generated it.
92. How to write a custom partitioner for a Hadoop MapReduce job?
Steps to write a Custom Partitioner for a Hadoop MapReduce Job-
A new class must be created that extends the pre-defined Partitioner Class.
getPartition method of the Partitioner class must be overridden.
The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.
93. What are side data distribution techniques in Hadoop?
The extra read only data required by a hadoop job to process the main dataset is referred to as side data. Hadoop has two side data distribution techniques –
- Using the job configuration – This technique should not be used for transferring more than few kilobytes of data as it can pressurize the memory usage of hadoop daemons,particularly if your system is running several hadoop jobs.
- Distributed Cache – Rather than serializing side data using the job configuration, it is suggested to distribute data using hadoop’s distributed cache mechanism.
94. When should you use HBase and what are the key components of HBase?
HBase should be used when the big data application has –
- A variable schema
- When data is stored in the form of collections
- If the application demands key based access to data while retrieving.
Key components of HBase are –
- Region- This component contains memory data store and Hfile.
- Region Server-This monitors the Region.
- HBase Master-It is responsible for monitoring the region server.
- Zookeeper- It takes care of the coordination between the HBase Master component and the client.
- Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.
95. What are the different operational commands in HBase at record level and table level?
Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
96. What is Row Key?
Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.
97. Explain the difference between RDBMS data model and HBase data model.
RDBMS is a schema based database whereas HBase is schema less data model.
RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
RDBMS stores normalized data whereas HBase stores de-normalized data.
98. Explain about the different catalog tables in HBase?
The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
99. What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?
The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
100. Explain the difference between HBase and Hive.
HBase and Hive both are completely different hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real time querying of big data where Hive is an ideal choice for analytical querying of data collected over period of time.
101. Explain the process of row deletion in HBase.
On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
102. What are the different types of tombstone markers in HBase for deletion?
There are 3 different types of tombstone markers in HBase for deletion-
- Family Delete Marker- This markers marks all columns for a column family.
- Version Delete Marker-This marker marks a single version of a column.
- Column Delete Marker-This markers marks all the versions of a column.
103. Explain about HLog and WAL in HBase.
All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
104. Explain about some important Sqoop commands other than import and export.
Create Job (–create)
Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file.
$ Sqoop job –create myjob \
–connect jdbc:mysql://localhost/db \
–username root \
–table employee –m 1
Verify Job (–list)
‘–list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ Sqoop job –list
Inspect Job (–show)
‘–show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output is used to verify a job called myjob.
$ Sqoop job –show myjob
Execute Job (–exec)
‘–exec’ option is used to execute a saved job. The following command is used to execute a saved job called myjob.
$ Sqoop job –exec myjob
105. How Sqoop can be used in a Java program?
The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
106. What is the process to perform an incremental data load in Sqoop?
The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
- Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
- Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
- Value (last-value) –This denotes the maximum value of the check column from the previous import operation.
107. Is it possible to do an incremental import using Sqoop?
Yes, Sqoop supports two types of incremental imports-
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
108. How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect jdbc: mysql: //localhost/user;
109. How are large objects handled in Sqoop?
Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-
- CLOB ‘s – Character Large Objects
- BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.
110. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
111. Differentiate between Sqoop and distCP.
DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.
112. What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avrofile , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
113. Is it sugggested to place the data transfer utility sqoop on an edge node ?
It is not suggested to place sqoop on an edge node or gateway node because the high data transfer volumes could risk the ability of hadoop services on the same node to communicate. Messages are the lifeblood of any hadoop service and high latency could result in the whole node being cut off from the hadoop cluster.
114. Explain about the core components of Flume.
The core components of Flume are –
Event– The single log entry or unit of data that is transported.
Source– This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel– it is the duct between the Sink and Source.
Agent– Any JVM that runs Flume.
Client– The component that transmits event to the source that operates with the agent.
115. Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
116. How can Flume be used with HBase?
Apache Flume can be used with HBase using one of the two HBase sinks –
HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.
Working of the HBaseSink –
In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.
Working of the AsyncHBaseSink-
AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.
117. Explain about the different channel types in Flume. Which channel type is faster?
The 3 different built in channel types available in Flume are-
- MEMORY Channel – Events are read from the source into memory and passed to the sink.
- JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
- FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.
- MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.
118. Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.
119. Explain about the replication and multiplexing selectors in Flume.
Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.
120. How multi-hop agent can be setup in Flume?
Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.
121. What is the standard location or path for Hadoop Sqoop scripts?
122. Does Apache Flume provide support for third party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.
123. Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.
Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink
124. Differentiate between FileSink and FileRollSink
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.
125. Can Apache Kafka be used without Zookeeper?
It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.
126. Name a few companies that use Zookeeper.
Yahoo, Solr, Helprace, Neo4j, Rackspace
127. What is the role of Zookeeper in HBase architecture?
In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, usability of ephemeral nodes to identify the available servers in the cluster.
128. Explain about ZooKeeper in Kafka
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.
129. Explain how Zookeeper works
ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes.
3 or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific server and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive till the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper.
130. List some examples of Zookeeper use cases.
Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications and discovery. The entire service of Found built up of various systems that read and write to Zookeeper.
Apache Kafka that depends on ZooKeeper is used by LinkedIn
Storm that relies on ZooKeeper is used by popular companies like Groupon and Twitter.
131. What are watches?
Client disconnection might be troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it.
132. What problems can be addressed by using Zookeeper?
In the development of distributed systems, creating own protocols for coordinating the hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning.
133. What are different modes of execution in Apache Pig?
Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster.
134. Explain about co-group in Pig.
COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.
135. Explain about the SMB Join in Hive.
In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.
136. How can you connect an application, if you run Hive as a server?
When running Hive as a server, the application can be connected in one of the 3 ways-
ODBC Driver-This supports the ODBC protocol
JDBC Driver- This supports the JDBC protocol
Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
137. What does the overwrite keyword denote in Hive load statement?
Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path i.e. the files that are referred by the file path will be added to the table when using the overwrite keyword.
138. What is SerDe in Hive? How can you write your own custom SerDe?
SerDe is a Serializer DeSerializer. Hive uses SerDe to read and write data from tables. Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it. If the SerDe supports DDL i.e. basically SerDe with parameterized columns and different column types, the users can implement a Protocol based DynamicSerDe rather than writing the SerDe from scratch.
139. What are the stable versions of Hadoop?
Release 2.7.1 (stable)
Release 1.2.1 (stable)
140. What is Apache Hadoop YARN?
YARN is a powerful and efficient feature rolled out as a part of Hadoop 2.0.YARN is a large scale distributed system for running big data applications.
141. Is YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology that supports MapReduce and is also referred to as Hadoop 2.0 or MapReduce 2.
142. What are the additional benefits YARN brings in to Hadoop?
Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource.In Hadoop MapReduce there are seperate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The same container can be used for Map and Reduce tasks leading to better utilization.
YARN is backward compatible so all the existing MapReduce jobs.
Using YARN, one can even run applications that are not based on the MaReduce model
143. How can native libraries be included in YARN jobs?
There are two ways to include native libraries in YARN jobs-
- By setting the -Djava.library.path on the command line but in this case there are chances that the native libraries might not be loaded correctly and there is possibility of errors.
- The better option to include native libraries is to the set the LD_LIBRARY_PATH in the .bashrc file.
144. Explain the differences between Hadoop 1.x and Hadoop 2.x
In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.
Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
Hadoop 1.x has single point of failure problem and whenever the NameNode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.
145. What are the core changes in Hadoop 2.0?
Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component. Hadoop 2.x Hadoop 2.x allows workable and fine grained resource configuration leading to efficient and better cluster utilization so that the application can scale to process larger number of jobs.
146. Differentiate between NFS, Hadoop NameNode and JournalNode.
HDFS is a write once file system so a user cannot update the files once they exist either they can read or write to it. However, under certain scenarios in the enterprise environment like file uploading, file downloading, file browsing or data streaming –it is not possible to achieve all this using the standard HDFS. This is where a distributed file system protocol Network File System (NFS) is used. NFS allows access to files on remote machines just similar to how local file system is accessed by applications.
Namenode is the heart of the HDFS file system that maintains the metadata and tracks where the file data is kept across the Hadoop cluster.
StandBy Nodes and Active Nodes communicate with a group of light weight nodes to keep their state synchronized. These are known as Journal Nodes.
147. What are the modules that constitute the Apache Hadoop 2.0 framework?
Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.
Hadoop Common – This module consists of all the basic utilities and libraries that required by other modules.
HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.
MapReduce- Java based programming model for data processing.
YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.
148. How is the distance between two nodes defined in Hadoop?
Measuring bandwidth is difficult in Hadoop so network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and java interface DNStoSwitchMapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always 1.
149. How will you test data quality ?
The entire data that has been collected could be important but all data is not equal so it is necessary to first define from where the data came , how the data would be used and consumed. Data that will be consumed by vendors or customers within the business ecosystem should be checked for quality and needs to cleaned. This can be done by applying stringent data quality rules and by inspecting different properties like conformity, perfection, repetition, reliability, validity, completeness of data, etc.
150. What are the challenges that you encounter when testing large datasets?
- More data needs to be substantiated.
- Testing large datsets requires automation.
- Testing options across all platforms need to be defined.
151. How to use Apache Zookeeper command line interface?
ZooKeeper has a command line client support for interactive use. The command line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each znode can contain data just similar to a file. Each znode can also have children just like directories in the UNIX file system.
Zookeeper-client command is used to launch the command line client. If the initial prompt is hidden by the log messages after entering the command, users can just hit ENTER to view the prompt.
152. What are the different types of Znodes?
There are 2 types of Znodes namely- Ephemeral and Sequential Znodes.
The Znodes that get destroyed as soon as the client that created it disconnects are referred to as Ephemeral Znodes.
Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is pre-fixed when the client assigns name to the znode.
153. What is the size of the biggest hadoop cluster a company X operates?
Asking this question helps a hadoop job seeker understand the hadoop maturity curve at a company.Based on the answer of the interviewer, a candidate can judge how much an organization invests in Hadoop and their enthusiasm to buy big data products from various vendors. The candidate can also get an idea on the hiring needs of the company based on their hadoop infrastructure.
154. For what kind of big data problems, did the organization choose to use Hadoop?
Asking this question to the interviewer shows the candidates keen interest in understanding the reason for hadoop implementation from a business perspective. This question gives the impression to the interviewer that the candidate is not merely interested in the hadoop developer job role but is also interested in the growth of the company.
155. Based on the answer to question no 1, the candidate can ask the interviewer why the hadoop infrastructure is configured in that particular way, why the company chose to use the selected big data tools and how workloads are constructed in the hadoop environment.
Asking this question to the interviewer gives the impression that you are not just interested in maintaining the big data system and developing products around it but are also seriously thoughtful on how the infrastructure can be improved to help business growth and make cost savings.
156. What kind of data the organization works with or what are the HDFS file formats the company uses?
The question gives the candidate an idea on the kind of big data he or she will be handling if selected for the hadoop developer job role. Based on the data, it gives an idea on the kind of analysis they will be required to perform on the data.
157. What is the most complex problem the company is trying to solve using Apache Hadoop?
Asking this question helps the candidate know more about the upcoming projects he or she might have to work and what are the challenges around it. Knowing this beforehand helps the interviewee prepare on his or her areas of weakness.
158. Will I get an opportunity to attend big data conferences? Or will the organization incur any costs involved in taking advanced hadoop or big data certification?
This is a very important question that you should be asking these the interviewer. This helps a candidate understand whether the prospective hiring manager is interested and supportive when it comes to professional development of the employee.
159. If you are an experienced hadoop professional then you are likely to be asked questions like –
The number of nodes you have worked with in a cluster.
Which hadoop distribution have you used in your recent project.
Your experience on working with special configurations like High Availability.
The data volume you have worked with in your most recent project.
What are the various tools you used in the big data and hadoop projects you have worked on?
Your answer to these interview questions will help the interviewer understand your expertise in Hadoop based on the size of the hadoop cluster and number of nodes. Based on the highest volume of data you have handled in your previous projects, interviewer can assess your overall experience in debugging and troubleshooting issues involving huge hadoop clusters.
The number of tools you have worked with help an interviewer judge that you are aware of the overall hadoop ecosystem and not just MapReduce. To be selected, it all depends on how well you communicate the answers to all these questions.
160. What are the challenges that you faced when implementing hadoop projects?
Interviewers are interested to know more about the various issues you have encountered in the past when working with hadoop clusters and understand how you addressed them. The way you answer this question tells a lot about your expertise in troubleshooting and debugging hadoop clusters.The more issues you have encountered, the more probability there is, that you have become an expert in that area of Hadoop. Ensure that you list out all the issues that have trouble-shooted.
161. How were you involved in data modelling, data ingestion, data transformation and data aggregation?
You are likely to be involved in one or more phases when working with big data in a hadoop environment. The answer to this question helps the interviewer understand what kind of tools you are familiar with.If you answer that your focus was mainly on data ingestion then they can expect you to be well-versed with Sqoop and Flume, if you answer that you were involved in data analysis and data transformation then it gives the interviewer an impression that you have expertise in using Pig and Hive.
162. What is your favourite tool in the hadoop ecosystem?
The answer to this question will help the interviewer know more about the big data tools that you are well-versed with and are interested in working with. If you show affinity towards a particular tool then the probability that you will be deployed to work on that particular tool, is more.If you say that you have a good knowledge of all the popular big data tools like pig, hive, HBase, Sqoop, flume then it shows that you have knowledge about the hadoop ecosystem as a whole.
163. In you previous project, did you maintain the hadoop cluster in-house or used hadoop in the cloud?
Most of the organizations still do not have the budget to maintain hadoop cluster in-house and they make use of hadoop in the cloud from various vendors like Amazon, Microsoft, Google, etc. Interviewer gets to know about your familiarity with using hadoop in the cloud because if the company does not have an in-house implementation then hiring a candidate who has knowledge about using hadoop in the cloud is worth it.
164. Explain “Big Data” and what are five V’s of Big Data?
“Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.
Tip: It will be a good idea to talk about the 5Vs in such questions, whether it is asked specifically or not!
Volume: The volume represents the amount of data which is growing at an exponential rate i.e. in Petabytes and Exabytes.
Velocity: Velocity refers to the rate at which data is growing, which is very fast. Today, yesterday’s data are considered as old data. Nowadays, social media is a major contributor in the velocity of growing data.
Variety: Variety refers to the heterogeneity of data types. In another word, the data which are gathered has a variety of formats like videos, audios, csv, etc. So, these various formats represent the variety of data.
Veracity: Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control. The volume is often the reason behind for the lack of quality and accuracy in the data.
Value: It is all well and good to have access to big data but unless we can turn it into a value it is useless. By turning it into value I mean, Is it adding to the benefits of the organizations? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.
As we know Big Data is growing at an accelerating rate, so the factors associated with it are also evolving. To go through them and understand it in detail, I recommend you to go through Big Data Tutorial blog.
165. What is Hadoop and its components.
When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.
Tip: Now, while explaining Hadoop, you should also explain the main components of Hadoop, i.e.:
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)
166. What are HDFS and YARN?
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology.
Tip: It is recommended to explain the HDFS components too i.e.
NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes.
Tip: Similarly, as we did in HDFS, we should also explain the two components of YARN:
ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs.
NodeManager: NodeManager is installed on every DataNode and it is responsible for execution of the task on every single DataNode.
167. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.
Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.
NameNode: It is the master node which is responsible for storing the metadata of all the files and directories. It has information about blocks, that make a file, and where those blocks are located in the cluster.
Datanode: It is the slave node that contains the actual data.
Secondary NameNode: It periodically merges the changes (edit log) with the FsImage (Filesystem Image), present in the NameNode. It stores the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
ResourceManager: It is the central authority that manages resources and schedule applications running on top of YARN.
NodeManager: It runs on slave machines, and is responsible for launching the application’s containers (where applications execute their part), monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.
168. Compare HDFS with Network Attached Storage (NAS).
In this question, first explain NAS and HDFS, and then compare their features as follows:
Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware.
In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
HDFS uses commodity hardware which is cost effective, whereas a NAS is a high-end storage devices which includes high cost.
169. What are the basic differences between relational database and HDFS?
Here are the key differences between HDFS and relational database:
RDBMS relies on the structured data and the schema of the data is always known.
Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
RDBMS provides limited or no processing capabilities.
Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schema on Read Vs. Write
RDBMS is based on ‘schema on write’ where schema validation is done before loading the data.
On the contrary, Hadoop follows the schema on read policy.
In RDBMS, reads are fast because the schema of the data is already known.
The writes are fast in HDFS because no schema validation happens during HDFS write.
Licensed software, therefore, I have to pay for the software.
Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use Case
RDBMS is used for OLTP (Online Trasanctional Processing) system.
Hadoop is used for Data discovery, data analytics or OLAP system
170. List the difference between Hadoop 1 and Hadoop 2.
This is an important question and while answering this question, we have to mainly focus on two points i.e. Passive NameNode and YARN architecture.
In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”. If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x.
Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.
NameNode is a Single Point of Failure
Active & Passive NameNode
MRV1 (Job Tracker & Task Tracker)
MRV2/YARN (ResourceManager & NodeManager)
171. What are active and passive “NameNodes”?
In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.
Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.
172. Why does one remove or add nodes in a Hadoop cluster frequently?
One of the most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. Because of these two reasons, one of the most common task of a Hadoop administrator is to commission (Add) and decommission (Remove) “Data Nodes” in a Hadoop Cluster.
173. What happens when two clients try to access the same file in the HDFS?
HDFS supports exclusive writes only.
When the first client contacts the “NameNode” to open the file for writing, the “NameNode” grants a lease to the client to create this file. When the second client tries to open the same file for writing, the “NameNode” will notice that the lease for the file is already granted to another client, and will reject the open request for the second client.
174. How does NameNode tackle DataNode failures?
NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
A block report contains a list of all the blocks on a DataNode. If a DataNode fails to send a heartbeat message, after a specific period of time it is marked dead.
The NameNode replicates the blocks of dead node to another DataNode using the replicas created earlier.
175. What will you do when NameNode is down?
The NameNode recovery process involves the following steps to make the Hadoop cluster up and running:
Use the file system metadata replica (FsImage) to start a new NameNode.
Then, configure the DataNodes and clients so that they can acknowledge this new NameNode, that is started.
Now the new NameNode will start serving the client after it has completed loading the last checkpoint FsImage (for metadata information) and received enough block reports from the DataNodes.
Whereas, on large Hadoop clusters this NameNode recovery process may consume a lot of time and this becomes even a greater challenge in the case of the routine maintenance. Therefore, we have HDFS High Availability Architecture which is covered in the HA architecture blog.
176. What is a checkpoint?
In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
177. How is HDFS fault tolerant?
When data is stored over HDFS, NameNode replicates the data to several DataNode. The default replication factor is 3. You can change the configuration factor as per your need. If a DataNode goes down, the NameNode will automatically copy the data to another node from the replicas and make the data available. This provides fault tolerance in HDFS.
178. Can NameNode and DataNode be a commodity hardware?
The smart answer to this question would be, DataNodes are commodity hardware like personal computers and laptops as it stores data and are required in a large number. But from your experience you can tell that, NameNode is the master node and it stores metadata about all the blocks stored in HDFS. It requires high memory (RAM) space, so NameNode needs to be a high-end machine with good memory space.
179. Why do we use HDFS for applications having large data sets and not when there are a lot of small files?
HDFS is more suitable for large amounts of data sets in a single file as compared to small amount of data spread across multiple files. As you know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to generation of too much meta data. And, storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.
180. How do you define “block” in HDFS? What is the default block size in Hadoop 1 and in Hadoop 2? Can it be changed?
Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. HDFS stores each as blocks, and distribute it across the Hadoop cluster. Files in HDFS are broken down into block-sized chunks, which are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size: 128 MB
Yes, blocks can be configured. The dfs.block.size parameter can be used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.
181. What does ‘jps’ command do?
The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager etc. that are running on the machine.
182. How do you define “Rack Awareness” in Hadoop?
Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack. Let’s say we consider replication factor 3 (default), the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is known as the “Replica Placement Policy”.
183. What is “speculative execution” in Hadoop?
If a node appears to be executing a task slower, the master node can redundantly execute another instance of the same task on another node. Then, the task which finishes first will be accepted and the other one is killed. This process is called “speculative execution”.
184. How can I restart “NameNode” or all the daemons in Hadoop?
This question can have two answers, we will discuss both the answers. We can restart NameNode by following methods:
You can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
To stop and start all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command which will stop all the daemons first and then start all the daemons.
These script files reside in the sbin directory inside the Hadoop directory.
185. What is the difference between an “HDFS Block” and an “Input Split”?
The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. HDFS divides data in blocks for storing the blocks together, whereas for processing, MapReduce divides the data into the input split and assign it to mapper function.
186. Name the three modes in which Hadoop can run.
The three modes in which Hadoop can run are as follows:
- Standalone (local) mode: This is the default mode if we don’t configure anything. In this mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and NodeManager, run as a single Java process. This uses local filesystem.
- Pseudo distributed mode: A single-node Hadoop deployment is considered as running Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services, including both the master and the slave services, were executed on a single compute node.
- Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave services run on separate nodes, are stated as fully distributed mode.
187. What is “MapReduce”? What is the syntax to run a “MapReduce” program?
It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming. The syntax to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.
188. What are the main configuration parameters in a “MapReduce” program?
The main configuration parameters which users need to specify in “MapReduce” framework are:
Job’s input locations in the distributed file system
Job’s output location in the distributed file system
Input format of data
Output format of data
Class containing the map function
Class containing the reduce function
JAR file containing the mapper, reducer and driver classes
189. State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?
This answer includes many points, so we will go through them sequentially.
We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function. Sorting occurs only on the reducer side and without sorting aggregation cannot be done.
During “aggregation”, we need output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on different machine where the data blocks are stored.
And lastly, if we try to aggregate data at mapper, it requires communication between all mapper functions which may be running on different machines. So, it will consume high network bandwidth and can cause network bottlenecking.
190. Explain “Distributed Cache” in a “MapReduce Framework”.
Distributed Cache can be explained as, a facility provided by the MapReduce framework to cache files needed by applications. Once you have cached a file for your job, Hadoop framework will make it available on each and every data nodes where you map/reduce tasks are running. Then you can access the cache file as a local file in your Mapper or Reducer job.
191. How do “reducers” communicate with each other?
This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.
192. What does a “MapReduce Partitioner” do?
A “MapReduce Partitioner” makes sure that all the values of a single key go to the same “reducer”, thus allowing even distribution of the map output over the “reducers”. It redirects the “mapper” output to the “reducer” by determining which “reducer” is responsible for the particular key.
193. How will you write a custom partitioner?
Custom partitioner for a Hadoop job can be written easily by following the below steps:
Create a new class that extends Partitioner Class
Override method – getPartition, in the wrapper that runs in the MapReduce.
Add the custom partitioner to the job by using method set Partitioner or add the custom partitioner to the job as a config file.
194. What is the purpose of “RecordReader” in Hadoop?
The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
195. What is a “Combiner”?
A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.
196. What do you know about “SequenceFileInputFormat”?
“SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
Sequence files can be generated as the output of other MapReduce tasks and are an efficient intermediate representation for data that is passing from one MapReduce job to another.
197. What are the benefits of Apache Pig over MapReduce?
Apache Pig is a platform, used to analyze large data sets representing them as data flows developed by Yahoo. It is designed to provide an abstraction over MapReduce, reducing the complexities of writing a MapReduce program.
Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm.
Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
Apache Pig reduces the length of the code by approx 20 times (according to Yahoo). Hence, this reduces the development period by almost 16 times.
Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job.
In addition, pig also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
198. What are different data types in Pig Latin?
Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.
Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char, byte.
Complex Data Types: Complex data types are Tuple, Map and Bag.
199. What are the different relational operations in “Pig Latin” you worked with?
Different relational operators are:
- for each
- order by
200. What is a UDF?
If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.
201. What is “SerDe” in “Hive”?
Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.
The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.
202. Can the default “Hive Metastore” be used by multiple users (processes) at the same time?
“Derby database” is the default “Hive Metastore”. Multiple users (processes) cannot access it at the same time. It is mainly used to perform unit tests.
203. What is the default location where “Hive” stores table data?
The default location where Hive stores table data is inside HDFS in /user/hive/warehouse.
204. What is Apache HBase?
HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault tolerant way of storing large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access on huge data sets.
205. What are the components of Apache HBase?
HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.
Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
206. What are the components of Region Server?
The components of a Region Server are:
WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage.
Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory.
MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region.
HFile: HFile is stored in HDFS. It stores the actual cells on the disk.
207. Mention the differences between “HBase” and “Relational Databases”?
HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS and provides BigTable like capabilities to Hadoop. Let us see the differences between HBase and relational database.
It is schema-less
It is schema based database
It is column-oriented data store
It is row-oriented data store
It is used to store de-normalized data
It is used to store normalized data
It contains sparsely populated tables
It contains thin tables
Automated partitioning is done is HBase
There is no such provision or built-in support for partitioning
208. Explain “WAL” in HBase?
Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.
209. What is Apache Spark?
The answer to this question is, Apache Spark is a framework for real time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.
It is 100x faster than MapReduce for large scale data processing by exploiting in-memory computations and other optimizations.
210. Can you build “Spark” with any particular Hadoop version?
Yes, one can build “Spark” for a specific Hadoop version. Check out this blog to learn more about building YARN and HIVE on Spark.
211. Define RDD.
RDD is the acronym for Resilient Distribution Datasets – a fault-tolerant collection of operational elements that run parallel. The partitioned data in RDD are immutable and distributed, which is a key component of Apache Spark.
212. What is Apache ZooKeeper and Apache Oozie?
Apache ZooKeeper coordinates with various services in a distributed environment. It saves a lot of time by performing synchronization, configuration maintenance, grouping and naming.
Apache Oozie is a scheduler which schedules Hadoop jobs and binds them together as one logical work. There are two kinds of Oozie jobs:
Oozie Workflow: These are sequential set of actions to be executed. You can assume it as a relay race. Where each athlete waits for the last one to complete his part.
Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made available to it. Think of this as the response-stimuli system in our body. In the same manner as we respond to an external stimulus, an Oozie coordinator responds to the availability of data and it rests otherwise.
213. How do you configure an “Oozie” job in Hadoop?
“Oozie” is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs such as “Java MapReduce”, “Streaming MapReduce”, “Pig”, “Hive” and “Sqoop”.