hive scenario based interview questions

Interview questions on Hive may be direct or application-based.
  • What applications are supported by Hive? …
  • What are the different tables available in Hive? …
  • What is the difference between external and managed tables? …
  • Where does the data of a Hive table get stored? …
  • Can Hive be used in OLTP systems?

In this post, we put together the best Hive interview questions and answers for beginner, intermediate and experienced candidates. These top questions and quiz is for quick browsing before the interview or to act as a detailed guide on different topics in Hive interviewers look for.

15 most asked Hive Interview Questions and Answers

Answer: If we drop the partition directory say hdfs://sandbox-hdp.hortonworks.com:8020/apps/hive/warehouse/bdp.db/partitioned_test_external2_parquet/yearofexperience=3 from the HDFS location, it will be listed if you query show partitions on the table.

4. Let’s take the same previous Hive partition table. If we drop the partition, will we able to access the data?

3. A Hive partition table is created which is partition by a column say yearofexperience. If we create a directory say yearofexperience=3 at the HDFS path of the table and dump the data set which is as per the table structure. Will the data be available if we execute select query on the table?

2. A Hive table is created as an external table at location say hdfs://usr/data/table_name. If we dump a data set which are having the data as per the table structure, will we able to fetch the records from the table using a select query?

1. Let’s say a Hive table is created as an external table. If we drop the table, will the data be accessible?

Table of Contents

1) What is the difference between Pig and Hive ?

Pig vs Hive

Criteria

Pig

Hive

Type of Data Apache Pig is usually used for semi structured data. Used for Structured Data
Schema Schema is optional. Hive requires a well-defined Schema.
Language It is a procedural data flow language. Follows SQL Dialect and is a declarative language.
Purpose Mainly used for programming. It is mainly used for reporting.
General Usage Usually used on the client side of the hadoop cluster. Usually used on the server side of the hadoop cluster.
Coding Style Verbose More like SQL

For a detailed answer on the difference between Pig and Hive, refer this link –

Ace Your Next Job Interview with Mock Interviews from Experts to Improve Your Skills and Boost Confidence!

2) What is the difference between HBase and Hive ?

Hive vs HBase

HBase

Hive

HBase does not allow execution of SQL queries. Hive allows execution of most SQL queries.
HBase runs on top of HDFS. Hive runs on top of Hadoop MapReduce.
HBase is a NoSQL database. Hive is a datawarehouse framework.
Supports record level insert, updated and delete operations. Does not support record level insert, update and delete.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

2) I do not need the index created in the first question anymore. How can I delete the above index named index_bonuspay?

DROP INDEX index_bonuspay ON employee;

In this case, initially, both ApplicationA and ApplicatonB will have some resources allocated to various jobs present in their corresponding queues. In such a way, only 12GB (40GB – (20 GB + 8 GB)) will remain in the cluster. Each of the queues will request to run a map task of size 32GB. The total memory available is 40 GB.The rest of the required resources can be taken from the CPU. In such a case, ApplicationA currently holds 20GB. Another 12GB is required for the map task to get executed. Here, the fair scheduler will grant the container requesting 12GB of memory to ApplicationA. The memory allocated to ApplicationB is 8GB, and it will require another 24 GB to run a map task. Memory is not available for application, and hence the DRF will try to use 8 GB from memory, and the remaining 20GB will be used from the CPU.

3) Imagine that you are uploading a file of 500MB into HDFS.100MB of data is successfully uploaded into HDFS and another client wants to read the uploaded data while the upload is still in progress. What will happen in such a scenario, will the 100 MB of data that is uploaded will it be displayed?

The performance of Hadoop is very heavily influenced by the number of map and reduce tasks. More tasks result in an increase in the framework overhead but also allow increased load balancing and reduces the cost of failures. At one end of the spectrum, only one map and only 1 reduce task results in no distribution. At the other end, the framework may run out of resources to meet the number of tasks.

An approach that can be used is to create a Hive table pointing to the HBase table as the data source. HBase existing tables can be mapped to Hive. Hive can be given access to an existing table in HBase, containing multiple families and columns, using the CREATE EXTERNAL TABLE statement. However, the columns of HBase have to be mapped as well, and they will be validated against the column families of the existing table on HBase. The table name of the table in HBase is optional. In such a case, if any changes are made to the table in HBase, they will be reflected in the table on Hive as well.

Although the default blocks size is 64 MB in Hadoop 1x and 128 MB in Hadoop 2x whereas in such a scenario let us consider block size to be 100 MB which means that we are going to have 5 blocks replicated 3 times (default replication factor). Let’s consider an example of how does a block is written to HDFS:

Test Your Practical Hadoop Knowledge

3) Can you list few commonly used Hive services?

  • Command Line Interface (cli)
  • Hive Web Interface (hwi)
  • HiveServer (hiveserver)
  • Printing the contents of an RC file using the tool rcfilecat.
  • Jar
  • Metastore
  • 4) Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?

    Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.

    FAQ

    What are the issues faced in Hive real time?

    There are many real time problems where we need nested queries , whereas hive supports only correlated queries. There is no subtract operation available in hive and thus we need to create two tables and perform left outer join on it with condition to accomplish the task.

    Can we change settings within Hive session if yes how?

    Yes, we can change the settings within a Hive session using the SET command. It helps change the Hive job settings for an exact query. For example, the following command shows that buckets are occupied according to the table definition: hive> SET hive.

    Can Hive be used for real time queries?

    Hive cannot be used for real-time data querying since it takes quite some time to give the results. Apache Hive does not support subqueries. Hive does not support online transaction processing (OLTP) and only supports online analytical processing (OLAP).

    Does Hive support record level operations?

    Hive is a data warehouse framework whereas HBase is a NoSQL database. While Hive can run most SQL queries, HBase does not allow SQL queries. Hive doesn’t support record-level insert, update, and delete operations on a table, but HBase supports these functions.

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *