50 Important Hive Interview Questions For 2021

A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.

Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.

The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.

As hive creates the schema and appends on top of an existing data file. One can have multiple schema for one data file, the schema will be saved in hive metastore and data will not be parsed or serialized to disk in a given schema. When we will try to retrieve data, the schema will be used. For example if we have 5 columns (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schemas by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).

The defauMetastore configuration metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an errthe or. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.

A common way to load data into Hive is to create an external table. You can create an external table that points to an HDFS directory. You can copy an external file into the HDFS location using either of the HDFS commands put or copy. Here, once I create the table named PAGE_VIEW_STG, I use the HDFS put command to load the data into the table.

Note that you can transform the initial data and load it into another Hive table, as shown in this example. The file /tmp/pv_2016-03-09.txt contains the page views served on 9 March 2016. These page views are loaded into the PAGE_VIEW table from the initial staging table named PAGE_VIEW_STG, using the following Statement.

A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.

Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.

The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.

As hive creates the schema and appends on top of an existing data file. One can have multiple schema for one data file, the schema will be saved in hive metastore and data will not be parsed or serialized to disk in a given schema. When we will try to retrieve data, the schema will be used. For example if we have 5 columns (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schemas by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).

The defauMetastore configuration metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an errthe or. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.

A common way to load data into Hive is to create an external table. You can create an external table that points to an HDFS directory. You can copy an external file into the HDFS location using either of the HDFS commands put or copy. Here, once I create the table named PAGE_VIEW_STG, I use the HDFS put command to load the data into the table.

Note that you can transform the initial data and load it into another Hive table, as shown in this example. The file /tmp/pv_2016-03-09.txt contains the page views served on 9 March 2016. These page views are loaded into the PAGE_VIEW table from the initial staging table named PAGE_VIEW_STG, using the following Statement.

Hive Interview Questions and Answers | Most Asked Hadoop HIVE Interview Questions |

Test Your Practical Hadoop Knowledge

3) Can you list few commonly used Hive services?

  • Command Line Interface (cli)
  • Hive Web Interface (hwi)
  • HiveServer (hiveserver)
  • Printing the contents of an RC file using the tool rcfilecat.
  • Jar
  • Metastore
  • 4) Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?

    Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.

    19. Why do we need buckets?

    There are two main reasons for performing bucketing to a partition:

  • A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from that of join key? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
  • Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
  • 5. Why Hive does not store metadata information in HDFS?

    Hive stores metadata information in the metastore using RDBMS instead of HDFS. The reason for choosing RDBMS is to achieve low latency as HDFS read/write operations are time consuming processes.

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *