- What applications are supported by Hive?
- What are the different tables available in Hive?
- What is the difference between external and managed tables?
- Where does the data of a Hive table get stored?
- Can Hive be used in OLTP systems?
- Can a table name be changed in Hive?
Prepare better with the best interview questions and answers, and walk away with top interview tips. These interview questions and answers will boost your core interview skills and help you perform better. Be smarter with every interview.
Basically, Hive is the tool to process structured data in Hadoop we use Hive. It is a data warehouse infrastructure.And it is used to summarize Big Data, it resides on top of Hadoop. Also, makes querying and analyzing easy.
It is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It also provides client access to this information with the help of metastore service API
Hive offers an embedded Derby database instance backed by the local disk for the metastore, by default. To this concept what we call embedded metastore configuration.
Apache Hive provides Different Built-in operators for data operations to be implemented on the tables present inside Apache Hive warehouse.
Hive operators are used for mathematical operations on operands. It returns specific value as per the logic applied.
For decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is called as a Bucketing in Hive.
In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.
Internal Table (Managed table): Managed table is also Known as Internal table. This is the default table in Hive. When user create a table in Hive without specifying it as external, by default we will get a Managed table.
If we create a table as a managed table, the table will be created in a specific location in HDFS.
External Tables:External table is mostly created for external use as when the data is used outside Hive. Whenever we want to delete the table’s metadata and we want to keep the table’s data as it is, we use External table. External table only deletes the schema of the table.a
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the HIVE server.
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive data warehouse.
The main purpose of Hive Thrift server is it allows access to Hive over a single port.
Thrift server is also known as Thrift Server.However, for scalable cross-language services development Thrift is a software framework. Also, it allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory
As hive creates the schema and appends on top of an existing data file. One can have multiple schema for one data file, the schema will be saved in hive metastore and data will not be parsed or serialized to disk in a given schema. When we will try to retrieve data, the schema will be used. For example if we have 5 columns (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schemas by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).
But while querying, if we specify any column other than abcolumnsove list, will result in NULL values.
Wherever we run the hive in the embedded mode it automatically creates the local metastore.And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml. Property is
So to change the behavior change the location to an absolute path, so metastore will be used from that location.
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has a value ‘Express’. In these situations, RLIKE will work better without using TRIM.
Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.
You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:
In a local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
There are two ways to know the current database. One temporary in cli and second one is persistent.
The defauMetastore configuration metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an errthe or. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
A common way to load data into Hive is to create an external table. You can create an external table that points to an HDFS directory. You can copy an external file into the HDFS location using either of the HDFS commands put or copy. Here, once I create the table named PAGE_VIEW_STG, I use the HDFS put command to load the data into the table.
Note that you can transform the initial data and load it into another Hive table, as shown in this example. The file /tmp/pv_2016-03-09.txt contains the page views served on 9 March 2016. These page views are loaded into the PAGE_VIEW table from the initial staging table named PAGE_VIEW_STG, using the following Statement.
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
Yes, we can run UNIX shell commands from Hive using the! Mark before the command.For example: !pwd at hive prompt will list the current directory.
Basically, Hive is the tool to process structured data in Hadoop we use Hive. It is a data warehouse infrastructure.And it is used to summarize Big Data, it resides on top of Hadoop. Also, makes querying and analyzing easy.
It is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It also provides client access to this information with the help of metastore service API
Hive offers an embedded Derby database instance backed by the local disk for the metastore, by default. To this concept what we call embedded metastore configuration.
Apache Hive provides Different Built-in operators for data operations to be implemented on the tables present inside Apache Hive warehouse.
Hive operators are used for mathematical operations on operands. It returns specific value as per the logic applied.
For decomposing table data sets into more manageable parts, Apache Hive offers another technique. That technique is called as a Bucketing in Hive.
In Hive Tables or partition are subdivided into buckets based on the hash function of a column in the table to give extra structure to the data that may be used for more efficient queries.
Internal Table (Managed table): Managed table is also Known as Internal table. This is the default table in Hive. When user create a table in Hive without specifying it as external, by default we will get a Managed table.
If we create a table as a managed table, the table will be created in a specific location in HDFS.
External Tables:External table is mostly created for external use as when the data is used outside Hive. Whenever we want to delete the table’s metadata and we want to keep the table’s data as it is, we use External table. External table only deletes the schema of the table.a
Also, we can use an ODBC Driver application. Since that support ODBC to connect to the HIVE server.
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Yes, with the help of LOCATION keyword, we can change the default location of Managed tables while creating the managed table in Hive. However, to do so, the user needs to specify the storage path of the managed table as the value to the LOCATION keyword, that will help to change the default location of a managed table.
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to hive metastore to users of other tools on Hadoop so that they can read and write data to hive data warehouse.
The main purpose of Hive Thrift server is it allows access to Hive over a single port.
Thrift server is also known as Thrift Server.However, for scalable cross-language services development Thrift is a software framework. Also, it allows clients using languages including Java, C++, Ruby, and many others, to programmatically access Hive remotely.
The HiveServer2 is a server interface and part of Hive Services that enables remote clients to execute queries against Hive and retrieve the results. The current implementation(HS2), based on Thrift RPC which has improved version of Hive Server 1 and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC Drivers.
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Hive organizes tables into partitions for grouping similar type of data together based on a column or partition key. Each Table can have one or more partition keys to identify a particular partition. Physically, a partition is nothing but a sub-directory in the table directory
As hive creates the schema and appends on top of an existing data file. One can have multiple schema for one data file, the schema will be saved in hive metastore and data will not be parsed or serialized to disk in a given schema. When we will try to retrieve data, the schema will be used. For example if we have 5 columns (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schemas by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).
But while querying, if we specify any column other than abcolumnsove list, will result in NULL values.
Wherever we run the hive in the embedded mode it automatically creates the local metastore.And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml. Property is
So to change the behavior change the location to an absolute path, so metastore will be used from that location.
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has a value ‘Express’. In these situations, RLIKE will work better without using TRIM.
Yes, by using hive -e option, we can run any kind of Hive query directly from the terminal without logging into the Hive shell.
You can also save the output into a file by using the cat ‘>’ command of Linux as shown below:
In a local metastore configuration, the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, either on the same machine or on a remote machine.
In the remote metastore configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM. Other processes communicate with the metastore server using Thrift Network APIs. You can have one or more metastore servers in this case to provide more availability.
There are two ways to know the current database. One temporary in cli and second one is persistent.
The defauMetastore configuration metastore configuration allows only one Hive session to be opened at a time for accessing the metastore. Therefore, if multiple clients try to access the metastore at the same time, they will get an errthe or. One has to use a standalone metastore, i.e. Local or remote metastore configuration in Apache Hive for allowing access to multiple clients concurrently.
A common way to load data into Hive is to create an external table. You can create an external table that points to an HDFS directory. You can copy an external file into the HDFS location using either of the HDFS commands put or copy. Here, once I create the table named PAGE_VIEW_STG, I use the HDFS put command to load the data into the table.
Note that you can transform the initial data and load it into another Hive table, as shown in this example. The file /tmp/pv_2016-03-09.txt contains the page views served on 9 March 2016. These page views are loaded into the PAGE_VIEW table from the initial staging table named PAGE_VIEW_STG, using the following Statement.
One can use the SequenceFile format which will group these small files together to form a single sequence file. The steps that will be followed in doing so are as follows:
Hence, a single SequenceFile is generated which contains the data present in all of the input files and therefore, the problem of having lots of small files is finally eliminated.
Yes, we can run UNIX shell commands from Hive using the! Mark before the command.For example: !pwd at hive prompt will list the current directory.
Hive Interview Questions and Answers | Most Asked Hadoop HIVE Interview Questions |
Hive Interview Questions for Freshers & Experienced
Here are Hive interview questions and answers for freshers as well as experienced candidates to get their dream job.
Explain what is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). It is a data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open-source-software that lets programmers analyze large data sets on Hadoop.
HDFS Interview Questions – HDFS
The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).
Second it really doesnt matter much if you could not answer few questions but it matters that whatever you answered, you must have answered with confidence. So just feel confident during your interview. We at tutorialspoint wish you best luck to have a good interviewer and all the very best for your future endeavor. Cheers 🙂 hive_questions_answers.htm
Dear readers, these Hive Interview Questions have been designed specially to get you acquainted with the nature of questions you may encounter during your interview for the subject of Hive. As per my experience good interviewers hardly plan to ask any particular question during your interview, normally questions start with some basic concept of the subject and later they continue based on further discussion and what you answer − What are the different types of tables available in HIve?
FAQ
How does Hive work in Hadoop?
- RDBMS vs Hadoop? …
- Explain Big data and its characteristics? …
- What is Hadoop and list its components? …
- What is YARN and explain its components? …
- What is the difference between a regular file system and HDFS? …
- What are the Hadoop daemons and explain their roles in a Hadoop cluster?
Can you avoid MapReduce on Hive?
What is the maximum data size Hive can handle?