In this post we are going to learn about 2 different types of Hive tables and the significance of each. There are 2 types of tables in Hive and they are Managed Table and External table.
Definitions & When To Use What
Managed Table has full control over its dataset. That is, when you drop the table the table’s dataset or files will also be deleted from HDFS. External Table does not have full control over its dataset. That is, when you drop the table the dataset is not deleted from HDFS.
Now this explanation brings up a very important question – When do you use managed table and when do you use external table?
You would choose to use Managed Table when Hive is the only application using the dataset. Where as you would choose to use External Table when the underlying dataset pointed by the Hive table is shared by many applications like Pig, MapReduce jobs etc.
Think about this. When multiple applications are interested in a dataset, would you keep multiple copies of the same dataset one for each application? No you wouldn’t. Because most likely your dataset will be in the magnitude of gigabytes or terabytes and so it does not make sense to keep multiple copies of the dataset. Which means when a single copy of the dataset is shared between applications. You don’t want Hive to delete the dataset when the table is dropped. So when the data behind the Hive table is shared by multiple applications it is better to make the table an external table.
Managed Table – Creation & Drop Experiment
Now that we understand the difference between Managed and External table lets see how to create a Managed table and how to create an external table.
By default when you create a table, it is created as a Managed table. If you want to create external table you have to specify the keyword external when you create the table.
CREATE TABLE IF NOT EXISTS stocks ( exch string, symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
Let execute a describe command on the stocks table and check out the table type. It will say managed table.
We have already loaded the table with data. Let me now demonstrate what happens when we drop the table. Table is now dropped lets check out the location attribute in HDFS. Oops the dataset is now dropped as well. So if anyone is referring to this dataset, tough luck, we deleted the data.
Like what you are reading? You would like our live webinars too. Get notified when we hot webinars.
External Table – Creation & Drop Experiment
Lets now look at the external table. First we will create an External table. Look at the syntax, we have to specify the EXTERNAL keyword.
CREATE EXTERNAL TABLE IF NOT EXISTS stocks_ext ( exch string, symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
Let execute a describe command on the external table we just created and check out the table type and it would say external table.
Lets now load the external table and verify the location of this table before and after dropping the table. Here is the command to load the dataset.
hive> LOAD DATA INPATH ‘input/hive/stocks_db’ INTO TABLE stocks_ext;
Lets drop the table and then check out the location behind the table. Now we can still see the dataset. which is exactly what we expect to see with external table.
So knowing when to use Managed table and when to use external table is crucial. Agree? Always make the table external when Hive is not the only tool using or managing the data pointed by the table.
Like what you read? You would like our live webinars too. Get notified when we hot webinars.