How to use Hive for CRUD- Run updates and deletes on Hive
Hive is an Apache open source project which give ability to have relational database structure on hadoop platform. Hive uses Hadoop core components like HDFS for data storage and Map reduce to execute jobs. Since big data platform is maturing beyond map reduce, Hive project also started adopting new technology emerging on the platform. Execution engine in Hive has become changeable based on the requirement, at present it can be opted from more advanced technology like Tez or Spark.
The ability of Hive to form a relational database like structure on
files owes to its meta store. Hive meta store is a place, usually a relational database like MySQL or Derby which stored meta data about the files. File meta data include file structure, folder structure, partitions, buckets , serialization and De-serialization options for file read and write etc.
Usage of Hive for ETL is a very common practice these days. Hive's relational database capability helps to migrate data warehouse or data stores built in relational databases like Oracle, Teradata, Netezza etc. But as we know that relational database ensure ACID on transactions. i.e. atomicity, consistency, isolation and durability among transactions. Hive project is also working towards it.
Compared to relational databases, advantages of Hive are the flexibility and options it gives the user
- Different file formats - Hive supports all major file formats in hadoop, right from text file to compressed files like Parquet or ORC
- Different kind of optimization techniques - Hive support rule based optimization, cost based optimization and correlation optimization
- Different execution engines -Hive support map reduce, Spark or Tez - User can set execution engine on a query level based on his/her
- Optional Transaction engines.
From 0.3 version onward Hive started supporting CRUD (create, read, update and delete). The biggest challenge in Hadoop is random access of data from files and in place update or delete. That is largely due to the initial design of Hadoop to support really huge data transactions. This can be compared to bulk operation on relational databases, where logs are not kept and huge chunk of data is read/written to tables. Hadoop is designed to support high volume transactions. Inserts and appends work fine, but random update, delete is tricky. To overcome these challenge Hive writes a new file in case of updates/deletes and old file is obsoleted.
Think about a file with fields ID and NAME.
Assume that, we need to update John to Jane, what Hive does is, it keeps a new row with ID=1 and NAME as Jane and old row is obsoleted. We can remove this old row using a technique called Compaction, which can be run outside of the update process. Once the data is updated (i.e new file created) Hive meta store gets the new file information.
How does Hive know where the data exists which should be updated?
When we update row where ID=1 , Hive should know the data location of all rows belonging to ID=1, otherwise it will need a full table scan and the updates will be scattered across location. To overcome this complexity Hive supports only update on files where bucketing is enabled. Bucketing can be compared to hash partitioning in relational databases. Bucketing make sure data is evenly distributed in the cluster.
So in our case we need to make sure data is distributed based on ID field. Bucketing can be done on any field which make sure that data is distributed across the nodes. Best practice is to pick 2n buckets based on the data volume and nodes available on the cluster.
create table customer(id int ,name string ) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
At present Hive supports only update on ORC data. ORC is a columnar storage format on hadoop where data is stored based on columns rather than storing rows. This method is suitable if you are selecting only a few columns in the queries. This helps in faster retrieval of selected columns rather than reading all columns for a row.
insert into table customer values (1,'John'),(2,'Doe');
Once the data in inserted we can see on the Hive file location data is bucketed into two as we chose two buckets. The files are default created in a Delta folder.
Run the update statement based on ID (bucketed column)
update customer set name="Jane" where id=1;
Once the data is updated, you can see new directory is created for delta and updates are saved. Old directory is still in tact.
New delta directory with the changed file.
Old delta files can be removed either during update process itself. To do that set up the compaction properties in table creation script.
CREATE TABLE customer (
CLUSTERED BY (id) INTO 2 BUCKETS STORED AS ORC
compactor.mapreduce.map.memory.mb - specify compaction map job properties
compactorthreshold.Hive.compactor.delta.num.threshold - trigger minor compaction if there are more than 4 delta directories
compactorthreshold.Hive.compactor.delta.pct.threshold - trigger major compaction if the ratio of size of delta files to size of base files is greater than 50%
If the compaction has to be done outside load window, an ALTER statement can be issued at any point of time.
ALTER TABLE customer COMPACT 'minor' WITH OVERWRITE TBLPROPERTIES ("compactor.mapreduce.map.memory.mb"="3072");
-- specify compaction map job properties
ALTER TABLE customer COMPACT 'major' WITH OVERWRITE TBLPROPERTIES ("tblprops.orc.compress.size"="8192");
-- change any other Hive table properties
After issuing compaction process
ALTER TABLE customer COMPACT 'major'; the folders are renamed to Base and files are rearranged in the folder.
Compaction can be done on a whole table level or partition level, if the data is partitioned.
ALTER TABLE table_name [PARTITION (partiti [, ...])]COMPACT 'compaction_type'
[WITH OVERWRITE TBLPROPERTIES ("property"="value" [, ...])];
So far we have seen build activities to make updates/deletes happen on Hive. To enable CRUD on Hive configuration files also needs some modification.
Below properties has to be changed in Hive-site.xml .
Hope this helps to run CRUD on Hive tables. Please use below section to comment / discuss.