Metadata Syncing between Two Hive Metastores’

Teja Dogiparthi
4 min readNov 13, 2020

--

Overview

Assuming one is aware of the Hadoop framework and hive, which are data warehouse software facilitating reading, writing, and managing large datasets residing in distributed storage using SQL. By the end of the blog, you will get to know how syncing of tables will happen between metastores’ in near-real-time.

Problem Scenario

The problem we faced was to sync up metadata between two hive metastores’ which can be in different regions or within the same region or completely between different accounts.

Approach We Considered

Based on the research and brainstorming we found a way by digging into the Hive Metastore Listeners we understood how we can connect to a hive metastore in an environment and gather information about its metadata changes

So what we will showcase in this article are really straightforward implementations to get you started.

Hive Metastore Listeners

In short, we create a class that extends an Interface or Abstract Class, with our custom logic, which will be called by a listener event.

Metastore Listeners: Properties and Classes

For each of them, you have a pair of property and class.

The property is where the event will be triggered, and the class is the one that you have to extend, so the hive runner knows how to call it.

The advantage of using a metastore listener is that it can’t affect query processing, it’s read-only.

we will work with the hive.metastore.event.listener property.

Within that property, we can choose between a list of events from the MetaStoreEventListener Abstract Class.

Methods from the MetaStoreEventListener with their Parameters

To work with it, we override the method we want and when that event occurs, we will receive the parameter pair.

Since we wanted to capture only the events that will make a difference to the metadata. We can only work on the methods that will make a change to the metadata. We will reconstruct the entire DDL based on the event

Architecture

The architecture of the solution.

Event Lifecycle

We are using a pub-sub(Publish and Subscribe) model. A Listener is installed on all the hive server nodes. Every server will have its own queue. Once the event is generated it is pushed to the source queue. The events are later subscribed to the queue.

We store the metadata of the event like what kind of event and when it has generated, where it has generated

A sample event was generated by the listener.

{
“status”: “Status of the event”,
“timestamp”: “2020–04–24 06:07:09 GMT” } ],
“database”: “Database that has to be synced.”,
“ddl”: “Type of DDL.”,
“event_id”: “674fc9f3–2f10–4a2a-8755-bd41923e6dd7(UUID)”,
“listener_generation_time”: “2020–04–23 23:44:14 GMT”,
“src_cluster_id”: “Cluster from which the event is generated ”,
“src_cluster_type”: “Type of the cluster (Eg: EMR)”,
“src_region”: “us-west-2”,
“statement_type”: “DROP_PARTITION”,
“table_name”: “Table Name that has to be synced”,
“table_type”: “EXTERNAL_TABLE”, “target_region(us-east-target)”: [{
“status”: “Success”,
“timestamp”: “2020–03–13 18:11:35 GMT”
}

The values for the keys in the event are sort of self-explanatory which is nothing but the metadata of a particular event on what kind of event and from where it has generated and to where it has to be executed.

Now that we have the event, we need a way to determine which events have to be synced to which metastores and what are all the tables or databases have to be synced. For that purpose, we have the configuration setup and configured in the database.

A sample configuration record that we store in the database looks like this.

{“filteredTables”: {
“database.*”: {
“lastSnapshot”: “2019–10–14 23:22:35 GMT”,
“targetlocation”: “NA”
}
},
“lastSnapShot”: “2019–09–19 23:15:54 GMT”,
“sourceRegion”: “us-east-1”,
“targetRegion”: “”
}

In the record, we can determine what tables are to be synced in the “filteredTables” column whether we want to sync an entire database or just a table and the “lastSnapshot” refers to the timestamp from when the event has to be synced. All the events that were generated before for that table will be discarded until they are specified in the “filteredTables” column. The “sourceRegion” and “targetRegion” will determine a particular table to be synced from which region to which region.

Looking at the configuration record we will discard the event if the table name in the event is not present in the “fiteredTables” if present then we would send it to the target queue from which whoever has subscribed to the target queue will receive the event and will execute the value in the table DDL.

References

Contributors

--

--