Meteor — A metadata collection framework
TLDR- Meteor is an easy-to-use, plugin-driven metadata collection framework to extract data from different sources and sink to any data catalog.
Before jumping into the detail overview of meteor, let’s find out why meteor was initially created and its purpose.
In Gojek, there are almost 100 million active user using 20+ products of Gojek which generate more than 1,000 Terabytes of data to be processed daily to make solid and data driven decision . These data consist of various form such as message queue, data stores etc. It could also wandering through multiple systems, properties and types. These huge data are scatter around with different format, e.g. JSON, Protobuf etc, and also have their own schema and metadata.
What is Metadata ?
Metadata is a relevant form of information that describes one or more aspects of particular data. It summarises basic information about data, making finding & working with particular instances of data easier. Various systems, databases, message queue have its own definition and structure for data. If we take a look into the DE ecosystem in Gojek, the message queuing system like Kafka, which is used to stream real time data to data store with the help of topic information. The table is part of the database that group data. Similarly Similarly, dashboard is a part of Metabase and Grafana with the information about the charts present in it.
It describes the information like where does the data come from, who created this table, what are similar tables, when it was last updated. These information are useful to understand the data in efficient manner.
The effective collection of metadata was a major challenge that team was facing as there was no easy to use, reusable, portable, and flexible enough to be modified to meet generic use cases. This led to evolution of meteor. But the challenges does not ends here. The old meteor as a service was not able to cope up with the new demands, as it was quite non-trivial to jump to a newer approach that includes a single solution for all the available data sources whether it is a bucket, topic, database, dashboard, table whatever possible, and to be able to process the metadata and sink it down easily to the destination.This made the decision to move with plugin-based metadata extractor.
The Meteor, a plugin driven agent for collecting metadata. Meteor role is to extract metadata from a variety of data sources and sink the metadata to variety of APIs.
The workflow is defined in 3 stages :
- Extraction : It is the process of extracting data from a source and transforming it into a format that can be consumed by the agent
- Processing : It is the process of transforming the extracted data into a format that can be consumed by the agent.
- Sink : It is the process of sending the processed data to a single or multiple destinations as defined in recipes.
Meteor is developed with plugin-driven system where each stage of processing could be done by its own plugin. Therefore, it consists of three types of plugins in Meteor i.e. Extractors, Processors and Sinks .
Some of the jargons w.r.t to meteor , to get familiar with the entire working of meteor.
- Job: A metadata extraction task from a single data source.
- Recipes: A set of instructions and configurations defined by user in yaml file , they are used to define how a particular job will be performed. It should contain instruction about the
sourcefrom which the metadata will be fetched, information about metadata
processorsand the destination is to be defined as
- Extractor: The type of plugins that extract the source of metadata. There are currently multiple plugins supported to extract metadata from various sources including databases, dashboards, topics, etc. A single job in meteor can have only one source in meteor.
- Processor: The type of plugins that work for processing stage to perform the enrichment or data processing for the metadata after extraction. There can be multiple processors in a single recipe.
- Sink: The type of plugins that act as the destination of our metadata to send metadata to a variety of third party APIs and catalog services, including Columbus, HTTP, BigQuery, Kafka, and many others.
- No Dependency: Written in Go. It compiles into a single binary with no external dependency.
- Extensible: Plugin system allows new sources and sinks to be easily added.
- Ecosystem: Extract metadata for many popular services with a wide number of service plugins.
- Customisable: Add your own processors and sinks to suit your many use cases.
- Runtime: Meteor can run inside VMs or containers with minimal memory footprint.
Meteor has a rich features of cli that would help user interact with meteor better. User could do several actions from listing all supported plugins to generate a recipe. For mac users, it provides brewed formula so installing it would be as easy as running:
brew install odpf/taps/meteor
The backbone of the Gojek ,the next-gen collaborative, domain-driven and distributed data platform i.e the Open DataOps Foundation or ODPF which is built and managed for the community by our brightest engineers in Data Platform Team at Gojek.
Meteor is also part of odpf and open for contributors, have a look at meteor documentation or repository to explore more, feel free to try out the another project of us in odpf github. Excited with the work that we do with odpf? we have several opening in our data platform team.