Metadata
See Metadata.
Data catalog rise
While data catalog is not something new in general it’s been on rise in recent years. Notable mentions:
- 2015 Apache Atlas (Hortonworks)
- 2016 Data Access Layer (Twitter) closed source
- 2017 Dataportal (AirBnB) closed source
- 2017 AWS Glue Data Catalog (AWS) closed source
- 2018 Databook (Uber) closed source
- 2018 Marquez (WeWork)
- 2018 Metacat (Netflix) open sourced, but no documentation
- 2019 Amundsen (Lyft)
- 2019 DataHub (LinkedIn)
- 2020 Nemo (Facebook) closed source
- 2020 Lexikon (Spotify) closed source
- 2020 Artifact (Shopify) closed source
- 2020 Data Catalog (Google Cloud) closed source
- 2021 OpenMetadata
Which shows the need and the complexity which companies facing when dealing with data in modern world.
Data catalog and co
There are a lot of software/methodologies which do some sort of metadata cataloging. For example:
- Data catalogs
- Data hubs
- Data lakes
Data catalogs provide metadata management capabilities:
- Data dictionary
- Controlled Vocabulary
- Taxonomy
- Ontology
But often do more than that
- Data browsing
- Data lineage (data provenance)
- Data governance
- Data collaboration (big data management, data warehousing)
- Data quality management
Which shows deep roots of “data catalog” approach in RDBMs and data warehouses in general. On the other hand data lineage, browsing and quality doesn’t make sense in the context of data serialisation schemas.
New tool
While there are a lot of open source tools, they all seems to be DB centric. I see great potential in general metadata tool:
- It can be a central place for business vocabulary
- It can be a central place for all schemas (what they call data dictionary)
- Schemas itself can be connected via relationships and form a graph, which would allow to use graph query languages (SPARQL, Cypher, Gremlin) for data discovery
- It can provide search and tagging capability
- It can go as far as generating type definitions, encoders, decoders
Uber’s metadata infrastructure
It seems that Uber’s metadata infrastructure is quite close to what I described above. They had two tools in core: Databook and Dragon. Both of them are close-source, but their successors are open-source
flowchart LR
u[Uber metadata infrastructure]
u --> Dargon --> Hydra
u --> Databook --> OpenMetadata -.- JSONSchema
Unfortunately they are disconnected. OpenMetadata uses JSONSchema for schema description and not Dragon/Hydra.
Dragon
A little algebra goes a long way
Dragon is based on Algebraic Property Graphs (mathematical concept from the same authors).
- Evolution of the Graph Schema, 2018
- Anything-to-Graph, 2021
Hydra
Hydra is a transformation toolkit along the lines of Dragon, but open source, and with a more advanced type system and other new features.