Data catalogs

Metadata

See Metadata.

Data catalog rise

While data catalog is not something new in general it’s been on rise in recent years. Notable mentions:

2015 Apache Atlas (Hortonworks)
2016 Data Access Layer (Twitter) closed source
2017 Dataportal (AirBnB) closed source
2017 AWS Glue Data Catalog (AWS) closed source
2018 Databook (Uber) closed source
2018 Marquez (WeWork)
2018 Metacat (Netflix) open sourced, but no documentation
2019 Amundsen (Lyft)
2019 DataHub (LinkedIn)
2020 Nemo (Facebook) closed source
2020 Lexikon (Spotify) closed source
2020 Artifact (Shopify) closed source
2020 Data Catalog (Google Cloud) closed source
2021 OpenMetadata

Which shows the need and the complexity which companies facing when dealing with data in modern world.

Data catalog and co

There are a lot of software/methodologies which do some sort of metadata cataloging. For example:

Data catalogs
Data hubs
Data lakes

Data catalogs provide metadata management capabilities:

Data dictionary
Controlled Vocabulary
Taxonomy
Ontology

But often do more than that

Data browsing
Data lineage (data provenance)
Data governance
Data collaboration (big data management, data warehousing)
Data quality management

Which shows deep roots of “data catalog” approach in RDBMs and data warehouses in general. On the other hand data lineage, browsing and quality doesn’t make sense in the context of data serialisation schemas.

New tool

While there are a lot of open source tools, they all seems to be DB centric. I see great potential in general metadata tool:

It can be a central place for business vocabulary
It can be a central place for all schemas (what they call data dictionary)
- Schemas itself can be connected via relationships and form a graph, which would allow to use graph query languages (SPARQL, Cypher, Gremlin) for data discovery
It can provide search and tagging capability
It can go as far as generating type definitions, encoders, decoders

Uber’s metadata infrastructure

It seems that Uber’s metadata infrastructure is quite close to what I described above. They had two tools in core: Databook and Dragon. Both of them are close-source, but their successors are open-source

Unfortunately they are disconnected. OpenMetadata uses JSONSchema for schema description and not Dragon/Hydra.

Dragon

A little algebra goes a long way

Dragon is based on Algebraic Property Graphs (mathematical concept from the same authors).

Evolution of the Graph Schema, 2018
Anything-to-Graph, 2021

Hydra

Hydra is a transformation toolkit along the lines of Dragon, but open source, and with a more advanced type system and other new features.

Graphs

Explorer