Metadata

See Metadata.

Data catalog rise

While data catalog is not something new in general it’s been on rise in recent years. Notable mentions:

Which shows the need and the complexity which companies facing when dealing with data in modern world.

Data catalog and co

There are a lot of software/methodologies which do some sort of metadata cataloging. For example:

  • Data catalogs
  • Data hubs
  • Data lakes

Data catalogs provide metadata management capabilities:

  • Data dictionary
  • Controlled Vocabulary
  • Taxonomy
  • Ontology

But often do more than that

  • Data browsing
  • Data lineage (data provenance)
  • Data governance
  • Data collaboration (big data management, data warehousing)
  • Data quality management

Which shows deep roots of “data catalog” approach in RDBMs and data warehouses in general. On the other hand data lineage, browsing and quality doesn’t make sense in the context of data serialisation schemas.

New tool

While there are a lot of open source tools, they all seems to be DB centric. I see great potential in general metadata tool:

  • It can be a central place for business vocabulary
  • It can be a central place for all schemas (what they call data dictionary)
    • Schemas itself can be connected via relationships and form a graph, which would allow to use graph query languages (SPARQL, Cypher, Gremlin) for data discovery
  • It can provide search and tagging capability
  • It can go as far as generating type definitions, encoders, decoders

Uber’s metadata infrastructure

It seems that Uber’s metadata infrastructure is quite close to what I described above. They had two tools in core: Databook and Dragon. Both of them are close-source, but their successors are open-source

flowchart LR
	u[Uber metadata infrastructure]
	u --> Dargon --> Hydra
	u --> Databook --> OpenMetadata -.- JSONSchema

Unfortunately they are disconnected. OpenMetadata uses JSONSchema for schema description and not Dragon/Hydra.

Dragon

A little algebra goes a long way

Dragon is based on Algebraic Property Graphs (mathematical concept from the same authors).

Hydra

Hydra is a transformation toolkit along the lines of Dragon, but open source, and with a more advanced type system and other new features.