Metadata

 

Metadata is “data about data”

It describes, explains, or gives context to actual data so that systems—and humans—can understand, manage, and use it effectively.

It like tags, or labels around your data that tell you what it is, how to use it, who owns it, etc,.

In technical terms, Metadata is structured information that describes data assets across dimensions like schema, technical properties, governance rules, quality, and usage, typically stored in catalogs , metastores , or format specific logs.

 

Major metadata types :

  • Descriptive metadata
  • What: Human readable info used to discover and understand data (title, description, keywords, owners).​
  • Example: A catalog entry with “Customer Orders (EU) – daily snapshot, used for billing reports.”​
  • Structural metadata
  • What: Information about how data pieces fit together—schemas, primary/foreign keys, partitions, table relationships, file hierarchies.​
  • Example: Table orders have columns ( order_id INT, customer_id INT, amount DECIMAL), partitioned by order_date .​
  • Administrative / governance metadata
  • What: Access rights, classifications (e.g., PII, confidential), retention rules, regulatory flags, and lifecycle policies.​
  • Example: Column email tagged as PII with access restricted to the “Compliance” group and a 7 year retention policy.​
  • Technical metadata
  • What: File types (Parquet, Delta), compression, sizes, row counts, physical locations, engine versions, statistics.​
  • Example: A Delta table’s DESCRIBE DETAIL output lists format, table id, location, creation time, partition columns, number of files, and total size in bytes.​
  • Operational metadata
  • What: Job runs, timestamps, row counts processed, error logs, pipeline statuses—used to operate and debug data flows.​
  • Example: A metadata record saying a nightly job loaded 120M rows at 02:03 UTC and dropped 0.2% as invalid.​
  • Quality metadata
  • What: Validity, completeness, freshness, and accuracy indicators attached to datasets or columns.​
  • Usage / behavioral metadata
  • What: Who queried a table, which dashboards use it, how frequently, and common query patterns.​
  •  

    Why Metadata Matters in Data Engineering

  • Improves data discovery Data catalogs (e.g., Collibra, Alation, DataHub ) use metadata to help you find datasets.
  • Enables schema enforcement & validation Without metadata, you can't enforce schemas or detect breaking changes.
  • Supports governance & compliance Helps track sensitive data (PII), ownership, access control.
  • Boosts pipeline reliability Operational metadata shows trends, failures, and performance.
  • Powers optimizations Tools like Spark and BigQuery use metadata to speed up query planning.
  •  

    💡 Example –

  • In a data lake or warehouse, metadata is the combination of labels, manuals, and audit trails for every dataset—without it, the data exists but is effectively unusable or risky; with rich metadata, teams can discover the right tables quickly, trust them, and use them safely and efficiently.
  • A Parquet file has:
  • Schema metadata (columns, types)
  • Compression type
  • Partition information
  • Row count statistics
  • This helps engines read only the necessary blocks → faster queries.

    Leave a Reply