Explain TFX Metadata Store data model definitions

GOAL

  1. Explain the following definitions in plane simple english?
  2. Many practical examples of what they can have (contain model, model hyper parameters, just and only data…)?
  3. What each of them do?

ORIGINAL DOC

This is the original ML Metadata  |  TFX  |  TensorFlow that I have issues with. In my description I will point back to this. “The Metadata Store uses the following data model to record and retrieve metadata from the storage backend.”:

  • ArtifactType: describes an artifact’s type and its properties that are stored in the metadata store. You can register these types on-the-fly with the metadata store in code, or you can load them in the store from a serialized format. Once you register a type, its definition is available throughout the lifetime of the store.
  • An Artifact: describes a specific instance of an ArtifactType, and its properties that are written to the metadata store.
  • An ExecutionType: describes a type of component or step in a workflow, and its runtime parameters.
  • An Execution: is a record of a component run or a step in an ML workflow and the runtime parameters. An execution can be thought of as an instance of an ExecutionType. Executions are recorded when you run an ML pipeline or step.
  • An Event: is a record of the relationship between artifacts and executions. When an execution happens, events record every artifact that was used by the execution, and every artifact that was produced. These records allow for lineage tracking throughout a workflow. By looking at all events, MLMD knows what executions happened and what artifacts were created as a result. MLMD can then recurse back from any artifact to all of its upstream inputs.
  • A ContextType: describes a type of conceptual group of artifacts and executions in a workflow, and its structural properties. For example: projects, pipeline runs, experiments, owners etc.
  • A Context: is an instance of a ContextType. It captures the shared information within the group. For example: project name, changelist commit id, experiment annotations etc. It has a user-defined unique name within its ContextType.
  • An Attribution: is a record of the relationship between artifacts and contexts.
  • An Association: is a record of the relationship between executions and contexts.

How I see them

(Please help me by correct my descriptions or let me know if they are right. The statements are how I perceive the descriptions and the questions are that I don’t understand and need to be answered)

  • Def ArtifactType (how I can draw down based on the documentations):
    • Contains the base data
    • Contains many iteration and multiple modified version of the base data
    • Defines the data types
    • Have properties
    • Sores data as metadasta in metadata storage ex.: database, in ram.
    • What is an ArtifactType?
    • What are all the possible things that it can store?
    • What are all the properties it has?
  • 1 Artifact (how I can draw down based on the documentations):
    • 1 version of the modified data
    • 1 version of the modified data’s properties
    • 1 version of the modified data’s data types
    • 1 version of a specific instance of an ArtifactType, and its properties that are written to the metadata store.
    • !! BUT THAN “List all Artifacts of a specific type. Example: all Models that have been trained.” → So saved down model can also be Artifacts. This documentation is just terrible their ArtifactType and Artifact is pointing on each other whiteout explain any of them what it is. It doesn’t makes any sense.
    • What is an Artifact?
    • What are all the possible things that it can store?
    • What are all the properties it has?
  • ExecutionType:
    • What is a “component in a workflow”?
    • What is a “step in a workflow, and its runtime parameters.”
    • Because this TEXT area has NO workflow chart to point at while there is one to the previous section tfx/guide/mlmd#metadata_store and there is one after tfx/guide/mlmd#integrate_ml_metadata_into_your_ml_workflows .
  • Execution:
    • What is a record here?
    • What is a component?
    • What is a component run?
    • What runtime parameters are we talking about?
    • What is Execution overall?
    • 1 version of a specific instance of ExecutionType.
    • Executions save to metadata storage (ex.: RAM or database) you run an ML pipeline or step.
  • Event:
    • Is a record of the relationship between artifacts and executions.
    • to me it is not clear why is this step even necessary because event and execution sounds like they fulfill the same exact purpose.
    • to me it seems like execution saves down itself than why do we need an event to save it down again?
    • This is the only understandable statement in this definition “By looking at all events, MLMD knows what executions happened and what artifacts were created as a result.”
  • ContextType:
    • What is “conceptual group of artifacts and executions” ? Especially what is “conceptual” about them?
    • perfect examples this is what all the other description should be.
  • Context:
    • 1 version of the ContextType.
    • Again what is this “conceptual group of artifacts and executions” ? Especially what is “conceptual” about them?
    • Again GREAT examples.
  • Attribution:
    • simple and understandable description
    • If all the elements ave and describes them self why is this necessary?
  • Association:
    • simple and understandable description
    • If all the elements ave and describes them self why is this necessary?

Previous recommendations

  • Just not helpful also the colab also not answers the basic definitions - /tfx/tutorials/mlmd/mlmd_tutorial
1 Like

Ccing @Robert_Crowe which might be able to help.

@Robert_Crowe he is doing the course that I am studieing at. I have also posted my question on that deeplearning.ai forum to the course but so far no help (plust at tons of online forums). I would love to reach out to him and recomend constructive feedback to this speific part of the documnetation.

Thanks for the feedback! The documentation starts from the abstract before ending in the specific, and it sounds like the better user experience would be to map the specific to the abstract upfront.

If you take the colab Penguin identifier example:
Artifacts:

  1. Palmer Penguins dataset
  2. Penguin species identifier model

ArtifactTypes:

  1. Dataset
  2. Model

Where “Artifacts” are instances of an “ArtifactType”. If you are trying multiple different training strategies, you can imagine you will end up with multiple models penguinGuesserA, penguinGuesserB, penguinGuesserC, etc. of type “Model” for each strategy you are experimenting with.

Possible things an ArtifactType can store:

  1. Dataset artifact type in this penguins example stores:
  • species Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
  • bill_length_mm 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
  • bill_depth_mm 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
  • flipper_length_mm 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
  • body_mass_g 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
  • year 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
    where each field (species, bill_length, etc.) is a property
  1. PushedModel artifact type in this penguins example stores:
property value
tfx_version 1.0.0
producer_component Pusher
pushed 1
state published
name pushed_model
pushed_version 1627118772
pushed_destination /tmp/tmpgs_w7rwz/serving_model/penguins_classi…

If this is helpful, I’ll keep going through explaining the other concepts in a similar way, but if not, I’ll let others chime in.

4 Likes

Thank you, @psp for this really thorough feedback. That’s not marketing-speak either! Really thankful.

(context: I’m the Product Manager that supports the MLMD team)

You’re right that the API docs are precise/abstract and not the quickest to scan and learn for someone new. That’s because MLMD’s users have mostly been other ML Infrastructure engineers, e.g. TensorFlow Extended (TFX).

Our goal with MLMD is to (1) make sure we record ~all that happens in ML pipelines over time and (2) make it useful and accessible afterwards. I.e. write it all down and then let you run reports, analyses, etc. automatically. Maybe you want to see all the other pipelines at your company that trained on the same dataset, and sort the models by best loss, etc. You can see an experimental project called NitroML that uses MLMD this way. Note, that’s an experimental project and I wouldn’t necessarily use them as a reference implementation!

This is, admittedly, rather general: a ~graph datastore for arbitrary directed-acyclic-graphs with some ML semantics thrown in.

In the meantime, here are some very cursory analogies that might help:

MLMD is sort of a grammar for ML.

  • Artifacts are nouns. A model, a dataset, a model analysis result, etc.
  • Executions are verbs. ingesting training data, transforming it, training the model, analyzing the model
  • Contexts are somewhat like sentences. E.g. a DAG of artifacts and executions linked together can be a Context.
  • a component is, for example, a TFX Component like the Trainer. A component is specific to TFX. MLMD is slightly more general. So when TFX uses MLMD, a component run is recorded as an execution in MLMD. But if MLMD is used outside TFX, not all MLMD executions are from TFX components. The docs don’t say that we mean a TFX component, though. You’re right to point that out.

I’ve added a task for us (the MLMD team) to improve the docs for a reader who doesn’t necessarily already have years of experience working on these problems already, and to clarify what we mean by component (and any other non-MLMD-specific definitions).

3 Likes

Sorry to be late to the party, but glad to see that Chloe and Ben jumped in! You’re in capable hands!

2 Likes