Python API to generate publish metadata dicts

Currently working on the integration of OpenCue with OpenPype, I face a problem I’ve had trouble to deal with since I’ve started working on OP core code: regarding the publish system, a lot of things rely on metadata dicts, and they are not documented anywhere. It is very tedious to guess which key is required and which one isn’t, where they belong (context.data, instance.data, representations…) and what feature they are related to.
Creating documentation could be very helpful and a good start, but I’m afraid contributors won’t stick to it and as reviewers we easily forget this part.

My proposal is that we use classes and functions to build these metadata dicts. This way, we’ll be able to define which arguments are required and which one are optional (kwargs), and purpose of each argument will be documented in the docstring, where it is less likely to forget to add it (linter may be a great help as well).
For example, we could create an InstanceData class which would look like:

class InstanceData(dict):
     def __init__(self, family:str, subset:str, ..., comment="", ...):
          """blabla.

         Args:
                 ....
          """
          self['family'] = family
          self['subset'] = subset
          self['comment'] = comment

     def add_representations(representations: RepresentationData):
          """doc"""
          if not isinstance(representations, Iterable):
              representations = [representations]
          self.setdefault('representations', []).extend(representations)

class RepresentationData(dict):
      ...
2 Likes

Just cross-referencing. This Issue #1355: Define representation as data class is closely related to your request.

Also, :+1: +1 to the feature request. @fabiaserra has also been requesting an abstraction and it has come up more often recently too.

Preferably we actually create a dataclass which in a way allows us to “lock” the data structure of the data.

2 Likes

I’m not sure to get your point here.

It in itself being a dict where it can still have any key-value without strict definitions makes it loose by design.

Here’s a quick example:

from dataclasses import dataclass, field, asdict
from typing import List, Optional, Set, Union


@dataclass
class Colorspace:
    config_path: str
    colorspace: str
    display: str
    view: str


@dataclass
class Representation:

    name: str
    ext: Optional[str] = None

    # List of filenames or file sequences
    files: Optional[List[Union[str, List]]] = field(default_factory=list)

    # Metadata
    comment: Optional[str] = ""

    # Frame data
    frameStart: Optional[int] = None
    frameEnd: Optional[int] = None
    handleStart: Optional[int] = None
    handleEnd: Optional[int] = None

    frames: Optional[Set[int]] = field(default_factory=set)

    # Colorspace
    colorspace: Optional[Colorspace] = None

    def add_sequence(self, files):
        self.files.extend(files)

    def add_single_file(self, file):
        self.files.append(file)

    def get_frames(self):
        if self.frames:
            # Explicit frames
            return list(sorted(self.frames))

        elif self.frameStart is not None and self.frameEnd is not None:
            # Frame range
            frame_start = self.frameStart
            frame_end = self.frameEnd
            handle_start = self.handleStart or 0
            handle_end = self.handleEnd or 0
            frame_start_handle = frame_start - handle_start
            frame_end_handle = frame_end - handle_end

            return list(range(
                frame_start_handle, frame_end_handle + 1
            ))


# Example usage
colorspace = Colorspace(
    config_path="/path/to/config.ocio",
    colorspace="acesCG",
    display="sRGB",
    view="sRGB"
)

repre = Representation(name="usd",
                       frameStart=1001,
                       frameEnd=1010,
                       colorspace=colorspace)
repre.add_single_file("/path/to/file.usda")

print(repre.get_frames())
print(repre.colorspace)

# Note that Python does still not disallow you doing this:
# repre.colorspace = "A"

# And to turn it into a dict
print(asdict(repre))

Yes okay, I really like this design. I’ve seen the proposal to use attrs module and by reading it, it feels very comfortable.
Is it something you are open to start right now for OPv3 or do you prefer waiting for v4?

Likely the amount of work on this might be easier when targeting AYON only - also because there’s a refactor/rename of some things like subset to product_name or whatever. At least designing it around the new naming conventions would be great.

I wonder if @milan might have an up-to-date document with the expected data for subsets, versions, representations, etc.


JSON Schema validations in OpenPype’s predecessor Avalon

Originally in Avalon - prior to OpenPype, there were also JSONSchema validations (even though very basic, and never updated) that could’ve helped streamlining e.g. the publishing behavior. We could do a validation like enforcing instance.data contains frameStart or that it is optional but if it exists then it MUST be an integer, etc. Here’s e.g. a link to an old one that should have had frameStart, frameEnd, handleStart, handleEnd, etc. defined for clarity. The same could’ve existed for instance and then a publish validator could have done:

class ValidateInstanceSchema(pyblish.api.InstancePlugin):
    order = pyblish.api.ValidatorOrder + 0.499

    def process(self, instance):
        schema.validate(instance.data)

To ensure that during publishing the instance actually adheres to it.

They were very useful to get to know a required structure for what is basically just dict or json data. But in the end the requirements were never updated and over time the data structure still became much more lose.

The concept of the schemas matches somewhat with defining a dataclass or clear data structure so wanted to mention it.

1 Like

I think introducing more abstraction for contributors can be good which should asset them as you mentioned.

(from my short experience) there are many possible things for developers to do and having tailored guides would provide them with what they only need to know about that specific thing
This is what I was trying to achieve here

I’m not just talking about introducing new guides but also some dev templates.

Those were my 2 cents and I hope they are related to the topic.

1 Like

This is our goal, precisely because of the arguments you mentioned. I would recommend using dataclasses as suggested by @BigRoy. However, at a higher level, significant effort is required across the board, so we will be implementing it for AYON. Allow me to elaborate further.

What’s problematic with the current system:

In addition to lacking documentation and a clear, unambiguous definition, it is challenging to maintain and track changes. The metadata dictionaries essentially consist of curated dumps of the pyblish context and relevant instance data.

When it comes to render farm publishing, the approach depends on how are the jobs on the farm created and what information is needed. In the case of Deadline, there are “PluginInfo” and “JobInfo” that are somewhat abstracted here:

and with RoyalRender, there is RoyalRenderJob here:

Unfortunately, job definitions cannot be clearly defined as they can vary significantly between renderers and DCCs.

The current representation data provided by OP/AYON is no longer sufficient. It attempts to describe various types of data without a rigid structure. For instance, certain types of data require time-related information to be stored on the representation, while others do not.

Through our collaboration with the team behind the OpenAssetIO project, we have come up with the idea of using Traits and Specifications. These concepts are currently being formalized in the MediaCreation package - for a sneek and peek:

this is just first draft from AYON side of things, probably too much. Idea is to use combination of those to define Specifications to describe data instead of current system of representations. This will of course come with some sort of schemas, versioning and validation.

Creating and publishing job results involves a dependency graph. The job submitter relies on specific information to generate jobs, while publishing the results requires both that information and the products of the job. In order to publish current Representations (or future asset Specifications), a well-defined set of information must be passed from the initial submitter process to the publishing process.

This is a brief overview of the problem, and I would be happy to discuss it in more detail here.

1 Like

Just wanted to leave a note that we might also be well off using pydantic for this which is also used for the AYON server settings as well so might remain familiar over time as well.

Note that it’s Py3.7+ only however but I believe it allows way stricter behavior than just a attr dataclass?

Just to note that on our own internal plugin pipes we have wrapped the data dicts with dataclasses and it has made managing the code a lot easier so this change would be very very welcome!

Also a note on pydantic, it doesn’t work in some versions of DCCs due to an incompatibility with PySide2 - I can’t remember the specifics sadly other than it being 3DsMax 2021 or 2022 but it is something to explore before locking into pydantic as it will render Ayon nonfunctional on certain DCC versions.

Here’s the bug: PySide2 conflicting with pydantic signatures · Issue #2264 · napari/napari · GitHub