Output Schema System
Overview
The output schema system provides explicit, versioned metadata definitions for every type of file produced by the RDFAnalyzerCore framework. Rather than relying on implicit conventions or runtime discovery, each output type is described by a dedicated schema class whose format contract is captured in a single OutputManifest YAML file written alongside the job outputs.
Why it exists: downstream tools — datacards, plotting scripts, statistical frameworks, batch-merge steps — need to know the exact structure of the outputs they consume. By persisting a manifest, any consumer can:
- Verify output files have the expected format without reading the ROOT files.
- Detect when outputs were produced by an older (incompatible) schema version.
- Detect when outputs are stale (schema version is current but the environment has changed, e.g. a newer framework git commit).
- Validate inputs before merging multiple job outputs.
The framework covers six output categories:
| Category | Produced by | Schema class |
|---|---|---|
| Skims | RootOutputSink |
SkimSchema |
| Histograms | NDHistogramManager |
HistogramSchema |
| Metadata / provenance | ProvenanceService |
MetadataSchema |
| Cutflows | CounterService |
CutflowSchema |
| LAW artifacts | law/nano_tasks.py, law/opendata_tasks.py |
LawArtifactSchema |
| Intermediate artifacts | Pipeline caching layer | IntermediateArtifactSchema |
OutputManifest
OutputManifest is the top-level container that holds references to all schema objects produced in a single analysis job.
Fields
| Field | Type | Description |
|---|---|---|
manifest_version |
int |
Container format version (must equal CURRENT_VERSION = 1). |
skim |
SkimSchema \| None |
Schema for the skim ROOT file, if produced. |
histograms |
HistogramSchema \| None |
Schema for the histogram ROOT file, if applicable. |
metadata |
MetadataSchema \| None |
Schema for the provenance metadata output, if applicable. |
cutflow |
CutflowSchema \| None |
Schema for the cutflow output, if applicable. |
law_artifacts |
list[LawArtifactSchema] |
Schemas for LAW task artifacts. Empty list if none. |
intermediate_artifacts |
list[IntermediateArtifactSchema] |
Schemas for cached intermediate artifacts. |
regions |
list[RegionDefinition] |
Named analysis regions declared by RegionManager. |
nuisance_groups |
list[NuisanceGroupDefinition] |
Nuisance / systematic group definitions. |
framework_hash |
str \| None |
Git commit hash of RDFAnalyzerCore at job-submission time. |
user_repo_hash |
str \| None |
Git commit hash of the user analysis repository. |
config_mtime |
str \| None |
UTC ISO 8601 modification time of the job configuration file. |
dataset_manifest_provenance |
DatasetManifestProvenance \| None |
Identity and query record for the dataset manifest used. |
At least one of skim, histograms, metadata, cutflow, law_artifacts, or intermediate_artifacts must be populated; validate() returns an error if all are absent.
Methods
| Method | Signature | Description |
|---|---|---|
save_yaml |
(path: str) -> None |
Serialise the manifest to a YAML file. Parent directories are created automatically. |
load_yaml |
(path: str) -> OutputManifest |
Classmethod. Load and deserialise a manifest from a YAML file. |
validate |
() -> list[str] |
Run structural validation on all contained schemas. Returns a (possibly empty) list of error strings. |
check_version_compatibility |
(manifest: OutputManifest) -> None |
Static method. Raise SchemaVersionError if any stored schema version differs from the current code version. |
provenance |
() -> ProvenanceRecord |
Build a ProvenanceRecord from the manifest’s recorded hashes and timestamps. |
resolve |
(current_provenance=None) -> dict[str, ArtifactResolutionStatus] |
Convenience wrapper for resolve_manifest() that uses this manifest’s stored provenance as the baseline. |
YAML Example
manifest_version: 1
framework_hash: a1b2c3d4e5f6
user_repo_hash: 9f8e7d6c5b4a
config_mtime: "2024-03-15T12:00:00+00:00"
skim:
schema_version: 1
output_file: output/sample_0.root
tree_name: Events
branches:
- Muon_pt
- Muon_eta
- MET_pt
histograms:
schema_version: 1
output_file: output/sample_0_meta.root
histogram_names:
- muon_pt_vs_eta
axes:
- variable: Muon_pt
bins: 50
lower_bound: 0.0
upper_bound: 200.0
label: "Muon p_{T} [GeV]"
- variable: Muon_eta
bins: 30
lower_bound: -3.0
upper_bound: 3.0
label: "#eta"
metadata:
schema_version: 1
output_file: output/sample_0_meta.root
provenance_dir: provenance
required_keys:
- framework.git_hash
- config.hash
optional_keys:
- dataset_manifest.file_hash
cutflow:
schema_version: 1
output_file: output/sample_0_meta.root
counter_keys:
- sample_0.total
- sample_0.weighted
law_artifacts: []
intermediate_artifacts: []
regions: []
nuisance_groups: []
dataset_manifest_provenance: null
Schema Types
SkimSchema
Describes a ROOT TTree skim file written by RootOutputSink.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
output_file |
str |
"" |
Path or pattern of the output ROOT file. Must not be empty. |
tree_name |
str |
"Events" |
Name of the TTree inside the ROOT file. Must not be empty. |
branches |
list[str] |
[] |
Expected branch names. Empty list means all branches are accepted. |
from output_schema import SkimSchema
skim = SkimSchema(
output_file="output/sample_0.root",
tree_name="Events",
branches=["Muon_pt", "Muon_eta", "MET_pt"],
)
errors = skim.validate() # [] if valid
HistogramAxisSpec
Describes a single axis dimension of a THnSparseF histogram.
| Field | Type | Default | Description |
|---|---|---|---|
variable |
str |
"" |
Branch/column name used to fill this axis. Must not be empty. |
bins |
int |
0 |
Number of bins. Must be > 0. |
lower_bound |
float |
0.0 |
Lower edge of the axis range. Must be less than upper_bound. |
upper_bound |
float |
1.0 |
Upper edge of the axis range. |
label |
str |
"" |
Human-readable axis label (supports ROOT LaTeX notation). |
HistogramSchema
Describes THnSparseF histogram objects saved by NDHistogramManager into the meta-output ROOT file.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
output_file |
str |
"" |
Path or pattern of the meta-output ROOT file. Must not be empty. |
histogram_names |
list[str] |
[] |
Names of the THnSparseF objects expected in the file. |
axes |
list[HistogramAxisSpec] |
[] |
Axis specifications shared across all histograms in this schema. |
from output_schema import HistogramSchema, HistogramAxisSpec
histograms = HistogramSchema(
output_file="output/sample_0_meta.root",
histogram_names=["muon_pt_vs_eta"],
axes=[
HistogramAxisSpec(variable="Muon_pt", bins=50, lower_bound=0.0, upper_bound=200.0, label="Muon p_{T} [GeV]"),
HistogramAxisSpec(variable="Muon_eta", bins=30, lower_bound=-3.0, upper_bound=3.0, label="#eta"),
],
)
MetadataSchema
Describes the provenance metadata output written by ProvenanceService as TNamed objects inside a TDirectory in the meta-output ROOT file.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
output_file |
str |
"" |
Path or pattern of the meta-output ROOT file. Must not be empty. |
provenance_dir |
str |
"provenance" |
Name of the TDirectory holding the provenance objects. Must not be empty. |
required_keys |
list[str] |
PROVENANCE_REQUIRED_KEYS |
Keys that must be present. Defaults to the framework-defined required set. |
optional_keys |
list[str] |
PROVENANCE_OPTIONAL_KEYS |
Keys that may optionally be present. |
Default required provenance keys (PROVENANCE_REQUIRED_KEYS):
framework.git_hash
framework.git_dirty
framework.build_timestamp
framework.compiler
root.version
config.hash
executor.num_threads
Default optional provenance keys (PROVENANCE_OPTIONAL_KEYS):
analysis.git_hash
analysis.git_dirty
env.container_tag
filelist.hash
dataset_manifest.file_hash
dataset_manifest.query_params
dataset_manifest.resolved_entries
CutflowSchema
Describes event-count histograms written by CounterService.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
output_file |
str |
"" |
Path or pattern of the ROOT file containing counter objects. Typically the same as the meta-output file. Must not be empty. |
counter_keys |
list[str] |
[] |
Ordered counter key names expected in the output. Empty list disables key-level validation. |
LawArtifactSchema
Describes a single output file produced by a LAW workflow task (PrepareNANOSample, BuildNANOSubmission, SubmitNANOJobs, MonitorNANOJobs, RunNANOAnalysisJob, …).
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
artifact_type |
str |
"" |
Logical type of the artifact. Must be one of the recognised types (see below). |
path_pattern |
str |
"" |
Glob-compatible path pattern for the expected output file(s). |
format |
str |
"text" |
Serialisation format: "json", "text", "shell", or "root". |
Recognised artifact_type values (LAW_ARTIFACT_TYPES):
| Value | LAW task |
|---|---|
prepare_sample |
PrepareNANOSample |
build_submission |
BuildNANOSubmission |
submit_jobs |
SubmitNANOJobs |
monitor_jobs |
MonitorNANOJobs |
run_job |
RunNANOAnalysisJob |
monitor_state |
State-monitoring tasks |
IntermediateArtifactSchema
Describes a mid-pipeline cached result that can be reused across workflow stages.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
artifact_kind |
str |
"" |
Logical kind of the intermediate artifact. Must be one of the recognised kinds (see below). |
output_file |
str |
"" |
Path or pattern of the output ROOT file. Must not be empty. |
tree_name |
str |
"Events" |
Name of the TTree inside the ROOT file. Must not be empty. |
columns |
list[str] |
[] |
Branch/column names materialised in this snapshot. Empty list means all columns are included. |
Recognised artifact_kind values (INTERMEDIATE_ARTIFACT_KINDS):
| Value | Description |
|---|---|
preselection |
Output after an initial preselection filter is applied. |
reduced_skim |
Skim produced after an event-level reduction step. |
column_snapshot |
Snapshot caching derived column/branch values. |
enriched_skim |
Skim enriched with pre-computed physics objects. |
RegionDefinition
Describes a named analysis region declared by RegionManager. Regions select a sub-sample of events via a boolean filter column, and can form a parent–child hierarchy.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
name |
str |
"" |
Unique region name (e.g. "signal", "control_wjets"). Must not be empty. |
filter_column |
str |
"" |
Boolean dataframe column that selects events in this region. Must not be empty. |
parent |
str |
"" |
Name of the parent region, or empty string for a root region. |
description |
str |
"" |
Optional human-readable description. |
The validate_region_hierarchy() helper validates a list of RegionDefinition objects, checking for duplicate names, unknown parent references, and cycles in the parent–child graph.
NuisanceGroupDefinition
Collects related systematic variations and records which processes, regions, and downstream tools they apply to.
| Field | Type | Default | Description |
|---|---|---|---|
schema_version |
int |
1 |
Schema format version. |
name |
str |
"" |
Unique group name (e.g. "jet_energy_scale"). Must not be empty. |
group_type |
str |
"shape" |
One of "shape", "rate", "normalization", "other". |
systematics |
list[str] |
[] |
Base names of systematic variations (e.g. ["JES", "JER"]). Each is expected to have Up and Down shifts. |
processes |
list[str] |
[] |
Processes this group applies to. Empty list means all processes. |
regions |
list[str] |
[] |
Analysis regions this group applies to. Empty list means all regions. |
output_usage |
list[str] |
[] |
Downstream tools: "histogram", "datacard", "plot". Empty list means all outputs. |
description |
str |
"" |
Optional human-readable description. |
correlation_group |
str |
"" |
Optional label grouping correlated systematics across definitions. |
The validate_nuisance_coverage() helper checks that all declared systematics have corresponding Up and Down variations present in the output.
Artifact Resolution and Caching
ProvenanceRecord
A snapshot of the versioning context at the time an artifact was produced. Used to determine whether a previously produced artifact is still up-to-date.
| Field | Type | Description |
|---|---|---|
framework_hash |
str \| None |
Git commit hash of the RDFAnalyzerCore framework. |
user_repo_hash |
str \| None |
Git commit hash of the user analysis repository. |
config_mtime |
str \| None |
UTC ISO 8601 modification time of the job configuration file. |
dataset_manifest_hash |
str \| None |
SHA-256 hash of the dataset manifest file. |
Any field may be None when the information is unavailable. Two records are considered matching when every field that is non-None in both records has the same value. Fields that are None in either record are treated as “unknown” and do not cause a mismatch.
from output_schema import ProvenanceRecord
recorded = manifest.provenance()
current = ProvenanceRecord(framework_hash="new_fw_hash")
if not recorded.matches(current):
print("Artifact is stale – consider regenerating")
ArtifactResolutionStatus
An enum.Enum with three values:
| Value | Meaning |
|---|---|
COMPATIBLE |
Schema version matches CURRENT_VERSION and provenance matches (or no comparison was requested). No regeneration needed. |
STALE |
Schema version is current but recorded provenance differs from the current environment. The artifact can still be used but should be regenerated when convenient. |
MUST_REGENERATE |
Schema version does not match CURRENT_VERSION. The artifact is incompatible and must be regenerated before use. |
resolve_artifact()
def resolve_artifact(
artifact,
recorded_provenance: ProvenanceRecord | None = None,
current_provenance: ProvenanceRecord | None = None,
) -> ArtifactResolutionStatus
Determines whether a single schema object is compatible, stale, or must be regenerated by applying the three-step resolution rules:
- MUST_REGENERATE –
artifact.schema_version != artifact.CURRENT_VERSION. - STALE – both provenance records are provided and
recorded_provenance.matches(current_provenance)isFalse. - COMPATIBLE – version is current and provenances match (or comparison was not requested).
resolve_manifest()
def resolve_manifest(
manifest: OutputManifest,
current_provenance: ProvenanceRecord | None = None,
) -> dict[str, ArtifactResolutionStatus]
Resolves all schemas in a manifest. Uses manifest.provenance() as the recorded provenance baseline. Returns a dict mapping role names to statuses:
{
"skim": ArtifactResolutionStatus.COMPATIBLE,
"histograms": ArtifactResolutionStatus.STALE,
"law_artifacts[0]": ArtifactResolutionStatus.COMPATIBLE,
...
}
CachedArtifact
Represents an intermediate artifact written to disk with an attached schema and provenance sidecar.
| Field | Type | Description |
|---|---|---|
artifact_path |
str |
Path to the cached artifact file. |
manifest |
OutputManifest |
Full OutputManifest describing the schemas and provenance. |
cached_at |
str |
ISO 8601 UTC timestamp when the artifact was cached. |
Cache Sidecar Files
The sidecar file is written at {artifact_path}.cache.yaml (the suffix constant CACHE_SIDECAR_SUFFIX = ".cache.yaml"). For example, output/presel.root gets a sidecar at output/presel.root.cache.yaml.
write_cache_sidecar(artifact_path, manifest, cached_at=None) -> str
Writes a CachedArtifact as YAML to {artifact_path}.cache.yaml. Returns the absolute path of the written sidecar.
read_cache_sidecar(artifact_path) -> CachedArtifact
Reads and deserialises the sidecar for a given artifact. Raises FileNotFoundError if no sidecar exists, ValueError if the content is not a valid YAML mapping.
check_cache_validity(artifact_path, current_provenance=None, strict=False) -> ArtifactResolutionStatus
Reads the sidecar and applies resolve_manifest() to determine whether the cached artifact can be reused. Resolution rules (in order):
- MUST_REGENERATE – the artifact file or sidecar does not exist.
- MUST_REGENERATE – any schema version in the sidecar does not match
CURRENT_VERSION. - STALE – all schema versions are current but recorded provenance does not match
current_provenance. - COMPATIBLE – all versions are current and provenance matches (or no
current_provenancewas supplied).
When strict=True, STALE is promoted to MUST_REGENERATE.
from output_schema import (
write_cache_sidecar, check_cache_validity,
ArtifactResolutionStatus, ProvenanceRecord,
)
# After producing presel.root:
write_cache_sidecar("output/presel.root", manifest)
# Later, before reusing the cache:
current = ProvenanceRecord(framework_hash="abc123", user_repo_hash="xyz789")
status = check_cache_validity("output/presel.root", current_provenance=current)
if status == ArtifactResolutionStatus.COMPATIBLE:
pass # safe to reuse
elif status == ArtifactResolutionStatus.STALE:
pass # usable but outdated – regenerate when convenient
elif status == ArtifactResolutionStatus.MUST_REGENERATE:
pass # must regenerate before use
Manifest Merging
After batch jobs produce individual per-sample manifests, the outputs must be merged (e.g. via hadd). The schema API provides two functions for this workflow.
validate_merge_inputs()
def validate_merge_inputs(
manifests: list[OutputManifest],
required_roles: list[str] | None = None,
) -> list[str]
Validates a collection of manifests for merge compatibility. This is the canonical pre-merge validation API: call it before any hadd or histogram-addition step. Returns a (possibly empty) list of human-readable error strings — an empty list means all inputs are compatible and the merge may proceed.
Checks performed (in order):
- At least one manifest is provided.
- Every manifest passes its own
validate()check. - Every manifest passes
check_version_compatibility()(no outdated schema versions). - All manifests expose the same set of scalar artifact roles (
skim,histograms,metadata,cutflow). - For each shared scalar role, all manifests carry the same
schema_version. - All manifests contain the same number of
law_artifacts. - Corresponding
law_artifactsentries share the sameschema_version. - Same count and version checks for
intermediate_artifacts. - If
required_rolesis supplied, every listed role must be present in every manifest.
Valid values for required_roles: "skim", "histograms", "metadata", "cutflow", "law_artifacts", "intermediate_artifacts".
merge_manifests()
def merge_manifests(
manifests: list[OutputManifest],
framework_hash: str | None = None,
user_repo_hash: str | None = None,
required_roles: list[str] | None = None,
) -> OutputManifest
Builds a merged OutputManifest from validated inputs. Calls validate_merge_inputs() internally and raises MergeInputValidationError if any checks fail. Schema definitions (including output_file path patterns) are copied from the first manifest. After the actual file merge (e.g. hadd), callers should update the output_file fields on the returned manifest’s schemas to point to the merged output path before saving.
MergeInputValidationError
RuntimeError subclass raised when merge inputs fail validation. Contains all diagnostics from all invalid inputs so callers can report every problem in a single message.
from output_schema import (
OutputManifest, merge_manifests, MergeInputValidationError,
)
manifests = [OutputManifest.load_yaml(p) for p in manifest_paths]
try:
merged = merge_manifests(
manifests,
framework_hash=current_fw_hash,
required_roles=["histograms"],
)
except MergeInputValidationError as exc:
print(f"Cannot merge: {exc}")
raise
# Run hadd / histogram addition here...
merged.histograms.output_file = "merged_meta.root"
merged.save_yaml("merged/output_manifest.yaml")
Schema Versioning
Version Constants
Each schema class has a module-level version constant and a CURRENT_VERSION class attribute. Bump the module-level constant (and add a note to the version table in output_schema.py) whenever you make a breaking change to that output format.
| Constant | Value | Schema |
|---|---|---|
SKIM_SCHEMA_VERSION |
1 |
SkimSchema |
HISTOGRAM_SCHEMA_VERSION |
1 |
HistogramSchema |
METADATA_SCHEMA_VERSION |
1 |
MetadataSchema |
CUTFLOW_SCHEMA_VERSION |
1 |
CutflowSchema |
LAW_ARTIFACT_SCHEMA_VERSION |
1 |
LawArtifactSchema |
INTERMEDIATE_ARTIFACT_SCHEMA_VERSION |
1 |
IntermediateArtifactSchema |
REGION_DEFINITION_VERSION |
1 |
RegionDefinition |
NUISANCE_GROUP_DEFINITION_VERSION |
1 |
NuisanceGroupDefinition |
OUTPUT_MANIFEST_VERSION |
1 |
OutputManifest |
SCHEMA_REGISTRY
A dict[str, int] mapping schema names to their current versions. Useful for programmatic queries.
from output_schema import SCHEMA_REGISTRY
print(SCHEMA_REGISTRY)
# {
# "skim": 1,
# "histogram": 1,
# "metadata": 1,
# "cutflow": 1,
# "law_artifact": 1,
# "intermediate_artifact": 1,
# "region_definition": 1,
# "nuisance_group_definition": 1,
# "output_manifest": 1,
# }
SchemaVersionError
RuntimeError subclass raised by check_version_compatibility() when a loaded schema version does not match CURRENT_VERSION. Catch this separately from other RuntimeError exceptions to apply migration logic or emit clear user-facing messages.
Usage Examples
Creating an OutputManifest
from output_schema import (
OutputManifest,
SkimSchema,
HistogramSchema,
HistogramAxisSpec,
MetadataSchema,
CutflowSchema,
RegionDefinition,
NuisanceGroupDefinition,
)
manifest = OutputManifest(
skim=SkimSchema(
output_file="output/sample_0.root",
tree_name="Events",
branches=["Muon_pt", "Muon_eta", "MET_pt"],
),
histograms=HistogramSchema(
output_file="output/sample_0_meta.root",
histogram_names=["muon_pt"],
axes=[
HistogramAxisSpec(
variable="Muon_pt",
bins=50,
lower_bound=0.0,
upper_bound=200.0,
label="Muon p_{T} [GeV]",
)
],
),
metadata=MetadataSchema(output_file="output/sample_0_meta.root"),
cutflow=CutflowSchema(output_file="output/sample_0_meta.root"),
regions=[
RegionDefinition(name="signal", filter_column="pass_signal_sel"),
RegionDefinition(
name="signal_tight",
filter_column="pass_tight_sel",
parent="signal",
),
],
nuisance_groups=[
NuisanceGroupDefinition(
name="jet_energy_scale",
group_type="shape",
systematics=["JES", "JER"],
output_usage=["histogram", "datacard"],
)
],
framework_hash="a1b2c3d4",
user_repo_hash="9f8e7d6c",
)
errors = manifest.validate()
if errors:
raise ValueError("Schema validation failed:\n" + "\n".join(errors))
manifest.save_yaml("output/output_manifest.yaml")
Loading and Validating a Manifest
from output_schema import OutputManifest, SchemaVersionError
# Load the manifest
manifest = OutputManifest.load_yaml("output/output_manifest.yaml")
# Validate structural correctness
errors = manifest.validate()
if errors:
for e in errors:
print(f" ERROR: {e}")
# Check that all schema versions are current (raises if not)
try:
OutputManifest.check_version_compatibility(manifest)
except SchemaVersionError as exc:
print(f"Schema version mismatch: {exc}")
# apply migration logic or abort
Checking Artifact Status
from output_schema import (
OutputManifest,
ProvenanceRecord,
resolve_manifest,
ArtifactResolutionStatus,
)
manifest = OutputManifest.load_yaml("output/output_manifest.yaml")
current = ProvenanceRecord(
framework_hash="new_fw_hash_abc",
user_repo_hash="new_user_hash_xyz",
)
statuses = resolve_manifest(manifest, current_provenance=current)
for role, status in statuses.items():
if status == ArtifactResolutionStatus.MUST_REGENERATE:
print(f" {role}: MUST REGENERATE (schema version mismatch)")
elif status == ArtifactResolutionStatus.STALE:
print(f" {role}: stale (environment changed, regenerate when convenient)")
else:
print(f" {role}: compatible")
Merging Manifests
from output_schema import (
OutputManifest,
merge_manifests,
MergeInputValidationError,
)
import glob
manifest_paths = glob.glob("jobs/*/output_manifest.yaml")
manifests = [OutputManifest.load_yaml(p) for p in manifest_paths]
try:
merged = merge_manifests(
manifests,
framework_hash="a1b2c3d4",
required_roles=["histograms"],
)
except MergeInputValidationError as exc:
print(f"Cannot merge: {exc}")
raise
# Run the actual file merge (e.g. hadd) here, then update output paths:
merged.histograms.output_file = "merged/merged_meta.root"
merged.save_yaml("merged/output_manifest.yaml")