Dataset Manifest
Overview
The dataset manifest is the primary metadata registry for HEP dataset samples in RDFAnalyzerCore. It provides a rich, structured model that replaces flat sample-config text files with a fully-typed, queryable YAML format.
What it stores:
- Integrated luminosity (global and per-year)
- Per-sample physics normalisation: cross-section, filter efficiency, k-factors, sum of weights
- Dataset identity: year, era, campaign, physics process, sample group, stitching IDs
- File discovery: DAS paths for CMS NanoAOD, explicit file lists, XRootD URLs
- Dataset type:
"mc"or"data" - Derived dataset lineage via
parentreferences - Friend tree / sidecar attachments for extra branches from separate ROOT files
Analyses and LAW tasks query datasets through this metadata model rather than relying solely on flat file lists, enabling reproducible and auditable dataset selection.
YAML Format
The top-level structure of a dataset manifest YAML file:
lumi: 59740.0 # Global / fallback integrated luminosity in pb^-1
lumi_by_year: # Optional per-year luminosities (take precedence over lumi)
"2018": 59740.0
"2022": 38010.0
"2023": 27010.0
whitelist: [] # Rucio site whitelist (empty = all sites allowed)
blacklist: [] # Rucio site blacklist
datasets:
- name: DYJets_2018
year: 2018
era: null # null for MC
campaign: Run2 UL
process: DYJets
group: DY
dtype: mc
xsec: 6077.22
filter_efficiency: 1.0
kfac: 1.0
extra_scale: 1.0
sum_weights: 1234567.0
stitch_id: 0
das: /DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIISummer20UL18NanoAODv9-106X_upgrade2018_realistic_v16_L1v1-v2/NANOAODSIM
files: []
parent: null
friend_trees: []
- name: SingleMuon_2018C
year: 2018
era: Run2018C
campaign: null
process: data
group: data
dtype: data
xsec: null
filter_efficiency: 1.0
kfac: 1.0
extra_scale: 1.0
sum_weights: null
stitch_id: null
das: /SingleMuon/Run2018C-UL2018_MiniAODv2_NanoAODv9-v1/NANOAOD
files: []
parent: null
friend_trees: []
- name: DYJets_2018_skim
year: 2018
era: null
campaign: Run2 UL
process: DYJets
group: DY
dtype: mc
xsec: 6077.22
filter_efficiency: 1.0
kfac: 1.0
extra_scale: 1.0
sum_weights: 1234567.0
stitch_id: null
das: null
files:
- /data/skims/DYJets_2018_skim.root
- root://eosserver.cern.ch//eos/cms/DYJets_2018_skim.root
parent: DYJets_2018
friend_trees:
- alias: calib
tree_name: Events
files:
- /data/calib/DYJets_2018_calib.root
directory: null
globs:
- .root
antiglobs: []
index_branches:
- run
- luminosityBlock
DatasetEntry
A DatasetEntry is a Python dataclass (@dataclass) that models a single HEP dataset sample.
Identity Fields
| Field | Type | Default | Description |
|---|---|---|---|
name |
str |
(required) | Unique sample identifier. Used by submission scripts and LAW tasks. Must be unique within a manifest. |
year |
int \| None |
None |
Data-taking or MC production year (e.g. 2022, 2023). |
era |
str \| None |
None |
Run era within the year (e.g. "Run2018C"). Set to None for MC. |
campaign |
str \| None |
None |
MC production campaign tag (e.g. "Run3Summer22NanoAODv12"). |
process |
str \| None |
None |
Physics process label (e.g. "ttbar", "wjets", "data"). |
group |
str \| None |
None |
Sample grouping label for collecting stitched or HT-binned samples (e.g. "wjets_ht"). All samples sharing a group can be queried together with DatasetManifest.query(group=...). |
stitch_id |
int \| None |
None |
Integer stitching code written into the event stream and read back by the CounterService counterIntWeightBranch mechanism. |
sample_type |
int \| None |
None |
Optional analysis-specific integer sample code forwarded to legacy batch workflows. Use this when you need a numeric per-sample code while preserving dtype for framework-level mc / data handling. |
Normalisation Fields
| Field | Type | Default | Description |
|---|---|---|---|
xsec |
float \| None |
None |
Cross-section in pb. Leave None for real data. |
filter_efficiency |
float |
1.0 |
Generator-level filter efficiency (fraction surviving the generator filter). |
kfac |
float |
1.0 |
QCD k-factor to scale the LO cross-section to (N)NLO. |
extra_scale |
float |
1.0 |
Additional arbitrary scale factor. |
sum_weights |
float \| None |
None |
Sum of generator weights. Computed at run-time when None. |
Type Field
| Field | Type | Default | Description |
|---|---|---|---|
dtype |
str |
"mc" |
Dataset type. Either "mc" (simulation) or "data" (real data). |
File Discovery Fields
| Field | Type | Default | Description |
|---|---|---|---|
das |
str \| None |
None |
Comma-separated DAS path(s) for CMS NanoAOD Rucio discovery (e.g. /TTto2L2Nu.../NANOAODSIM). |
files |
list[str] |
[] |
Explicit list of ROOT file paths or XRootD URLs. Used instead of das for open-data or local file inputs. |
Provenance Field
| Field | Type | Default | Description |
|---|---|---|---|
parent |
str \| None |
None |
Name of the parent DatasetEntry from which this dataset was derived. Enables lineage tracking for skimmed or filtered datasets. |
Sidecar Field
| Field | Type | Default | Description |
|---|---|---|---|
friend_trees |
list[FriendTreeConfig] |
[] |
Friend tree / sidecar configurations to attach to this dataset. See FriendTreeConfig. |
Methods
| Method | Description |
|---|---|
validate() |
Not present as a standalone method on DatasetEntry; validation is performed at the manifest level via DatasetManifest.validate(). |
to_dict() -> dict |
Return the full entry as a plain Python dict suitable for YAML serialisation. None values are preserved. |
from_dict(data: dict) -> DatasetEntry |
Classmethod. Construct a DatasetEntry from a plain dict. Unknown keys are silently ignored for forward-compatibility. Nested friend_trees dicts are deserialised automatically. |
to_legacy_dict() -> dict[str, str] |
Return a {key: str} dict compatible with the legacy getSampleList / _get_sample_list format. Only non-None values are included; all values are stringified. |
FriendTreeConfig
A FriendTreeConfig describes a sidecar or friend tree input attached to a dataset. The framework attaches the friend TChain to the main chain before the RDataFrame is created, making all friend branches immediately accessible.
Both local file paths and XRootD remote URLs (root://...) are supported in the files list.
Fields
| Field | Type | Default | Description |
|---|---|---|---|
alias |
str |
(required) | Alias used to access friend tree branches in ROOT. Branches from a friend registered as "calib" are accessible as calib.pt, calib.eta, etc. |
tree_name |
str |
"Events" |
Name of the TTree inside the friend ROOT file(s). |
files |
list[str] |
[] |
Explicit list of ROOT file paths or XRootD URLs. Takes precedence over directory when both are set. |
directory |
str \| None |
None |
Path to a local directory scanned recursively for ROOT files. Used when no explicit files list is provided. |
globs |
list[str] |
[".root"] |
Filename patterns to include when scanning directory. |
antiglobs |
list[str] |
[] |
Filename patterns to exclude when scanning directory. |
index_branches |
list[str] |
[] |
Branch names used as event identifiers for index-based matching. When non-empty, the framework calls TChain::BuildIndex using the first two entries as the major and minor index keys. |
index_branches Common Configurations
| Value | Use case |
|---|---|
["run", "luminosityBlock"] |
Match on run and luminosity block |
["run", "event"] |
Match on run and event number |
[] (empty) |
Position-based (sequential-order) matching — the default ROOT friend tree behaviour |
Use Case
Attach pre-computed calibration corrections or tagger outputs from separate ROOT files:
friend_trees:
- alias: calib
tree_name: Events
files:
- /data/calib_2022.root
- root://eosserver.cern.ch//eos/data/calib_remote.root
index_branches:
- run
- luminosityBlock
Methods
| Method | Description |
|---|---|
to_dict() -> dict |
Return the config as a plain Python dict suitable for YAML serialisation. |
from_dict(data: dict) -> FriendTreeConfig |
Classmethod. Construct from a plain dict. Unknown keys are silently ignored. |
DatasetManifest API
DatasetManifest is a container for a collection of DatasetEntry objects with rich querying capabilities.
Constructor
DatasetManifest(
datasets: list[DatasetEntry] | None = None,
lumi: float = 1.0,
whitelist: list[str] | None = None,
blacklist: list[str] | None = None,
lumi_by_year: dict[int, float] | None = None,
)
Properties
| Attribute | Type | Description |
|---|---|---|
datasets |
list[DatasetEntry] |
All registered dataset entries. |
lumi |
float |
Global integrated luminosity in pb⁻¹. Used as the fallback when no per-year entry is found. |
lumi_by_year |
dict[int, float] |
Optional per-year luminosity mapping (year → pb⁻¹). Takes precedence over the global lumi in lumi_for(). |
whitelist |
list[str] |
Rucio site whitelist (empty = all sites allowed). |
blacklist |
list[str] |
Rucio site blacklist. |
Query Methods
by_name(name: str) -> DatasetEntry | None
Return the DatasetEntry with the given name, or None if not found.
entry = manifest.by_name("ttbar_powheg_2022")
query(**kwargs) -> list[DatasetEntry]
Filter datasets by metadata criteria. All supplied keyword arguments are combined with logical AND. Each argument may be a single value or a list of values (list matches any entry whose field is contained in that list — logical OR within the field).
| Parameter | Type | Description |
|---|---|---|
year |
int \| list[int] |
Filter by data-taking year. |
era |
str \| list[str] |
Filter by run era. |
campaign |
str \| list[str] |
Filter by MC production campaign. |
process |
str \| list[str] |
Filter by physics process label. |
group |
str \| list[str] |
Filter by sample group label. |
dtype |
str \| list[str] |
Filter by dataset type ("mc" or "data"). |
parent |
str \| list[str] |
Filter by parent dataset name. |
# All 2022 MC ttbar samples
ttbar_2022 = manifest.query(year=2022, dtype="mc", process="ttbar")
# Samples across multiple years
multi = manifest.query(year=[2022, 2023], dtype="mc", process="ttbar")
# All samples in a stitched group
wjets = manifest.query(group="wjets_ht")
lumi_for(year=None, era=None) -> float
Return the luminosity in pb⁻¹ for the given year. Look-up order:
- If
yearis given and present inlumi_by_year, return that value. - Otherwise fall back to the global
lumi.
Note: The
eraparameter is reserved for future per-era look-up and is currently ignored.
lumi_2022 = manifest.lumi_for(year=2022) # from lumi_by_year
lumi_all = manifest.lumi_for() # global lumi
Discovery helpers
| Method | Description |
|---|---|
get_groups() -> list[str] |
Sorted list of unique group labels across all datasets. |
get_processes() -> list[str] |
Sorted list of unique process labels. |
get_years() -> list[int] |
Sorted list of unique years. |
get_eras() -> list[str] |
Sorted list of unique era labels across all datasets. |
get_eras_for_year(year: int) -> list[str] |
Sorted list of unique era labels for a specific year. |
I/O Methods
DatasetManifest.load(path: str) -> DatasetManifest
Classmethod. Auto-detect format and load a manifest:
- Files ending in
.yaml/.ymlare loaded as YAML manifests. - All other files are interpreted as legacy key=value text configs.
DatasetManifest.load_yaml(path: str) -> DatasetManifest
Classmethod. Load a YAML manifest file.
DatasetManifest.save_yaml(path: str) -> None
Serialise the manifest to a YAML file. Parent directories are created automatically.
DatasetManifest.file_hash(path: str) -> str
Static method. Return the SHA-256 hex digest of the manifest file at path (64-character lowercase hex string, or "<not found>" if the file cannot be opened). Useful for recording the exact manifest revision used in a workflow task, enabling reproducible dataset selection.
h = DatasetManifest.file_hash("datasets.yaml")
# "3a5f2b..." – stable identifier for this manifest revision
Validation
validate() -> list[str]
Check the manifest for consistency and return a list of error/warning messages. An empty list means the manifest is consistent.
Checks performed:
- Duplicate names — every
DatasetEntry.namemust be unique. - Missing parent references — if
entry.parentis set, aDatasetEntrywith that name must exist in the manifest. - Unknown
lumi_by_yearkeys — years inlumi_by_yearthat do not appear in any dataset are reported as"Warning:"messages.
errors = manifest.validate()
if errors:
for msg in errors:
print(msg)
Container Protocol
DatasetManifest supports len() and iteration:
print(f"Manifest contains {len(manifest)} datasets")
for entry in manifest:
print(entry.name)
Python Examples
Creating a Manifest Programmatically
from dataset_manifest import DatasetManifest, DatasetEntry, FriendTreeConfig
entry_mc = DatasetEntry(
name="ttbar_powheg_2022",
year=2022,
campaign="Run3Summer22NanoAODv12",
process="ttbar",
group="ttbar",
dtype="mc",
xsec=98.34,
kfac=1.0,
sum_weights=8_765_432.0,
das="/TTto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM",
)
entry_data = DatasetEntry(
name="Muon_Run2022C",
year=2022,
era="Run2022C",
process="data",
group="data",
dtype="data",
das="/Muon/Run2022C-22Sep2023-v1/NANOAOD",
)
manifest = DatasetManifest(
datasets=[entry_mc, entry_data],
lumi=38010.0,
lumi_by_year={2022: 38010.0, 2023: 27010.0},
)
errors = manifest.validate()
if errors:
raise ValueError("Manifest invalid:\n" + "\n".join(errors))
manifest.save_yaml("datasets.yaml")
Loading from YAML
from dataset_manifest import DatasetManifest
manifest = DatasetManifest.load("datasets.yaml")
# Check the manifest revision
file_hash = DatasetManifest.file_hash("datasets.yaml")
print(f"Manifest revision: {file_hash}")
# Validate
errors = manifest.validate()
if errors:
for msg in errors:
print(f" {msg}")
Querying Datasets
from dataset_manifest import DatasetManifest
manifest = DatasetManifest.load("datasets.yaml")
# Single dataset by name
ttbar = manifest.by_name("ttbar_powheg_2022")
if ttbar:
print(f"ttbar xsec: {ttbar.xsec} pb")
# All MC samples for 2022 ttbar
samples = manifest.query(year=2022, dtype="mc", process="ttbar")
print(f"Found {len(samples)} ttbar MC samples")
# Samples from multiple years at once
run3 = manifest.query(year=[2022, 2023], dtype="mc")
print(f"Run 3 MC samples: {len(run3)}")
# All samples in a stitched W+jets group
wjets = manifest.query(group="wjets_ht")
# Per-year luminosity
lumi_2022 = manifest.lumi_for(year=2022)
lumi_2023 = manifest.lumi_for(year=2023)
lumi_total = lumi_2022 + lumi_2023
print(f"Total Run 3 lumi: {lumi_total:.0f} pb^-1")
Iterating Over MC and Data Datasets Separately
from dataset_manifest import DatasetManifest
manifest = DatasetManifest.load("datasets.yaml")
mc_samples = manifest.query(dtype="mc")
data_samples = manifest.query(dtype="data")
print("MC samples:")
for sample in mc_samples:
lumi = manifest.lumi_for(year=sample.year)
effective_xsec = (
sample.xsec * sample.filter_efficiency * sample.kfac * sample.extra_scale
if sample.xsec is not None
else None
)
print(f" {sample.name}: xsec_eff = {effective_xsec:.4f} pb" if effective_xsec else f" {sample.name}")
print("\nData samples:")
for sample in data_samples:
print(f" {sample.name} (era: {sample.era}, year: {sample.year})")
# Discover all unique years and eras
print(f"\nYears in manifest: {manifest.get_years()}")
for year in manifest.get_years():
print(f" {year}: eras = {manifest.get_eras_for_year(year)}")
Working with Friend Trees
from dataset_manifest import DatasetManifest, DatasetEntry, FriendTreeConfig
entry = DatasetEntry(
name="DYJets_2022",
year=2022,
process="DYJets",
dtype="mc",
xsec=6077.22,
das="/DYJetsToLL_M-50_TuneCP5_13p6TeV-amcatnloFXFX-pythia8/Run3Summer22NanoAODv12-130X_mcRun3_2022_realistic_v5-v2/NANOAODSIM",
friend_trees=[
FriendTreeConfig(
alias="calib",
tree_name="Events",
files=[
"/data/calib/DYJets_2022_jerc.root",
"root://eosserver.cern.ch//eos/cms/DYJets_2022_jerc_remote.root",
],
index_branches=["run", "luminosityBlock"],
),
FriendTreeConfig(
alias="tagger",
tree_name="Events",
directory="/data/taggers/2022/",
globs=[".root"],
antiglobs=["_failed"],
),
],
)
# Serialise
d = entry.to_dict()
# Load back
from dataset_manifest import DatasetEntry
reloaded = DatasetEntry.from_dict(d)
assert reloaded.friend_trees[0].alias == "calib"
Integration with LAW Tasks
LAW workflow tasks (PrepareNANOSample, BuildNANOSubmission, RunNANOAnalysisJob, …) accept the dataset manifest via the --dataset-manifest command-line parameter:
law run RunNANOAnalysisJob \
--dataset-manifest datasets.yaml \
--config config/analysis.yaml \
--sample DYJets_2022 \
--branch 0
Within a LAW task, the manifest is typically loaded, queried, and the resolved dataset selection is recorded in DatasetManifestProvenance for full reproducibility:
from dataset_manifest import DatasetManifest
from output_schema import DatasetManifestProvenance, OutputManifest
# Load the manifest
manifest = DatasetManifest.load(self.dataset_manifest)
manifest_hash = DatasetManifest.file_hash(self.dataset_manifest)
# Query the relevant subset
query = {"year": self.year, "dtype": "mc", "process": self.process}
entries = manifest.query(**query)
# Record provenance for the output manifest
prov = DatasetManifestProvenance(
manifest_path=self.dataset_manifest,
manifest_hash=manifest_hash,
query_params=query,
resolved_entry_names=[e.name for e in entries],
)
output_manifest = OutputManifest(
skim=...,
dataset_manifest_provenance=prov,
framework_hash=current_fw_hash,
)
output_manifest.save_yaml(self.output_manifest_path)
The DatasetManifestProvenance records:
| Field | Description |
|---|---|
manifest_path |
Path to the manifest file. |
manifest_hash |
SHA-256 hex digest of the manifest file (from DatasetManifest.file_hash()). |
query_params |
Dict of keyword arguments passed to DatasetManifest.query(). |
resolved_entry_names |
Ordered list of DatasetEntry.name values selected by the query. |
This makes every dataset selection fully auditable and reproducible: given the manifest hash and query parameters, the resolved entry list can be independently verified.
Legacy Format Compatibility
For backward compatibility, DatasetManifest.load() also accepts the legacy key=value text format used by older submission scripts:
lumi=59740.0
WL=T2_DE_RWTH,T2_CH_CERN
BL=T2_US_Vanderbilt
name=DYJets_2018 xsec=6077.22 das=/DYJetsToLL/... type=mc kfac=1.0 norm=1234567.0 year=2018
name=SingleMuon_2018C das=/SingleMuon/... type=data year=2018 era=Run2018C
Global directives:
lumi=<float>— integrated luminosityWL=site1,site2— site whitelistBL=site1,site2— site blacklist
Per-sample recognised fields: name, xsec, das, type ("mc"/"data"), kfac, extraScale, filterEfficiency, norm (sum of weights), year, era, campaign, process, group, stitch_id, parent, fileList (comma-separated file paths).
For YAML manifests, keep framework typing in dtype and place any analysis-specific integer code in sample_type. Submission tools can then materialize that numeric code into per-job integer configs without overloading the main config type key.
The YAML format is preferred for new analyses; the legacy format is maintained solely for compatibility with existing submission scripts.