Configuration Validation
RDFAnalyzerCore ships a configuration validation tool that catches mis-spelled keys, missing files, bad numeric values and structural inconsistencies before a long analysis run starts. Fixing problems early avoids silent data-quality issues and wasted CPU time on the grid.
The tool lives in core/python/validate_config.py and supports two independent validation modes:
| Mode | Input | Use when |
|---|---|---|
| Submit config | key=value text file |
validating a snapshot/analysis submission config before running |
| Analysis config | YAML file | validating regions, nuisances, histograms, friend trees and plugins |
CLI Usage
python -m validate_config --config path/to/submit_config.txt [--mode {auto,nano,opendata}]
python -m validate_config --analysis-config path/to/analysis.yaml
The two modes are mutually exclusive; exactly one of --config or --analysis-config must be supplied.
Options
| Flag | Description |
|---|---|
--config PATH |
Validate a submit config text file (see Submit Config Validation). |
--analysis-config PATH |
Validate a full YAML analysis config (see Analysis Config Validation). |
--mode {auto,nano,opendata} |
Applies only to --config. Controls how the embedded sample config is interpreted. auto (default) detects the mode from the content of the sample config file. |
Exit codes
| Code | Meaning |
|---|---|
0 |
Validation passed (warnings may still be printed). |
1 |
One or more errors were found, or the config file was not found. |
Example output
Config validation warnings:
- Missing 'norm' in sample config (line 4): 'norm' will be calculated on the fly and is deprecated
Config validation OK
Config validation failed:
- Missing required key 'sampleConfig' in submit config
- File not found for 'floatConfig': /path/to/float_config.txt
Submit Config Validation
validate_submit_config(config_path, mode='auto')
Validates a key=value text file used to configure the snapshot / analysis submission step.
from validate_config import validate_submit_config
errors, warnings = validate_submit_config("configs/submit_config.txt", mode="nano")
Returns (errors: list[str], warnings: list[str]).
Raises ValidationError if config_path does not exist.
What is checked
Required keys
The following keys must be present and non-empty in every submit config:
| Key | Purpose |
|---|---|
sampleConfig |
Path to the sample/dataset config file (text or YAML). |
saveDirectory |
Output directory for the snapshot/analysis results. |
saveTree |
Name of the ROOT tree to save. |
Optional file-existence checks
If the following keys are present and non-empty, the referenced files must exist on the filesystem:
| Key | Required to exist |
|---|---|
sampleConfig |
Yes – error if absent |
floatConfig |
Yes – error if absent |
intConfig |
Yes – error if absent |
saveConfig |
No – warning if absent |
Paths are resolved relative to the parent directory of the submit config file.
threads value
When threads is present its value must be an integer, the string auto, or the string max; anything else is an error.
Sample config validation
The sample config (pointed to by sampleConfig) is validated according to its format:
- YAML (
.yaml/.yml) – loaded viaDatasetManifest.load_yaml()and then mode-specific field checks are applied (see below). - Text (any other extension) – parsed as
key=valueentries and mode-specific checks are applied.
Mode detection
When --mode auto is used (the default) the mode is inferred from the sample config content:
- OpenData – any entry that has a purely numeric
dasfield (a CERN Open Data record ID) triggers OpenData mode. - NANO – everything else.
NANO mode sample checks
Each sample entry is expected to contain:
| Field | Type | Requirement |
|---|---|---|
das |
string | Required (DAS path to CMS dataset) |
xsec |
float | Required; must be a valid float |
type |
int | Required; must be a valid integer |
norm |
float | Deprecated; warning if absent, error if present but not a float |
OpenData mode sample checks
| Check | Severity |
|---|---|
A recids entry must be present |
Error |
Each sample entry must have a das field (record ID) |
Error |
Numeric fields (xsec, norm, kfac, extraScale, lumi) must be valid floats |
Error |
type field must be a valid integer |
Error |
Analysis Config Validation
validate_analysis_config(config_path)
Validates a comprehensive YAML analysis configuration file. All validation is structural – the function does not execute any C++ code or load ROOT files.
from validate_config import validate_analysis_config
errors, warnings = validate_analysis_config("configs/analysis.yaml")
Returns (errors: list[str], warnings: list[str]).
Raises ValidationError if config_path does not exist.
The recognised top-level keys are:
| Key | Validator called |
|---|---|
regions |
validate_regions_config() |
nuisance_groups |
validate_nuisance_groups_config() |
histogram_config |
validate_histogram_config_file() |
friend_trees |
validate_friend_config() |
plugins |
validate_plugin_config() (checked against _KNOWN_PLUGINS) |
Unknown top-level keys generate a warning. If both friend_trees and friends are present simultaneously, a warning is emitted and only friend_trees is validated.
validate_regions_config(regions_data, source_context='')
Validates the regions list. Each entry must be a dict with at minimum:
| Key | Requirement |
|---|---|
name |
Required |
filter_column |
Required |
parent |
Optional (string name of a parent region) |
description |
Optional |
After individual entry validation the region hierarchy is checked for cycles, unknown parent references and duplicate names via validate_region_hierarchy().
validate_nuisance_groups_config(nuisance_data, source_context='')
Validates the nuisance_groups list. Each entry must be a dict with at minimum a name key. Additional checks:
- Parsed via
NuisanceGroupDefinition.from_dict()and its ownvalidate()method. - Duplicate
namevalues are errors. - A group with no
systematicsdeclared generates a warning. - Each
output_usagevalue must be one of the recognised usages (e.g.HISTOGRAM,DATACARD).
validate_histogram_config_file(path)
Validates a whitespace-separated histogram configuration text file. Each non-comment, non-blank line is treated as a histogram entry.
Required keys per entry:
| Key | Type | Constraint |
|---|---|---|
name |
string | Unique within the file |
variable |
string | — |
bins |
int | Must be positive |
lowerBound |
float | Must be less than upperBound |
upperBound |
float | Must be greater than lowerBound |
The validator also checks the optional channel, control-region and sample-category axis keys (channelBins, controlRegionBins, sampleCategoryBins, etc.) for correct numeric types and bound ordering. Unknown keys generate a warning. Duplicate histogram names are errors.
validate_friend_config(friends_data, source_context='')
Validates the friend_trees list. Each entry must be a dict with:
| Key | Requirement |
|---|---|
alias |
Required; must be unique |
tree_name / treeName |
Optional string |
directory |
Optional string |
files / fileList |
Optional list |
globs / antiglobs |
Optional lists |
index_branches / indexBranches |
Optional lists |
A warning is issued when an entry has neither a files/fileList source nor a directory. Unknown keys generate a warning.
Plugin Config Validation
The plugins section of an analysis config is a mapping (dict) from plugin name to plugin-specific configuration. Each plugin config value may be a dict of options, null, or true (flag-style activation with no options).
validate_plugin_config(plugins_data, source_context='')
Checks that each plugin name is in the known registry and that all declared config keys satisfy type constraints. Unknown plugin names generate a warning (not an error) so that third-party or future plugins are not blocked.
_KNOWN_PLUGINS registry
| Plugin | Optional keys | Key types |
|---|---|---|
RegionManager |
enabled |
bool |
WeightManager |
nominal_weight, scale_factors, normalizations, weight_variations |
str, list, list, list |
NDHistogramManager |
histogram_config, output_file, region_aware |
str, str, bool |
CutflowManager |
enabled |
bool |
CorrectionManager |
corrections |
list |
BDTManager |
models |
list |
OnnxManager |
models |
list |
SofieManager |
models |
list |
TriggerManager |
triggers |
list |
GoldenJsonManager |
json_files |
list |
KinematicFitManager |
fits |
list |
NamedObjectManager |
objects |
list |
All plugins have an empty required_keys list – every key is optional. A plugin entry can therefore be as minimal as PluginName: true.
Programmatic Usage
Both validators are importable as a Python library:
from validate_config import validate_analysis_config, validate_submit_config, ValidationError
# --- Analysis YAML config ---
try:
errors, warnings = validate_analysis_config("configs/analysis.yaml")
except ValidationError as exc:
print(f"File not found: {exc}")
raise SystemExit(1)
for w in warnings:
print(f"WARNING: {w}")
if errors:
for e in errors:
print(f"ERROR: {e}")
raise SystemExit(1)
print("Analysis config is valid.")
# --- Submit config ---
try:
errors, warnings = validate_submit_config("configs/submit_config.txt", mode="nano")
except ValidationError as exc:
print(f"File not found: {exc}")
raise SystemExit(1)
if errors:
print("Submit config has errors:", errors)
The functions never raise on validation errors – they always return (errors, warnings) lists. ValidationError is only raised when the input file itself cannot be found.
Analysis YAML Config Format
Below is a fully annotated example covering all recognised top-level sections:
# ------------------------------------------------------------------
# regions – list of selection regions applied by RegionManager
# ------------------------------------------------------------------
regions:
- name: preselection # required; unique within the file
filter_column: isPresel # required; boolean branch in the RDataFrame
- name: signal_region
filter_column: isSignalRegion
parent: preselection # optional; must name another region in this list
description: "Main SR" # optional human-readable label
- name: control_region_ttbar
filter_column: isCRttbar
parent: preselection
# ------------------------------------------------------------------
# nuisance_groups – systematic uncertainty groups for downstream tools
# ------------------------------------------------------------------
nuisance_groups:
- name: JES # required; unique within the file
group_type: SHAPE # e.g. SHAPE or NORM
systematics:
- JES_up
- JES_down
processes: [ttbar, WJets] # optional process filter
regions: [signal_region] # optional region filter
output_usage: # what outputs consume this group
- HISTOGRAM
- DATACARD
# ------------------------------------------------------------------
# histogram_config – path to a whitespace-separated histogram spec
# (resolved relative to the directory containing this YAML file)
# ------------------------------------------------------------------
histogram_config: histograms.txt
# ------------------------------------------------------------------
# friend_trees – sidecar ROOT trees merged at runtime
# ------------------------------------------------------------------
friend_trees:
- alias: scores # required; unique identifier
tree_name: Events # optional; defaults to the main tree name
directory: /path/to/scores/ # optional; scanned for matching files
globs: # optional; shell globs applied inside directory
- "*.root"
index_branches: # optional; branches used to align the friend tree
- run
- luminosityBlock
- event
# ------------------------------------------------------------------
# plugins – map of plugin name → config (all keys optional)
# ------------------------------------------------------------------
plugins:
RegionManager: # activates RegionManager with default settings
enabled: true
WeightManager:
nominal_weight: "genWeight * puWeight"
scale_factors: []
normalizations: []
weight_variations: []
NDHistogramManager:
histogram_config: histograms.txt
output_file: histograms.root
region_aware: true
CutflowManager:
enabled: true
CorrectionManager:
corrections:
- { name: pileup, file: corrections/pileup.json, set: Collisions18_UltraLegacy }
BDTManager:
models:
- { path: models/bdt_signal.pkl, output_branch: bdt_score }
OnnxManager:
models:
- { path: models/nn_classifier.onnx, output_branch: nn_score }
TriggerManager:
triggers:
- HLT_IsoMu27
- HLT_Ele32_WPTight_Gsf
GoldenJsonManager:
json_files:
- data/Cert_Run2018_14TeV_UL.json
KinematicFitManager:
fits:
- { name: top_mass_fit, type: ttbar }
NamedObjectManager:
objects:
- { name: jet_selector, type: JetSelector }