Production Manager Guide
The Production Manager provides a unified, cohesive system for managing batch analysis productions in RDFAnalyzerCore. It handles job generation, submission, monitoring, validation, and failure recovery with state persistence to support resilient operations.
Table of Contents
- Overview
- Features
- Quick Start
- Architecture
- Usage Guide
- Monitoring
- Resilience and Recovery
- Backend Support
- Examples
- Troubleshooting
Overview
The Production Manager is designed to address common challenges in large-scale batch analyses:
- Unified Interface: Single system for the entire production lifecycle
- State Persistence: Can stop and restart without losing progress
- Progress Monitoring: Real-time status updates and progress tracking
- Output Validation: Automatic verification of job outputs
- Failure Recovery: Automatic resubmission of failed jobs
- Multiple Backends: Support for HTCondor and DASK
- Storage Support: Works in AFS and EOS areas
Features
Job Management
- Automatic job generation from data discovery (Rucio/CERN Open Data)
- Configuration validation before submission
- Per-job configuration management
- Shared executable and auxiliary file staging
- Automatic shared library (.so) discovery and staging
- C++ executable wrapper for DASK compatibility
Submission Backends
- HTCondor: Traditional batch submission with condor_submit
- Automatic .so file transfer to worker nodes
- LD_LIBRARY_PATH setup on execution nodes
- DASK: Python-based distributed computing with dask-jobqueue
- Python wrapper for C++ executables
- Shared library staging for remote execution
Monitoring and Validation
- Real-time progress monitoring (text or curses interface)
- Automatic job status updates
- Output file validation (existence, size, ROOT file integrity)
- Progress statistics and reporting
Resilience
- State persistence to JSON (can resume after disconnection)
- Automatic retry of failed jobs with configurable limits
- Graceful handling of connection failures
- Works in network file systems (AFS/EOS)
Quick Start
1. Create a Production
python core/python/production_submit.py \
--name my_analysis \
--config cfg/analysis_config.txt \
--sample-config cfg/samples.txt \
--exe build/analyses/MyAnalysis/myanalysis \
--submit
This will:
- Discover input files from Rucio
- Generate job configurations
- Submit jobs to HTCondor
- Save state to
condorSub_my_analysis/production_state.json
2. Monitor Progress
# Interactive monitoring with curses interface
python core/python/production_monitor.py monitor --name my_analysis
# Simple text-based monitoring
python core/python/production_monitor.py monitor --name my_analysis --simple
# One-time status check
python core/python/production_monitor.py status --name my_analysis
3. Validate Outputs
python core/python/production_monitor.py validate --name my_analysis
4. Resubmit Failed Jobs
python core/python/production_monitor.py resubmit --name my_analysis
Architecture
Components
production_manager.py # Core ProductionManager class
├── ProductionConfig # Configuration dataclass
├── Job # Job tracking dataclass
├── JobStatus # Job status enumeration
└── ProductionManager # Main manager class
production_submit.py # Production creation and submission
production_monitor.py # Monitoring and management CLI
test_production_manager.py # Test suite
State Management
Production state is persisted to production_state.json in the work directory:
{
"production_name": "my_analysis",
"timestamp": 1234567890.0,
"jobs": {
"0": {
"job_id": 0,
"config_path": "/path/to/config.txt",
"output_path": "/path/to/output.root",
"status": "completed",
"condor_job_id": "12345.0",
"submit_time": 1234567890.0,
"attempts": 1
}
}
}
Job Lifecycle
CREATED → SUBMITTED → RUNNING → COMPLETED → VALIDATED
↓
FAILED/MISSING_OUTPUT
↓
(resubmit)
Usage Guide
Creating a Production
Option 1: Using production_submit.py (Recommended)
python core/python/production_submit.py \
--name my_production \
--config cfg/base_config.txt \
--sample-config cfg/samples.txt \
--exe /path/to/analyzer \
--output-dir /eos/user/u/username/outputs \
--size 30 \
--backend htcondor \
--stage-inputs \
--stage-outputs \
--submit
Options:
--name: Production name (required)--config: Base analysis configuration (required)--sample-config: Sample list for data discovery (required)--exe: Path to analysis executable (required)--output-dir: Output directory (default: from config)--size: GB per job (default: 30)--backend: htcondor or dask (default: htcondor)--stage-inputs: Copy input files to worker nodes--stage-outputs: Copy outputs back from worker nodes--submit: Submit jobs immediately--dry-run: Generate but don’t submit
Option 2: Using ProductionManager API
from pathlib import Path
from production_manager import ProductionManager, ProductionConfig
# Create config
config = ProductionConfig(
name="my_production",
work_dir=Path("condorSub_my_production"),
exe_path=Path("/path/to/analyzer"),
base_config="cfg/config.txt",
output_dir=Path("/eos/user/u/username/outputs"),
backend="htcondor",
stage_inputs=True,
stage_outputs=True,
)
# Create manager
manager = ProductionManager(config)
# Generate jobs
file_lists = [
"root://xrootd/file1.root,root://xrootd/file2.root",
"root://xrootd/file3.root,root://xrootd/file4.root",
]
manager.generate_jobs(file_lists)
# Submit
manager.submit_jobs()
# Monitor
manager.update_status()
manager.print_progress()
Monitoring Productions
Interactive Monitoring
# Curses interface (default)
python core/python/production_monitor.py monitor --name my_production
# Simple text interface
python core/python/production_monitor.py monitor --name my_production --simple
# Custom refresh interval (seconds)
python core/python/production_monitor.py monitor --name my_production --refresh 60
The curses interface shows:
- Total jobs and status breakdown
- Progress bars for completion, validation, and failures
- Recent job activity with runtimes
- Real-time updates
Press q to quit the monitor.
One-Time Status Check
python core/python/production_monitor.py status --name my_production
List All Productions
# List productions in current directory
python core/python/production_monitor.py list
# List productions in specific directory
python core/python/production_monitor.py list --dir /path/to/submissions
Validating Outputs
python core/python/production_monitor.py validate --name my_production
This will:
- Check if output files exist
- Verify files are non-empty
- Attempt to open with ROOT to verify integrity
- Update job status to VALIDATED or MISSING_OUTPUT
Resubmitting Failed Jobs
# Resubmit with default retry limit (3)
python core/python/production_monitor.py resubmit --name my_production
# Resubmit with custom retry limit
python core/python/production_monitor.py resubmit --name my_production --max-attempts 5
Only jobs that haven’t exceeded the maximum attempts will be resubmitted.
Resilience and Recovery
State Persistence
The Production Manager automatically saves state after every operation:
- Job generation
- Status updates
- Submission
- Validation
If your session is interrupted:
# Resume by creating a new manager with the same work directory
python core/python/production_monitor.py status --work-dir condorSub_my_production
The manager will automatically load the previous state and continue from where it left off.
Handling Connection Failures
The Production Manager is designed to handle connection failures gracefully:
- AFS/EOS Token Expiration: State is saved locally, so you can renew tokens and resume
- Network Interruptions: Monitor can be restarted anytime
- Batch System Issues: Job status is queried fresh each time
Working in AFS/EOS
# Production in EOS
python core/python/production_submit.py \
--name my_prod \
--work-dir /eos/user/u/username/productions/my_prod \
--output-dir /eos/user/u/username/outputs \
--config cfg/config.txt \
--sample-config cfg/samples.txt \
--exe ./analyzer \
--stage-outputs
Note: When working in EOS, use --stage-outputs to avoid issues with worker node access.
Use --eos-sched flag when submitting to EOS scheduler:
python core/python/production_submit.py \
--name my_prod \
--work-dir /eos/user/u/username/productions/my_prod \
--config cfg/config.txt \
--sample-config cfg/samples.txt \
--exe ./analyzer \
--eos-sched \
--stage-outputs
Shared Library Management
The Production Manager automatically discovers and stages shared libraries (.so files) required by your C++ executable.
Automatic Discovery
When submitting jobs, the Production Manager:
- Uses
lddto find all shared library dependencies - Filters out system libraries (already on worker nodes)
- Copies custom libraries to
work_dir/lib/ - Transfers them to worker nodes
- Sets
LD_LIBRARY_PATHon execution
Manual Library Staging
If automatic discovery doesn’t work, you can manually stage libraries:
# Create lib directory in work dir
mkdir -p condorSub_my_prod/lib
# Copy your custom libraries
cp /path/to/libMyLibrary.so condorSub_my_prod/lib/
cp /path/to/libAnotherLib.so condorSub_my_prod/lib/
# Submit - libraries will be transferred automatically
python production_submit.py --name my_prod ...
Backend Support
HTCondor Backend
Default backend using traditional HTCondor submission:
python core/python/production_submit.py \
--backend htcondor \
...
Features:
- Uses existing condor submission infrastructure
- Supports input/output staging with xrdcp
- Integrates with existing resubmit_jobs.py
- Well-tested and stable
- Automatic .so file transfer and LD_LIBRARY_PATH setup
DASK Backend
Python-based backend using DASK distributed:
# Install DASK first
pip install dask distributed dask-jobqueue
python core/python/production_submit.py \
--backend dask \
...
Features:
- Pure Python submission
- Dynamic scaling
- Better for interactive analysis
- Integration with Python workflows
- Python wrapper for C++ executables (cpp_wrapper.py)
- Automatic shared library discovery and transfer
- Compatible with HTCondor via dask-jobqueue
The DASK backend wraps C++ executables in Python to enable:
- Proper environment setup on worker nodes
- Shared library path configuration
- Error handling and logging
- Integration with Python-based workflows
Note: DASK backend requires dask-jobqueue package.
Examples
Example 1: Simple Production
# Generate and submit
python core/python/production_submit.py \
--name ttbar_analysis \
--config cfg/ttbar.txt \
--sample-config cfg/ttbar_samples.txt \
--exe build/analyses/TTbar/ttbar \
--submit
# Monitor
python core/python/production_monitor.py monitor --name ttbar_analysis
Example 2: Production with Staging
For analyses with large input files or unreliable xrootd access:
python core/python/production_submit.py \
--name large_production \
--config cfg/config.txt \
--sample-config cfg/samples.txt \
--exe ./analyzer \
--stage-inputs \
--stage-outputs \
--submit
Example 3: Resume After Interruption
# Original submission was interrupted
# Resume by checking status
python core/python/production_monitor.py status --name my_prod
# Submit any jobs that weren't submitted (using production_submit.py --submit)
# Or manually trigger submission:
cd condorSub_my_prod && condor_submit condor_submit.sub
# Continue monitoring
python core/python/production_monitor.py monitor --name my_prod
Example 4: Manage Multiple Productions
# List all productions
python core/python/production_monitor.py list
# Check status of each
for prod in prod1 prod2 prod3; do
echo "=== $prod ==="
python core/python/production_monitor.py status --name $prod
done
# Resubmit failures in all
for prod in prod1 prod2 prod3; do
python core/python/production_monitor.py resubmit --name $prod
done
Troubleshooting
Jobs Not Submitting
- Check HTCondor is available:
condor_q - Verify executable exists:
ls -l /path/to/analyzer - Check work directory permissions:
ls -ld condorSub_*
Jobs Failing Immediately
- Check job logs:
cat condorSub_my_prod/condor_logs/log_*.stderr - Test job locally (after generation):
# Test by running the job's config directly cd condorSub_my_prod/job_0 /path/to/analyzer job_config.txt - Check configuration:
cat condorSub_my_prod/job_0/job_config.txt
Output Validation Failing
- Check output directory exists and is writable:
ls -ld /path/to/outputs - Check for disk space:
df -h /path/to/outputs - Verify ROOT files manually:
root -l /path/to/outputs/output_0.root
State File Corruption
If production_state.json becomes corrupted:
# Backup current state
cp condorSub_my_prod/production_state.json condorSub_my_prod/production_state.json.bak
# Try to recover or regenerate
# (This will lose progress information but preserve job definitions)
python core/python/production_submit.py \
--name my_prod \
--work-dir condorSub_my_prod \
...
Permission Denied in AFS/EOS
# Check AFS token
klist
aklog
# Check EOS authentication
eos whoami
# Verify directory permissions
fs la condorSub_my_prod # For AFS
eos ls -l /eos/user/... # For EOS
Best Practices
- Use Descriptive Names: Choose meaningful production names
- Stage Outputs for EOS: Always use
--stage-outputswhen output is in EOS - Monitor Regularly: Check progress periodically, especially for long productions
- Validate Before Merging: Always validate outputs before merging results
- Clean Up: Remove work directories after successful completion
- Test Locally First: Run test jobs locally before large-scale submission
Integration with Existing Tools
The Production Manager integrates with existing RDFAnalyzerCore tools:
- generateSubmissionFilesNANO.py: Used for data discovery
- submission_backend.py: Shared HTCondor submission logic
- resubmit_jobs.py: Can still be used for manual resubmission
- validate_config.py: Automatic configuration validation
Law Workflow Integration
The Production Manager integrates with the Law (Luigi Analysis Workflow) task framework for managing complex analysis workflows on batch systems. Law provides a powerful task-based workflow system with automatic dependency resolution, retry logic, and progress tracking.
Available Law Tasks
NANO Analysis Tasks
Law tasks for CMS NanoAOD analysis workflow:
SubmitNANOJobs: Submit analysis jobs to HTCondor or DASK- Automatically discovers input datasets
- Generates per-file job configurations
- Stages shared libraries and executables
- Tracks job submission status
MonitorNANOJobs: Monitor running jobs- Queries batch system for job status
- Reports progress and completion rate
- Identifies failed jobs for resubmission
ValidateNANOOutputs: Validate ROOT output files- Checks file integrity (ROOT file structure)
- Verifies expected trees and branches exist
- Reports validation failures
Plotting Tasks
Law tasks for creating physics plots:
MakePlot: Generate a single stack plot- Uses PlottingUtility Python bindings
- Supports data/MC ratio panels
- Configurable via PlotRequest objects
MakePlots: Batch plotting from configuration- Generates multiple plots in parallel
- Reads plot specifications from config files
- Automatic output organization
Combine Datacard Tasks
Law tasks for CMS combine statistical analysis:
CreateDatacard: Generate CMS combine datacards- Extracts histograms from analysis outputs
- Formats systematic uncertainties
- Writes datacard in combine format
RunCombine: Execute statistical fits- Runs combine tool with specified method
- Supports AsymptoticLimits, FitDiagnostics, etc.
- Collects fit results
Example Law Workflows
Basic NANO Analysis
# Submit analysis jobs
law run SubmitNANOJobs \
--config cfg/analysis.txt \
--dataset /DoubleMuon/Run2022*/NANOAOD \
--backend htcondor \
--work-dir condorSub_myanalysis
# Monitor progress
law run MonitorNANOJobs \
--work-dir condorSub_myanalysis
# Validate outputs
law run ValidateNANOOutputs \
--work-dir condorSub_myanalysis
Statistical Analysis Workflow
# Create datacards from analysis outputs
law run CreateDatacard \
--datacard-config cfg/datacard.yaml \
--name myRun \
--input-dir outputs/
# Run combine fit
law run RunCombine \
--name myRun \
--method AsymptoticLimits \
--datacard datacards/myRun.txt
Plotting Workflow
# Single plot
law run MakePlot \
--meta-file outputs/meta.root \
--output-file plots/pt.pdf \
--histogram-name jet_pt
# Batch plotting from config
law run MakePlots \
--plot-config cfg/plots.yaml \
--output-dir plots/
Law Task Dependencies
Law automatically manages task dependencies. For example, ValidateNANOOutputs depends on SubmitNANOJobs, so running:
law run ValidateNANOOutputs --config cfg/analysis.txt
Will automatically:
- Submit jobs (if not already done)
- Wait for jobs to complete
- Validate outputs
Task Configuration
Law tasks use the same configuration files as the Production Manager:
# analysis.txt - Same format as before
fileList=/path/to/inputs/*.root
saveFile=output.root
threads=-1
Additional Law-specific configuration can be provided via command-line parameters or Law config files (law.cfg).
Integration Benefits
Using Law with the Production Manager provides:
- Dependency Management: Automatic task ordering and execution
- Retry Logic: Failed tasks are automatically retried
- Caching: Completed tasks are not re-executed unnecessarily
- Parallel Execution: Independent tasks run in parallel
- Progress Tracking: Built-in status reporting and logging
- Workflow Visualization: Generate workflow graphs with
law run --print-status
See Combine Integration for detailed datacard and statistical analysis workflows.
Future Enhancements
Planned improvements:
- Automatic output merging
- Email notifications for completion/failures
- Web-based monitoring dashboard
- Support for additional batch systems (SLURM, PBS)
- Job priority management
- Resource usage statistics
Support
For issues or questions:
- Check the troubleshooting section
- Review existing issues on GitHub
- Consult the main README and batch submission docs
Related Documentation: