Production Manager Guide

The Production Manager provides a unified, cohesive system for managing batch analysis productions in RDFAnalyzerCore. It handles job generation, submission, monitoring, validation, and failure recovery with state persistence to support resilient operations.

Overview
Features
Quick Start
Architecture
Usage Guide
Monitoring
Resilience and Recovery
Backend Support
Examples
Troubleshooting

Overview

The Production Manager is designed to address common challenges in large-scale batch analyses:

Unified Interface: Single system for the entire production lifecycle
State Persistence: Can stop and restart without losing progress
Progress Monitoring: Real-time status updates and progress tracking
Output Validation: Automatic verification of job outputs
Failure Recovery: Automatic resubmission of failed jobs
Multiple Backends: Support for HTCondor and DASK
Storage Support: Works in AFS and EOS areas

Features

Job Management

Automatic job generation from data discovery (Rucio/CERN Open Data)
Configuration validation before submission
Per-job configuration management
Shared executable and auxiliary file staging
Automatic shared library (.so) discovery and staging
C++ executable wrapper for DASK compatibility

Submission Backends

HTCondor: Traditional batch submission with condor_submit
- Automatic .so file transfer to worker nodes
- LD_LIBRARY_PATH setup on execution nodes
DASK: Python-based distributed computing with dask-jobqueue
- Python wrapper for C++ executables
- Shared library staging for remote execution

Monitoring and Validation

Real-time progress monitoring (text or curses interface)
Automatic job status updates
Output file validation (existence, size, ROOT file integrity)
Progress statistics and reporting

Resilience

State persistence to JSON (can resume after disconnection)
Automatic retry of failed jobs with configurable limits
Graceful handling of connection failures
Works in network file systems (AFS/EOS)

Quick Start

1. Create a Production

python core/python/production_submit.py \
    --name my_analysis \
    --config cfg/analysis_config.txt \
    --sample-config cfg/samples.txt \
    --exe build/analyses/MyAnalysis/myanalysis \
    --submit

This will:

Discover input files from Rucio
Generate job configurations
Submit jobs to HTCondor
Save state to condorSub_my_analysis/production_state.json

2. Monitor Progress

# Interactive monitoring with curses interface
python core/python/production_monitor.py monitor --name my_analysis

# Simple text-based monitoring
python core/python/production_monitor.py monitor --name my_analysis --simple

# One-time status check
python core/python/production_monitor.py status --name my_analysis

3. Validate Outputs

python core/python/production_monitor.py validate --name my_analysis

4. Resubmit Failed Jobs

python core/python/production_monitor.py resubmit --name my_analysis

Architecture

Components

production_manager.py          # Core ProductionManager class
├── ProductionConfig          # Configuration dataclass
├── Job                       # Job tracking dataclass
├── JobStatus                 # Job status enumeration
└── ProductionManager         # Main manager class

production_submit.py          # Production creation and submission
production_monitor.py         # Monitoring and management CLI
test_production_manager.py    # Test suite

State Management

Production state is persisted to production_state.json in the work directory:

{
  "production_name": "my_analysis",
  "timestamp": 1234567890.0,
  "jobs": {
    "0": {
      "job_id": 0,
      "config_path": "/path/to/config.txt",
      "output_path": "/path/to/output.root",
      "status": "completed",
      "condor_job_id": "12345.0",
      "submit_time": 1234567890.0,
      "attempts": 1
    }
  }
}

Job Lifecycle

CREATED → SUBMITTED → RUNNING → COMPLETED → VALIDATED
                          ↓
                       FAILED/MISSING_OUTPUT
                          ↓
                      (resubmit)

Usage Guide

Creating a Production

Option 1: Using production_submit.py (Recommended)

python core/python/production_submit.py \
    --name my_production \
    --config cfg/base_config.txt \
    --sample-config cfg/samples.txt \
    --exe /path/to/analyzer \
    --output-dir /eos/user/u/username/outputs \
    --size 30 \
    --backend htcondor \
    --stage-inputs \
    --stage-outputs \
    --submit

Options:

--name: Production name (required)
--config: Base analysis configuration (required)
--sample-config: Sample list for data discovery (required)
--exe: Path to analysis executable (required)
--output-dir: Output directory (default: from config)
--size: GB per job (default: 30)
--backend: htcondor or dask (default: htcondor)
--stage-inputs: Copy input files to worker nodes
--stage-outputs: Copy outputs back from worker nodes
--submit: Submit jobs immediately
--dry-run: Generate but don’t submit

Option 2: Using ProductionManager API

from pathlib import Path
from production_manager import ProductionManager, ProductionConfig

# Create config
config = ProductionConfig(
    name="my_production",
    work_dir=Path("condorSub_my_production"),
    exe_path=Path("/path/to/analyzer"),
    base_config="cfg/config.txt",
    output_dir=Path("/eos/user/u/username/outputs"),
    backend="htcondor",
    stage_inputs=True,
    stage_outputs=True,
)

# Create manager
manager = ProductionManager(config)

# Generate jobs
file_lists = [
    "root://xrootd/file1.root,root://xrootd/file2.root",
    "root://xrootd/file3.root,root://xrootd/file4.root",
]
manager.generate_jobs(file_lists)

# Submit
manager.submit_jobs()

# Monitor
manager.update_status()
manager.print_progress()

Monitoring Productions

Interactive Monitoring

# Curses interface (default)
python core/python/production_monitor.py monitor --name my_production

# Simple text interface
python core/python/production_monitor.py monitor --name my_production --simple

# Custom refresh interval (seconds)
python core/python/production_monitor.py monitor --name my_production --refresh 60

The curses interface shows:

Total jobs and status breakdown
Progress bars for completion, validation, and failures
Recent job activity with runtimes
Real-time updates

Press q to quit the monitor.

One-Time Status Check

python core/python/production_monitor.py status --name my_production

List All Productions

# List productions in current directory
python core/python/production_monitor.py list

# List productions in specific directory
python core/python/production_monitor.py list --dir /path/to/submissions

Validating Outputs

python core/python/production_monitor.py validate --name my_production

This will:

Check if output files exist
Verify files are non-empty
Attempt to open with ROOT to verify integrity
Update job status to VALIDATED or MISSING_OUTPUT

Resubmitting Failed Jobs

# Resubmit with default retry limit (3)
python core/python/production_monitor.py resubmit --name my_production

# Resubmit with custom retry limit
python core/python/production_monitor.py resubmit --name my_production --max-attempts 5

Only jobs that haven’t exceeded the maximum attempts will be resubmitted.

Resilience and Recovery

State Persistence

The Production Manager automatically saves state after every operation:

Job generation
Status updates
Submission
Validation

If your session is interrupted:

# Resume by creating a new manager with the same work directory
python core/python/production_monitor.py status --work-dir condorSub_my_production

The manager will automatically load the previous state and continue from where it left off.

Handling Connection Failures

The Production Manager is designed to handle connection failures gracefully:

AFS/EOS Token Expiration: State is saved locally, so you can renew tokens and resume
Network Interruptions: Monitor can be restarted anytime
Batch System Issues: Job status is queried fresh each time

Working in AFS/EOS

# Production in EOS
python core/python/production_submit.py \
    --name my_prod \
    --work-dir /eos/user/u/username/productions/my_prod \
    --output-dir /eos/user/u/username/outputs \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --stage-outputs

Note: When working in EOS, use --stage-outputs to avoid issues with worker node access.

Use --eos-sched flag when submitting to EOS scheduler:

python core/python/production_submit.py \
    --name my_prod \
    --work-dir /eos/user/u/username/productions/my_prod \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --eos-sched \
    --stage-outputs

Shared Library Management

The Production Manager automatically discovers and stages shared libraries (.so files) required by your C++ executable.

Automatic Discovery

When submitting jobs, the Production Manager:

Uses ldd to find all shared library dependencies
Filters out system libraries (already on worker nodes)
Copies custom libraries to work_dir/lib/
Transfers them to worker nodes
Sets LD_LIBRARY_PATH on execution

Manual Library Staging

If automatic discovery doesn’t work, you can manually stage libraries:

# Create lib directory in work dir
mkdir -p condorSub_my_prod/lib

# Copy your custom libraries
cp /path/to/libMyLibrary.so condorSub_my_prod/lib/
cp /path/to/libAnotherLib.so condorSub_my_prod/lib/

# Submit - libraries will be transferred automatically
python production_submit.py --name my_prod ...

Backend Support

HTCondor Backend

Default backend using traditional HTCondor submission:

python core/python/production_submit.py \
    --backend htcondor \
    ...

Features:

Uses existing condor submission infrastructure
Supports input/output staging with xrdcp
Integrates with existing resubmit_jobs.py
Well-tested and stable
Automatic .so file transfer and LD_LIBRARY_PATH setup

DASK Backend

Python-based backend using DASK distributed:

# Install DASK first
pip install dask distributed dask-jobqueue

python core/python/production_submit.py \
    --backend dask \
    ...

Features:

Pure Python submission
Dynamic scaling
Better for interactive analysis
Integration with Python workflows
Python wrapper for C++ executables (cpp_wrapper.py)
Automatic shared library discovery and transfer
Compatible with HTCondor via dask-jobqueue

The DASK backend wraps C++ executables in Python to enable:

Proper environment setup on worker nodes
Shared library path configuration
Error handling and logging
Integration with Python-based workflows

Note: DASK backend requires dask-jobqueue package.

Examples

Example 1: Simple Production

# Generate and submit
python core/python/production_submit.py \
    --name ttbar_analysis \
    --config cfg/ttbar.txt \
    --sample-config cfg/ttbar_samples.txt \
    --exe build/analyses/TTbar/ttbar \
    --submit

# Monitor
python core/python/production_monitor.py monitor --name ttbar_analysis

Example 2: Production with Staging

For analyses with large input files or unreliable xrootd access:

python core/python/production_submit.py \
    --name large_production \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --stage-inputs \
    --stage-outputs \
    --submit

Example 3: Resume After Interruption

# Original submission was interrupted
# Resume by checking status
python core/python/production_monitor.py status --name my_prod

# Submit any jobs that weren't submitted (using production_submit.py --submit)
# Or manually trigger submission:
cd condorSub_my_prod && condor_submit condor_submit.sub

# Continue monitoring
python core/python/production_monitor.py monitor --name my_prod

Example 4: Manage Multiple Productions

# List all productions
python core/python/production_monitor.py list

# Check status of each
for prod in prod1 prod2 prod3; do
    echo "=== $prod ==="
    python core/python/production_monitor.py status --name $prod
done

# Resubmit failures in all
for prod in prod1 prod2 prod3; do
    python core/python/production_monitor.py resubmit --name $prod
done

Troubleshooting

Jobs Not Submitting

Check HTCondor is available:
```
condor_q
```
Verify executable exists:
```
ls -l /path/to/analyzer
```
Check work directory permissions:
```
ls -ld condorSub_*
```

Jobs Failing Immediately

Check job logs:

cat condorSub_my_prod/condor_logs/log_*.stderr

Test job locally (after generation):

# Test by running the job's config directly
cd condorSub_my_prod/job_0
/path/to/analyzer job_config.txt

Check configuration:

cat condorSub_my_prod/job_0/job_config.txt

Output Validation Failing

Check output directory exists and is writable:
```
ls -ld /path/to/outputs
```
Check for disk space:
```
df -h /path/to/outputs
```
Verify ROOT files manually:
```
root -l /path/to/outputs/output_0.root
```

State File Corruption

If production_state.json becomes corrupted:

# Backup current state
cp condorSub_my_prod/production_state.json condorSub_my_prod/production_state.json.bak

# Try to recover or regenerate
# (This will lose progress information but preserve job definitions)
python core/python/production_submit.py \
    --name my_prod \
    --work-dir condorSub_my_prod \
    ...

Permission Denied in AFS/EOS

# Check AFS token
klist
aklog

# Check EOS authentication
eos whoami

# Verify directory permissions
fs la condorSub_my_prod  # For AFS
eos ls -l /eos/user/...  # For EOS

Best Practices

Use Descriptive Names: Choose meaningful production names
Stage Outputs for EOS: Always use --stage-outputs when output is in EOS
Monitor Regularly: Check progress periodically, especially for long productions
Validate Before Merging: Always validate outputs before merging results
Clean Up: Remove work directories after successful completion
Test Locally First: Run test jobs locally before large-scale submission

Integration with Existing Tools

The Production Manager integrates with existing RDFAnalyzerCore tools:

generateSubmissionFilesNANO.py: Used for data discovery
submission_backend.py: Shared HTCondor submission logic
resubmit_jobs.py: Can still be used for manual resubmission
validate_config.py: Automatic configuration validation

Law Workflow Integration

The Production Manager integrates with the Law (Luigi Analysis Workflow) task framework for managing complex analysis workflows on batch systems. Law provides a powerful task-based workflow system with automatic dependency resolution, retry logic, and progress tracking.

Available Law Tasks

NANO Analysis Tasks

Law tasks for CMS NanoAOD analysis workflow:

SubmitNANOJobs: Submit analysis jobs to HTCondor or DASK
- Automatically discovers input datasets
- Generates per-file job configurations
- Stages shared libraries and executables
- Tracks job submission status
MonitorNANOJobs: Monitor running jobs
- Queries batch system for job status
- Reports progress and completion rate
- Identifies failed jobs for resubmission
ValidateNANOOutputs: Validate ROOT output files
- Checks file integrity (ROOT file structure)
- Verifies expected trees and branches exist
- Reports validation failures

Plotting Tasks

Law tasks for creating physics plots:

MakePlot: Generate a single stack plot
- Uses PlottingUtility Python bindings
- Supports data/MC ratio panels
- Configurable via PlotRequest objects
MakePlots: Batch plotting from configuration
- Generates multiple plots in parallel
- Reads plot specifications from config files
- Automatic output organization

Combine Datacard Tasks

Law tasks for CMS combine statistical analysis:

CreateDatacard: Generate CMS combine datacards
- Extracts histograms from analysis outputs
- Formats systematic uncertainties
- Writes datacard in combine format
RunCombine: Execute statistical fits
- Runs combine tool with specified method
- Supports AsymptoticLimits, FitDiagnostics, etc.
- Collects fit results

Example Law Workflows

Basic NANO Analysis

# Submit analysis jobs
law run SubmitNANOJobs \
    --config cfg/analysis.txt \
    --dataset /DoubleMuon/Run2022*/NANOAOD \
    --backend htcondor \
    --work-dir condorSub_myanalysis

# Monitor progress
law run MonitorNANOJobs \
    --work-dir condorSub_myanalysis

# Validate outputs
law run ValidateNANOOutputs \
    --work-dir condorSub_myanalysis

Statistical Analysis Workflow

# Create datacards from analysis outputs
law run CreateDatacard \
    --datacard-config cfg/datacard.yaml \
    --name myRun \
    --input-dir outputs/

# Run combine fit
law run RunCombine \
    --name myRun \
    --method AsymptoticLimits \
    --datacard datacards/myRun.txt

Plotting Workflow

# Single plot
law run MakePlot \
    --meta-file outputs/meta.root \
    --output-file plots/pt.pdf \
    --histogram-name jet_pt

# Batch plotting from config
law run MakePlots \
    --plot-config cfg/plots.yaml \
    --output-dir plots/

Law Task Dependencies

Law automatically manages task dependencies. For example, ValidateNANOOutputs depends on SubmitNANOJobs, so running:

law run ValidateNANOOutputs --config cfg/analysis.txt

Will automatically:

Submit jobs (if not already done)
Wait for jobs to complete
Validate outputs

Task Configuration

Law tasks use the same configuration files as the Production Manager:

# analysis.txt - Same format as before
fileList=/path/to/inputs/*.root
saveFile=output.root
threads=-1

Additional Law-specific configuration can be provided via command-line parameters or Law config files (law.cfg).

Integration Benefits

Using Law with the Production Manager provides:

Dependency Management: Automatic task ordering and execution
Retry Logic: Failed tasks are automatically retried
Caching: Completed tasks are not re-executed unnecessarily
Parallel Execution: Independent tasks run in parallel
Progress Tracking: Built-in status reporting and logging
Workflow Visualization: Generate workflow graphs with law run --print-status

See Combine Integration for detailed datacard and statistical analysis workflows.

Future Enhancements

Planned improvements:

Automatic output merging
Email notifications for completion/failures
Web-based monitoring dashboard
Support for additional batch systems (SLURM, PBS)
Job priority management
Resource usage statistics

Support

For issues or questions:

Check the troubleshooting section
Review existing issues on GitHub
Consult the main README and batch submission docs

Related Documentation: