Skip to the content.

Production Manager Guide

The Production Manager provides a unified, cohesive system for managing batch analysis productions in RDFAnalyzerCore. It handles job generation, submission, monitoring, validation, and failure recovery with state persistence to support resilient operations.

Table of Contents

Overview

The Production Manager is designed to address common challenges in large-scale batch analyses:

Features

Job Management

Submission Backends

Monitoring and Validation

Resilience

Quick Start

1. Create a Production

python core/python/production_submit.py \
    --name my_analysis \
    --config cfg/analysis_config.txt \
    --sample-config cfg/samples.txt \
    --exe build/analyses/MyAnalysis/myanalysis \
    --submit

This will:

  1. Discover input files from Rucio
  2. Generate job configurations
  3. Submit jobs to HTCondor
  4. Save state to condorSub_my_analysis/production_state.json

2. Monitor Progress

# Interactive monitoring with curses interface
python core/python/production_monitor.py monitor --name my_analysis

# Simple text-based monitoring
python core/python/production_monitor.py monitor --name my_analysis --simple

# One-time status check
python core/python/production_monitor.py status --name my_analysis

3. Validate Outputs

python core/python/production_monitor.py validate --name my_analysis

4. Resubmit Failed Jobs

python core/python/production_monitor.py resubmit --name my_analysis

Architecture

Components

production_manager.py          # Core ProductionManager class
├── ProductionConfig          # Configuration dataclass
├── Job                       # Job tracking dataclass
├── JobStatus                 # Job status enumeration
└── ProductionManager         # Main manager class

production_submit.py          # Production creation and submission
production_monitor.py         # Monitoring and management CLI
test_production_manager.py    # Test suite

State Management

Production state is persisted to production_state.json in the work directory:

{
  "production_name": "my_analysis",
  "timestamp": 1234567890.0,
  "jobs": {
    "0": {
      "job_id": 0,
      "config_path": "/path/to/config.txt",
      "output_path": "/path/to/output.root",
      "status": "completed",
      "condor_job_id": "12345.0",
      "submit_time": 1234567890.0,
      "attempts": 1
    }
  }
}

Job Lifecycle

CREATED → SUBMITTED → RUNNING → COMPLETED → VALIDATED
                          ↓
                       FAILED/MISSING_OUTPUT
                          ↓
                      (resubmit)

Usage Guide

Creating a Production

python core/python/production_submit.py \
    --name my_production \
    --config cfg/base_config.txt \
    --sample-config cfg/samples.txt \
    --exe /path/to/analyzer \
    --output-dir /eos/user/u/username/outputs \
    --size 30 \
    --backend htcondor \
    --stage-inputs \
    --stage-outputs \
    --submit

Options:

Option 2: Using ProductionManager API

from pathlib import Path
from production_manager import ProductionManager, ProductionConfig

# Create config
config = ProductionConfig(
    name="my_production",
    work_dir=Path("condorSub_my_production"),
    exe_path=Path("/path/to/analyzer"),
    base_config="cfg/config.txt",
    output_dir=Path("/eos/user/u/username/outputs"),
    backend="htcondor",
    stage_inputs=True,
    stage_outputs=True,
)

# Create manager
manager = ProductionManager(config)

# Generate jobs
file_lists = [
    "root://xrootd/file1.root,root://xrootd/file2.root",
    "root://xrootd/file3.root,root://xrootd/file4.root",
]
manager.generate_jobs(file_lists)

# Submit
manager.submit_jobs()

# Monitor
manager.update_status()
manager.print_progress()

Monitoring Productions

Interactive Monitoring

# Curses interface (default)
python core/python/production_monitor.py monitor --name my_production

# Simple text interface
python core/python/production_monitor.py monitor --name my_production --simple

# Custom refresh interval (seconds)
python core/python/production_monitor.py monitor --name my_production --refresh 60

The curses interface shows:

Press q to quit the monitor.

One-Time Status Check

python core/python/production_monitor.py status --name my_production

List All Productions

# List productions in current directory
python core/python/production_monitor.py list

# List productions in specific directory
python core/python/production_monitor.py list --dir /path/to/submissions

Validating Outputs

python core/python/production_monitor.py validate --name my_production

This will:

  1. Check if output files exist
  2. Verify files are non-empty
  3. Attempt to open with ROOT to verify integrity
  4. Update job status to VALIDATED or MISSING_OUTPUT

Resubmitting Failed Jobs

# Resubmit with default retry limit (3)
python core/python/production_monitor.py resubmit --name my_production

# Resubmit with custom retry limit
python core/python/production_monitor.py resubmit --name my_production --max-attempts 5

Only jobs that haven’t exceeded the maximum attempts will be resubmitted.

Resilience and Recovery

State Persistence

The Production Manager automatically saves state after every operation:

If your session is interrupted:

# Resume by creating a new manager with the same work directory
python core/python/production_monitor.py status --work-dir condorSub_my_production

The manager will automatically load the previous state and continue from where it left off.

Handling Connection Failures

The Production Manager is designed to handle connection failures gracefully:

  1. AFS/EOS Token Expiration: State is saved locally, so you can renew tokens and resume
  2. Network Interruptions: Monitor can be restarted anytime
  3. Batch System Issues: Job status is queried fresh each time

Working in AFS/EOS

# Production in EOS
python core/python/production_submit.py \
    --name my_prod \
    --work-dir /eos/user/u/username/productions/my_prod \
    --output-dir /eos/user/u/username/outputs \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --stage-outputs

Note: When working in EOS, use --stage-outputs to avoid issues with worker node access.

Use --eos-sched flag when submitting to EOS scheduler:

python core/python/production_submit.py \
    --name my_prod \
    --work-dir /eos/user/u/username/productions/my_prod \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --eos-sched \
    --stage-outputs

Shared Library Management

The Production Manager automatically discovers and stages shared libraries (.so files) required by your C++ executable.

Automatic Discovery

When submitting jobs, the Production Manager:

  1. Uses ldd to find all shared library dependencies
  2. Filters out system libraries (already on worker nodes)
  3. Copies custom libraries to work_dir/lib/
  4. Transfers them to worker nodes
  5. Sets LD_LIBRARY_PATH on execution

Manual Library Staging

If automatic discovery doesn’t work, you can manually stage libraries:

# Create lib directory in work dir
mkdir -p condorSub_my_prod/lib

# Copy your custom libraries
cp /path/to/libMyLibrary.so condorSub_my_prod/lib/
cp /path/to/libAnotherLib.so condorSub_my_prod/lib/

# Submit - libraries will be transferred automatically
python production_submit.py --name my_prod ...

Backend Support

HTCondor Backend

Default backend using traditional HTCondor submission:

python core/python/production_submit.py \
    --backend htcondor \
    ...

Features:

DASK Backend

Python-based backend using DASK distributed:

# Install DASK first
pip install dask distributed dask-jobqueue

python core/python/production_submit.py \
    --backend dask \
    ...

Features:

The DASK backend wraps C++ executables in Python to enable:

Note: DASK backend requires dask-jobqueue package.

Examples

Example 1: Simple Production

# Generate and submit
python core/python/production_submit.py \
    --name ttbar_analysis \
    --config cfg/ttbar.txt \
    --sample-config cfg/ttbar_samples.txt \
    --exe build/analyses/TTbar/ttbar \
    --submit

# Monitor
python core/python/production_monitor.py monitor --name ttbar_analysis

Example 2: Production with Staging

For analyses with large input files or unreliable xrootd access:

python core/python/production_submit.py \
    --name large_production \
    --config cfg/config.txt \
    --sample-config cfg/samples.txt \
    --exe ./analyzer \
    --stage-inputs \
    --stage-outputs \
    --submit

Example 3: Resume After Interruption

# Original submission was interrupted
# Resume by checking status
python core/python/production_monitor.py status --name my_prod

# Submit any jobs that weren't submitted (using production_submit.py --submit)
# Or manually trigger submission:
cd condorSub_my_prod && condor_submit condor_submit.sub

# Continue monitoring
python core/python/production_monitor.py monitor --name my_prod

Example 4: Manage Multiple Productions

# List all productions
python core/python/production_monitor.py list

# Check status of each
for prod in prod1 prod2 prod3; do
    echo "=== $prod ==="
    python core/python/production_monitor.py status --name $prod
done

# Resubmit failures in all
for prod in prod1 prod2 prod3; do
    python core/python/production_monitor.py resubmit --name $prod
done

Troubleshooting

Jobs Not Submitting

  1. Check HTCondor is available:
    condor_q
    
  2. Verify executable exists:
    ls -l /path/to/analyzer
    
  3. Check work directory permissions:
    ls -ld condorSub_*
    

Jobs Failing Immediately

  1. Check job logs:
    cat condorSub_my_prod/condor_logs/log_*.stderr
    
  2. Test job locally (after generation):
    # Test by running the job's config directly
    cd condorSub_my_prod/job_0
    /path/to/analyzer job_config.txt
    
  3. Check configuration:
    cat condorSub_my_prod/job_0/job_config.txt
    

Output Validation Failing

  1. Check output directory exists and is writable:
    ls -ld /path/to/outputs
    
  2. Check for disk space:
    df -h /path/to/outputs
    
  3. Verify ROOT files manually:
    root -l /path/to/outputs/output_0.root
    

State File Corruption

If production_state.json becomes corrupted:

# Backup current state
cp condorSub_my_prod/production_state.json condorSub_my_prod/production_state.json.bak

# Try to recover or regenerate
# (This will lose progress information but preserve job definitions)
python core/python/production_submit.py \
    --name my_prod \
    --work-dir condorSub_my_prod \
    ...

Permission Denied in AFS/EOS

# Check AFS token
klist
aklog

# Check EOS authentication
eos whoami

# Verify directory permissions
fs la condorSub_my_prod  # For AFS
eos ls -l /eos/user/...  # For EOS

Best Practices

  1. Use Descriptive Names: Choose meaningful production names
  2. Stage Outputs for EOS: Always use --stage-outputs when output is in EOS
  3. Monitor Regularly: Check progress periodically, especially for long productions
  4. Validate Before Merging: Always validate outputs before merging results
  5. Clean Up: Remove work directories after successful completion
  6. Test Locally First: Run test jobs locally before large-scale submission

Integration with Existing Tools

The Production Manager integrates with existing RDFAnalyzerCore tools:

Law Workflow Integration

The Production Manager integrates with the Law (Luigi Analysis Workflow) task framework for managing complex analysis workflows on batch systems. Law provides a powerful task-based workflow system with automatic dependency resolution, retry logic, and progress tracking.

Available Law Tasks

NANO Analysis Tasks

Law tasks for CMS NanoAOD analysis workflow:

Plotting Tasks

Law tasks for creating physics plots:

Combine Datacard Tasks

Law tasks for CMS combine statistical analysis:

Example Law Workflows

Basic NANO Analysis

# Submit analysis jobs
law run SubmitNANOJobs \
    --config cfg/analysis.txt \
    --dataset /DoubleMuon/Run2022*/NANOAOD \
    --backend htcondor \
    --work-dir condorSub_myanalysis

# Monitor progress
law run MonitorNANOJobs \
    --work-dir condorSub_myanalysis

# Validate outputs
law run ValidateNANOOutputs \
    --work-dir condorSub_myanalysis

Statistical Analysis Workflow

# Create datacards from analysis outputs
law run CreateDatacard \
    --datacard-config cfg/datacard.yaml \
    --name myRun \
    --input-dir outputs/

# Run combine fit
law run RunCombine \
    --name myRun \
    --method AsymptoticLimits \
    --datacard datacards/myRun.txt

Plotting Workflow

# Single plot
law run MakePlot \
    --meta-file outputs/meta.root \
    --output-file plots/pt.pdf \
    --histogram-name jet_pt

# Batch plotting from config
law run MakePlots \
    --plot-config cfg/plots.yaml \
    --output-dir plots/

Law Task Dependencies

Law automatically manages task dependencies. For example, ValidateNANOOutputs depends on SubmitNANOJobs, so running:

law run ValidateNANOOutputs --config cfg/analysis.txt

Will automatically:

  1. Submit jobs (if not already done)
  2. Wait for jobs to complete
  3. Validate outputs

Task Configuration

Law tasks use the same configuration files as the Production Manager:

# analysis.txt - Same format as before
fileList=/path/to/inputs/*.root
saveFile=output.root
threads=-1

Additional Law-specific configuration can be provided via command-line parameters or Law config files (law.cfg).

Integration Benefits

Using Law with the Production Manager provides:

See Combine Integration for detailed datacard and statistical analysis workflows.

Future Enhancements

Planned improvements:

Support

For issues or questions:


Related Documentation: