Skip to the content.

Python Bindings Implementation Summary

Overview

This document summarizes the implementation of Python bindings for RDFAnalyzerCore using pybind11.

What Was Implemented

Core Components

  1. AnalyzerPythonWrapper Class (core/bindings/python_bindings.cpp)
    • Wraps the C++ Analyzer class for Python access
    • Provides three main interfaces for defining variables and filters:
      • String-based JIT expressions (ROOT interpreter)
      • Numba function pointers
      • Numpy array integration
  2. Methods Exposed to Python
    • Define(name, expression, columns) - C++-style alias for string-based definitions
    • Filter(name, expression, columns) - C++-style alias for string-based filters
    • DefineVector(name, data_ptr, size, element_type) - C++-style alias for vector definitions
    • DefineJIT(name, expression, columns) - Define variables with C++ expressions
    • FilterJIT(name, expression, columns) - Apply filters with C++ expressions
    • DefineFromPointer(name, func_ptr, signature, columns) - Use numba-compiled functions
    • DefineFromVector(name, data_ptr, size, element_type) - Pass numpy arrays
    • save() - Trigger computation and save results (chainable)
    • configMap(key) - Access configuration values
    • setConfig(key, value), getConfigMap(), getConfigList(...) - Configuration management from Python
    • registerSystematic(...), getSystematics(), makeSystList(...) - Systematics management APIs
    • AddPlugin(role, pluginType), AddDefaultPlugins(), SetupPlugin(role) - Plugin lifecycle control
    • Plugin action APIs by role:
      • applyAllOnnxModels, applyOnnxModel, getOnnxModelNames, getOnnxModelFeatures
      • applyAllBDTs, applyBDT, getBDTNames, getBDTFeatures
      • applyCorrection, getCorrectionFeatures
      • applyAllTriggers, getTriggerGroups, getTriggers, getVetoes
      • applyAllSofieModels, applySofieModel, getSofieModelNames
  3. CMake Integration (core/bindings/CMakeLists.txt)
    • Optional build when Python 3.8+ and pybind11 are available
    • Automatically finds pybind11 via pip installation
    • Links against core library and ROOT
    • Output: build/python/rdfanalyzer.so module

Security Features

  1. Input Validation
    • Function signature validation in DefineFromPointer to prevent code injection
    • Whitelist of allowed element types in DefineFromVector
    • Validation checks for dangerous characters and malformed inputs
  2. User Feedback
    • Warning messages when attempting to redefine existing columns
    • Clear error messages for invalid inputs
    • Helpful exception messages with expected formats

Documentation

  1. User Documentation
    • docs/PYTHON_BINDINGS.md - Comprehensive usage guide (13KB)
    • docs/PYTHON_BINDINGS_TESTING.md - Testing guide (8KB)
    • examples/README.md - Quick start for examples (3KB)
    • Updated main README.md with Python sections
  2. Examples
    • examples/example_string_expressions.py - ROOT JIT usage
    • examples/example_numba_functions.py - Numba integration
    • examples/example_numpy_arrays.py - Numpy arrays
  3. API Documentation
    • Inline docstrings in pybind11 bindings
    • Python-style documentation in comments
    • Full API reference in user guide

Testing

  1. Test Suite (core/bindings/test_bindings.py)
    • Module import verification
    • Method existence checks
    • Numpy integration test
    • Numba integration test
    • Basic constructor test
  2. Build Validation (test_python_bindings.sh)
    • Checks ROOT availability
    • Verifies Python dependencies
    • Validates module was built
    • Runs test suite
    • Provides troubleshooting guidance

Technical Details

Systematic Variations

The bindings automatically handle systematic variations:

Memory Management

Performance

All three approaches provide near-native performance:

Files Changed/Added

New Files

core/bindings/
├── CMakeLists.txt              (1.5KB) - Build configuration
├── python_bindings.cpp         (15KB)  - Main bindings implementation
└── test_bindings.py            (4.7KB) - Test suite

examples/
├── README.md                   (3.2KB) - Examples documentation
├── example_string_expressions.py (2.4KB)
├── example_numba_functions.py    (3.9KB)
└── example_numpy_arrays.py       (3.2KB)

docs/
├── PYTHON_BINDINGS.md          (13KB) - User guide
└── PYTHON_BINDINGS_TESTING.md  (8.2KB) - Testing guide

test_python_bindings.sh         (2.9KB) - Build test script

Modified Files

README.md                       - Added Python bindings section
core/CMakeLists.txt            - Added bindings subdirectory

Requirements

Build Requirements

Runtime Requirements

Usage Patterns

Pattern 1: String Expressions (ROOT JIT)

analyzer = rdfanalyzer.Analyzer("config.txt")
analyzer.DefineJIT("pt_gev", "pt / 1000.0", ["pt"])
analyzer.FilterJIT("high_pt", "pt_gev > 25.0", ["pt_gev"])
analyzer.save()

Use when:

Pattern 2: Numba Functions

import numba, ctypes

@numba.cfunc("float64(float64)")
def convert(x):
    return x / 1000.0

ptr = ctypes.cast(convert.address, ctypes.c_void_p).value
analyzer.DefineFromPointer("pt_gev", ptr, "double(double)", ["pt"])

Use when:

Pattern 3: Numpy Arrays

import numpy as np

weights = np.array([1.0, 1.2, 0.8], dtype=np.float32)
analyzer.DefineFromVector("weight", weights.ctypes.data, len(weights), "float")

Use when:

Design Decisions

Why pybind11?

Why Wrapper Class?

Why Three Interfaces?

API Naming Parity Update

To reduce friction between C++ and Python usage, the bindings now expose C++-style method names directly in Python:

Legacy names (DefineJIT, FilterJIT, DefineFromVector) remain available for backward compatibility.

Why String-based JIT?

Limitations and Future Work

Current Limitations

  1. Template-based C++ callables (Analyzer::Define with arbitrary C++ lambdas) are not directly bindable from Python
  2. Direct ROOT::RDF::RNode manipulation is still not exposed
  3. Histogram booking structures (histInfo, selectionInfo) are not yet bound as Python-native types
  4. Runtime plugin registration currently supports built-in manager types only

Future Enhancements

Planned for future releases:

Security Considerations

Implemented Safeguards

  1. Type whitelisting - Only predefined types allowed in DefineFromVector
  2. Signature validation - Function signatures validated before use
  3. Input sanitization - Dangerous characters rejected
  4. Clear error messages - Users informed of invalid inputs

Remaining Risks

  1. Expression injection - DefineJIT accepts arbitrary C++ code (by design)
  2. Memory safety - Users must manage numpy array lifetimes
  3. Pointer validity - Function pointers assumed valid

Recommendations

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Build and Deployment

Build Process

# Install dependencies
pip install pybind11 numpy numba

# Configure and build
source env.sh  # Setup ROOT
cmake -S . -B build
cmake --build build -j$(nproc)

Output

build/python/rdfanalyzer.so    # Python module

Installation (Optional)

cmake --install build
# Or add build/python to PYTHONPATH

Conclusion

The Python bindings implementation provides:

✅ Three flexible interfaces for defining variables ✅ High performance with minimal overhead ✅ Comprehensive documentation and examples ✅ Security features to prevent code injection ✅ Automatic systematic variation handling ✅ Easy integration with numpy and numba ✅ Minimal changes to existing codebase

The implementation is production-ready and suitable for:

All requirements from the problem statement have been addressed: