NiccoloN 15e8edb9c4
All checks were successful
Validate Operations / validate-operations (push) Successful in 21m14s
better spat computes merging
2026-04-25 19:24:09 +02:00
2026-03-30 18:00:32 +02:00
2026-04-25 19:24:09 +02:00
2026-04-25 19:24:09 +02:00
2026-04-23 09:28:57 +02:00
2026-02-26 19:16:42 +01:00
2026-04-10 18:59:56 +02:00
2026-02-23 16:17:52 +01:00
2026-04-25 19:24:09 +02:00

Raptor

Raptor is a domain-specific MLIR compiler for neural networks (ONNX format) targeting in-memory computing / processing-in-memory (PIM) architectures. It progressively lowers ONNX-MLIR through a set of MLIR dialects down to target-specific artifacts (currently JSON code for the pimsim-nn simulator).

Overview

PIM architectures perform most of the computation directly in memory. Raptor's first supported target is pimsim-nn, which simulates a chip with:

  • a shared host memory,
  • a number of cores that do most of the computation directly in their memory (vector ops, vmm/mvm on ReRAM crossbars),
  • no branching instructions (branchless architecture) and no hardware loop support — any repeated work (e.g. convolutions) must be unrolled into explicit per-iteration instructions.

Because of this, the amount of emitted instructions explodes quickly and the compiler must optimize aggressively at every stage to keep compilation tractable.

A second target, PulPim, is planned for an accelerator with RISC-V cores each carrying its own in-memory computing unit and crossbars. It will live in a dedicated dialect (future work).

Targets and simulators

pimsim-nn (under backend-simulators/pim/pimsim-nn) is used for performance estimates (latency, energy), but does not functionally execute the JSON code it consumes. To validate the numerical correctness of the JSON code produced by Raptor (or, for comparison, by the pimcomp compiler), we use a Rust simulator we maintain in-tree at backend-simulators/pim/pim-simulator.

Compilation pipeline

The PIM-related sources live under src/PIM and the tests under test/PIM. When working on this codebase, most changes should stay confined to those trees (you only need to look outside, e.g. at onnx-mlir or llvm, for framework-level details).

High-level lowering flow:

ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM JSON
  1. ONNX → Spatial (src/PIM/Conversion/ONNXToSpatial). Lowers ONNX ops into the spat dialect (src/PIM/Dialect/Spatial). Spatial models a high-level spatial in-memory accelerator: vmm/mvm operations are accelerated by storing a constant RHS matrix into a crossbar. Crossbars cannot be re-programmed during execution, have a limited fixed size, and there is a limited number of them per core. Conversion patterns are split by op family under Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor} (Conv, Gemm, MatMul, Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather, Reshape, Resize, Split).

  2. Spatial → Pim (src/PIM/Conversion/SpatialToPim). Lowers Spatial to the pim dialect (src/PIM/Dialect/Pim), which materializes PIM cores (pim.core), inter-core communication (pim.send / pim.receive), halts, and crossbar-level operations.

  3. Merge compute nodes (src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes). A DCP-inspired heuristic (Dynamic Critical Path — see the original scheduling paper by Kwok & Ahmad, DCP-eScience2007) that coarsens the virtual node graph and decides how to group compute nodes onto cores. Our implementation is only DCP-inspired: it is a heuristic with different assumptions from the paper (different cost model, constraints from crossbar capacity / core resources, and a windowed coarsening loop instead of full-graph reprioritization). The dcp-critical-window-size option controls how many lowest-slack virtual nodes each coarsening iteration considers (0 = legacy full-graph analysis). Related sources: DCPGraph/DCPAnalysis.cpp, Graph.cpp/.hpp, MergeComputeNodesPass.cpp.

  4. Bufferization (src/PIM/Dialect/Pim/Transforms/Bufferization). Converts tensor-semantics PIM IR into memref-semantics PIM IR using the standard MLIR BufferizableOpInterface machinery (OpBufferizationInterfaces.*, PimBufferization.td).

  5. PIM code generation (src/PIM/Pass/PimCodegen):

    • HostConstantFolding — folds host-side constants.
    • MaterializeHostConstantsPass — materializes the remaining host constants for emission.
    • VerificationPass — checks invariants before emission.
    • EmitPimJsonPass — emits the final PIM JSON consumed by pimsim-nn and pim-simulator.

Supporting pieces:

  • src/PIM/Compiler — PIM-specific compiler options (crossbar size/count, core count, DCP window, experimental conv impl, concat error handling, …) and PimCodeGen entry points.
  • src/PIM/Common — shared utilities (PimCommon, LabeledList).
  • src/PIM/Pass — auxiliary passes (MessagePass, CountInstructionPass) and the PIMPasses.h registry used by PimAccelerator.
  • src/PIM/PimAccelerator.{cpp,hpp} — accelerator entry point: registers dialects, passes, and plugs Raptor into the ONNX-MLIR driver.

Key compiler options

Pass these on the onnx-mlir command line when compiling for PIM:

  • --maccel=PIM — select the PIM accelerator.
  • --EmitSpatial / --EmitPim / --EmitPimBufferized / --EmitPimCodegen — stop the pipeline at the requested stage (default: EmitPimCodegen).
  • --pim-only-codegen — assume the input is already bufferized PIM IR and run only the codegen tail.
  • --crossbar-size=<N> / --crossbar-count=<N> — crossbar dimensions and per-core count.
  • --core-count=<N> — number of cores (-1 picks the minimum).
  • --dcp-critical-window-size=<N> — DCP coarsening window (0 = legacy).
  • --use-experimental-conv-impl — alternative convolution lowering.
  • --ignore-concat-error — soft-fail corner case in ConcatOp.

Validation

Functional validation lives in validation/ and drives the Rust pim-simulator to compare Raptor's output against a reference.

Per-operation validation (from validation/):

validate.py \
    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
    --onnx-include-dir ../onnx-mlir/include

End-to-end network validation (example: first 4 layers of YOLOv11n):

validate.py \
    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
    --onnx-include-dir ../onnx-mlir/include \
    --operations-dir ./networks/yolo11n/depth_04 \
    --crossbar-size 2048

Available networks under validation/networks/: vgg16, yolo11n. Available operations under validation/operations/: add, conv, div, gather, gemm, gemv, mul, pool, reduce_mean, relu, resize, sigmoid, softmax, split.

Rebuilding

Release build (fast):

cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30

A slower debug build is also available — configure it the same way but with -DCMAKE_BUILD_TYPE=Debug (see installation instructions below).

Build

Protobuf

Use the following commands to install protobuf:

git clone --depth 1 --branch v34.0 https://github.com/protocolbuffers/protobuf
cd protobuf
mkdir build
cd build
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release
ninja
sudo ninja install

You can now remove the protobuf repo directory with:

cd ../..
rm -rf protobuf

Mlir

Follow the first part of instructions here to build mlir.

Remember to set -DCMAKE_BUILD_TYPE=Debug for developing on Raptor

Moreover, if compiling with build type debug, it is also suggested to use mold as linker (you will need to install it if you don't have it already) to reduce memory usage during linking. You can use it by setting the options:

-DLLVM_USE_LINKER=mold

Raptor

Use the following commands to build Raptor.

Remember to set -DCMAKE_BUILD_TYPE=Debug for developing on Raptor.

Also in this case, it is suggested to use mold as linker to reduce link time and memory usage, setting the options:

-DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
git submodule update --init --recursive

MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build/lib/cmake/mlir
mkdir build && cd build
cmake .. -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR}
cmake --build .

If the build fails because of protobuf missing uint definitions, just patch the problematic files by adding #include <cstdint> to their includes.

Description
No description provided
Readme 2.7 GiB
Languages
C++ 63%
Rust 25.9%
Python 10%
CMake 1.1%