better spat computes merging

2026-04-25 19:24:09 +02:00
parent 951baca106
commit 15e8edb9c4
11 changed files with 1477 additions and 369 deletions
--- a/README.md
+++ b/README.md
@@ -1,5 +1,159 @@
 # Raptor

+Raptor is a domain-specific MLIR compiler for neural networks (ONNX format)
+targeting in-memory computing / processing-in-memory (PIM) architectures.
+It progressively lowers ONNX-MLIR through a set of MLIR dialects down to
+target-specific artifacts (currently JSON code for the `pimsim-nn` simulator).
+
+## Overview
+
+PIM architectures perform most of the computation directly in memory.
+Raptor's first supported target is `pimsim-nn`, which simulates a chip with:
+- a shared host memory,
+- a number of cores that do most of the computation directly in their memory
+  (vector ops, vmm/mvm on ReRAM crossbars),
+- no branching instructions (branchless architecture) and no hardware loop
+  support — any repeated work (e.g. convolutions) must be unrolled into
+  explicit per-iteration instructions.
+
+Because of this, the amount of emitted instructions explodes quickly and the
+compiler must optimize aggressively at every stage to keep compilation
+tractable.
+
+A second target, `PulPim`, is planned for an accelerator with RISC-V cores
+each carrying its own in-memory computing unit and crossbars. It will live in
+a dedicated dialect (future work).
+
+### Targets and simulators
+
+`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for
+**performance** estimates (latency, energy), but does not functionally execute
+the JSON code it consumes. To validate the numerical correctness of the JSON
+code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use
+a Rust simulator we maintain in-tree at
+`backend-simulators/pim/pim-simulator`.
+
+## Compilation pipeline
+
+The PIM-related sources live under `src/PIM` and the tests under `test/PIM`.
+When working on this codebase, most changes should stay confined to those
+trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for
+framework-level details).
+
+High-level lowering flow:
+
+```
+ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM JSON
+```
+
+1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
+   Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`).
+   Spatial models a high-level spatial in-memory accelerator: vmm/mvm
+   operations are accelerated by storing a constant RHS matrix into a
+   crossbar. Crossbars cannot be re-programmed during execution, have a
+   limited fixed size, and there is a limited number of them per core.
+   Conversion patterns are split by op family under
+   `Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul,
+   Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather,
+   Reshape, Resize, Split).
+
+2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`).
+   Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which
+   materializes PIM cores (`pim.core`), inter-core communication
+   (`pim.send` / `pim.receive`), halts, and crossbar-level operations.
+
+3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
+   A DCP-inspired heuristic (Dynamic Critical Path — see the original
+   scheduling paper by Kwok & Ahmad,
+   [DCP-eScience2007](https://clouds.cis.unimelb.edu.au/papers/DCP-eScience2007.pdf))
+   that coarsens the virtual node graph and decides how to group compute
+   nodes onto cores. Our implementation is only DCP-*inspired*: it is a
+   heuristic with different assumptions from the paper (different cost
+   model, constraints from crossbar capacity / core resources, and a
+   windowed coarsening loop instead of full-graph reprioritization). The
+   `dcp-critical-window-size` option controls how many lowest-slack virtual
+   nodes each coarsening iteration considers (0 = legacy full-graph
+   analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`,
+   `MergeComputeNodesPass.cpp`.
+
+4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`).
+   Converts tensor-semantics PIM IR into memref-semantics PIM IR using the
+   standard MLIR `BufferizableOpInterface` machinery
+   (`OpBufferizationInterfaces.*`, `PimBufferization.td`).
+
+5. **PIM code generation** (`src/PIM/Pass/PimCodegen`):
+   - `HostConstantFolding` — folds host-side constants.
+   - `MaterializeHostConstantsPass` — materializes the remaining host
+     constants for emission.
+   - `VerificationPass` — checks invariants before emission.
+   - `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn`
+     and `pim-simulator`.
+
+Supporting pieces:
+- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count,
+  core count, DCP window, experimental conv impl, concat error handling, …)
+  and `PimCodeGen` entry points.
+- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`).
+- `src/PIM/Pass` — auxiliary passes (`MessagePass`, `CountInstructionPass`)
+  and the `PIMPasses.h` registry used by `PimAccelerator`.
+- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers
+  dialects, passes, and plugs Raptor into the ONNX-MLIR driver.
+
+## Key compiler options
+
+Pass these on the `onnx-mlir` command line when compiling for PIM:
+
+- `--maccel=PIM` — select the PIM accelerator.
+- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen`
+  — stop the pipeline at the requested stage (default: `EmitPimCodegen`).
+- `--pim-only-codegen` — assume the input is already bufferized PIM IR and
+  run only the codegen tail.
+- `--crossbar-size=<N>` / `--crossbar-count=<N>` — crossbar dimensions and
+  per-core count.
+- `--core-count=<N>` — number of cores (`-1` picks the minimum).
+- `--dcp-critical-window-size=<N>` — DCP coarsening window (0 = legacy).
+- `--use-experimental-conv-impl` — alternative convolution lowering.
+- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`.
+
+## Validation
+
+Functional validation lives in `validation/` and drives the Rust
+`pim-simulator` to compare Raptor's output against a reference.
+
+Per-operation validation (from `validation/`):
+
+```
+validate.py \
+    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
+    --onnx-include-dir ../onnx-mlir/include
+```
+
+End-to-end network validation (example: first 4 layers of YOLOv11n):
+
+```
+validate.py \
+    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
+    --onnx-include-dir ../onnx-mlir/include \
+    --operations-dir ./networks/yolo11n/depth_04 \
+    --crossbar-size 2048
+```
+
+Available networks under `validation/networks/`: `vgg16`, `yolo11n`.
+Available operations under `validation/operations/`: `add`, `conv`, `div`,
+`gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`,
+`sigmoid`, `softmax`, `split`.
+
+## Rebuilding
+
+Release build (fast):
+
+```
+cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30
+```
+
+A slower debug build is also available — configure it the same way but with
+`-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below).
+
 ## Build

 ### Protobuf