Update README and AGENTS

2026-05-27 15:09:30 +02:00
parent c6b02af7a9
commit 013ae0ac2a
2 changed files with 243 additions and 185 deletions
@@ -1,7 +1,7 @@
 - Always read the full README.md before doing anything.
 - Build commands:
-    - `cmake --build ./build_release --target onnx-mlir -j 30`
-    - `cmake --build ./build_debug --target onnx-mlir -j 30`
+    - `cmake --build ./build_release`
+    - `cmake --build ./build_debug`
 - Never use `ninja` directly: it bypasses cmake's configuration and invalidates the build cache.

 # Code changes
@@ -1,168 +1,178 @@
 # Raptor

-Raptor is a domain-specific MLIR compiler for neural networks (ONNX format)
-targeting in-memory computing / processing-in-memory (PIM) architectures.
-It progressively lowers ONNX-MLIR through a set of MLIR dialects down to
-target-specific artifacts (currently JSON code for the `pimsim-nn` simulator).
+Raptor is a domain-specific MLIR compiler for neural networks in ONNX format,
+targeting in-memory computing / processing-in-memory (PIM) architectures. It
+extends ONNX-MLIR with a PIM accelerator and progressively lowers ONNX-MLIR
+through custom MLIR dialects to simulator artifacts.
+
+The current target is the PIM simulator stack under `backend-simulators/pim`.
+Raptor emits binary per-core `.pim` instruction files by default, plus
+`memory.bin`, `config.json`, and weight binaries. It can also emit per-core JSON
+instruction files with `--pim-emit-json`.

 ## Overview

-PIM architectures perform most of the computation directly in memory.
-Raptor's first supported target is `pimsim-nn`, which simulates a chip with:
- a shared host memory,
- a number of cores that do most of the computation directly in their memory
-  (vector ops, vmm/mvm on ReRAM crossbars),
- no branching instructions (branchless architecture) and no hardware loop
-  support — any repeated work (e.g. convolutions) must be unrolled into
-  explicit per-iteration instructions.
+PIM architectures perform most computation directly in memory. The supported
+target models a chip with:
+- shared host memory,
+- multiple PIM cores,
+- ReRAM crossbars for vector-matrix / matrix-vector work,
+- explicit communication between cores,
+- no hardware branch or loop support in emitted simulator code.

-Because of this, the amount of emitted instructions explodes quickly and the
-compiler must optimize aggressively at every stage to keep compilation
-tractable.
-
-A second target, `PulPim`, is planned for an accelerator with RISC-V cores
-each carrying its own in-memory computing unit and crossbars. It will live in
-a dedicated dialect (future work).
+Because repeated work such as convolutions is eventually made explicit, emitted
+instruction counts can grow quickly. Most compiler work therefore focuses on
+lowering, scheduling, memory layout, and code-generation optimizations.

 ### Targets and simulators

-`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for
-**performance** estimates (latency, energy), but does not functionally execute
-the JSON code it consumes. To validate the numerical correctness of the JSON
-code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use
-a Rust simulator we maintain in-tree at
-`backend-simulators/pim/pim-simulator`.
+- `backend-simulators/pim/pim-simulator` is the in-tree Rust functional
+  simulator used by validation. It reads Raptor's `pim/` artifact directory and
+  compares simulator output against native ONNX-MLIR execution.
+- `backend-simulators/pim/pimsim-nn` is the performance simulator submodule.
+  The helper scripts in `pimcomp_utils/` are for comparison with PIMCOMP-NN and
+  contain local paths; treat them as local utilities, not portable workflows.

 ## Compilation pipeline

-The PIM-related sources live under `src/PIM` and the tests under `test/PIM`.
-When working on this codebase, most changes should stay confined to those
-trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for
-framework-level details).
+The PIM sources live under `src/PIM` and tests under `test/PIM`. CMake exposes
+them to ONNX-MLIR through generated shim directories under
+`onnx-mlir/src/Accelerators/PIM` and `onnx-mlir/test/accelerators/PIM`.

 High-level lowering flow:

 ```
-ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM code
+ONNX-MLIR -> Spatial -> Pim (tensor) -> Pim (bufferized) -> PIM artifacts
 ```

-1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
-   Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`).
-   Spatial models a high-level spatial in-memory accelerator: vmm/mvm
-   operations are accelerated by storing a constant RHS matrix into a
-   crossbar. Crossbars cannot be re-programmed during execution, have a
-   limited fixed size, and there is a limited number of them per core.
-   Conversion patterns are split by op family under
-   `Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul,
-   Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather,
-   Reshape, Resize, Split, etc...).
+1. **ONNX -> Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
+   Lowers supported ONNX ops into the `spat` dialect
+   (`src/PIM/Dialect/Spatial`). Conversion patterns are split by op family under
+   `Patterns/{Math,NN,Tensor}` and currently cover Conv, Gemm, MatMul,
+   elementwise Add/Mul/Div, ReduceMean, pooling, Relu, Sigmoid, Softmax,
+   Concat, Gather, Reshape, Resize, and Split.

-2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`).
-   Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which
-   materializes PIM cores (`pim.core`), inter-core communication
-   (`pim.send` / `pim.receive`), halts, and crossbar-level operations.
+2. **Merge compute nodes**
+   (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
+   Builds a compute graph, schedules it with the PEFT scheduler, and materializes
+   the merge schedule into Spatial IR. Supporting scheduling code lives under
+   `MergeComputeNodes/Scheduling`.

-3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
-   A PEFT heuristic that coarsens the virtual node graph and decides how to group compute
-   nodes onto cores. Our implementation is only DCP-*inspired*: it is a
-   heuristic with different assumptions from the paper (different cost
-   model, constraints from crossbar capacity / core resources, and a
-   windowed coarsening loop instead of full-graph reprioritization). The
-   `dcp-critical-window-size` option controls how many lowest-slack virtual
-   nodes each coarsening iteration considers (0 = legacy full-graph
-   analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`,
-   `MergeComputeNodesPass.cpp`.
+3. **Spatial -> Pim** (`src/PIM/Conversion/SpatialToPim`).
+   Lowers Spatial operations to the `pim` dialect (`src/PIM/Dialect/Pim`),
+   including `pim.core`, `pim.core_batch`, communication, tensor packing, global
+   tensor materialization, and return-path normalization.

 4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`).
-   Converts tensor-semantics PIM IR into memref-semantics PIM IR using the
-   standard MLIR `BufferizableOpInterface` machinery
-   (`OpBufferizationInterfaces.*`, `PimBufferization.td`).
+   Converts tensor-semantics PIM IR into memref-semantics PIM IR using MLIR's
+   bufferization interfaces.

-5. **Static memory coalescing** (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
-   Conservatively reuses same-typed local memref allocations inside PIM cores
-   after bufferization and before code generation.
+5. **Static memory coalescing**
+   (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
+   Reuses compatible local memref allocations inside PIM cores before codegen.

-6. **PIM code generation** (`src/PIM/Pass/PimCodegen`):
-   - `HostConstantFolding` — folds host-side constants.
-   - `MaterializeHostConstantsPass` — materializes the remaining host
-     constants for emission.
-   - `VerificationPass` — checks invariants before emission.
-   - `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn`
-     and `pim-simulator`.
+6. **PIM code generation** (`src/PIM/Pass/PimCodegen` and
+   `src/PIM/Compiler`).
+   Folds host constants, materializes remaining host constants, verifies PIM IR,
+   emits `.pim` core files, writes weights, and writes `memory.bin` /
+   `config.json`.

 Supporting pieces:
- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count,
-  core count, DCP window, experimental conv impl, concat error handling, …)
-  and `PimCodeGen` entry points.
- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`).
- `src/PIM/Pass` — auxiliary passes (`MessagePass`)
-  and the `PIMPasses.h` registry used by `PimAccelerator`.
- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers
-  dialects, passes, and plugs Raptor into the ONNX-MLIR driver.
+- `src/PIM/Common` - shared IR, filesystem, diagnostics, reports, and utility
+  helpers.
+- `src/PIM/Compiler` - PIM compiler options, memory/address planning, binary
+  instruction format, artifact writing, weight emission, and codegen entry
+  points.
+- `src/PIM/Conversion/SpatialToGraphviz` - optional Spatial graphviz conversion
+  pass.
+- `src/PIM/Pass` - pass registration and auxiliary passes.
+- `src/PIM/PimAccelerator.{cpp,hpp}` - ONNX-MLIR accelerator entry point.

 ## Key compiler options

-Pass these on the `onnx-mlir` command line when compiling for PIM:
+Pass these to `onnx-mlir` when compiling for PIM:

- `--maccel=PIM` — select the PIM accelerator.
- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen`
-  — stop the pipeline at the requested stage (default: `EmitPimCodegen`).
- `--pim-only-codegen` — assume the input is already bufferized PIM IR and
-  run only the codegen tail.
- `--crossbar-size=<N>` / `--crossbar-count=<N>` — crossbar dimensions and
-  per-core count.
- `--core-count=<N>` — number of cores. Required for PIM compilation.
- `--pim-merge-scheduler={peft,dcp}` — scheduler used by the Spatial
-  merge-compute-nodes pass (default: `peft`).
- `--dcp-critical-window-size=<N>` — DCP coarsening window (0 = legacy).
- `--use-experimental-conv-impl` — alternative convolution lowering.
- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`.
+- `--maccel=PIM` - select the PIM accelerator.
+- `--EmitSpatial`, `--EmitPim`, `--EmitPimBufferized`,
+  `--EmitPimCodegen` - stop the PIM pipeline at the requested stage. The PIM
+  default is `--EmitPimCodegen`.
+- `--core-count=<N>` - required positive core count for PIM compilation.
+- `--crossbar-size=<N>` - crossbar width/height. Default in code is `2`.
+- `--crossbar-count=<N>` - crossbars per core. Default in code is `256`.
+- `--pim-merge-scheduler=peft` - merge scheduler. `peft` is the only accepted
+  value in the current code.
+- `--pim-only-codegen` - assume input is already bufferized PIM IR and only run
+  the codegen tail.
+- `--pim-emit-json` - also emit `core_*.json` instruction files alongside
+  `core_*.pim`.
+- `--use-experimental-conv-impl` - use the alternate convolution lowering.
+- `--ignore-concat-error` - soft-fail a ConcatOp corner case.
+
+Example:
+
+```bash
+./build_release/Release/bin/onnx-mlir model.onnx -o /tmp/raptor/model \
+  --maccel=PIM --EmitPimCodegen \
+  --crossbar-size=2048 --crossbar-count=256 --core-count=1000
+```
+
+This writes PIM artifacts under `/tmp/raptor/pim/`.

 ## Validation

-Functional validation lives in `validation/` and drives the Rust
-`pim-simulator` to compare Raptor's output against a reference.
+Functional validation lives in `validation/`. It compiles ONNX models, builds a
+native ONNX-MLIR reference runner, generates random inputs, runs Raptor, runs
+the Rust PIM simulator, and compares outputs.

-Per-operation validation (from `validation/`):
+Python dependencies used by the validation scripts are `numpy`, `onnx`, and
+`colorama`. The simulator requires the Rust toolchain.

-```
-validate.py \
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
-    --onnx-include-dir ../onnx-mlir/include \
+Per-operation validation from the repository root:
+
+```bash
+python3 validation/validate.py \
+  --raptor-path build_release/Release/bin/onnx-mlir \
+  --onnx-include-dir onnx-mlir/include \
  --core-count 1000
 ```

-End-to-end network validation (example: first 4 layers of YOLOv11n):
+Validate one network or a subset by pointing `--operations-dir` at any directory
+containing `.onnx` files:

-```
-validate.py \
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
-    --onnx-include-dir ../onnx-mlir/include \
-    --operations-dir ./networks/yolo11n/depth_04 \
+```bash
+python3 validation/validate.py \
+  --raptor-path build_release/Release/bin/onnx-mlir \
+  --onnx-include-dir onnx-mlir/include \
+  --operations-dir validation/networks/yolo11n/depth_04 \
  --crossbar-size 2048 --crossbar-count 256 --core-count 1000
 ```

-Each validation run writes debugging artifacts into the benchmark's workspace
-directory (for example `validation/operations/gemm/small/`):
- `inputs/` — generated input CSVs used for the run.
- `outputs/` — reference outputs dumped by the native ONNX runner.
- `raptor/` — compiler artifacts:
-  `*.onnx.mlir`, `dialects/spatial0.mlir`, `dialects/spatial1_dcp_merged.mlir`,
-  `dialects/pim0.mlir`, `dialects/pim1_buff.mlir`, `dialects/pim2_coalesced.mlir`,
-  `dialects/pim3_folded.mlir`, `dialects/pim4_materialized.mlir`,
-  `pim/config.json`, `pim/core_*.pim`, `pim/memory.bin`, and reports under
-  `raptor/reports/` such as `dcp_merge_report.txt`,
-  `memory_report.txt`, and `static_memory_coalescing_report.txt`.
- `runner/` — generated reference runner source, build tree, and shared library.
- `simulation/out.bin` — raw simulator output dump used for output comparison.
+Useful validation options:
+- `--simulator-dir <path>` - override the auto-detected
+  `backend-simulators/pim/pim-simulator` path.
+- `--threshold <float>` - maximum allowed per-element output difference.
+- `--seed <int>` - RNG seed for generated inputs.
+- `--command-timeout-seconds <float>` - timeout for compiler, runner, and
+  simulator subprocesses.
+- `--verbose` - print subprocess logs and average PIM pass timings.
+- `--clean` - remove generated validation artifacts and exit.

-That means you usually do not need to rerun standalone `--EmitSpatial` or
-`--EmitPim` commands while debugging validation failures: the per-pass dialect
-dumps are already available under `raptor/dialects/`.
+Each validation run writes artifacts in the model workspace, for example under
+`validation/operations/gemm/small/`:
+- `inputs/` - generated input CSV files.
+- `outputs/` - native ONNX-MLIR reference outputs.
+- `raptor/` - compiler artifacts, including `*.onnx.mlir`, dialect dumps under
+  `dialects/`, reports under `reports/`, and final PIM artifacts under `pim/`.
+- `runner/` - generated reference runner source, build tree, and shared library.
+- `simulation/out.bin` - raw simulator output used for comparison.

-The validator does not currently expose a simulator tracing flag, but once a
-validation has produced `raptor/pim/` you can rerun the simulator manually with
-tracing enabled:
+The compiler currently dumps dialect snapshots such as `spatial0.mlir`,
+`spatial1_dcp_merged.mlir`, `pim0.mlir`, `pim1_buff.mlir`,
+`pim2_coalesced.mlir`, `pim3_folded.mlir`, and
+`pim4_materialized.mlir` when an output directory is available.
+
+To rerun the simulator manually with tracing after validation has produced a
+`raptor/pim/` directory:

 ```bash
 cd backend-simulators/pim/pim-simulator
@@ -174,90 +184,138 @@ cargo run --no-default-features --features tracing --release \
 ```

 With `--features tracing`, the simulator writes per-core traces as
-`simulation/TraceCore0`, `simulation/TraceCore1`, ... next to `simulation/out.bin`.
-The validator normally computes the `-d` dump ranges from `raptor/pim/config.json`
-and the model output shapes. If you need a clean slate before rerunning, use:
+`TraceCore0`, `TraceCore1`, ... next to `out.bin`. The validator normally
+computes the `-d` ranges from `raptor/pim/config.json` and model output shapes.
+
+Available validation networks under `validation/networks/`: `vgg16`,
+`yolo11n`, `yolo11nv2`.
+
+Available operation suites under `validation/operations/`: `add`, `concat`,
+`conv`, `div`, `gather`, `gemm`, `gemv`, `matmul`, `mul`, `pool`,
+`reduce_mean`, `relu`, `reshape`, `resize`, `sigmoid`, `softmax`, `split`.
+
+Generated operation tests can be regenerated with:

 ```bash
-validate.py --clean
+python3 validation/operations/gen_tests.py
 ```

-Available networks under `validation/networks/`: `vgg16`, `yolo11n`.
-Available operations under `validation/operations/`: `add`, `conv`, `div`,
-`gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`,
-`sigmoid`, `softmax`, `split`.
-
-## Rebuilding
-
-Release build (fast):
-
-```
-cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30
-```
-
-A slower debug build is also available — configure it the same way but with
-`-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below).
-
 ## Build

+Initialize submodules first:
+
+```bash
+git submodule update --init --recursive
+```
+
+The project follows ONNX-MLIR's build requirements. The CI workflow documents
+the currently used versions and setup:
+- CMake 4.3.0 in CI,
+- LLVM/MLIR checked out under `onnx-mlir/llvm-project`,
+- Protobuf `v34.0`,
+- Rust stable for `pim-simulator`,
+- Python packages `numpy`, `onnx`, `colorama` for validation.
+
 ### Protobuf

-Use the following commands to install protobuf:
-```
+Install Protobuf if your system does not already provide a compatible version:
+
+```bash
 git clone --depth 1 --branch v34.0 https://github.com/protocolbuffers/protobuf
-cd protobuf
-mkdir build
-cd build
-cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release
-ninja
-sudo ninja install
+cmake -S protobuf -B protobuf/build -G Ninja \
+  -DCMAKE_BUILD_TYPE=Release \
+  -Dprotobuf_BUILD_TESTS=OFF
+cmake --build protobuf/build
+sudo cmake --install protobuf/build
 ```

-You can now remove the protobuf repo directory with:
-```
-cd ../..
+You can then remove the temporary checkout:
+
+```bash
 rm -rf protobuf
 ```

-### Mlir
+### MLIR

-Follow the first part of instructions [here](onnx-mlir/docs/BuildOnLinuxOSX.md) to build mlir.
+Follow the ONNX-MLIR instructions in
+`onnx-mlir/docs/BuildOnLinuxOSX.md` to build LLVM/MLIR. The local Raptor build
+expects `MLIR_DIR` to point at the MLIR CMake package, for example:

-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor
-
-Moreover, if compiling with build type debug, it is also suggested to use
-mold as linker (you will need to install it if you don't have it already)
-to reduce memory usage during linking. You can use it by setting the options:
-```
-DLLVM_USE_LINKER=mold
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
 ```

+If your LLVM build directory is named `build` instead of `build_release`, adjust
+the path accordingly.
+
 ### Raptor

-Use the following commands to build Raptor.
+Configure a release build:

-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor.
-
-Also in this case, it is suggested to use mold as linker to reduce link time and memory usage,
-setting the options:
-```
-DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
-```
-
-```
-git submodule update --init --recursive
-
-MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build/lib/cmake/mlir
-mkdir build && cd build
-cmake .. -G Ninja \
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
+cmake -S . -B build_release -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR}
-cmake --build .
 ```

-If the build fails because of protobuf missing uint definitions,
-just patch the problematic files by adding ```#include <cstdint>``` to their includes.
+Configure a debug build similarly:
+
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_debug/lib/cmake/mlir
+cmake -S . -B build_debug -G Ninja \
+  -DCMAKE_BUILD_TYPE=Debug \
+  -DONNX_MLIR_ACCELERATORS=PIM \
+  -DLLVM_ENABLE_ASSERTIONS=ON \
+  -DMLIR_DIR=${MLIR_DIR}
+```
+
+For debug development, using `mold` can reduce link time and memory use:
+
+```bash
+cmake -S . -B build_debug -G Ninja \
+  -DCMAKE_BUILD_TYPE=Debug \
+  -DONNX_MLIR_ACCELERATORS=PIM \
+  -DLLVM_ENABLE_ASSERTIONS=ON \
+  -DMLIR_DIR=${MLIR_DIR} \
+  -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
+  -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
+  -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
+```
+
+Build the compiler with CMake:
+
+```bash
+cmake --build ./build_release
+cmake --build ./build_debug
+```
+
+Do not invoke `ninja` directly for this project; use `cmake --build` so CMake's
+configuration and generated shims stay consistent.
+
+If a build fails because Protobuf headers are missing fixed-width integer
+definitions, patch the affected Protobuf-generated files by adding
+`#include <cstdint>`.
+
+## Tests
+
+The Rust simulator has its own tests:
+
+```bash
+cd backend-simulators/pim/pim-simulator
+cargo test
+```
+
+## Repository Layout
+
+- `src/PIM/` - PIM accelerator implementation.
+- `test/PIM/` - PIM C++ unit tests.
+- `validation/` - functional validation scripts, ONNX operation tests, network
+  slices, and pimsim config generation.
+- `backend-simulators/pim/pim-simulator/` - in-tree Rust functional simulator.
+- `backend-simulators/pim/pimsim-nn/` - performance simulator submodule.
+- `pimcomp_utils/` - local comparison helpers for PIMCOMP-NN.
+- `.github/actions/` and `.github/workflows/validate_operations.yml` - CI setup
+  for MLIR/Protobuf caching, building Raptor, and validation.