Update README and AGENTS

2026-05-27 15:09:30 +02:00
parent c6b02af7a9
commit 013ae0ac2a
2 changed files with 243 additions and 185 deletions
@@ -1,7 +1,7 @@
 - Always read the full README.md before doing anything.
 - Build commands:
-    - `cmake --build ./build_release --target onnx-mlir -j 30`
+    - `cmake --build ./build_release`
-    - `cmake --build ./build_debug --target onnx-mlir -j 30`
+    - `cmake --build ./build_debug`
 - Never use `ninja` directly: it bypasses cmake's configuration and invalidates the build cache.
 # Code changes
@@ -1,168 +1,178 @@
 # Raptor
-Raptor is a domain-specific MLIR compiler for neural networks (ONNX format)
+Raptor is a domain-specific MLIR compiler for neural networks in ONNX format,
-targeting in-memory computing / processing-in-memory (PIM) architectures.
+targeting in-memory computing / processing-in-memory (PIM) architectures. It
-It progressively lowers ONNX-MLIR through a set of MLIR dialects down to
+extends ONNX-MLIR with a PIM accelerator and progressively lowers ONNX-MLIR
-target-specific artifacts (currently JSON code for the `pimsim-nn` simulator).
+through custom MLIR dialects to simulator artifacts.
 The current target is the PIM simulator stack under `backend-simulators/pim`.
 Raptor emits binary per-core `.pim` instruction files by default, plus
 `memory.bin`, `config.json`, and weight binaries. It can also emit per-core JSON
 instruction files with `--pim-emit-json`.
 ## Overview
-PIM architectures perform most of the computation directly in memory.
+PIM architectures perform most computation directly in memory. The supported
-Raptor's first supported target is `pimsim-nn`, which simulates a chip with:
+target models a chip with:
- a shared host memory,
+- shared host memory,
- a number of cores that do most of the computation directly in their memory
+- multiple PIM cores,
-  (vector ops, vmm/mvm on ReRAM crossbars),
+- ReRAM crossbars for vector-matrix / matrix-vector work,
- no branching instructions (branchless architecture) and no hardware loop
+- explicit communication between cores,
-  support — any repeated work (e.g. convolutions) must be unrolled into
+- no hardware branch or loop support in emitted simulator code.
  explicit per-iteration instructions.
-Because of this, the amount of emitted instructions explodes quickly and the
+Because repeated work such as convolutions is eventually made explicit, emitted
-compiler must optimize aggressively at every stage to keep compilation
+instruction counts can grow quickly. Most compiler work therefore focuses on
-tractable.
+lowering, scheduling, memory layout, and code-generation optimizations.
 A second target, `PulPim`, is planned for an accelerator with RISC-V cores
 each carrying its own in-memory computing unit and crossbars. It will live in
 a dedicated dialect (future work).
 ### Targets and simulators
-`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for
+- `backend-simulators/pim/pim-simulator` is the in-tree Rust functional
-**performance** estimates (latency, energy), but does not functionally execute
+  simulator used by validation. It reads Raptor's `pim/` artifact directory and
-the JSON code it consumes. To validate the numerical correctness of the JSON
+  compares simulator output against native ONNX-MLIR execution.
-code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use
+- `backend-simulators/pim/pimsim-nn` is the performance simulator submodule.
-a Rust simulator we maintain in-tree at
+  The helper scripts in `pimcomp_utils/` are for comparison with PIMCOMP-NN and
-`backend-simulators/pim/pim-simulator`.
+  contain local paths; treat them as local utilities, not portable workflows.
 ## Compilation pipeline
-The PIM-related sources live under `src/PIM` and the tests under `test/PIM`.
+The PIM sources live under `src/PIM` and tests under `test/PIM`. CMake exposes
-When working on this codebase, most changes should stay confined to those
+them to ONNX-MLIR through generated shim directories under
-trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for
+`onnx-mlir/src/Accelerators/PIM` and `onnx-mlir/test/accelerators/PIM`.
 framework-level details).
 High-level lowering flow:
 ```
-ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM code
+ONNX-MLIR -> Spatial -> Pim (tensor) -> Pim (bufferized) -> PIM artifacts
 ```
-1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
+1. **ONNX -> Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
-   Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`).
+   Lowers supported ONNX ops into the `spat` dialect
-   Spatial models a high-level spatial in-memory accelerator: vmm/mvm
+   (`src/PIM/Dialect/Spatial`). Conversion patterns are split by op family under
-   operations are accelerated by storing a constant RHS matrix into a
+   `Patterns/{Math,NN,Tensor}` and currently cover Conv, Gemm, MatMul,
-   crossbar. Crossbars cannot be re-programmed during execution, have a
+   elementwise Add/Mul/Div, ReduceMean, pooling, Relu, Sigmoid, Softmax,
-   limited fixed size, and there is a limited number of them per core.
+   Concat, Gather, Reshape, Resize, and Split.
   Conversion patterns are split by op family under
   `Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul,
   Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather,
   Reshape, Resize, Split, etc...).
-2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`).
+2. **Merge compute nodes**
-   Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which
+   (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
-   materializes PIM cores (`pim.core`), inter-core communication
+   Builds a compute graph, schedules it with the PEFT scheduler, and materializes
-   (`pim.send` / `pim.receive`), halts, and crossbar-level operations.
+   the merge schedule into Spatial IR. Supporting scheduling code lives under
   `MergeComputeNodes/Scheduling`.
-3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
+3. **Spatial -> Pim** (`src/PIM/Conversion/SpatialToPim`).
-   A PEFT heuristic that coarsens the virtual node graph and decides how to group compute
+   Lowers Spatial operations to the `pim` dialect (`src/PIM/Dialect/Pim`),
-   nodes onto cores. Our implementation is only DCP-*inspired*: it is a
+   including `pim.core`, `pim.core_batch`, communication, tensor packing, global
-   heuristic with different assumptions from the paper (different cost
+   tensor materialization, and return-path normalization.
   model, constraints from crossbar capacity / core resources, and a
   windowed coarsening loop instead of full-graph reprioritization). The
   `dcp-critical-window-size` option controls how many lowest-slack virtual
   nodes each coarsening iteration considers (0 = legacy full-graph
   analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`,
   `MergeComputeNodesPass.cpp`.
 4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`).
-   Converts tensor-semantics PIM IR into memref-semantics PIM IR using the
+   Converts tensor-semantics PIM IR into memref-semantics PIM IR using MLIR's
-   standard MLIR `BufferizableOpInterface` machinery
+   bufferization interfaces.
   (`OpBufferizationInterfaces.*`, `PimBufferization.td`).
-5. **Static memory coalescing** (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
+5. **Static memory coalescing**
-   Conservatively reuses same-typed local memref allocations inside PIM cores
+   (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
-   after bufferization and before code generation.
+   Reuses compatible local memref allocations inside PIM cores before codegen.
-6. **PIM code generation** (`src/PIM/Pass/PimCodegen`):
+6. **PIM code generation** (`src/PIM/Pass/PimCodegen` and
-   - `HostConstantFolding` — folds host-side constants.
+   `src/PIM/Compiler`).
-   - `MaterializeHostConstantsPass` — materializes the remaining host
+   Folds host constants, materializes remaining host constants, verifies PIM IR,
-     constants for emission.
+   emits `.pim` core files, writes weights, and writes `memory.bin` /
-   - `VerificationPass` — checks invariants before emission.
+   `config.json`.
   - `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn`
     and `pim-simulator`.
 Supporting pieces:
- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count,
+- `src/PIM/Common` - shared IR, filesystem, diagnostics, reports, and utility
-  core count, DCP window, experimental conv impl, concat error handling, …)
+  helpers.
-  and `PimCodeGen` entry points.
+- `src/PIM/Compiler` - PIM compiler options, memory/address planning, binary
- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`).
+  instruction format, artifact writing, weight emission, and codegen entry
- `src/PIM/Pass` — auxiliary passes (`MessagePass`)
+  points.
-  and the `PIMPasses.h` registry used by `PimAccelerator`.
+- `src/PIM/Conversion/SpatialToGraphviz` - optional Spatial graphviz conversion
- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers
+  pass.
-  dialects, passes, and plugs Raptor into the ONNX-MLIR driver.
+- `src/PIM/Pass` - pass registration and auxiliary passes.
 - `src/PIM/PimAccelerator.{cpp,hpp}` - ONNX-MLIR accelerator entry point.
 ## Key compiler options
-Pass these on the `onnx-mlir` command line when compiling for PIM:
+Pass these to `onnx-mlir` when compiling for PIM:
- `--maccel=PIM` — select the PIM accelerator.
+- `--maccel=PIM` - select the PIM accelerator.
- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen`
+- `--EmitSpatial`, `--EmitPim`, `--EmitPimBufferized`,
-  — stop the pipeline at the requested stage (default: `EmitPimCodegen`).
+  `--EmitPimCodegen` - stop the PIM pipeline at the requested stage. The PIM
- `--pim-only-codegen` — assume the input is already bufferized PIM IR and
+  default is `--EmitPimCodegen`.
-  run only the codegen tail.
+- `--core-count=<N>` - required positive core count for PIM compilation.
- `--crossbar-size=<N>` / `--crossbar-count=<N>` — crossbar dimensions and
+- `--crossbar-size=<N>` - crossbar width/height. Default in code is `2`.
-  per-core count.
+- `--crossbar-count=<N>` - crossbars per core. Default in code is `256`.
- `--core-count=<N>` — number of cores. Required for PIM compilation.
+- `--pim-merge-scheduler=peft` - merge scheduler. `peft` is the only accepted
- `--pim-merge-scheduler={peft,dcp}` — scheduler used by the Spatial
+  value in the current code.
-  merge-compute-nodes pass (default: `peft`).
+- `--pim-only-codegen` - assume input is already bufferized PIM IR and only run
- `--dcp-critical-window-size=<N>` — DCP coarsening window (0 = legacy).
+  the codegen tail.
- `--use-experimental-conv-impl` — alternative convolution lowering.
+- `--pim-emit-json` - also emit `core_*.json` instruction files alongside
- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`.
+  `core_*.pim`.
 - `--use-experimental-conv-impl` - use the alternate convolution lowering.
 - `--ignore-concat-error` - soft-fail a ConcatOp corner case.
 Example:
 ```bash
 ./build_release/Release/bin/onnx-mlir model.onnx -o /tmp/raptor/model \
  --maccel=PIM --EmitPimCodegen \
  --crossbar-size=2048 --crossbar-count=256 --core-count=1000
 ```
 This writes PIM artifacts under `/tmp/raptor/pim/`.
 ## Validation
-Functional validation lives in `validation/` and drives the Rust
+Functional validation lives in `validation/`. It compiles ONNX models, builds a
-`pim-simulator` to compare Raptor's output against a reference.
+native ONNX-MLIR reference runner, generates random inputs, runs Raptor, runs
 the Rust PIM simulator, and compares outputs.
-Per-operation validation (from `validation/`):
+Python dependencies used by the validation scripts are `numpy`, `onnx`, and
 `colorama`. The simulator requires the Rust toolchain.
-```
+Per-operation validation from the repository root:
-validate.py \
+
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
+```bash
-    --onnx-include-dir ../onnx-mlir/include \
+python3 validation/validate.py \
-    --core-count 1000
+  --raptor-path build_release/Release/bin/onnx-mlir \
  --onnx-include-dir onnx-mlir/include \
  --core-count 1000
 ```
-End-to-end network validation (example: first 4 layers of YOLOv11n):
+Validate one network or a subset by pointing `--operations-dir` at any directory
 containing `.onnx` files:
-```
+```bash
-validate.py \
+python3 validation/validate.py \
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
+  --raptor-path build_release/Release/bin/onnx-mlir \
-    --onnx-include-dir ../onnx-mlir/include \
+  --onnx-include-dir onnx-mlir/include \
-    --operations-dir ./networks/yolo11n/depth_04 \
+  --operations-dir validation/networks/yolo11n/depth_04 \
-    --crossbar-size 2048 --crossbar-count 256 --core-count 1000
+  --crossbar-size 2048 --crossbar-count 256 --core-count 1000
 ```
-Each validation run writes debugging artifacts into the benchmark's workspace
+Useful validation options:
-directory (for example `validation/operations/gemm/small/`):
+- `--simulator-dir <path>` - override the auto-detected
- `inputs/` — generated input CSVs used for the run.
+  `backend-simulators/pim/pim-simulator` path.
- `outputs/` — reference outputs dumped by the native ONNX runner.
+- `--threshold <float>` - maximum allowed per-element output difference.
- `raptor/` — compiler artifacts:
+- `--seed <int>` - RNG seed for generated inputs.
-  `*.onnx.mlir`, `dialects/spatial0.mlir`, `dialects/spatial1_dcp_merged.mlir`,
+- `--command-timeout-seconds <float>` - timeout for compiler, runner, and
-  `dialects/pim0.mlir`, `dialects/pim1_buff.mlir`, `dialects/pim2_coalesced.mlir`,
+  simulator subprocesses.
-  `dialects/pim3_folded.mlir`, `dialects/pim4_materialized.mlir`,
+- `--verbose` - print subprocess logs and average PIM pass timings.
-  `pim/config.json`, `pim/core_*.pim`, `pim/memory.bin`, and reports under
+- `--clean` - remove generated validation artifacts and exit.
  `raptor/reports/` such as `dcp_merge_report.txt`,
  `memory_report.txt`, and `static_memory_coalescing_report.txt`.
 - `runner/` — generated reference runner source, build tree, and shared library.
 - `simulation/out.bin` — raw simulator output dump used for output comparison.
-That means you usually do not need to rerun standalone `--EmitSpatial` or
+Each validation run writes artifacts in the model workspace, for example under
-`--EmitPim` commands while debugging validation failures: the per-pass dialect
+`validation/operations/gemm/small/`:
-dumps are already available under `raptor/dialects/`.
+- `inputs/` - generated input CSV files.
 - `outputs/` - native ONNX-MLIR reference outputs.
 - `raptor/` - compiler artifacts, including `*.onnx.mlir`, dialect dumps under
  `dialects/`, reports under `reports/`, and final PIM artifacts under `pim/`.
 - `runner/` - generated reference runner source, build tree, and shared library.
 - `simulation/out.bin` - raw simulator output used for comparison.
-The validator does not currently expose a simulator tracing flag, but once a
+The compiler currently dumps dialect snapshots such as `spatial0.mlir`,
-validation has produced `raptor/pim/` you can rerun the simulator manually with
+`spatial1_dcp_merged.mlir`, `pim0.mlir`, `pim1_buff.mlir`,
-tracing enabled:
+`pim2_coalesced.mlir`, `pim3_folded.mlir`, and
 `pim4_materialized.mlir` when an output directory is available.
 To rerun the simulator manually with tracing after validation has produced a
 `raptor/pim/` directory:
 ```bash
 cd backend-simulators/pim/pim-simulator
@@ -174,90 +184,138 @@ cargo run --no-default-features --features tracing --release \
 ```
 With `--features tracing`, the simulator writes per-core traces as
-`simulation/TraceCore0`, `simulation/TraceCore1`, ... next to `simulation/out.bin`.
+`TraceCore0`, `TraceCore1`, ... next to `out.bin`. The validator normally
-The validator normally computes the `-d` dump ranges from `raptor/pim/config.json`
+computes the `-d` ranges from `raptor/pim/config.json` and model output shapes.
-and the model output shapes. If you need a clean slate before rerunning, use:
+
 Available validation networks under `validation/networks/`: `vgg16`,
 `yolo11n`, `yolo11nv2`.
 Available operation suites under `validation/operations/`: `add`, `concat`,
 `conv`, `div`, `gather`, `gemm`, `gemv`, `matmul`, `mul`, `pool`,
 `reduce_mean`, `relu`, `reshape`, `resize`, `sigmoid`, `softmax`, `split`.
 Generated operation tests can be regenerated with:
 ```bash
-validate.py --clean
+python3 validation/operations/gen_tests.py
 ```
 Available networks under `validation/networks/`: `vgg16`, `yolo11n`.
 Available operations under `validation/operations/`: `add`, `conv`, `div`,
 `gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`,
 `sigmoid`, `softmax`, `split`.
 ## Rebuilding
 Release build (fast):
 ```
 cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30
 ```
 A slower debug build is also available — configure it the same way but with
 `-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below).
 ## Build
 Initialize submodules first:
 ```bash
 git submodule update --init --recursive
 ```
 The project follows ONNX-MLIR's build requirements. The CI workflow documents
 the currently used versions and setup:
 - CMake 4.3.0 in CI,
 - LLVM/MLIR checked out under `onnx-mlir/llvm-project`,
 - Protobuf `v34.0`,
 - Rust stable for `pim-simulator`,
 - Python packages `numpy`, `onnx`, `colorama` for validation.
 ### Protobuf
-Use the following commands to install protobuf:
+Install Protobuf if your system does not already provide a compatible version:
-```
+
 ```bash
 git clone --depth 1 --branch v34.0 https://github.com/protocolbuffers/protobuf
-cd protobuf
+cmake -S protobuf -B protobuf/build -G Ninja \
-mkdir build
+  -DCMAKE_BUILD_TYPE=Release \
-cd build
+  -Dprotobuf_BUILD_TESTS=OFF
-cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release
+cmake --build protobuf/build
-ninja
+sudo cmake --install protobuf/build
 sudo ninja install
 ```
-You can now remove the protobuf repo directory with:
+You can then remove the temporary checkout:
-```
+
-cd ../..
+```bash
 rm -rf protobuf
 ```
-### Mlir
+### MLIR
-Follow the first part of instructions [here](onnx-mlir/docs/BuildOnLinuxOSX.md) to build mlir.
+Follow the ONNX-MLIR instructions in
 `onnx-mlir/docs/BuildOnLinuxOSX.md` to build LLVM/MLIR. The local Raptor build
 expects `MLIR_DIR` to point at the MLIR CMake package, for example:
-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor
+```bash
-
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
 Moreover, if compiling with build type debug, it is also suggested to use
 mold as linker (you will need to install it if you don't have it already)
 to reduce memory usage during linking. You can use it by setting the options:
 ```
 -DLLVM_USE_LINKER=mold
 ```
 If your LLVM build directory is named `build` instead of `build_release`, adjust
 the path accordingly.
 ### Raptor
-Use the following commands to build Raptor.
+Configure a release build:
-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor.
+```bash
-
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
-Also in this case, it is suggested to use mold as linker to reduce link time and memory usage,
+cmake -S . -B build_release -G Ninja \
 setting the options:
 ```
 -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
 -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
 -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
 ```
 ```
 git submodule update --init --recursive
 MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build/lib/cmake/mlir
 mkdir build && cd build
 cmake .. -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR}
 cmake --build .
 ```
-If the build fails because of protobuf missing uint definitions,
+Configure a debug build similarly:
-just patch the problematic files by adding ```#include <cstdint>``` to their includes.
+
 ```bash
 MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_debug/lib/cmake/mlir
 cmake -S . -B build_debug -G Ninja \
  -DCMAKE_BUILD_TYPE=Debug \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR}
 ```
 For debug development, using `mold` can reduce link time and memory use:
 ```bash
 cmake -S . -B build_debug -G Ninja \
  -DCMAKE_BUILD_TYPE=Debug \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR} \
  -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
  -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
  -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
 ```
 Build the compiler with CMake:
 ```bash
 cmake --build ./build_release
 cmake --build ./build_debug
 ```
 Do not invoke `ninja` directly for this project; use `cmake --build` so CMake's
 configuration and generated shims stay consistent.
 If a build fails because Protobuf headers are missing fixed-width integer
 definitions, patch the affected Protobuf-generated files by adding
 `#include <cstdint>`.
 ## Tests
 The Rust simulator has its own tests:
 ```bash
 cd backend-simulators/pim/pim-simulator
 cargo test
 ```
 ## Repository Layout
 - `src/PIM/` - PIM accelerator implementation.
 - `test/PIM/` - PIM C++ unit tests.
 - `validation/` - functional validation scripts, ONNX operation tests, network
  slices, and pimsim config generation.
 - `backend-simulators/pim/pim-simulator/` - in-tree Rust functional simulator.
 - `backend-simulators/pim/pimsim-nn/` - performance simulator submodule.
 - `pimcomp_utils/` - local comparison helpers for PIMCOMP-NN.
 - `.github/actions/` and `.github/workflows/validate_operations.yml` - CI setup
  for MLIR/Protobuf caching, building Raptor, and validation.