From 013ae0ac2a6de3b563210de05d054262eafc8019 Mon Sep 17 00:00:00 2001 From: ilgeco Date: Wed, 27 May 2026 15:09:30 +0200 Subject: [PATCH] Update README and AGENTS --- AGENTS.md | 4 +- README.md | 424 +++++++++++++++++++++++++++++++----------------------- 2 files changed, 243 insertions(+), 185 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 37021ff..6e4cdbc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,7 +1,7 @@ - Always read the full README.md before doing anything. - Build commands: - - `cmake --build ./build_release --target onnx-mlir -j 30` - - `cmake --build ./build_debug --target onnx-mlir -j 30` + - `cmake --build ./build_release` + - `cmake --build ./build_debug` - Never use `ninja` directly: it bypasses cmake's configuration and invalidates the build cache. # Code changes diff --git a/README.md b/README.md index 5a52b70..8507a2d 100644 --- a/README.md +++ b/README.md @@ -1,168 +1,178 @@ # Raptor -Raptor is a domain-specific MLIR compiler for neural networks (ONNX format) -targeting in-memory computing / processing-in-memory (PIM) architectures. -It progressively lowers ONNX-MLIR through a set of MLIR dialects down to -target-specific artifacts (currently JSON code for the `pimsim-nn` simulator). +Raptor is a domain-specific MLIR compiler for neural networks in ONNX format, +targeting in-memory computing / processing-in-memory (PIM) architectures. It +extends ONNX-MLIR with a PIM accelerator and progressively lowers ONNX-MLIR +through custom MLIR dialects to simulator artifacts. + +The current target is the PIM simulator stack under `backend-simulators/pim`. +Raptor emits binary per-core `.pim` instruction files by default, plus +`memory.bin`, `config.json`, and weight binaries. It can also emit per-core JSON +instruction files with `--pim-emit-json`. ## Overview -PIM architectures perform most of the computation directly in memory. -Raptor's first supported target is `pimsim-nn`, which simulates a chip with: -- a shared host memory, -- a number of cores that do most of the computation directly in their memory - (vector ops, vmm/mvm on ReRAM crossbars), -- no branching instructions (branchless architecture) and no hardware loop - support — any repeated work (e.g. convolutions) must be unrolled into - explicit per-iteration instructions. +PIM architectures perform most computation directly in memory. The supported +target models a chip with: +- shared host memory, +- multiple PIM cores, +- ReRAM crossbars for vector-matrix / matrix-vector work, +- explicit communication between cores, +- no hardware branch or loop support in emitted simulator code. -Because of this, the amount of emitted instructions explodes quickly and the -compiler must optimize aggressively at every stage to keep compilation -tractable. - -A second target, `PulPim`, is planned for an accelerator with RISC-V cores -each carrying its own in-memory computing unit and crossbars. It will live in -a dedicated dialect (future work). +Because repeated work such as convolutions is eventually made explicit, emitted +instruction counts can grow quickly. Most compiler work therefore focuses on +lowering, scheduling, memory layout, and code-generation optimizations. ### Targets and simulators -`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for -**performance** estimates (latency, energy), but does not functionally execute -the JSON code it consumes. To validate the numerical correctness of the JSON -code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use -a Rust simulator we maintain in-tree at -`backend-simulators/pim/pim-simulator`. +- `backend-simulators/pim/pim-simulator` is the in-tree Rust functional + simulator used by validation. It reads Raptor's `pim/` artifact directory and + compares simulator output against native ONNX-MLIR execution. +- `backend-simulators/pim/pimsim-nn` is the performance simulator submodule. + The helper scripts in `pimcomp_utils/` are for comparison with PIMCOMP-NN and + contain local paths; treat them as local utilities, not portable workflows. ## Compilation pipeline -The PIM-related sources live under `src/PIM` and the tests under `test/PIM`. -When working on this codebase, most changes should stay confined to those -trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for -framework-level details). +The PIM sources live under `src/PIM` and tests under `test/PIM`. CMake exposes +them to ONNX-MLIR through generated shim directories under +`onnx-mlir/src/Accelerators/PIM` and `onnx-mlir/test/accelerators/PIM`. High-level lowering flow: ``` -ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM code +ONNX-MLIR -> Spatial -> Pim (tensor) -> Pim (bufferized) -> PIM artifacts ``` -1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`). - Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`). - Spatial models a high-level spatial in-memory accelerator: vmm/mvm - operations are accelerated by storing a constant RHS matrix into a - crossbar. Crossbars cannot be re-programmed during execution, have a - limited fixed size, and there is a limited number of them per core. - Conversion patterns are split by op family under - `Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul, - Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather, - Reshape, Resize, Split, etc...). +1. **ONNX -> Spatial** (`src/PIM/Conversion/ONNXToSpatial`). + Lowers supported ONNX ops into the `spat` dialect + (`src/PIM/Dialect/Spatial`). Conversion patterns are split by op family under + `Patterns/{Math,NN,Tensor}` and currently cover Conv, Gemm, MatMul, + elementwise Add/Mul/Div, ReduceMean, pooling, Relu, Sigmoid, Softmax, + Concat, Gather, Reshape, Resize, and Split. -2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`). - Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which - materializes PIM cores (`pim.core`), inter-core communication - (`pim.send` / `pim.receive`), halts, and crossbar-level operations. +2. **Merge compute nodes** + (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`). + Builds a compute graph, schedules it with the PEFT scheduler, and materializes + the merge schedule into Spatial IR. Supporting scheduling code lives under + `MergeComputeNodes/Scheduling`. -3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`). - A PEFT heuristic that coarsens the virtual node graph and decides how to group compute - nodes onto cores. Our implementation is only DCP-*inspired*: it is a - heuristic with different assumptions from the paper (different cost - model, constraints from crossbar capacity / core resources, and a - windowed coarsening loop instead of full-graph reprioritization). The - `dcp-critical-window-size` option controls how many lowest-slack virtual - nodes each coarsening iteration considers (0 = legacy full-graph - analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`, - `MergeComputeNodesPass.cpp`. +3. **Spatial -> Pim** (`src/PIM/Conversion/SpatialToPim`). + Lowers Spatial operations to the `pim` dialect (`src/PIM/Dialect/Pim`), + including `pim.core`, `pim.core_batch`, communication, tensor packing, global + tensor materialization, and return-path normalization. 4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`). - Converts tensor-semantics PIM IR into memref-semantics PIM IR using the - standard MLIR `BufferizableOpInterface` machinery - (`OpBufferizationInterfaces.*`, `PimBufferization.td`). + Converts tensor-semantics PIM IR into memref-semantics PIM IR using MLIR's + bufferization interfaces. -5. **Static memory coalescing** (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`). - Conservatively reuses same-typed local memref allocations inside PIM cores - after bufferization and before code generation. +5. **Static memory coalescing** + (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`). + Reuses compatible local memref allocations inside PIM cores before codegen. -6. **PIM code generation** (`src/PIM/Pass/PimCodegen`): - - `HostConstantFolding` — folds host-side constants. - - `MaterializeHostConstantsPass` — materializes the remaining host - constants for emission. - - `VerificationPass` — checks invariants before emission. - - `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn` - and `pim-simulator`. +6. **PIM code generation** (`src/PIM/Pass/PimCodegen` and + `src/PIM/Compiler`). + Folds host constants, materializes remaining host constants, verifies PIM IR, + emits `.pim` core files, writes weights, and writes `memory.bin` / + `config.json`. Supporting pieces: -- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count, - core count, DCP window, experimental conv impl, concat error handling, …) - and `PimCodeGen` entry points. -- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`). -- `src/PIM/Pass` — auxiliary passes (`MessagePass`) - and the `PIMPasses.h` registry used by `PimAccelerator`. -- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers - dialects, passes, and plugs Raptor into the ONNX-MLIR driver. +- `src/PIM/Common` - shared IR, filesystem, diagnostics, reports, and utility + helpers. +- `src/PIM/Compiler` - PIM compiler options, memory/address planning, binary + instruction format, artifact writing, weight emission, and codegen entry + points. +- `src/PIM/Conversion/SpatialToGraphviz` - optional Spatial graphviz conversion + pass. +- `src/PIM/Pass` - pass registration and auxiliary passes. +- `src/PIM/PimAccelerator.{cpp,hpp}` - ONNX-MLIR accelerator entry point. ## Key compiler options -Pass these on the `onnx-mlir` command line when compiling for PIM: +Pass these to `onnx-mlir` when compiling for PIM: -- `--maccel=PIM` — select the PIM accelerator. -- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen` - — stop the pipeline at the requested stage (default: `EmitPimCodegen`). -- `--pim-only-codegen` — assume the input is already bufferized PIM IR and - run only the codegen tail. -- `--crossbar-size=` / `--crossbar-count=` — crossbar dimensions and - per-core count. -- `--core-count=` — number of cores. Required for PIM compilation. -- `--pim-merge-scheduler={peft,dcp}` — scheduler used by the Spatial - merge-compute-nodes pass (default: `peft`). -- `--dcp-critical-window-size=` — DCP coarsening window (0 = legacy). -- `--use-experimental-conv-impl` — alternative convolution lowering. -- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`. +- `--maccel=PIM` - select the PIM accelerator. +- `--EmitSpatial`, `--EmitPim`, `--EmitPimBufferized`, + `--EmitPimCodegen` - stop the PIM pipeline at the requested stage. The PIM + default is `--EmitPimCodegen`. +- `--core-count=` - required positive core count for PIM compilation. +- `--crossbar-size=` - crossbar width/height. Default in code is `2`. +- `--crossbar-count=` - crossbars per core. Default in code is `256`. +- `--pim-merge-scheduler=peft` - merge scheduler. `peft` is the only accepted + value in the current code. +- `--pim-only-codegen` - assume input is already bufferized PIM IR and only run + the codegen tail. +- `--pim-emit-json` - also emit `core_*.json` instruction files alongside + `core_*.pim`. +- `--use-experimental-conv-impl` - use the alternate convolution lowering. +- `--ignore-concat-error` - soft-fail a ConcatOp corner case. + +Example: + +```bash +./build_release/Release/bin/onnx-mlir model.onnx -o /tmp/raptor/model \ + --maccel=PIM --EmitPimCodegen \ + --crossbar-size=2048 --crossbar-count=256 --core-count=1000 +``` + +This writes PIM artifacts under `/tmp/raptor/pim/`. ## Validation -Functional validation lives in `validation/` and drives the Rust -`pim-simulator` to compare Raptor's output against a reference. +Functional validation lives in `validation/`. It compiles ONNX models, builds a +native ONNX-MLIR reference runner, generates random inputs, runs Raptor, runs +the Rust PIM simulator, and compares outputs. -Per-operation validation (from `validation/`): +Python dependencies used by the validation scripts are `numpy`, `onnx`, and +`colorama`. The simulator requires the Rust toolchain. -``` -validate.py \ - --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \ - --onnx-include-dir ../onnx-mlir/include \ - --core-count 1000 +Per-operation validation from the repository root: + +```bash +python3 validation/validate.py \ + --raptor-path build_release/Release/bin/onnx-mlir \ + --onnx-include-dir onnx-mlir/include \ + --core-count 1000 ``` -End-to-end network validation (example: first 4 layers of YOLOv11n): +Validate one network or a subset by pointing `--operations-dir` at any directory +containing `.onnx` files: -``` -validate.py \ - --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \ - --onnx-include-dir ../onnx-mlir/include \ - --operations-dir ./networks/yolo11n/depth_04 \ - --crossbar-size 2048 --crossbar-count 256 --core-count 1000 +```bash +python3 validation/validate.py \ + --raptor-path build_release/Release/bin/onnx-mlir \ + --onnx-include-dir onnx-mlir/include \ + --operations-dir validation/networks/yolo11n/depth_04 \ + --crossbar-size 2048 --crossbar-count 256 --core-count 1000 ``` -Each validation run writes debugging artifacts into the benchmark's workspace -directory (for example `validation/operations/gemm/small/`): -- `inputs/` — generated input CSVs used for the run. -- `outputs/` — reference outputs dumped by the native ONNX runner. -- `raptor/` — compiler artifacts: - `*.onnx.mlir`, `dialects/spatial0.mlir`, `dialects/spatial1_dcp_merged.mlir`, - `dialects/pim0.mlir`, `dialects/pim1_buff.mlir`, `dialects/pim2_coalesced.mlir`, - `dialects/pim3_folded.mlir`, `dialects/pim4_materialized.mlir`, - `pim/config.json`, `pim/core_*.pim`, `pim/memory.bin`, and reports under - `raptor/reports/` such as `dcp_merge_report.txt`, - `memory_report.txt`, and `static_memory_coalescing_report.txt`. -- `runner/` — generated reference runner source, build tree, and shared library. -- `simulation/out.bin` — raw simulator output dump used for output comparison. +Useful validation options: +- `--simulator-dir ` - override the auto-detected + `backend-simulators/pim/pim-simulator` path. +- `--threshold ` - maximum allowed per-element output difference. +- `--seed ` - RNG seed for generated inputs. +- `--command-timeout-seconds ` - timeout for compiler, runner, and + simulator subprocesses. +- `--verbose` - print subprocess logs and average PIM pass timings. +- `--clean` - remove generated validation artifacts and exit. -That means you usually do not need to rerun standalone `--EmitSpatial` or -`--EmitPim` commands while debugging validation failures: the per-pass dialect -dumps are already available under `raptor/dialects/`. +Each validation run writes artifacts in the model workspace, for example under +`validation/operations/gemm/small/`: +- `inputs/` - generated input CSV files. +- `outputs/` - native ONNX-MLIR reference outputs. +- `raptor/` - compiler artifacts, including `*.onnx.mlir`, dialect dumps under + `dialects/`, reports under `reports/`, and final PIM artifacts under `pim/`. +- `runner/` - generated reference runner source, build tree, and shared library. +- `simulation/out.bin` - raw simulator output used for comparison. -The validator does not currently expose a simulator tracing flag, but once a -validation has produced `raptor/pim/` you can rerun the simulator manually with -tracing enabled: +The compiler currently dumps dialect snapshots such as `spatial0.mlir`, +`spatial1_dcp_merged.mlir`, `pim0.mlir`, `pim1_buff.mlir`, +`pim2_coalesced.mlir`, `pim3_folded.mlir`, and +`pim4_materialized.mlir` when an output directory is available. + +To rerun the simulator manually with tracing after validation has produced a +`raptor/pim/` directory: ```bash cd backend-simulators/pim/pim-simulator @@ -174,90 +184,138 @@ cargo run --no-default-features --features tracing --release \ ``` With `--features tracing`, the simulator writes per-core traces as -`simulation/TraceCore0`, `simulation/TraceCore1`, ... next to `simulation/out.bin`. -The validator normally computes the `-d` dump ranges from `raptor/pim/config.json` -and the model output shapes. If you need a clean slate before rerunning, use: +`TraceCore0`, `TraceCore1`, ... next to `out.bin`. The validator normally +computes the `-d` ranges from `raptor/pim/config.json` and model output shapes. + +Available validation networks under `validation/networks/`: `vgg16`, +`yolo11n`, `yolo11nv2`. + +Available operation suites under `validation/operations/`: `add`, `concat`, +`conv`, `div`, `gather`, `gemm`, `gemv`, `matmul`, `mul`, `pool`, +`reduce_mean`, `relu`, `reshape`, `resize`, `sigmoid`, `softmax`, `split`. + +Generated operation tests can be regenerated with: ```bash -validate.py --clean +python3 validation/operations/gen_tests.py ``` -Available networks under `validation/networks/`: `vgg16`, `yolo11n`. -Available operations under `validation/operations/`: `add`, `conv`, `div`, -`gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`, -`sigmoid`, `softmax`, `split`. - -## Rebuilding - -Release build (fast): - -``` -cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30 -``` - -A slower debug build is also available — configure it the same way but with -`-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below). - ## Build +Initialize submodules first: + +```bash +git submodule update --init --recursive +``` + +The project follows ONNX-MLIR's build requirements. The CI workflow documents +the currently used versions and setup: +- CMake 4.3.0 in CI, +- LLVM/MLIR checked out under `onnx-mlir/llvm-project`, +- Protobuf `v34.0`, +- Rust stable for `pim-simulator`, +- Python packages `numpy`, `onnx`, `colorama` for validation. + ### Protobuf -Use the following commands to install protobuf: -``` +Install Protobuf if your system does not already provide a compatible version: + +```bash git clone --depth 1 --branch v34.0 https://github.com/protocolbuffers/protobuf -cd protobuf -mkdir build -cd build -cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release -ninja -sudo ninja install +cmake -S protobuf -B protobuf/build -G Ninja \ + -DCMAKE_BUILD_TYPE=Release \ + -Dprotobuf_BUILD_TESTS=OFF +cmake --build protobuf/build +sudo cmake --install protobuf/build ``` -You can now remove the protobuf repo directory with: -``` -cd ../.. +You can then remove the temporary checkout: + +```bash rm -rf protobuf ``` -### Mlir +### MLIR -Follow the first part of instructions [here](onnx-mlir/docs/BuildOnLinuxOSX.md) to build mlir. +Follow the ONNX-MLIR instructions in +`onnx-mlir/docs/BuildOnLinuxOSX.md` to build LLVM/MLIR. The local Raptor build +expects `MLIR_DIR` to point at the MLIR CMake package, for example: -Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor - -Moreover, if compiling with build type debug, it is also suggested to use -mold as linker (you will need to install it if you don't have it already) -to reduce memory usage during linking. You can use it by setting the options: -``` --DLLVM_USE_LINKER=mold +```bash +MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir ``` +If your LLVM build directory is named `build` instead of `build_release`, adjust +the path accordingly. + ### Raptor -Use the following commands to build Raptor. +Configure a release build: -Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor. - -Also in this case, it is suggested to use mold as linker to reduce link time and memory usage, -setting the options: -``` --DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \ --DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \ --DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold" -``` - -``` -git submodule update --init --recursive - -MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build/lib/cmake/mlir -mkdir build && cd build -cmake .. -G Ninja \ +```bash +MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir +cmake -S . -B build_release -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DONNX_MLIR_ACCELERATORS=PIM \ -DLLVM_ENABLE_ASSERTIONS=ON \ -DMLIR_DIR=${MLIR_DIR} -cmake --build . ``` -If the build fails because of protobuf missing uint definitions, -just patch the problematic files by adding ```#include ``` to their includes. +Configure a debug build similarly: + +```bash +MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_debug/lib/cmake/mlir +cmake -S . -B build_debug -G Ninja \ + -DCMAKE_BUILD_TYPE=Debug \ + -DONNX_MLIR_ACCELERATORS=PIM \ + -DLLVM_ENABLE_ASSERTIONS=ON \ + -DMLIR_DIR=${MLIR_DIR} +``` + +For debug development, using `mold` can reduce link time and memory use: + +```bash +cmake -S . -B build_debug -G Ninja \ + -DCMAKE_BUILD_TYPE=Debug \ + -DONNX_MLIR_ACCELERATORS=PIM \ + -DLLVM_ENABLE_ASSERTIONS=ON \ + -DMLIR_DIR=${MLIR_DIR} \ + -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \ + -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \ + -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold" +``` + +Build the compiler with CMake: + +```bash +cmake --build ./build_release +cmake --build ./build_debug +``` + +Do not invoke `ninja` directly for this project; use `cmake --build` so CMake's +configuration and generated shims stay consistent. + +If a build fails because Protobuf headers are missing fixed-width integer +definitions, patch the affected Protobuf-generated files by adding +`#include `. + +## Tests + +The Rust simulator has its own tests: + +```bash +cd backend-simulators/pim/pim-simulator +cargo test +``` + +## Repository Layout + +- `src/PIM/` - PIM accelerator implementation. +- `test/PIM/` - PIM C++ unit tests. +- `validation/` - functional validation scripts, ONNX operation tests, network + slices, and pimsim config generation. +- `backend-simulators/pim/pim-simulator/` - in-tree Rust functional simulator. +- `backend-simulators/pim/pimsim-nn/` - performance simulator submodule. +- `pimcomp_utils/` - local comparison helpers for PIMCOMP-NN. +- `.github/actions/` and `.github/workflows/validate_operations.yml` - CI setup + for MLIR/Protobuf caching, building Raptor, and validation.