better spat computes merging
All checks were successful
Validate Operations / validate-operations (push) Successful in 21m14s
All checks were successful
Validate Operations / validate-operations (push) Successful in 21m14s
This commit is contained in:
154
README.md
154
README.md
@@ -1,5 +1,159 @@
|
||||
# Raptor
|
||||
|
||||
Raptor is a domain-specific MLIR compiler for neural networks (ONNX format)
|
||||
targeting in-memory computing / processing-in-memory (PIM) architectures.
|
||||
It progressively lowers ONNX-MLIR through a set of MLIR dialects down to
|
||||
target-specific artifacts (currently JSON code for the `pimsim-nn` simulator).
|
||||
|
||||
## Overview
|
||||
|
||||
PIM architectures perform most of the computation directly in memory.
|
||||
Raptor's first supported target is `pimsim-nn`, which simulates a chip with:
|
||||
- a shared host memory,
|
||||
- a number of cores that do most of the computation directly in their memory
|
||||
(vector ops, vmm/mvm on ReRAM crossbars),
|
||||
- no branching instructions (branchless architecture) and no hardware loop
|
||||
support — any repeated work (e.g. convolutions) must be unrolled into
|
||||
explicit per-iteration instructions.
|
||||
|
||||
Because of this, the amount of emitted instructions explodes quickly and the
|
||||
compiler must optimize aggressively at every stage to keep compilation
|
||||
tractable.
|
||||
|
||||
A second target, `PulPim`, is planned for an accelerator with RISC-V cores
|
||||
each carrying its own in-memory computing unit and crossbars. It will live in
|
||||
a dedicated dialect (future work).
|
||||
|
||||
### Targets and simulators
|
||||
|
||||
`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for
|
||||
**performance** estimates (latency, energy), but does not functionally execute
|
||||
the JSON code it consumes. To validate the numerical correctness of the JSON
|
||||
code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use
|
||||
a Rust simulator we maintain in-tree at
|
||||
`backend-simulators/pim/pim-simulator`.
|
||||
|
||||
## Compilation pipeline
|
||||
|
||||
The PIM-related sources live under `src/PIM` and the tests under `test/PIM`.
|
||||
When working on this codebase, most changes should stay confined to those
|
||||
trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for
|
||||
framework-level details).
|
||||
|
||||
High-level lowering flow:
|
||||
|
||||
```
|
||||
ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM JSON
|
||||
```
|
||||
|
||||
1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
|
||||
Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`).
|
||||
Spatial models a high-level spatial in-memory accelerator: vmm/mvm
|
||||
operations are accelerated by storing a constant RHS matrix into a
|
||||
crossbar. Crossbars cannot be re-programmed during execution, have a
|
||||
limited fixed size, and there is a limited number of them per core.
|
||||
Conversion patterns are split by op family under
|
||||
`Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul,
|
||||
Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather,
|
||||
Reshape, Resize, Split).
|
||||
|
||||
2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`).
|
||||
Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which
|
||||
materializes PIM cores (`pim.core`), inter-core communication
|
||||
(`pim.send` / `pim.receive`), halts, and crossbar-level operations.
|
||||
|
||||
3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
|
||||
A DCP-inspired heuristic (Dynamic Critical Path — see the original
|
||||
scheduling paper by Kwok & Ahmad,
|
||||
[DCP-eScience2007](https://clouds.cis.unimelb.edu.au/papers/DCP-eScience2007.pdf))
|
||||
that coarsens the virtual node graph and decides how to group compute
|
||||
nodes onto cores. Our implementation is only DCP-*inspired*: it is a
|
||||
heuristic with different assumptions from the paper (different cost
|
||||
model, constraints from crossbar capacity / core resources, and a
|
||||
windowed coarsening loop instead of full-graph reprioritization). The
|
||||
`dcp-critical-window-size` option controls how many lowest-slack virtual
|
||||
nodes each coarsening iteration considers (0 = legacy full-graph
|
||||
analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`,
|
||||
`MergeComputeNodesPass.cpp`.
|
||||
|
||||
4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`).
|
||||
Converts tensor-semantics PIM IR into memref-semantics PIM IR using the
|
||||
standard MLIR `BufferizableOpInterface` machinery
|
||||
(`OpBufferizationInterfaces.*`, `PimBufferization.td`).
|
||||
|
||||
5. **PIM code generation** (`src/PIM/Pass/PimCodegen`):
|
||||
- `HostConstantFolding` — folds host-side constants.
|
||||
- `MaterializeHostConstantsPass` — materializes the remaining host
|
||||
constants for emission.
|
||||
- `VerificationPass` — checks invariants before emission.
|
||||
- `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn`
|
||||
and `pim-simulator`.
|
||||
|
||||
Supporting pieces:
|
||||
- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count,
|
||||
core count, DCP window, experimental conv impl, concat error handling, …)
|
||||
and `PimCodeGen` entry points.
|
||||
- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`).
|
||||
- `src/PIM/Pass` — auxiliary passes (`MessagePass`, `CountInstructionPass`)
|
||||
and the `PIMPasses.h` registry used by `PimAccelerator`.
|
||||
- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers
|
||||
dialects, passes, and plugs Raptor into the ONNX-MLIR driver.
|
||||
|
||||
## Key compiler options
|
||||
|
||||
Pass these on the `onnx-mlir` command line when compiling for PIM:
|
||||
|
||||
- `--maccel=PIM` — select the PIM accelerator.
|
||||
- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen`
|
||||
— stop the pipeline at the requested stage (default: `EmitPimCodegen`).
|
||||
- `--pim-only-codegen` — assume the input is already bufferized PIM IR and
|
||||
run only the codegen tail.
|
||||
- `--crossbar-size=<N>` / `--crossbar-count=<N>` — crossbar dimensions and
|
||||
per-core count.
|
||||
- `--core-count=<N>` — number of cores (`-1` picks the minimum).
|
||||
- `--dcp-critical-window-size=<N>` — DCP coarsening window (0 = legacy).
|
||||
- `--use-experimental-conv-impl` — alternative convolution lowering.
|
||||
- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`.
|
||||
|
||||
## Validation
|
||||
|
||||
Functional validation lives in `validation/` and drives the Rust
|
||||
`pim-simulator` to compare Raptor's output against a reference.
|
||||
|
||||
Per-operation validation (from `validation/`):
|
||||
|
||||
```
|
||||
validate.py \
|
||||
--raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
|
||||
--onnx-include-dir ../onnx-mlir/include
|
||||
```
|
||||
|
||||
End-to-end network validation (example: first 4 layers of YOLOv11n):
|
||||
|
||||
```
|
||||
validate.py \
|
||||
--raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
|
||||
--onnx-include-dir ../onnx-mlir/include \
|
||||
--operations-dir ./networks/yolo11n/depth_04 \
|
||||
--crossbar-size 2048
|
||||
```
|
||||
|
||||
Available networks under `validation/networks/`: `vgg16`, `yolo11n`.
|
||||
Available operations under `validation/operations/`: `add`, `conv`, `div`,
|
||||
`gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`,
|
||||
`sigmoid`, `softmax`, `split`.
|
||||
|
||||
## Rebuilding
|
||||
|
||||
Release build (fast):
|
||||
|
||||
```
|
||||
cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30
|
||||
```
|
||||
|
||||
A slower debug build is also available — configure it the same way but with
|
||||
`-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below).
|
||||
|
||||
## Build
|
||||
|
||||
### Protobuf
|
||||
|
||||
Reference in New Issue
Block a user