ReduceMean + resnet

Fix direct import
Yolo Image Validator + new accept rule
2026-06-10 14:30:10 +02:00 · 2026-06-10 12:14:20 +02:00 · 2026-06-10 11:59:43 +02:00 · 2026-06-05 18:22:59 +02:00 · 2026-06-05 17:36:51 +02:00 · 2026-06-05 17:12:16 +02:00
376 changed files with 23506 additions and 13066 deletions
@@ -4,14 +4,11 @@

 .claude
 .codex
-AGENTS.md

 CMakeUserPresets.json

-build
-build_release
-cmake-build-debug
-cmake-build-release
+build_*
 compile.sh
+pimcomp_utils/*

 **/__*
@@ -0,0 +1,210 @@
+* Always read the full README.md before doing anything
+* Build commands:
+  * `cmake --build ./build_release`
+  * `cmake --build ./build_debug`
+* Never use `ninja` directly: it bypasses cmake's configuration and invalidates the build cache
+* Always try the release build first before building with the debug version
+* Use the debug build only when it is useful to obtain a clear stack trace with symbols, inspect names, place breakpoints, or test a small case interactively
+* The debug build is very slow, so use it only on small fast tests such as operation validations, not on network validations
+
+# Core engineering philosophy
+
+* Clean architecture matters as much as making the immediate test pass
+* Prefer fixes that preserve clear ownership boundaries, explicit invariants, and simple dataflow
+* Do not stack compensating fixes on top of earlier mistakes. If the current approach is becoming messy, stop and explain why
+* A correct fix should usually make the responsible producer, resolver, verifier, or lowering own the behavior directly
+* Avoid late repair passes, defensive cleanup, or broad rewrites when a cleaner owner-side fix is possible
+* Do not hide an upstream modeling bug by normalizing it later in the pipeline. Fix the producer when the producer owns the invariant
+* Prefer patterns/rewrites for local IR canonicalization. Use module walks only when pass-level structural analysis genuinely requires them
+* Prefer compact, structured designs over long case-by-case implementations
+
+# Think before coding
+
+* State assumptions explicitly before implementing when they affect the design
+* If multiple interpretations exist, present them instead of silently choosing one
+* If a simpler approach exists, say so and prefer it unless there is a clear reason not to
+* If something is unclear, stop, name what is confusing, and ask
+* If the requested or obvious approach would make the architecture worse, push back and propose a cleaner alternative
+
+# Code changes
+
+* Keep changes minimal and localized to the relevant parts of the code
+* Preserve the existing naming conventions and coding style used in the surrounding code
+* Keep code easy to read, well organized, and suitable for future extensibility
+* A function must not exceed roughly 200/250 lines. If a change pushes a function beyond that, extract focused helpers
+* Prefer clear naming and structure over comments. Add comments only when they materially improve clarity
+* Do not rename symbols, move files, or restructure modules unless that is necessary for the requested change
+* Avoid duplicate ad-hoc logic. If the same concept appears in multiple places, consider whether it deserves a shared helper/API
+* When adding a helper or API, ask:
+  * Could this be useful to another component now
+  * Is another component already implementing the same idea differently
+  * Is this likely to be needed by a future adjacent component
+  * What is the narrowest useful abstraction
+  * What is the correct ownership level for this API
+* If a shared API is justified, place it at the lowest clean layer that can be used by all relevant consumers without creating dependency cycles or leaking policy across layers
+* If an existing component should use a newly introduced shared API, refactor that component in the same patch when doing so is directly related and reduces duplication
+* Do not create broad frameworks just because a helper might someday be useful. Shared APIs should encode a real reusable concept, not speculative generality
+* If the reusable abstraction is plausible but not clearly needed yet, keep the code local and mention the possible future extraction separately
+
+# Avoid case-listing designs
+
+* Avoid solving problems with large chains of `if`/`else`, switches, or repeated special cases that enumerate every possible situation
+* Long case listings tend to overfit the current tests, grow the codebase, and hide the underlying abstraction
+* When you see a growing list of special cases, stop and look for the shared concept, data model, interface, or normalization step that would make the cases collapse
+* Prefer table-driven logic, traits/interfaces, small reusable predicates, structured dispatch, or producer-side normalization when they express the invariant more directly
+* A few explicit cases are fine when the domain is genuinely small and closed
+* If the list is likely to grow, refactor toward a cleaner and more compact design instead of adding another branch
+* When keeping a case list is the pragmatic choice, explain why the domain is closed or why a broader abstraction would be premature
+
+# Ownership and invariants
+
+Before implementing, identify the owner of the behavior:
+
+* A producer should emit IR/data that satisfies the contract of the next stage
+* A lowering should make representation changes explicit and semantically correct
+* A resolver should resolve existing structure without silently changing semantics
+* A verifier should reject invalid states with bounded, actionable diagnostics
+* Codegen should assume verified invariants and fail clearly if they are violated
+
+When fixing a bug:
+
+* State the invariant that was violated
+* State which component should own that invariant
+* Fix that component directly
+* Avoid fixes that merely mask the violation later in the pipeline
+* Add or preserve verification if the invariant is important enough to regress
+
+# Refactor and API policy
+
+You may propose or implement a refactor when:
+
+* the local fix would duplicate logic
+* the local fix would violate a layer boundary
+* the bug exists because responsibility is assigned to the wrong component
+* multiple components already implement ad-hoc variants of the same concept
+* a shared helper/API would make the code smaller, clearer, and easier to maintain
+* existing callers can be migrated cleanly without broad churn
+* the current implementation is turning into a long list of special cases instead of a structured solution
+
+When proposing or implementing a refactor:
+
+* Explain what responsibility is being moved or shared
+* Justify why the new location is the right ownership level
+* Keep the API narrow and named after the concept or invariant it represents
+* Migrate directly related existing users when that improves compactness and consistency
+* Separate changes required for correctness from optional cleanup
+* Avoid unrelated renames, formatting changes, or module moves
+* Do not expand a justified refactor beyond directly related callers
+
+Do not refactor when:
+
+* the issue is truly local and a local fix is clearer
+* the abstraction would have only one user and no clear adjacent use
+* the abstraction would mix policies from different layers
+* the refactor would affect unrelated behavior
+* the refactor is mainly aesthetic
+
+# Working style
+
+* Infer style and conventions from the existing code before introducing new patterns
+* When several implementation options are possible, prefer the simplest one that fits the current architecture and minimizes churn
+* Push back when the requested or obvious fix would make the architecture worse
+* If a cleaner fix requires a small refactor or shared helper/API, propose it explicitly and justify it
+* Avoid broad refactors unless explicitly requested or clearly necessary for correctness and maintainability
+* When tests fail, bucket failures by likely root cause and separate patch-related failures from pre-existing or out-of-scope failures
+
+# Simplicity first
+
+* Minimum code that solves the problem cleanly. Nothing speculative
+* No features beyond what was asked
+* No error handling for impossible scenarios
+* If you write 200 lines and it could be 50, rewrite it
+* Ask: “Would a senior engineer say this is overcomplicated?” If yes, simplify
+* Prefer direct, explicit code over generic machinery unless the generic machinery clearly reduces duplication and preserves boundaries
+
+# Fallbacks and defaults
+
+* Avoid silent fallback behavior when the semantic category is unknown
+* Do not treat “unknown” as “safe” unless the codebase already defines that convention
+* If a value cannot be classified, either preserve the existing behavior deliberately or fail with a clear diagnostic
+* When adding a fallback, state why it is semantically valid and what invariant makes it safe
+
+# Surgical changes
+
+* Touch only what you must
+* Clean up only the mess introduced by your own change
+* Do not “improve” adjacent code, comments, or formatting
+* Match existing style, even if you would personally do it differently
+* If you notice unrelated dead code, bad abstractions, or fragile design, mention it separately. Do not delete or rewrite it unless asked
+* When your changes create orphans, remove imports, variables, functions, or files made unused by your change
+* Every changed line should trace directly to the requested fix, a required cleanup, or a justified reuse/refactor decision
+
+# Diagnostics and verification
+
+* Use existing bounded diagnostic mechanisms for pass-level verification or codegen failures
+* Do not emit unbounded repeated diagnostics from loops or parallel workers
+* Diagnostics should identify the violated invariant and the relevant value/op when useful
+* Verifiers should reject invalid states, not repair them
+* Codegen should not compensate for invalid IR/data unless codegen is the owner of that invariant
+* Do not make failing tests pass by weakening verifiers, assertions, or diagnostics unless the check itself is proven wrong
+* If a check is too strict, explain the valid case it rejects and update the invariant accordingly
+* Prefer fixing invalid IR/data producers over relaxing consumers
+* If adding diagnostics only for debugging, remove them or cap them before finalizing
+
+# Temporary debugging code
+
+* Temporary diagnostics, dumps, assertions, and debug-only helpers must be removed or intentionally converted into bounded permanent diagnostics before finalizing
+* If debug instrumentation remains, explain why it is useful as permanent infrastructure
+* Do not leave noisy validation output behind
+
+# Performance awareness
+
+* Avoid algorithmic regressions in compiler passes, especially repeated full-module walks, repeated expensive analyses, or per-op recomputation inside nested loops
+* If a change adds a walk, cache, analysis, or structural traversal, justify why it is needed
+* For hot paths, prefer preserving existing asymptotic behavior unless a better structure is part of the requested change
+* If performance may change, mention the expected impact and suggest a targeted timing check
+
+# Goal-driven execution
+
+For multi-step tasks, state a brief plan:
+
+1. [Step] → verify: [check]
+2. [Step] → verify: [check]
+3. [Step] → verify: [check]
+
+Define success criteria before implementing:
+
+* For bug fixes, success means reproducing or identifying the failure, fixing the responsible owner, and verifying the targeted case
+* For refactors, success means preserving behavior while making ownership, reuse, or structure cleaner
+* For validation changes, success means checking both valid and invalid cases when applicable
+
+Transform tasks into verifiable goals:
+
+* “Fix the bug” → identify the invariant, reproduce the failure, fix the owner, verify the targeted case
+* “Add validation” → write or identify tests for invalid inputs, then make them pass/fail as expected
+* “Refactor X” → preserve behavior before and after, then run relevant tests
+
+# Final self-review
+
+Before reporting completion, check:
+
+* Did I fix the owner of the invariant rather than masking the issue downstream
+* Did I avoid broad case lists and ad-hoc special handling
+* Did I introduce a helper/API only at the right ownership level
+* Did I migrate directly related duplicate logic when doing so improves compactness
+* Did I avoid weakening verifiers or assertions unnecessarily
+* Did I remove temporary debugging code or make it bounded and intentional
+* Did I avoid unrelated formatting, renames, or cleanup
+* Did I consider performance impact for added walks, analyses, caches, or repeated computations
+* Did I run the required build/test commands
+* Did I clearly report remaining failures or risks
+
+When reporting back:
+
+* Say what changed
+* Say what was verified
+* Say what remains
+* When showing code in chat, make it easy to copy-paste into the codebase
+* Keep outputs focused on the changed parts
+* List bad practices, fragile assumptions, or cleaner alternatives separately
+* If a change is intentionally pragmatic rather than architecturally ideal, say so and explain the tradeoff
@@ -3,31 +3,99 @@ cmake_minimum_required(VERSION 3.20.0)

 project(raptor)

-# Add symlink to PIM as accelerator in onnx-mlir
-function(raptor_ensure_symlink link_path target_path)
-  get_filename_component(link_parent "${link_path}" DIRECTORY)
+# Materialize a CMake shim directory
+function(raptor_write_external_cmake_shim shim_dir external_source_dir description)
+  get_filename_component(real_external_source_dir "${external_source_dir}" REALPATH)
+  file(RELATIVE_PATH relative_external_source_dir "${shim_dir}" "${real_external_source_dir}")

-  if(NOT EXISTS "${link_parent}")
-    message(FATAL_ERROR "Directory not found: ${link_parent}")
-  endif()
-
-  if(NOT EXISTS "${link_path}")
-    message(STATUS "Creating symlink ${link_path} -> ${target_path}")
-    file(CREATE_LINK
-        "${target_path}"
-        "${link_path}"
-        SYMBOLIC
+  if (NOT EXISTS "${real_external_source_dir}/CMakeLists.txt")
+    message(FATAL_ERROR
+      "External CMake source directory not found or missing CMakeLists.txt:\n"
+      "  ${real_external_source_dir}"
    )
-  endif()
+  endif ()
+
+  if (IS_SYMLINK "${shim_dir}")
+    message(STATUS "Removing old full-directory symlink: ${shim_dir}")
+    file(REMOVE "${shim_dir}")
+  endif ()
+
+  if (EXISTS "${shim_dir}" AND NOT IS_DIRECTORY "${shim_dir}")
+    message(FATAL_ERROR "Expected directory or absent path, got file: ${shim_dir}")
+  endif ()
+
+  file(MAKE_DIRECTORY "${shim_dir}")
+
+  set(shim_file "${shim_dir}/CMakeLists.txt")
+  set(shim_contents
+    "get_filename_component(raptor_external_source_dir
+    \"\${CMAKE_CURRENT_LIST_DIR}/${relative_external_source_dir}\"
+    REALPATH
+)
+add_subdirectory(
+    \"\${raptor_external_source_dir}\"
+    \"\${CMAKE_CURRENT_BINARY_DIR}/raptor-external\"
+)
+if (DEFINED PIM_ENABLED)
+    set(PIM_ENABLED \"\${PIM_ENABLED}\" PARENT_SCOPE)
+endif ()
+"
+  )
+
+  if (EXISTS "${shim_file}")
+    file(READ "${shim_file}" old_contents)
+  else ()
+    set(old_contents "")
+  endif ()
+
+  if (NOT old_contents STREQUAL shim_contents)
+    file(WRITE "${shim_file}" "${shim_contents}")
+    message(STATUS "Wrote CMake shim for ${description}: ${shim_file}")
+  else ()
+    message(STATUS "CMake shim already up to date for ${description}")
+  endif ()
+
+  # Mirror the external tree's first-level entries into the shim directory
+  # so legacy includes like src/Accelerators/PIM/Compiler/... keep working.
+  file(GLOB children RELATIVE "${real_external_source_dir}" "${real_external_source_dir}/*")
+
+  foreach (child IN LISTS children)
+    if (child STREQUAL "CMakeLists.txt")
+      continue()
+    endif ()
+
+    set(real_child "${real_external_source_dir}/${child}")
+    set(shim_child "${shim_dir}/${child}")
+
+    if (IS_SYMLINK "${shim_child}")
+      file(READ_SYMLINK "${shim_child}" existing_link_target)
+      if (existing_link_target STREQUAL real_child)
+        continue()
+      endif ()
+      file(REMOVE_RECURSE "${shim_child}")
+    elseif (EXISTS "${shim_child}")
+      # Do not delete real files/directories. This protects the generated shim.
+      continue()
+    endif ()
+
+    file(CREATE_LINK
+      "${real_child}"
+      "${shim_child}"
+      SYMBOLIC
+    )
+  endforeach ()
 endfunction()

-raptor_ensure_symlink(
-    "${CMAKE_CURRENT_SOURCE_DIR}/onnx-mlir/src/Accelerators/PIM"
-    "${CMAKE_CURRENT_SOURCE_DIR}/src/PIM"
+raptor_write_external_cmake_shim(
+  "${CMAKE_CURRENT_SOURCE_DIR}/onnx-mlir/src/Accelerators/PIM"
+  "${CMAKE_CURRENT_SOURCE_DIR}/src/PIM"
+  "PIM accelerator"
 )
-raptor_ensure_symlink(
-    "${CMAKE_CURRENT_SOURCE_DIR}/onnx-mlir/test/accelerators/PIM"
-    "${CMAKE_CURRENT_SOURCE_DIR}/test/PIM"
+
+raptor_write_external_cmake_shim(
+  "${CMAKE_CURRENT_SOURCE_DIR}/onnx-mlir/test/accelerators/PIM"
+  "${CMAKE_CURRENT_SOURCE_DIR}/test/PIM"
+  "PIM accelerator tests"
 )

 # Patch onnx-mlir sources for PIM accelerator support.
@@ -38,21 +106,21 @@ function(raptor_apply_patch file_path anchor replacement description)

  # Already applied – replacement text is present
  string(FIND "${contents}" "${replacement}" already_applied_pos)
-  if(NOT already_applied_pos EQUAL -1)
+  if (NOT already_applied_pos EQUAL -1)
    message(STATUS "Patch already applied: ${description}")
    return()
-  endif()
+  endif ()

  # Anchor must exist for the patch to be applicable
  string(FIND "${contents}" "${anchor}" anchor_pos)
-  if(anchor_pos EQUAL -1)
+  if (anchor_pos EQUAL -1)
    message(FATAL_ERROR
      "Patch anchor not found – onnx-mlir may have changed.\n"
      "  Patch : ${description}\n"
      "  File  : ${file_path}\n"
      "  Anchor: ${anchor}"
    )
-  endif()
+  endif ()

  string(REPLACE "${anchor}" "${replacement}" patched "${contents}")
  file(WRITE "${file_path}" "${patched}")
@@ -1,223 +1,321 @@
 # Raptor

-Raptor is a domain-specific MLIR compiler for neural networks (ONNX format)
-targeting in-memory computing / processing-in-memory (PIM) architectures.
-It progressively lowers ONNX-MLIR through a set of MLIR dialects down to
-target-specific artifacts (currently JSON code for the `pimsim-nn` simulator).
+Raptor is a domain-specific MLIR compiler for neural networks in ONNX format,
+targeting in-memory computing / processing-in-memory (PIM) architectures. It
+extends ONNX-MLIR with a PIM accelerator and progressively lowers ONNX-MLIR
+through custom MLIR dialects to simulator artifacts.
+
+The current target is the PIM simulator stack under `backend-simulators/pim`.
+Raptor emits binary per-core `.pim` instruction files by default, plus
+`memory.bin`, `config.json`, and weight binaries. It can also emit per-core JSON
+instruction files with `--pim-emit-json`.

 ## Overview

-PIM architectures perform most of the computation directly in memory.
-Raptor's first supported target is `pimsim-nn`, which simulates a chip with:
- a shared host memory,
- a number of cores that do most of the computation directly in their memory
-  (vector ops, vmm/mvm on ReRAM crossbars),
- no branching instructions (branchless architecture) and no hardware loop
-  support — any repeated work (e.g. convolutions) must be unrolled into
-  explicit per-iteration instructions.
+PIM architectures perform most computation directly in memory. The supported
+target models a chip with:
+- shared host memory,
+- multiple PIM cores,
+- ReRAM crossbars for vector-matrix / matrix-vector work,
+- explicit communication between cores,
+- no hardware branch or loop support in emitted simulator code.

-Because of this, the amount of emitted instructions explodes quickly and the
-compiler must optimize aggressively at every stage to keep compilation
-tractable.
-
-A second target, `PulPim`, is planned for an accelerator with RISC-V cores
-each carrying its own in-memory computing unit and crossbars. It will live in
-a dedicated dialect (future work).
+Because repeated work such as convolutions is eventually made explicit, emitted
+instruction counts can grow quickly. Most compiler work therefore focuses on
+lowering, scheduling, memory layout, and code-generation optimizations.

 ### Targets and simulators

-`pimsim-nn` (under `backend-simulators/pim/pimsim-nn`) is used for
-**performance** estimates (latency, energy), but does not functionally execute
-the JSON code it consumes. To validate the numerical correctness of the JSON
-code produced by Raptor (or, for comparison, by the `pimcomp` compiler), we use
-a Rust simulator we maintain in-tree at
-`backend-simulators/pim/pim-simulator`.
+- `backend-simulators/pim/pim-simulator` is the in-tree Rust functional
+  simulator used by validation. It reads Raptor's `pim/` artifact directory and
+  compares simulator output against native ONNX-MLIR execution.
+- `backend-simulators/pim/pimsim-nn` is the performance simulator submodule.
+  The helper scripts in `pimcomp_utils/` are for comparison with PIMCOMP-NN and
+  contain local paths; treat them as local utilities, not portable workflows.

 ## Compilation pipeline

-The PIM-related sources live under `src/PIM` and the tests under `test/PIM`.
-When working on this codebase, most changes should stay confined to those
-trees (you only need to look outside, e.g. at `onnx-mlir` or `llvm`, for
-framework-level details).
+The PIM sources live under `src/PIM` and tests under `test/PIM`. CMake exposes
+them to ONNX-MLIR through generated shim directories under
+`onnx-mlir/src/Accelerators/PIM` and `onnx-mlir/test/accelerators/PIM`.

 High-level lowering flow:

 ```
-ONNX-MLIR ──► Spatial ──► Pim (tensor) ──► Pim (bufferized) ──► PIM JSON
+ONNX-MLIR -> Spatial -> Pim (tensor) -> Pim (bufferized) -> PIM artifacts
 ```

-1. **ONNX → Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
-   Lowers ONNX ops into the `spat` dialect (`src/PIM/Dialect/Spatial`).
-   Spatial models a high-level spatial in-memory accelerator: vmm/mvm
-   operations are accelerated by storing a constant RHS matrix into a
-   crossbar. Crossbars cannot be re-programmed during execution, have a
-   limited fixed size, and there is a limited number of them per core.
-   Conversion patterns are split by op family under
-   `Conversion/ONNXToSpatial/Patterns/{Math,NN,Tensor}` (Conv, Gemm, MatMul,
-   Elementwise, ReduceMean, Pool, Relu, Sigmoid, Softmax, Concat, Gather,
-   Reshape, Resize, Split).
+1. **ONNX -> Spatial** (`src/PIM/Conversion/ONNXToSpatial`).
+   Lowers supported ONNX ops into the `spat` dialect
+   (`src/PIM/Dialect/Spatial`). Conversion patterns are split by op family under
+   `Patterns/{Math,NN,Tensor}` and currently cover Conv, Gemm, MatMul,
+   elementwise Add/Mul/Div, ReduceMean, pooling, Relu, Sigmoid, Softmax,
+   Concat, Gather, Reshape, Resize, and Split.

-2. **Spatial → Pim** (`src/PIM/Conversion/SpatialToPim`).
-   Lowers Spatial to the `pim` dialect (`src/PIM/Dialect/Pim`), which
-   materializes PIM cores (`pim.core`), inter-core communication
-   (`pim.send` / `pim.receive`), halts, and crossbar-level operations.
+2. **Merge compute nodes**
+   (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
+   Builds a compute graph, schedules it with the PEFT scheduler, and materializes
+   the merge schedule into Spatial IR. Supporting scheduling code lives under
+   `MergeComputeNodes/Scheduling`.

-3. **Merge compute nodes** (`src/PIM/Dialect/Spatial/Transforms/MergeComputeNodes`).
-   A DCP-inspired heuristic (Dynamic Critical Path — see the original
-   scheduling paper by Kwok & Ahmad,
-   [DCP-eScience2007](https://clouds.cis.unimelb.edu.au/papers/DCP-eScience2007.pdf))
-   that coarsens the virtual node graph and decides how to group compute
-   nodes onto cores. Our implementation is only DCP-*inspired*: it is a
-   heuristic with different assumptions from the paper (different cost
-   model, constraints from crossbar capacity / core resources, and a
-   windowed coarsening loop instead of full-graph reprioritization). The
-   `dcp-critical-window-size` option controls how many lowest-slack virtual
-   nodes each coarsening iteration considers (0 = legacy full-graph
-   analysis). Related sources: `DCPGraph/DCPAnalysis.cpp`, `Graph.cpp/.hpp`,
-   `MergeComputeNodesPass.cpp`.
+3. **Spatial -> Pim** (`src/PIM/Conversion/SpatialToPim`).
+   Lowers Spatial operations to the `pim` dialect (`src/PIM/Dialect/Pim`),
+   including `pim.core`, `pim.core_batch`, communication, tensor packing, global
+   tensor materialization, and return-path normalization.

 4. **Bufferization** (`src/PIM/Dialect/Pim/Transforms/Bufferization`).
-   Converts tensor-semantics PIM IR into memref-semantics PIM IR using the
-   standard MLIR `BufferizableOpInterface` machinery
-   (`OpBufferizationInterfaces.*`, `PimBufferization.td`).
+   Converts tensor-semantics PIM IR into memref-semantics PIM IR using MLIR's
+   bufferization interfaces.

-5. **Static memory coalescing** (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
-   Conservatively reuses same-typed local memref allocations inside PIM cores
-   after bufferization and before code generation.
+5. **Static memory coalescing**
+   (`src/PIM/Dialect/Pim/Transforms/StaticMemoryCoalescing`).
+   Reuses compatible local memref allocations inside PIM cores before codegen.

-6. **PIM code generation** (`src/PIM/Pass/PimCodegen`):
-   - `HostConstantFolding` — folds host-side constants.
-   - `MaterializeHostConstantsPass` — materializes the remaining host
-     constants for emission.
-   - `VerificationPass` — checks invariants before emission.
-   - `EmitPimJsonPass` — emits the final PIM JSON consumed by `pimsim-nn`
-     and `pim-simulator`.
+6. **PIM code generation** (`src/PIM/Pass/PimCodegen` and
+   `src/PIM/Compiler`).
+   Folds host constants, materializes remaining host constants, verifies PIM IR,
+   emits `.pim` core files, writes weights, and writes `memory.bin` /
+   `config.json`.

 Supporting pieces:
- `src/PIM/Compiler` — PIM-specific compiler options (crossbar size/count,
-  core count, DCP window, experimental conv impl, concat error handling, …)
-  and `PimCodeGen` entry points.
- `src/PIM/Common` — shared utilities (`PimCommon`, `LabeledList`).
- `src/PIM/Pass` — auxiliary passes (`MessagePass`, `CountInstructionPass`)
-  and the `PIMPasses.h` registry used by `PimAccelerator`.
- `src/PIM/PimAccelerator.{cpp,hpp}` — accelerator entry point: registers
-  dialects, passes, and plugs Raptor into the ONNX-MLIR driver.
+- `src/PIM/Common` - shared IR, filesystem, diagnostics, reports, and utility
+  helpers.
+- `src/PIM/Compiler` - PIM compiler options, memory/address planning, binary
+  instruction format, artifact writing, weight emission, and codegen entry
+  points.
+- `src/PIM/Conversion/SpatialToGraphviz` - optional Spatial graphviz conversion
+  pass.
+- `src/PIM/Pass` - pass registration and auxiliary passes.
+- `src/PIM/PimAccelerator.{cpp,hpp}` - ONNX-MLIR accelerator entry point.

 ## Key compiler options

-Pass these on the `onnx-mlir` command line when compiling for PIM:
+Pass these to `onnx-mlir` when compiling for PIM:

- `--maccel=PIM` — select the PIM accelerator.
- `--EmitSpatial` / `--EmitPim` / `--EmitPimBufferized` / `--EmitPimCodegen`
-  — stop the pipeline at the requested stage (default: `EmitPimCodegen`).
- `--pim-only-codegen` — assume the input is already bufferized PIM IR and
-  run only the codegen tail.
- `--crossbar-size=<N>` / `--crossbar-count=<N>` — crossbar dimensions and
-  per-core count.
- `--core-count=<N>` — number of cores (`-1` picks the minimum).
- `--dcp-critical-window-size=<N>` — DCP coarsening window (0 = legacy).
- `--use-experimental-conv-impl` — alternative convolution lowering.
- `--ignore-concat-error` — soft-fail corner case in `ConcatOp`.
+- `--maccel=PIM` - select the PIM accelerator.
+- `--EmitSpatial`, `--EmitPim`, `--EmitPimBufferized`,
+  `--EmitPimCodegen` - stop the PIM pipeline at the requested stage. The PIM
+  default is `--EmitPimCodegen`.
+- `--core-count=<N>` - required positive core count for PIM compilation.
+- `--crossbar-size=<N>` - crossbar width/height. Default in code is `2`.
+- `--crossbar-count=<N>` - crossbars per core. Default in code is `256`.
+- `--pim-merge-scheduler=peft` - merge scheduler. `peft` is the only accepted
+  value in the current code.
+- `--pim-only-codegen` - assume input is already bufferized PIM IR and only run
+  the codegen tail.
+- `--pim-emit-json` - also emit `core_*.json` instruction files alongside
+  `core_*.pim`.
+- `--use-experimental-conv-impl` - use the alternate convolution lowering.
+- `--ignore-concat-error` - soft-fail a ConcatOp corner case.
+
+Example:
+
+```bash
+./build_release/Release/bin/onnx-mlir model.onnx -o /tmp/raptor/model \
+  --maccel=PIM --EmitPimCodegen \
+  --crossbar-size=2048 --crossbar-count=256 --core-count=1000
+```
+
+This writes PIM artifacts under `/tmp/raptor/pim/`.

 ## Validation

-Functional validation lives in `validation/` and drives the Rust
-`pim-simulator` to compare Raptor's output against a reference.
+Functional validation lives in `validation/`. It compiles ONNX models, builds a
+native ONNX-MLIR reference runner, generates random inputs, runs Raptor, runs
+the Rust PIM simulator, and compares outputs.

-Per-operation validation (from `validation/`):
+Python dependencies used by the validation scripts are `numpy`, `onnx`, and
+`colorama`. The simulator requires the Rust toolchain.

-```
-validate.py \
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
-    --onnx-include-dir ../onnx-mlir/include
+Per-operation validation from the repository root:
+
+```bash
+python3 validation/validate.py \
+  --raptor-path build_release/Release/bin/onnx-mlir \
+  --onnx-include-dir onnx-mlir/include \
+  --core-count 1000
 ```

-End-to-end network validation (example: first 4 layers of YOLOv11n):
+Validate one network or a subset by pointing `--operations-dir` at any directory
+containing `.onnx` files:

-```
-validate.py \
-    --raptor-path ../cmake-build-release/Release/bin/onnx-mlir \
-    --onnx-include-dir ../onnx-mlir/include \
-    --operations-dir ./networks/yolo11n/depth_04 \
-    --crossbar-size 2048 --crossbar-count 256 --core-count 1000
+```bash
+python3 validation/validate.py \
+  --raptor-path build_release/Release/bin/onnx-mlir \
+  --onnx-include-dir onnx-mlir/include \
+  --operations-dir validation/networks/yolo11n/depth_04 \
+  --crossbar-size 2048 --crossbar-count 256 --core-count 1000
 ```

-Available networks under `validation/networks/`: `vgg16`, `yolo11n`.
-Available operations under `validation/operations/`: `add`, `conv`, `div`,
-`gather`, `gemm`, `gemv`, `mul`, `pool`, `reduce_mean`, `relu`, `resize`,
-`sigmoid`, `softmax`, `split`.
+Useful validation options:
+- `--simulator-dir <path>` - override the auto-detected
+  `backend-simulators/pim/pim-simulator` path.
+- `--threshold <float>` - maximum allowed per-element output difference.
+- `--seed <int>` - RNG seed for generated inputs.
+- `--command-timeout-seconds <float>` - timeout for compiler, runner, and
+  simulator subprocesses.
+- `--verbose` - print subprocess logs and average PIM pass timings.
+- `--clean` - remove generated validation artifacts and exit.

-## Rebuilding
+Each validation run writes artifacts in the model workspace, for example under
+`validation/operations/gemm/small/`:
+- `inputs/` - generated input CSV files.
+- `outputs/` - native ONNX-MLIR reference outputs.
+- `raptor/` - compiler artifacts, including `*.onnx.mlir`, dialect dumps under
+  `dialects/`, reports under `reports/`, and final PIM artifacts under `pim/`.
+- `runner/` - generated reference runner source, build tree, and shared library.
+- `simulation/out.bin` - raw simulator output used for comparison.

-Release build (fast):
+The compiler currently dumps dialect snapshots such as `spatial0.mlir`,
+`spatial1_dcp_merged.mlir`, `pim0.mlir`, `pim1_buff.mlir`,
+`pim2_coalesced.mlir`, and `pim3_folded.mlir` when an output directory is
+available.

-```
-cmake --build /home/nico/raptor/raptor/cmake-build-release --target onnx-mlir -j 30
+To rerun the simulator manually with tracing after validation has produced a
+`raptor/pim/` directory:
+
+```bash
+cd backend-simulators/pim/pim-simulator
+cargo run --no-default-features --features tracing --release \
+  --package pim-simulator --bin pim-simulator -- \
+  -f /path/to/workspace/raptor/pim \
+  -o /path/to/workspace/simulation/out.bin \
+  -d <addr0>,<size0>,<addr1>,<size1>,...
 ```

-A slower debug build is also available — configure it the same way but with
-`-DCMAKE_BUILD_TYPE=Debug` (see installation instructions below).
+With `--features tracing`, the simulator writes per-core traces as
+`TraceCore0`, `TraceCore1`, ... next to `out.bin`. The validator normally
+computes the `-d` ranges from `raptor/pim/config.json` and model output shapes.
+
+Available validation networks under `validation/networks/`: `vgg16`,
+`yolo11n`, `yolo11nv2`.
+
+Available operation suites under `validation/operations/`: `add`, `concat`,
+`conv`, `div`, `gather`, `gemm`, `gemv`, `matmul`, `mul`, `pool`,
+`reduce_mean`, `relu`, `reshape`, `resize`, `sigmoid`, `softmax`, `split`.
+
+Generated operation tests can be regenerated with:
+
+```bash
+python3 validation/operations/gen_tests.py
+```

 ## Build

+Initialize submodules first:
+
+```bash
+git submodule update --init --recursive
+```
+
+The project follows ONNX-MLIR's build requirements. The CI workflow documents
+the currently used versions and setup:
+- CMake 4.3.0 in CI,
+- LLVM/MLIR checked out under `onnx-mlir/llvm-project`,
+- Protobuf `v34.0`,
+- Rust stable for `pim-simulator`,
+- Python packages `numpy`, `onnx`, `colorama` for validation.
+
 ### Protobuf

-Use the following commands to install protobuf:
-```
+Install Protobuf if your system does not already provide a compatible version:
+
+```bash
 git clone --depth 1 --branch v34.0 https://github.com/protocolbuffers/protobuf
-cd protobuf
-mkdir build
-cd build
-cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release
-ninja
-sudo ninja install
+cmake -S protobuf -B protobuf/build -G Ninja \
+  -DCMAKE_BUILD_TYPE=Release \
+  -Dprotobuf_BUILD_TESTS=OFF
+cmake --build protobuf/build
+sudo cmake --install protobuf/build
 ```

-You can now remove the protobuf repo directory with:
-```
-cd ../..
+You can then remove the temporary checkout:
+
+```bash
 rm -rf protobuf
 ```

-### Mlir
+### MLIR

-Follow the first part of instructions [here](onnx-mlir/docs/BuildOnLinuxOSX.md) to build mlir.
+Follow the ONNX-MLIR instructions in
+`onnx-mlir/docs/BuildOnLinuxOSX.md` to build LLVM/MLIR. The local Raptor build
+expects `MLIR_DIR` to point at the MLIR CMake package, for example:

-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor
-
-Moreover, if compiling with build type debug, it is also suggested to use
-mold as linker (you will need to install it if you don't have it already)
-to reduce memory usage during linking. You can use it by setting the options:
-```
-DLLVM_USE_LINKER=mold
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
 ```

+If your LLVM build directory is named `build` instead of `build_release`, adjust
+the path accordingly.
+
 ### Raptor

-Use the following commands to build Raptor.
+Configure a release build:

-Remember to set ```-DCMAKE_BUILD_TYPE=Debug``` for developing on Raptor.
-
-Also in this case, it is suggested to use mold as linker to reduce link time and memory usage,
-setting the options:
-```
-DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
-DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
-```
-
-```
-git submodule update --init --recursive
-
-MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build/lib/cmake/mlir
-mkdir build && cd build
-cmake .. -G Ninja \
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_release/lib/cmake/mlir
+cmake -S . -B build_release -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DONNX_MLIR_ACCELERATORS=PIM \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DMLIR_DIR=${MLIR_DIR}
-cmake --build .
 ```

-If the build fails because of protobuf missing uint definitions,
-just patch the problematic files by adding ```#include <cstdint>``` to their includes.
+Configure a debug build similarly:
+
+```bash
+MLIR_DIR=$(pwd)/onnx-mlir/llvm-project/build_debug/lib/cmake/mlir
+cmake -S . -B build_debug -G Ninja \
+  -DCMAKE_BUILD_TYPE=Debug \
+  -DONNX_MLIR_ACCELERATORS=PIM \
+  -DLLVM_ENABLE_ASSERTIONS=ON \
+  -DMLIR_DIR=${MLIR_DIR}
+```
+
+For debug development, using `mold` can reduce link time and memory use:
+
+```bash
+cmake -S . -B build_debug -G Ninja \
+  -DCMAKE_BUILD_TYPE=Debug \
+  -DONNX_MLIR_ACCELERATORS=PIM \
+  -DLLVM_ENABLE_ASSERTIONS=ON \
+  -DMLIR_DIR=${MLIR_DIR} \
+  -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=mold" \
+  -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=mold" \
+  -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=mold"
+```
+
+Build the compiler with CMake:
+
+```bash
+cmake --build ./build_release
+cmake --build ./build_debug
+```
+
+Do not invoke `ninja` directly for this project; use `cmake --build` so CMake's
+configuration and generated shims stay consistent.
+
+If a build fails because Protobuf headers are missing fixed-width integer
+definitions, patch the affected Protobuf-generated files by adding
+`#include <cstdint>`.
+
+## Tests
+
+The Rust simulator has its own tests:
+
+```bash
+cd backend-simulators/pim/pim-simulator
+cargo test
+```
+
+## Repository Layout
+
+- `src/PIM/` - PIM accelerator implementation.
+- `test/PIM/` - PIM C++ unit tests.
+- `validation/` - functional validation scripts, ONNX operation tests, network
+  slices, and pimsim config generation.
+- `backend-simulators/pim/pim-simulator/` - in-tree Rust functional simulator.
+- `backend-simulators/pim/pimsim-nn/` - performance simulator submodule.
+- `pimcomp_utils/` - local comparison helpers for PIMCOMP-NN.
+- `.github/actions/` and `.github/workflows/validate_operations.yml` - CI setup
+  for MLIR/Protobuf caching, building Raptor, and validation.
@@ -43,7 +43,7 @@ struct Args {

    /// Comma separated list of (address,size) for memory output dump
    #[arg(short, long, value_delimiter = ',', num_args = 1.., value_name = "ADDR,SIZE")]
-    dump: Vec<i32>,
+    dump: Vec<usize>,
 }

 fn main() -> Result<()> {
@@ -67,7 +67,7 @@ fn main() -> Result<()> {
        .lock()
        .unwrap()
        .init(executor.cpu().num_core(), args.output.clone());
-    executor.execute();
+    executor.execute()?;
    dump_memory(executor, &args)?;
    Ok(())
 }
@@ -77,7 +77,7 @@ fn map_crossbars_to_cores<'c>(
    args: &Args,
    global_crossbars: &'c HashMap<String, Crossbar>,
 ) -> Vec<Vec<&'c Crossbar>> {
-    let mut res = Vec::new();
+    let mut res = vec![Vec::new()];
    let num_cores = config.get("core_cnt").unwrap().as_i64().unwrap() as i32;

    if let Some(folder) = args.folder.as_ref() {
@@ -168,7 +168,7 @@ fn get_crossbars(config: &Value, args: &Args) -> anyhow::Result<HashMap<String,
 }

 fn dump_memory(mut executor: pimcore::Executable, args: &Args) -> Result<()> {
-    let dumps: Vec<(i32, i32)> = args
+    let dumps: Vec<(usize, usize)> = args
        .dump
        .chunks_exact(2)
        .map(|chunk| (chunk[0], chunk[1]))
@@ -312,7 +312,7 @@ fn append_record(
        29 => {
            inst_data_builder
                .set_rd_u8(rd)
-                .set_imm_core(r2_or_imm)
+                .set_imm_core(r2_or_imm + 1)
                .set_imm_len(generic3)
                .set_offset_select_value(generic1, generic2);
            inst_builder.make_inst(send, inst_data_builder.build());
@@ -320,7 +320,7 @@ fn append_record(
        30 => {
            inst_data_builder
                .set_rd_u8(rd)
-                .set_imm_core(r2_or_imm)
+                .set_imm_core(r2_or_imm + 1)
                .set_imm_len(generic3)
                .set_offset_select_value(generic1, generic2);
            inst_builder.make_inst(recv, inst_data_builder.build());
@@ -366,23 +366,19 @@ fn binary_to_instructions(

 pub fn binary_to_executor<'a, 'b>(
    config: Value,
-    mut cores: impl Iterator<Item = &'b Vec<u8>>,
+    cores: impl Iterator<Item = &'b Vec<u8>>,
    crossbars: Vec<Vec<&'a Crossbar>>,
 ) -> Result<Executable<'a>> {
    let core_cnt = config
        .get("core_cnt")
        .context("missing core_cnt in config")?
        .as_i64()
-        .context("core_cnt is not an integer")? as i32
-        - 1;
+        .context("core_cnt is not an integer")? as i32;

    let cpu = CPU::new(core_cnt, crossbars);
    let mut core_insts_builder = CoreInstructionsBuilder::new(core_cnt as usize);
-    cores.next();
-    for core_indx in 1..=core_cnt {
-        let core_bytes = cores
-            .next()
-            .unwrap_or_else(|| panic!("cores files less than {}", core_indx));
+    for (external_core_indx, core_bytes) in cores.enumerate() {
+        let core_indx = external_core_indx as i32 + 1;
        let instructions = binary_to_instructions(core_bytes, core_indx)?;
        core_insts_builder.set_core(core_indx, instructions);
    }
@@ -396,6 +392,7 @@ mod tests {
        HEADER_SIZE, InstructionRecord, MAGIC, RECORD_SIZE, VERSION, binary_to_instructions,
    };
    use crate::{
+        functor_to_name,
        instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder},
        json_to_instruction::json_isa::json_to_instruction,
    };
@@ -490,7 +487,10 @@ mod tests {

        assert_eq!(json_instructions.len(), binary_instructions.len());
        for (json_inst, binary_inst) in json_instructions.iter().zip(binary_instructions.iter()) {
-            assert_eq!(json_inst.functor_name(), binary_inst.functor_name());
+            assert_eq!(
+                functor_to_name(json_inst.functor as usize),
+                functor_to_name(binary_inst.functor as usize)
+            );
            assert_eq!(json_inst.data, binary_inst.data);
        }
    }
@@ -1,3 +1,4 @@
+use crate::utility::AddressArg;
 use std::{collections::HashMap, fmt::Debug};
 use anyhow::{Context, Result, ensure};

@@ -9,6 +10,7 @@ use crate::{

 pub mod crossbar;

+
 #[derive(Debug, Clone)]
 pub struct CPU<'a> {
    cores: Box<[Core<'a>]>,
@@ -91,30 +93,26 @@ impl<'a> Core<'a> {
        self.memory.execute_load()
    }

-    pub fn execute_store<T>(&mut self, address: impl TryToUsize, element: &[T]) -> Result<()>
+    pub fn execute_store<T>(&mut self, address: impl AddressArg, element: &[T]) -> Result<()>
    where
        T: MemoryStorable,
    {
-        let address = address.try_into().context("address can not be negative")?;
+        let address = address.to_address_usize()?;
        self.memory.execute_store(address, element)
    }

    pub fn reserve_load(
        &mut self,
-        address: impl TryToUsize,
+        address: impl AddressArg,
        size: impl TryToUsize,
    ) -> Result<&mut CoreMemory> {
-        let address = address.try_into().context("address can not be negative")?;
+        let address = address.to_address_usize()?;
        let size = size.try_into().context("size can not be negative")?;
        self.memory.reserve_load(address, size)
    }

    pub fn set_register(&mut self, index: impl TryToUsize, value: i32) {
        let index = index.try_into().expect("index can not be negative");
-        assert!(
-            value >= 0,
-            "Register cannot be negative if happens remove this and go check where it's used as usize"
-        );
        self.registers[index] = value;
    }

@@ -123,11 +121,11 @@ impl<'a> Core<'a> {
        self.registers[index]
    }

-    pub fn load<T>(&mut self, address: impl TryToUsize, size: impl TryToUsize) -> Result<Vec<&[T]>>
+    pub fn load<T>(&mut self, address: impl AddressArg, size: impl TryToUsize) -> Result<Vec<&[T]>>
    where
        T: MemoryStorable,
    {
-        let address = address.try_into().context("address can not be negative")?;
+        let address = address.to_address_usize()?;
        let size = size.try_into().context("size can not be negative")?;
        self.memory.load(address, size)
    }
@@ -141,8 +139,8 @@ impl<'a> Core<'a> {
        (memory, crossbars)
    }

-    pub fn memset(&mut self, address: impl TryToUsize, size: impl TryToUsize, val: u8) -> Result<()> {
-        let address = address.try_into().context("address can not be negative")?;
+    pub fn memset(&mut self, address: impl AddressArg, size: impl TryToUsize, val: u8) -> Result<()> {
+        let address = address.to_address_usize()?;
        let size = size.try_into().context("size can not be negative")?;
        self.memory.memset(address, size, val)
    }
@@ -567,7 +567,7 @@ fn json_to_send(
    let (offset_select, offset_value) = json_to_offset(json.get("offset").unwrap());
    inst_data_builder
        .set_rd(rd)
-        .set_imm_core(core)
+        .set_imm_core(core + 1)
        .set_imm_len(size)
        .set_offset_select(offset_select)
        .set_offset_value(offset_value);
@@ -588,7 +588,7 @@ fn json_to_recv(
    let (offset_select, offset_value) = json_to_offset(json.get("offset").unwrap());
    inst_data_builder
        .set_rd(rd)
-        .set_imm_core(core)
+        .set_imm_core(core + 1)
        .set_imm_len(size)
        .set_offset_select(offset_select)
        .set_offset_value(offset_value);
@@ -1,49 +1,34 @@
-use core::panic;
-use std::io::{Read, Write};
-use std::{collections::HashMap, fs::File, io::BufReader};
-
-use serde_json::{Deserializer, Map, Value};
+use serde_json::Value;
+use std::{fs::File, io::BufReader};

 use crate::{
    CoreInstructionsBuilder, Executable,
-    cpu::{
-        CPU,
-        crossbar::{self, Crossbar},
-    },
-    instruction_set::{
-        InstructionsBuilder,
-        instruction_data::{self, InstructionData, InstructionDataBuilder},
-    },
-    json_to_instruction::{self, json_isa},
-    memory_manager::type_traits::TryToUsize,
+    cpu::{CPU, crossbar::Crossbar},
+    instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder},
+    json_to_instruction::json_isa,
 };

 pub fn json_to_executor<'a, 'b>(
    config: Value,
-    mut cores: &mut Vec<BufReader<File>>,
+    cores: &'b mut Vec<BufReader<File>>,
    crossbars: Vec<Vec<&'a Crossbar>>,
 ) -> Executable<'a> {
-    let cell_precision = config.get("cell_precision").unwrap().as_i64().unwrap() as i32;
-    let core_cnt = config.get("core_cnt").unwrap().as_i64().unwrap() as i32 - 1;
-    let xbar_count = config.get("xbar_array_count").unwrap().as_i64().unwrap() as i32;
-    let xbar_size = config.get("xbar_size").unwrap().as_array().unwrap();
-    let rows_crossbar = xbar_size[0].as_i64().unwrap() as i32;
-    let column_corssbar = xbar_size[1].as_i64().unwrap() as i32;
+    let core_cnt = config.get("core_cnt").unwrap().as_i64().unwrap() as i32;

-    let mut cpu = CPU::new(core_cnt, crossbars);
+    let cpu = CPU::new(core_cnt, crossbars);
    let mut core_insts_builder = CoreInstructionsBuilder::new(core_cnt as usize);
-    // Note: cores[0] is intentionally empty and discarded
-    for core_indx in 1..=core_cnt {
+    for (external_core_indx, json_core_reader) in cores.iter_mut().enumerate() {
+        let core_indx = external_core_indx as i32 + 1;
        let mut insts_builder = InstructionsBuilder::new();
        let mut inst_data_builder = InstructionDataBuilder::new();
        inst_data_builder.set_core_indx(core_indx).fix_core_indx();
-        let stream = Deserializer::from_reader(&mut cores[core_indx as usize]).into_iter::<Value>();
-
-        for (i, json_inst_result) in stream.enumerate() {
-            let json_inst = json_inst_result.expect("Failed to parse instruction");
-            // Pass the single Value to your parser
-            json_isa::json_to_instruction(&mut insts_builder, &mut inst_data_builder, &json_inst);
-            drop(json_inst);
+        let json_core: Value = serde_json::from_reader(json_core_reader)
+            .unwrap_or_else(|err| panic!("failed to parse core{}: {}", external_core_indx, err));
+        let json_core_insts = json_core
+            .as_array()
+            .unwrap_or_else(|| panic!("core{} has not a list of instruction", external_core_indx));
+        for json_inst in json_core_insts {
+            json_isa::json_to_instruction(&mut insts_builder, &mut inst_data_builder, json_inst);
        }
        core_insts_builder.set_core(core_indx, insts_builder.build());
    }
@@ -1,2 +1,2 @@
-mod json_isa;
+pub(crate) mod json_isa;
 pub mod json_to_executor;
@@ -1,3 +1,4 @@
+use std::cmp::min;
 use std::fmt::Debug;

 use anyhow::{Context, Result, bail, ensure};
@@ -86,7 +87,7 @@ where {
            size,
        };
        if self.memory.len() < address + size {
-            self.memory.resize((address + size) * 2, 0);
+            self.memory.resize(min((address + size) * 2, u32::MAX as usize), 0);
        }
        self.load_requests.push(load_request);
        Ok(self)
@@ -1,5 +1,6 @@
 #![allow(unused)]

+use anyhow::{Result, bail};
 use std::{
    collections::{HashMap, HashSet},
    time::{Duration, SystemTime},
@@ -87,6 +88,11 @@ pub struct Executable<'a> {
    send_recv: SendRecv,
 }

+struct DeadlockInfo {
+    cycle: String,
+    states: String,
+}
+
 fn print_status(core_instructions: &[CoreInstructions]) {
    let mut tot_instructions = 0;
    let mut progress = 0;
@@ -118,7 +124,7 @@ impl<'a> Executable<'a> {
        }
    }

-    pub fn execute<'b>(&'b mut self)
+    pub fn execute<'b>(&'b mut self) -> Result<()>
    where
        'a: 'b,
    {
@@ -153,7 +159,13 @@ impl<'a> Executable<'a> {
                }
                if (now.elapsed().unwrap() > Duration::from_secs(5)) {
                    print_status(cores_instructions);
-                    check_cycle(cpu, cores_instructions, send_recv);
+                    if let Some(deadlock) = detect_deadlock(cores_instructions) {
+                        bail!(
+                            "Deadlock cycle detected: {} [{}]",
+                            deadlock.cycle,
+                            deadlock.states
+                        );
+                    }
                    now = SystemTime::now();
                }
            }
@@ -178,8 +190,23 @@ impl<'a> Executable<'a> {
        }
        print_status(cores_instructions);

+        if let Some(deadlock) = detect_deadlock(cores_instructions) {
+            bail!(
+                "Deadlock cycle detected: {} [{}]",
+                deadlock.cycle,
+                deadlock.states
+            );
+        }
+        if cores_instructions
+            .iter()
+            .any(|core_inst| core_inst.program_counter < core_inst.instructions.len())
+        {
+            bail!("Execution stalled with unfinished instructions");
+        }
+
        #[cfg(feature = "profile_time")]
        TRACER.lock().unwrap().report();
+        Ok(())
    }

    pub fn cpu(&self) -> &CPU<'a> {
@@ -201,11 +228,11 @@ impl<'a> Executable<'a> {
    }
 }

-fn check_cycle(cpu: &mut CPU, cores_instructions: &[CoreInstructions], send_recv: &mut SendRecv) {
+fn detect_deadlock(cores_instructions: &[CoreInstructions]) -> Option<DeadlockInfo> {
    #[derive(Debug, PartialEq, Eq)]
    enum CoreState {
-        SendingTo(i32),
-        ReceivingFrom(i32),
+        SendingTo(i32, i32),
+        ReceivingFrom(i32, i32),
        Working,
        Halted,
    }
@@ -223,9 +250,9 @@ fn check_cycle(cpu: &mut CPU, cores_instructions: &[CoreInstructions], send_recv
        let (this_core, target_core) = data.get_core_immcore();

        if isa_recv(functor_address) {
-            states.insert(this_core, CoreState::ReceivingFrom(target_core));
+            states.insert(this_core, CoreState::ReceivingFrom(target_core, data.imm_len()));
        } else if isa_send(functor_address) {
-            states.insert(this_core, CoreState::SendingTo(target_core));
+            states.insert(this_core, CoreState::SendingTo(target_core, data.imm_len()));
        } else {
            states.insert(this_core, CoreState::Working);
        }
@@ -235,15 +262,15 @@ fn check_cycle(cpu: &mut CPU, cores_instructions: &[CoreInstructions], send_recv

    for (&core_id, state) in states.iter() {
        match state {
-            CoreState::SendingTo(target_core) => {
+            CoreState::SendingTo(target_core, size) => {
                let target_state = states.get(target_core).unwrap_or(&CoreState::Halted);
-                if target_state != &CoreState::ReceivingFrom(core_id) {
+                if target_state != &CoreState::ReceivingFrom(core_id, *size) {
                    wait_for.insert(core_id, *target_core);
                }
            }
-            CoreState::ReceivingFrom(target_core) => {
+            CoreState::ReceivingFrom(target_core, size) => {
                let target_state = states.get(target_core).unwrap_or(&CoreState::Halted);
-                if target_state != &CoreState::SendingTo(core_id) {
+                if target_state != &CoreState::SendingTo(core_id, *size) {
                    wait_for.insert(core_id, *target_core);
                }
            }
@@ -272,18 +299,41 @@ fn check_cycle(cpu: &mut CPU, cores_instructions: &[CoreInstructions], send_recv
            if in_path.contains(&waiting_for) {
                let cycle_start = path.iter().position(|&c| c == waiting_for).unwrap();
                let cycle = &path[cycle_start..];
+                let format_core = |core: &i32| (core - 1).to_string();

                let cycle_str = cycle
                    .iter()
-                    .map(|c| c.to_string())
+                    .map(format_core)
                    .collect::<Vec<_>>()
                    .join(" -> ");

-                let cycle_msg = format!("{} -> {}", cycle_str, waiting_for);
+                let cycle = cycle
+                    .iter()
+                    .copied()
+                    .chain(std::iter::once(waiting_for))
+                    .collect::<Vec<_>>();
+                let cycle_msg = format!("{} -> {}", cycle_str, waiting_for - 1);
+                let states_msg = cycle
+                    .iter()
+                    .filter_map(|core| {
+                        states.get(core).map(|state| match state {
+                            CoreState::SendingTo(target, size) => {
+                                format!("core {} send {}B -> {}", core - 1, size, target - 1)
+                            }
+                            CoreState::ReceivingFrom(source, size) => {
+                                format!("core {} recv {}B <- {}", core - 1, size, source - 1)
+                            }
+                            CoreState::Working => format!("core {} working", core - 1),
+                            CoreState::Halted => format!("core {} halted", core - 1),
+                        })
+                    })
+                    .collect::<Vec<_>>()
+                    .join(", ");

-                println!("Fatal: Deadlock cycle detected: {}", cycle_msg);
-                // bail!("Deadlock detected: {}", cycle_msg);
-                break; // Stop tracing
+                return Some(DeadlockInfo {
+                    cycle: cycle_msg,
+                    states: states_msg,
+                });
            }

            // Hit a known branch that didn't result in a cycle
@@ -294,6 +344,7 @@ fn check_cycle(cpu: &mut CPU, cores_instructions: &[CoreInstructions], send_recv
            current_core = waiting_for;
        }
    }
+    None
 }

 fn handle_wait_sync<'a, 'b, 'c>(
@@ -1,7 +1,45 @@
+use anyhow::{Result,Context};
 use std::{fmt::Debug, mem::transmute};

-use crate::memory_manager::type_traits::TryToUsize;

+pub trait AddressArg {
+    fn to_address_usize(self) -> Result<usize>;
+}
+
+impl AddressArg for usize {
+    fn to_address_usize(self) -> Result<usize> {
+        Ok(self)
+    }
+}
+
+impl AddressArg for u32 {
+    fn to_address_usize(self) -> Result<usize> {
+        Ok(self as usize)
+    }
+}
+
+impl AddressArg for u64 {
+    fn to_address_usize(self) -> Result<usize> {
+        usize::try_from(self).context("address does not fit in usize")
+    }
+}
+
+impl AddressArg for i32 {
+    fn to_address_usize(self) -> Result<usize> {
+        Ok(self as u32 as usize)
+    }
+}
+
+impl AddressArg for i64 {
+    fn to_address_usize(self) -> Result<usize> {
+        usize::try_from(self).context("address can not be negative")
+    }
+}
+
+
+fn address_to_usize(address: i32) -> usize {
+    address as u32 as usize
+}

 fn add_offset_impl(address: usize, offset_select : i32, offset_value : i32, id:i32) -> usize{
     assert!(offset_select == 1 || offset_select == 2 || offset_select == 4 || offset_value == 0, "offset_select not a bit field");
@@ -14,21 +52,21 @@ fn add_offset_impl(address: usize, offset_select : i32, offset_value : i32, id:i
 }


-pub fn add_offset_rd(address: impl TryToUsize, offset_select : i32, offset_value : i32) -> usize
+pub fn add_offset_rd(address: i32, offset_select : i32, offset_value : i32) -> usize
 {
-    let address = address.try_into().expect("address can not be negative");
+    let address = address_to_usize(address);
    add_offset_impl(address, offset_select, offset_value, 4)
 }

-pub fn add_offset_r1(address: impl TryToUsize, offset_select : i32, offset_value : i32) -> usize
+pub fn add_offset_r1(address: i32, offset_select : i32, offset_value : i32) -> usize
 {
-    let address = address.try_into().expect("address can not be negative");
+    let address = address_to_usize(address);
    add_offset_impl(address, offset_select, offset_value, 1)
 }

-pub fn add_offset_r2(address: impl TryToUsize, offset_select : i32, offset_value : i32) -> usize
+pub fn add_offset_r2(address: i32, offset_select : i32, offset_value : i32) -> usize
 {
-    let address = address.try_into().expect("address can not be negative");
+    let address = address_to_usize(address);
    add_offset_impl(address, offset_select, offset_value, 2)
 }

@@ -1,6 +1,11 @@
 use std::path::Path;

-use pimcore::{Executable, cpu::CPU, instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*}};
+use pimcore::{
+    Executable,
+    cpu::crossbar::Crossbar,
+    instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*},
+    memory_manager::CoreMemory,
+};

 fn simple_read(path:  &Path) -> Vec<f32> {
    if !path.exists() {
@@ -17,14 +22,12 @@ fn simple_read(path:  &Path) -> Vec<f32> {
 fn mvmul_f32(err: &str)
 where
 {
-    let mut cpu = CPU::new(0);
-    cpu.reserve_crossbar(1, 1024 * size_of::<f32>(),  1024);
-    let (memory, crossbars) = cpu.host().get_memory_crossbar();
-    let matrix = simple_read(Path::new("B.txt")) ;
-
-    
-    crossbars.get_mut(0).unwrap().execute_store( &matrix).unwrap();
-    let vector = simple_read(Path::new("A.txt"));
+    let matrix = simple_read(Path::new("tests/B.txt"));
+    let mut crossbar = Crossbar::new(1024 * size_of::<f32>(), 1024, CoreMemory::new());
+    crossbar.execute_store(&matrix).unwrap();
+    let mut cpu = pimcore::cpu::CPU::new(0, vec![vec![&crossbar]]);
+    let (memory, _) = cpu.host().get_memory_crossbar();
+    let vector = simple_read(Path::new("tests/A.txt"));
    memory.execute_store(0, &vector).unwrap();

    let mut inst_builder = InstructionsBuilder::new();
@@ -57,7 +60,7 @@ where
            .cpu_mut()
            .host()
            .load::<f32>(1024 * size_of::<f32>(), 1024*size_of::<f32>()).unwrap()[0].iter().zip(
-        simple_read(Path::new("X.txt")) ).all(|(&a,b) : (&f32, f32)| {a-b < 0.001}),
+        simple_read(Path::new("tests/X.txt")) ).all(|(&a,b) : (&f32, f32)| {a-b < 0.001}),
        "Wrong result for {}",
        err
    );
@@ -69,5 +72,3 @@ fn mvmul_big_test() {
   

 }
-
-
@@ -0,0 +1,5 @@
+use pimcore::cpu::CPU;
+
+pub fn empty_cpu(num_cores: usize) -> CPU<'static> {
+    CPU::new(num_cores, vec![Vec::new(); num_cores + 1])
+}
@@ -1,51 +1,103 @@
-use std::{fs, io::BufReader, path::Path};
+use std::{
+    fs::{self, File},
+    io::BufReader,
+    path::{Path, PathBuf},
+};

 use anyhow::{Context, Result};
-use pimcore::json_to_instruction::json_to_executor;
+use pimcore::{
+    cpu::crossbar::Crossbar,
+    json_to_instruction::json_to_executor,
+    memory_manager::CoreMemory,
+};
 use serde_json::Value;

-fn collect_json_from_subfolders<P: AsRef<Path>>(root: P) -> Result<Vec<(Value, Vec<Value>)>> {
+fn collect_examples<P: AsRef<Path>>(root: P) -> Result<Vec<PathBuf>> {
    let mut result = Vec::new();
    for entry in fs::read_dir(root)? {
        let entry = entry.context("Root not found")?;
        let path = entry.path();
        if path.is_dir() {
-            let mut cores = Vec::new();
-            let mut config: Option<Value> = None;
-            for sub_entry in fs::read_dir(&path)
-                .with_context(|| format!("File {} not readable", path.display()))?
-            {
-                let sub_entry =
-                    sub_entry.with_context(|| format!("File {} not readable", path.display()))?;
-                let sub_path = sub_entry.path();
-                if sub_path.is_file()
-                    && sub_path.extension().and_then(|s| s.to_str()) == Some("json")
-                {
-                    let file = fs::File::open(&sub_path)
-                        .with_context(|| format!("Subpath  {} not opened", sub_path.display()))?;
-                    let reader = BufReader::new(file);
-                    let val: Value = serde_json::from_reader(reader).with_context(|| format!(
-                        "Serde reader fail for subpath {}",
-                        sub_path.display()
-                    ))?;
-                    if sub_path.file_name().unwrap() == "config.json" {
-                        config = Some(val);
-                    } else {
-                        cores.push(val);
-                    }
-                }
-            }
-            result.push((config.unwrap(), cores));
+            result.push(path);
        }
    }
    Ok(result)
 }

+fn core_sort_key(path: &Path) -> i32 {
+    let stem = path.file_stem().unwrap().to_str().unwrap();
+    stem[5..].parse::<i32>().unwrap()
+}
+
+fn crossbar_sort_key(path: &Path) -> i32 {
+    let stem = path.file_stem().unwrap().to_str().unwrap();
+    stem[9..].parse::<i32>().unwrap()
+}
+
+fn load_crossbars(folder: &Path, config: &Value) -> Result<Vec<Vec<Crossbar>>> {
+    let xbar_size = config.get("xbar_size").unwrap().as_array().unwrap();
+    let rows = xbar_size[0].as_i64().unwrap() as usize;
+    let cols = xbar_size[1].as_i64().unwrap() as usize;
+    let core_cnt = config.get("core_cnt").unwrap().as_i64().unwrap() as usize;
+    let mut owned_crossbars = Vec::with_capacity(core_cnt + 1);
+    owned_crossbars.push(Vec::new());
+
+    for core_idx in 0..core_cnt {
+        let core_folder = folder.join(format!("core_{core_idx}"));
+        let mut core_crossbars = Vec::new();
+        if core_folder.is_dir() {
+            let mut paths: Vec<_> = fs::read_dir(&core_folder)?
+                .map(|entry| entry.map(|entry| entry.path()))
+                .collect::<std::io::Result<Vec<_>>>()?;
+            paths.sort_by_cached_key(|path| crossbar_sort_key(path));
+            for path in paths {
+                if path.extension().and_then(|ext| ext.to_str()) != Some("bin") {
+                    continue;
+                }
+                let bytes = fs::read(&path)
+                    .with_context(|| format!("failed to read crossbar {}", path.display()))?;
+                let mut crossbar = Crossbar::new(cols * 4, rows, CoreMemory::new());
+                crossbar.execute_store(&bytes)?;
+                core_crossbars.push(crossbar);
+            }
+        }
+        owned_crossbars.push(core_crossbars);
+    }
+    Ok(owned_crossbars)
+}
+
 #[test]
 fn json_folder_tester() {
-    let examples = collect_json_from_subfolders("data").unwrap();
-    for example in examples {
-        let (config, cores) = example;
-        json_to_executor::json_to_executor(config, cores.iter()).execute();
+    let examples = collect_examples("tests/data").unwrap();
+    for folder in examples {
+        let config_path = folder.join("config.json");
+        let config_file = File::open(&config_path).unwrap();
+        let config: Value = serde_json::from_reader(BufReader::new(config_file)).unwrap();
+
+        let core_cnt = config.get("core_cnt").unwrap().as_i64().unwrap() as usize;
+        let mut core_paths: Vec<_> = fs::read_dir(&folder)
+            .unwrap()
+            .map(|entry| entry.unwrap().path())
+            .filter(|path| path.extension().and_then(|ext| ext.to_str()) == Some("json"))
+            .filter(|path| path.file_name().unwrap() != "config.json")
+            .collect();
+        core_paths.sort_by_cached_key(|path| core_sort_key(path));
+        assert_eq!(core_paths.len(), core_cnt);
+
+        let mut core_readers: Vec<_> = core_paths
+            .into_iter()
+            .map(|path| BufReader::new(File::open(path).unwrap()))
+            .collect();
+
+        let owned_crossbars = load_crossbars(&folder, &config).unwrap();
+        let crossbars = owned_crossbars
+            .iter()
+            .map(|core_crossbars| core_crossbars.iter().collect())
+            .collect();
+
+        let mut executable = json_to_executor::json_to_executor(config, &mut core_readers, crossbars);
+        let memory = fs::read(folder.join("memory.bin")).unwrap();
+        executable.cpu_mut().host().execute_store(0, &memory).unwrap();
+        executable.execute();
    }
 }
@@ -1,11 +1,17 @@
+mod common;

-        use pimcore::{Executable, cpu::CPU, instruction_set::{InstructionType,  InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*}};
+use pimcore::{
+    Executable,
+    instruction_set::{
+        InstructionType, InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*,
+    },
+};


 #[test]
 #[should_panic(expected = "Function not found for the requested size") ]
 fn wrong_size_place_holder() {
-    let cpu = CPU::new(0);
+    let cpu = common::empty_cpu(0);
    let mut inst_builder = InstructionsBuilder::new();
    let mut idata_build = InstructionDataBuilder::new();
    idata_build.set_core_indx(0).fix_core_indx();
@@ -30,7 +36,7 @@ fn wrong_size_place_holder() {


 fn  place_holder(inst :  InstructionType) {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let mut idata_build = InstructionDataBuilder::new();
    idata_build.set_core_indx(0).fix_core_indx();
    inst(&mut cpu, idata_build.build()).unwrap();
@@ -1,8 +1,10 @@
+mod common;
+
 use pimcore::{
    Executable,
-    cpu::CPU,
+    cpu::crossbar::Crossbar,
    instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*},
-    memory_manager::{MemoryStorable, type_traits::UpcastDestTraits},
+    memory_manager::{CoreMemory, MemoryStorable, type_traits::UpcastDestTraits},
 };

 /// VVADD Test
@@ -11,7 +13,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        1.0.into(),
        2.0.into(),
@@ -115,7 +117,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        1.0.into(),
        2.0.into(),
@@ -219,7 +221,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        1.0.into(),
        2.0.into(),
@@ -323,7 +325,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        1.0.into(),
        2.0.into(),
@@ -420,7 +422,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        9.0.into(),
        2.0.into(),
@@ -524,7 +526,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        9.0.into(),
        2.0.into(),
@@ -562,6 +564,7 @@ where
        vavg,
        idata_build
            .set_rdr1r2(3, 1, 1)
+            .set_offset_select(1)
            .set_imm_len(8 * size_of::<F>() as i32)
            .build(),
    );
@@ -617,7 +620,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        (-9.0).into(),
        2.0.into(),
@@ -717,7 +720,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable + UpcastDestTraits<T>,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        0.1.into(),
        0.2.into(),
@@ -819,7 +822,7 @@ where
    F: From<f32> + std::fmt::Debug + PartialEq<F> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable + UpcastDestTraits<T>,
 {
-    let mut cpu = CPU::new(0);
+    let mut cpu = common::empty_cpu(0);
    let buff: [F; _] = [
        0.1.into(),
        0.2.into(),
@@ -923,9 +926,6 @@ where
    M: From<f32> + std::fmt::Debug + PartialEq<M> + MemoryStorable,
    T: From<f32> + std::fmt::Debug + PartialEq<T> + MemoryStorable + UpcastDestTraits<T>,
 {
-    let mut cpu = CPU::new(0);
-    cpu.reserve_crossbar(1, 4 * size_of::<M>(),  4);
-    let (memory, crossbars) = cpu.host().get_memory_crossbar();
    let matrix: [M; _] = [
        1.0.into(),
        2.0.into(),
@@ -944,7 +944,10 @@ where
        15.0.into(),
        16.0.into(),
    ];
-    crossbars.get_mut(0).unwrap().execute_store( &matrix).unwrap();
+    let mut crossbar = Crossbar::new(4 * size_of::<M>(), 4, CoreMemory::new());
+    crossbar.execute_store(&matrix).unwrap();
+    let mut cpu = pimcore::cpu::CPU::new(0, vec![vec![&crossbar]]);
+    let (memory, _) = cpu.host().get_memory_crossbar();
    let vector: [F; _] = [
        1.0.into(),
        2.0.into(),
@@ -1054,5 +1057,3 @@ fn mvmul_test() {
    mvmul_test_generic::<f64,f64,f64>("mvmul<f64,f64,f64>",1);

 }
-
-
@@ -1,12 +1,13 @@
+mod common;
+
 use pimcore::{
    Executable, CoreInstructionsBuilder,
-    cpu::CPU,
    instruction_set::{InstructionsBuilder, instruction_data::InstructionDataBuilder, isa::*},
 };

 #[test]
 fn ld_test() {
-    let mut cpu = CPU::new(1);
+    let mut cpu = common::empty_cpu(1);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(1);
    let buff: [f32; _] = [
        1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
@@ -41,7 +42,7 @@ fn ld_test() {

 #[test]
 fn st_test() {
-    let mut cpu = CPU::new(1);
+    let mut cpu = common::empty_cpu(1);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(1);
    let buff: [f32; _] = [
        1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
@@ -76,7 +77,7 @@ fn st_test() {

 #[test]
 fn lldi_test() {
-    let cpu = CPU::new(1);
+    let cpu = common::empty_cpu(1);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(1);
    let mut inst_builder = InstructionsBuilder::new();
    let mut idata_build = InstructionDataBuilder::new();
@@ -106,7 +107,7 @@ fn lldi_test() {

 #[test]
 fn lmv_test() {
-    let mut cpu = CPU::new(1);
+    let mut cpu = common::empty_cpu(1);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(1);
    let buff: [f32; _] = [
        1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
@@ -148,7 +149,7 @@ fn lmv_test() {

 #[test]
 fn simple_send_recv_test() {
-    let mut cpu = CPU::new(2);
+    let mut cpu = common::empty_cpu(2);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(2);
    let buff: [f32; _] = [
        1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,
@@ -207,7 +208,7 @@ fn simple_send_recv_test() {

 #[test]
 fn multiple_send_recv_test() {
-    let mut cpu = CPU::new(4);
+    let mut cpu = common::empty_cpu(4);
    let mut core_instruction_builder = CoreInstructionsBuilder::new(4);
    let buff: [f32; _] = [
       1.0, 1.0, 1.0, 1.0, 1.0 
@@ -226,7 +227,7 @@ fn multiple_send_recv_test() {
    ];
    cpu.core(4).execute_store(0, &buff).unwrap();

-    let send_inst = |cpu :&mut CPU, core_instruction_builder: &mut CoreInstructionsBuilder, inst_builder: &mut InstructionsBuilder, from : i32,  to : i32| {
+    let send_inst = |inst_builder: &mut InstructionsBuilder, from: i32, to: i32| {
    let mut idata_build = InstructionDataBuilder::new();
    idata_build.set_core_indx(from).fix_core_indx();
    inst_builder.make_inst(sldi, idata_build.set_rdimm(1, from*size_of::<f32>() as i32).build());
@@ -240,7 +241,7 @@ fn multiple_send_recv_test() {
    );
    };

-    let recv_inst = |cpu :&mut CPU, core_instruction_builder: &mut CoreInstructionsBuilder, mut inst_builder: &mut InstructionsBuilder, to : i32,  from : i32| {
+    let recv_inst = |inst_builder: &mut InstructionsBuilder, to: i32, from: i32| {
    let mut idata_build = InstructionDataBuilder::new();
    idata_build.set_core_indx(to).fix_core_indx();
    inst_builder.make_inst(sldi, idata_build.set_rdimm(1, from*size_of::<f32>() as i32).build());
@@ -257,26 +258,26 @@ fn multiple_send_recv_test() {


    // 1 -> 3
-    send_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,1, 3);
+    send_inst(&mut inst_builder, 1, 3);
    core_instruction_builder.set_core(1, inst_builder.build());

    // 2 -> 3
    // 2 <- 4
-    send_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,2, 3);
-    recv_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,2, 4);
+    send_inst(&mut inst_builder, 2, 3);
+    recv_inst(&mut inst_builder, 2, 4);
    core_instruction_builder.set_core(2, inst_builder.build());

    // 3 <- 2
    // 3 <- 4
    // 3 <- 1
-    recv_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,3, 2);
-    recv_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,3, 4);
-    recv_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,3, 1);
+    recv_inst(&mut inst_builder, 3, 2);
+    recv_inst(&mut inst_builder, 3, 4);
+    recv_inst(&mut inst_builder, 3, 1);
    core_instruction_builder.set_core(3, inst_builder.build());
    // 4 -> 2
    // 4 -> 3
-    send_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,4, 2);
-    send_inst(&mut cpu,&mut core_instruction_builder,&mut inst_builder,4, 3);
+    send_inst(&mut inst_builder, 4, 2);
+    send_inst(&mut inst_builder, 4, 3);
    core_instruction_builder.set_core(4, inst_builder.build());

    let mut executable = Executable::new(cpu, core_instruction_builder.build());
@@ -10,6 +10,56 @@ set(PIM_INCLUDE_PATH ${CMAKE_INCLUDE_OUTPUT_DIRECTORY})
 set(PIM_ONNX_MLIR_SRC_ROOT ${ONNX_MLIR_SRC_ROOT})
 set(PIM_ONNX_MLIR_BIN_ROOT ${ONNX_MLIR_BIN_ROOT})

+set(PIM_GENERATED_PATH_SHIM_TARGET "")
+get_filename_component(PIM_BIN_ROOT_NAME "${PIM_BIN_ROOT}" NAME)
+if (PIM_BIN_ROOT_NAME STREQUAL "raptor-external")
+  get_filename_component(PIM_GENERATED_PATH_SHIM_ROOT "${PIM_BIN_ROOT}" DIRECTORY)
+  set(PIM_GENERATED_PATH_SHIM_OUTPUTS)
+
+  function(add_pim_generated_path_shim relative_path)
+    set(real_file "${PIM_BIN_ROOT}/${relative_path}")
+    set(shim_file "${PIM_GENERATED_PATH_SHIM_ROOT}/${relative_path}")
+    get_filename_component(shim_dir "${shim_file}" DIRECTORY)
+
+    add_custom_command(
+      OUTPUT "${shim_file}"
+      DEPENDS "${real_file}"
+      COMMAND "${CMAKE_COMMAND}" -E make_directory "${shim_dir}"
+      COMMAND "${CMAKE_COMMAND}" -E rm -f "${shim_file}"
+      COMMAND "${CMAKE_COMMAND}" -E create_symlink "${real_file}" "${shim_file}"
+      VERBATIM
+    )
+
+    list(APPEND PIM_GENERATED_PATH_SHIM_OUTPUTS "${shim_file}")
+    set(PIM_GENERATED_PATH_SHIM_OUTPUTS "${PIM_GENERATED_PATH_SHIM_OUTPUTS}" PARENT_SCOPE)
+  endfunction()
+
+  file(GLOB_RECURSE pim_generated_path_scan_sources
+    CONFIGURE_DEPENDS
+    "${PIM_SRC_ROOT}/*.cpp"
+    "${PIM_SRC_ROOT}/*.hpp"
+  )
+
+  set(pim_generated_path_shims)
+  foreach (source_file IN LISTS pim_generated_path_scan_sources)
+    file(READ "${source_file}" source_contents)
+    string(REGEX MATCHALL "#include \"src/Accelerators/PIM/[^\"]+\\.inc\"" source_inc_matches "${source_contents}")
+
+    foreach (inc_match IN LISTS source_inc_matches)
+      string(REGEX REPLACE "^#include \"src/Accelerators/PIM/(.+)\"$" "\\1" relative_inc_path "${inc_match}")
+      list(APPEND pim_generated_path_shims "${relative_inc_path}")
+    endforeach ()
+  endforeach ()
+
+  list(REMOVE_DUPLICATES pim_generated_path_shims)
+  foreach (relative_inc_path IN LISTS pim_generated_path_shims)
+    add_pim_generated_path_shim("${relative_inc_path}")
+  endforeach ()
+
+  add_custom_target(OMPimGeneratedPathShims DEPENDS ${PIM_GENERATED_PATH_SHIM_OUTPUTS})
+  set(PIM_GENERATED_PATH_SHIM_TARGET OMPimGeneratedPathShims)
+endif ()
+
 set(PIM_PUBLIC_INCLUDE_DIRS
  ${ONNX_MLIR_SRC_ROOT}/include
  ${ONNX_MLIR_SRC_ROOT}
@@ -37,6 +87,9 @@ set(PIM_GENERATED_INCLUDE_DIRS

 function(add_pim_library name)
  add_onnx_mlir_library(${name} STATIC ${ARGN})
+  if (PIM_GENERATED_PATH_SHIM_TARGET)
+    add_dependencies(${name} ${PIM_GENERATED_PATH_SHIM_TARGET})
+  endif ()
 endfunction()

 add_subdirectory(Dialect)
@@ -68,6 +121,8 @@ add_pim_library(OMPIMAccel
  OMSpatialToPim
  OMPimCommon
  OMPimBufferization
-  OMPimStaticMemoryCoalescing
+  OMPimMemoryCoalescing
+  OMPimHostConstantFolding
+  OMPimVerification
  MLIRTensorInferTypeOpInterfaceImpl
 )
@@ -1,10 +1,15 @@
 add_pim_library(OMPimCommon
+  IR/AffineUtils.cpp
  IR/AddressAnalysis.cpp
+  IR/BatchCoreUtils.cpp
+  IR/ConstantUtils.cpp
  IR/CoreBlockUtils.cpp
  IR/EntryPointUtils.cpp
+  IR/LoopUtils.cpp
  IR/ShapeUtils.cpp
  IR/SubviewUtils.cpp
  IR/WeightUtils.cpp
+  Support/CheckedArithmetic.cpp
  Support/DebugDump.cpp
  Support/Diagnostics.cpp
  Support/FileSystemUtils.cpp
@@ -16,6 +21,8 @@ add_pim_library(OMPimCommon
  ${PIM_PUBLIC_INCLUDE_DIRS}

  LINK_LIBS PUBLIC
+  MLIRLinalgDialect
+  MLIRSCFDialect
  onnx
  SpatialOps
  PimOps
@@ -1,7 +1,11 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"
+#include "mlir/IR/BuiltinAttributes.h"
 #include "mlir/Interfaces/DestinationStyleOpInterface.h"

+#include <limits>
+
 #include "src/Accelerators/PIM/Common/IR/AddressAnalysis.hpp"
 #include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
 #include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
@@ -28,6 +32,14 @@ mlir::Value resolveAlias(mlir::Value value, const StaticValueKnowledge* knowledg
  return value;
 }

+llvm::FailureOr<CompiledIndexExpr> compileIndexValueImpl(mlir::Value value);
+llvm::FailureOr<CompiledAddressExpr> compileContiguousAddressExprImpl(mlir::Value value);
+
+template <typename... Args>
+CompiledIndexExpr makeCompiledIndexExpr(Args&&... args) {
+  return CompiledIndexExpr(std::make_shared<CompiledIndexExprNode>(std::forward<Args>(args)...));
+}
+
 mlir::Value resolveLoopCarriedAliasImpl(mlir::Value value, const StaticValueKnowledge* knowledge) {
  value = resolveAlias(value, knowledge);

@@ -55,6 +67,288 @@ mlir::Value resolveLoopCarriedAliasImpl(mlir::Value value, const StaticValueKnow
 }

 llvm::FailureOr<int64_t> resolveOpFoldResult(mlir::OpFoldResult ofr, const StaticValueKnowledge* knowledge);
+llvm::FailureOr<int64_t> resolveIndexValueImpl(mlir::Value value, const StaticValueKnowledge* knowledge);
+
+static llvm::FailureOr<llvm::SmallVector<int64_t>> getStaticMemRefStrides(mlir::MemRefType type) {
+  llvm::SmallVector<int64_t> strides;
+  int64_t offset = 0;
+  if (failed(type.getStridesAndOffset(strides, offset)))
+    return mlir::failure();
+  if (llvm::any_of(strides, mlir::ShapedType::isDynamic))
+    return mlir::failure();
+  return strides;
+}
+
+static llvm::FailureOr<int64_t> resolveConstantGlobalLoad(mlir::memref::LoadOp loadOp,
+                                                          const StaticValueKnowledge* knowledge) {
+  auto getGlobalOp = loadOp.getMemRef().getDefiningOp<mlir::memref::GetGlobalOp>();
+  if (!getGlobalOp)
+    return mlir::failure();
+
+  auto moduleOp = loadOp->getParentOfType<mlir::ModuleOp>();
+  auto globalOp = lookupGlobalForGetGlobal(moduleOp, getGlobalOp);
+  if (!globalOp || !globalOp.getConstant() || !globalOp.getInitialValue())
+    return mlir::failure();
+
+  auto denseAttr = mlir::dyn_cast<mlir::DenseElementsAttr>(*globalOp.getInitialValue());
+  auto globalType = mlir::dyn_cast<mlir::MemRefType>(getGlobalOp.getType());
+  if (!denseAttr || !globalType || !globalType.hasStaticShape())
+    return mlir::failure();
+
+  auto elementType = denseAttr.getElementType();
+  if (!elementType.isIndex() && !elementType.isInteger())
+    return mlir::failure();
+
+  llvm::SmallVector<int64_t> indices;
+  indices.reserve(loadOp.getIndices().size());
+  for (mlir::Value index : loadOp.getIndices()) {
+    auto resolvedIndex = resolveIndexValueImpl(index, knowledge);
+    if (failed(resolvedIndex))
+      return mlir::failure();
+    indices.push_back(*resolvedIndex);
+  }
+
+  if (indices.size() != static_cast<size_t>(globalType.getRank()))
+    return mlir::failure();
+
+  auto strides = computeRowMajorStrides(globalType.getShape());
+  int64_t linearIndex = linearizeIndex(indices, strides);
+  if (linearIndex < 0 || linearIndex >= globalType.getNumElements())
+    return mlir::failure();
+
+  return denseAttr.getValues<llvm::APInt>()[linearIndex].getSExtValue();
+}
+
+static bool evaluateCmpPredicate(mlir::arith::CmpIPredicate predicate, int64_t lhs, int64_t rhs) {
+  switch (predicate) {
+  case mlir::arith::CmpIPredicate::eq:  return lhs == rhs;
+  case mlir::arith::CmpIPredicate::ne:  return lhs != rhs;
+  case mlir::arith::CmpIPredicate::slt: return lhs < rhs;
+  case mlir::arith::CmpIPredicate::sle: return lhs <= rhs;
+  case mlir::arith::CmpIPredicate::sgt: return lhs > rhs;
+  case mlir::arith::CmpIPredicate::sge: return lhs >= rhs;
+  case mlir::arith::CmpIPredicate::ult: return static_cast<uint64_t>(lhs) < static_cast<uint64_t>(rhs);
+  case mlir::arith::CmpIPredicate::ule: return static_cast<uint64_t>(lhs) <= static_cast<uint64_t>(rhs);
+  case mlir::arith::CmpIPredicate::ugt: return static_cast<uint64_t>(lhs) > static_cast<uint64_t>(rhs);
+  case mlir::arith::CmpIPredicate::uge: return static_cast<uint64_t>(lhs) >= static_cast<uint64_t>(rhs);
+  }
+
+  llvm_unreachable("unknown cmpi predicate");
+}
+
+llvm::FailureOr<int64_t> evaluateCompiledIndexExpr(const CompiledIndexExpr& expr,
+                                                   const StaticValueKnowledge& knowledge) {
+  if (!expr.node)
+    return mlir::failure();
+
+  switch (expr.node->kind) {
+  case CompiledIndexExprNode::Kind::Constant: return expr.node->constant;
+  case CompiledIndexExprNode::Kind::Symbol:   {
+    auto value = resolveAlias(expr.node->symbol, &knowledge);
+    auto iter = knowledge.indexValues.find(value);
+    if (iter != knowledge.indexValues.end())
+      return iter->second;
+    return mlir::failure();
+  }
+  case CompiledIndexExprNode::Kind::Add:
+  case CompiledIndexExprNode::Kind::Sub:
+  case CompiledIndexExprNode::Kind::Mul:
+  case CompiledIndexExprNode::Kind::DivUI:
+  case CompiledIndexExprNode::Kind::DivSI:
+  case CompiledIndexExprNode::Kind::RemUI:
+  case CompiledIndexExprNode::Kind::RemSI:
+  case CompiledIndexExprNode::Kind::MinUI:
+  case CompiledIndexExprNode::Kind::CmpI:  {
+    auto lhs = evaluateCompiledIndexExpr(expr.node->operands[0], knowledge);
+    auto rhs = evaluateCompiledIndexExpr(expr.node->operands[1], knowledge);
+    if (failed(lhs) || failed(rhs))
+      return mlir::failure();
+
+    switch (expr.node->kind) {
+    case CompiledIndexExprNode::Kind::Add: return *lhs + *rhs;
+    case CompiledIndexExprNode::Kind::Sub: return *lhs - *rhs;
+    case CompiledIndexExprNode::Kind::Mul: return *lhs * *rhs;
+    case CompiledIndexExprNode::Kind::DivUI:
+      if (*rhs == 0)
+        return mlir::failure();
+      return static_cast<int64_t>(static_cast<uint64_t>(*lhs) / static_cast<uint64_t>(*rhs));
+    case CompiledIndexExprNode::Kind::DivSI:
+      if (*rhs == 0 || (*lhs == std::numeric_limits<int64_t>::min() && *rhs == -1))
+        return mlir::failure();
+      return *lhs / *rhs;
+    case CompiledIndexExprNode::Kind::RemUI:
+      if (*rhs == 0)
+        return mlir::failure();
+      return static_cast<int64_t>(static_cast<uint64_t>(*lhs) % static_cast<uint64_t>(*rhs));
+    case CompiledIndexExprNode::Kind::RemSI:
+      if (*rhs == 0)
+        return mlir::failure();
+      if (*lhs == std::numeric_limits<int64_t>::min() && *rhs == -1)
+        return 0;
+      return *lhs % *rhs;
+    case CompiledIndexExprNode::Kind::MinUI:
+      return static_cast<int64_t>(std::min(static_cast<uint64_t>(*lhs), static_cast<uint64_t>(*rhs)));
+    case CompiledIndexExprNode::Kind::CmpI: return evaluateCmpPredicate(expr.node->predicate, *lhs, *rhs) ? 1 : 0;
+    default:                                llvm_unreachable("unexpected binary compiled index kind");
+    }
+  }
+  case CompiledIndexExprNode::Kind::Select: {
+    auto condition = evaluateCompiledIndexExpr(expr.node->operands[0], knowledge);
+    if (failed(condition))
+      return mlir::failure();
+    return evaluateCompiledIndexExpr(*condition != 0 ? expr.node->operands[1] : expr.node->operands[2], knowledge);
+  }
+  case CompiledIndexExprNode::Kind::ConstantGlobalLoad: {
+    if (!expr.node->globalOp || !expr.node->globalOp.getInitialValue())
+      return mlir::failure();
+
+    auto denseAttr = mlir::dyn_cast<mlir::DenseElementsAttr>(*expr.node->globalOp.getInitialValue());
+    auto globalType = mlir::dyn_cast<mlir::MemRefType>(expr.node->globalOp.getType());
+    if (!denseAttr || !globalType)
+      return mlir::failure();
+
+    llvm::SmallVector<int64_t> indices;
+    indices.reserve(expr.node->operands.size());
+    for (const CompiledIndexExpr& operand : expr.node->operands) {
+      auto resolvedIndex = evaluateCompiledIndexExpr(operand, knowledge);
+      if (failed(resolvedIndex))
+        return mlir::failure();
+      indices.push_back(*resolvedIndex);
+    }
+
+    int64_t linearIndex = linearizeIndex(indices, expr.node->globalStrides);
+    if (linearIndex < 0 || linearIndex >= globalType.getNumElements())
+      return mlir::failure();
+    return denseAttr.getValues<llvm::APInt>()[linearIndex].getSExtValue();
+  }
+  }
+
+  llvm_unreachable("unknown compiled index kind");
+}
+
+llvm::FailureOr<CompiledIndexExpr> compileConstantGlobalLoad(mlir::memref::LoadOp loadOp) {
+  auto getGlobalOp = loadOp.getMemRef().getDefiningOp<mlir::memref::GetGlobalOp>();
+  if (!getGlobalOp)
+    return mlir::failure();
+
+  auto moduleOp = loadOp->getParentOfType<mlir::ModuleOp>();
+  auto globalOp = lookupGlobalForGetGlobal(moduleOp, getGlobalOp);
+  if (!globalOp || !globalOp.getConstant() || !globalOp.getInitialValue())
+    return mlir::failure();
+
+  auto denseAttr = mlir::dyn_cast<mlir::DenseElementsAttr>(*globalOp.getInitialValue());
+  auto globalType = mlir::dyn_cast<mlir::MemRefType>(getGlobalOp.getType());
+  if (!denseAttr || !globalType || !globalType.hasStaticShape())
+    return mlir::failure();
+
+  auto elementType = denseAttr.getElementType();
+  if (!elementType.isIndex() && !elementType.isInteger())
+    return mlir::failure();
+
+  CompiledIndexExprNode expr;
+  expr.kind = CompiledIndexExprNode::Kind::ConstantGlobalLoad;
+  expr.globalOp = globalOp;
+  expr.globalStrides = computeRowMajorStrides(globalType.getShape());
+  expr.operands.reserve(loadOp.getIndices().size());
+  for (mlir::Value index : loadOp.getIndices()) {
+    auto compiledIndex = compileIndexValueImpl(index);
+    if (failed(compiledIndex))
+      return mlir::failure();
+    expr.operands.push_back(*compiledIndex);
+  }
+  return makeCompiledIndexExpr(std::move(expr));
+}
+
+llvm::FailureOr<CompiledIndexExpr> compileIndexValueImpl(mlir::Value value) {
+  if (auto constantOp = value.getDefiningOp<mlir::arith::ConstantOp>()) {
+    if (auto integerAttr = mlir::dyn_cast<mlir::IntegerAttr>(constantOp.getValue())) {
+      CompiledIndexExprNode expr;
+      expr.kind = CompiledIndexExprNode::Kind::Constant;
+      expr.constant = integerAttr.getInt();
+      return makeCompiledIndexExpr(std::move(expr));
+    }
+  }
+
+  mlir::Operation* definingOp = value.getDefiningOp();
+  if (!definingOp) {
+    CompiledIndexExprNode expr;
+    expr.kind = CompiledIndexExprNode::Kind::Symbol;
+    expr.symbol = value;
+    return makeCompiledIndexExpr(std::move(expr));
+  }
+
+  auto buildBinaryExpr = [&](CompiledIndexExprNode::Kind kind, mlir::Value lhsValue, mlir::Value rhsValue) {
+    auto lhs = compileIndexValueImpl(lhsValue);
+    auto rhs = compileIndexValueImpl(rhsValue);
+    if (failed(lhs) || failed(rhs))
+      return llvm::FailureOr<CompiledIndexExpr>(mlir::failure());
+    CompiledIndexExprNode expr;
+    expr.kind = kind;
+    expr.operands = {*lhs, *rhs};
+    return llvm::FailureOr<CompiledIndexExpr>(makeCompiledIndexExpr(std::move(expr)));
+  };
+
+  if (auto indexCastOp = mlir::dyn_cast<mlir::arith::IndexCastOp>(definingOp))
+    return compileIndexValueImpl(indexCastOp.getIn());
+  if (auto addOp = mlir::dyn_cast<mlir::arith::AddIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::Add, addOp.getLhs(), addOp.getRhs());
+  if (auto subOp = mlir::dyn_cast<mlir::arith::SubIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::Sub, subOp.getLhs(), subOp.getRhs());
+  if (auto mulOp = mlir::dyn_cast<mlir::arith::MulIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::Mul, mulOp.getLhs(), mulOp.getRhs());
+  if (auto divOp = mlir::dyn_cast<mlir::arith::DivUIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::DivUI, divOp.getLhs(), divOp.getRhs());
+  if (auto divOp = mlir::dyn_cast<mlir::arith::DivSIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::DivSI, divOp.getLhs(), divOp.getRhs());
+  if (auto minOp = mlir::dyn_cast<mlir::arith::MinUIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::MinUI, minOp.getLhs(), minOp.getRhs());
+  if (auto remOp = mlir::dyn_cast<mlir::arith::RemUIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::RemUI, remOp.getLhs(), remOp.getRhs());
+  if (auto remOp = mlir::dyn_cast<mlir::arith::RemSIOp>(definingOp))
+    return buildBinaryExpr(CompiledIndexExprNode::Kind::RemSI, remOp.getLhs(), remOp.getRhs());
+  if (auto cmpOp = mlir::dyn_cast<mlir::arith::CmpIOp>(definingOp)) {
+    auto expr = buildBinaryExpr(CompiledIndexExprNode::Kind::CmpI, cmpOp.getLhs(), cmpOp.getRhs());
+    if (failed(expr))
+      return mlir::failure();
+    auto exprNode = std::make_shared<CompiledIndexExprNode>(*expr->node);
+    exprNode->predicate = cmpOp.getPredicate();
+    return CompiledIndexExpr(exprNode);
+  }
+  if (auto maxOp = mlir::dyn_cast<mlir::arith::MaxUIOp>(definingOp)) {
+    auto lhs = compileIndexValueImpl(maxOp.getLhs());
+    auto rhs = compileIndexValueImpl(maxOp.getRhs());
+    if (failed(lhs) || failed(rhs))
+      return mlir::failure();
+
+    CompiledIndexExprNode cmpExpr;
+    cmpExpr.kind = CompiledIndexExprNode::Kind::CmpI;
+    cmpExpr.predicate = mlir::arith::CmpIPredicate::uge;
+    cmpExpr.operands = {*lhs, *rhs};
+
+    CompiledIndexExprNode selectExpr;
+    selectExpr.kind = CompiledIndexExprNode::Kind::Select;
+    selectExpr.operands = {makeCompiledIndexExpr(std::move(cmpExpr)), *lhs, *rhs};
+    return makeCompiledIndexExpr(std::move(selectExpr));
+  }
+  if (auto selectOp = mlir::dyn_cast<mlir::arith::SelectOp>(definingOp)) {
+    auto condition = compileIndexValueImpl(selectOp.getCondition());
+    auto trueValue = compileIndexValueImpl(selectOp.getTrueValue());
+    auto falseValue = compileIndexValueImpl(selectOp.getFalseValue());
+    if (failed(condition) || failed(trueValue) || failed(falseValue))
+      return mlir::failure();
+    CompiledIndexExprNode expr;
+    expr.kind = CompiledIndexExprNode::Kind::Select;
+    expr.operands = {*condition, *trueValue, *falseValue};
+    return makeCompiledIndexExpr(std::move(expr));
+  }
+  if (auto loadOp = mlir::dyn_cast<mlir::memref::LoadOp>(definingOp))
+    return compileConstantGlobalLoad(loadOp);
+
+  CompiledIndexExprNode expr;
+  expr.kind = CompiledIndexExprNode::Kind::Symbol;
+  expr.symbol = value;
+  return makeCompiledIndexExpr(std::move(expr));
+}

 llvm::FailureOr<int64_t> resolveIndexValueImpl(mlir::Value value, const StaticValueKnowledge* knowledge) {
  value = resolveAlias(value, knowledge);
@@ -110,6 +404,24 @@ llvm::FailureOr<int64_t> resolveIndexValueImpl(mlir::Value value, const StaticVa
    return static_cast<int64_t>(static_cast<uint64_t>(*lhs) / static_cast<uint64_t>(*rhs));
  }

+  if (auto divOp = mlir::dyn_cast<mlir::arith::DivSIOp>(definingOp)) {
+    auto lhs = resolveIndexValueImpl(divOp.getLhs(), knowledge);
+    auto rhs = resolveIndexValueImpl(divOp.getRhs(), knowledge);
+    if (failed(lhs) || failed(rhs) || *rhs == 0)
+      return mlir::failure();
+    if (*lhs == std::numeric_limits<int64_t>::min() && *rhs == -1)
+      return mlir::failure();
+    return *lhs / *rhs;
+  }
+
+  if (auto minOp = mlir::dyn_cast<mlir::arith::MinUIOp>(definingOp)) {
+    auto lhs = resolveIndexValueImpl(minOp.getLhs(), knowledge);
+    auto rhs = resolveIndexValueImpl(minOp.getRhs(), knowledge);
+    if (failed(lhs) || failed(rhs))
+      return mlir::failure();
+    return static_cast<int64_t>(std::min(static_cast<uint64_t>(*lhs), static_cast<uint64_t>(*rhs)));
+  }
+
  if (auto remOp = mlir::dyn_cast<mlir::arith::RemUIOp>(definingOp)) {
    auto lhs = resolveIndexValueImpl(remOp.getLhs(), knowledge);
    auto rhs = resolveIndexValueImpl(remOp.getRhs(), knowledge);
@@ -118,6 +430,34 @@ llvm::FailureOr<int64_t> resolveIndexValueImpl(mlir::Value value, const StaticVa
    return static_cast<int64_t>(static_cast<uint64_t>(*lhs) % static_cast<uint64_t>(*rhs));
  }

+  if (auto remOp = mlir::dyn_cast<mlir::arith::RemSIOp>(definingOp)) {
+    auto lhs = resolveIndexValueImpl(remOp.getLhs(), knowledge);
+    auto rhs = resolveIndexValueImpl(remOp.getRhs(), knowledge);
+    if (failed(lhs) || failed(rhs) || *rhs == 0)
+      return mlir::failure();
+    if (*lhs == std::numeric_limits<int64_t>::min() && *rhs == -1)
+      return 0;
+    return *lhs % *rhs;
+  }
+
+  if (auto cmpOp = mlir::dyn_cast<mlir::arith::CmpIOp>(definingOp)) {
+    auto lhs = resolveIndexValueImpl(cmpOp.getLhs(), knowledge);
+    auto rhs = resolveIndexValueImpl(cmpOp.getRhs(), knowledge);
+    if (failed(lhs) || failed(rhs))
+      return mlir::failure();
+    return evaluateCmpPredicate(cmpOp.getPredicate(), *lhs, *rhs) ? 1 : 0;
+  }
+
+  if (auto selectOp = mlir::dyn_cast<mlir::arith::SelectOp>(definingOp)) {
+    auto condition = resolveIndexValueImpl(selectOp.getCondition(), knowledge);
+    if (failed(condition))
+      return mlir::failure();
+    return resolveIndexValueImpl(*condition != 0 ? selectOp.getTrueValue() : selectOp.getFalseValue(), knowledge);
+  }
+
+  if (auto loadOp = mlir::dyn_cast<mlir::memref::LoadOp>(definingOp))
+    return resolveConstantGlobalLoad(loadOp, knowledge);
+
  return mlir::failure();
 }

@@ -209,8 +549,10 @@ llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddressImpl(mlir::Va
      if (!isMemoryContiguous(sourceType.getShape(), offsets, sizes, strides))
        return mlir::failure();

-      auto sourceStrides = computeRowMajorStrides(sourceType.getShape());
-      byteOffset += linearizeIndex(offsets, sourceStrides) * subviewType.getElementTypeBitWidth() / 8;
+      auto sourceStrides = getStaticMemRefStrides(sourceType);
+      if (failed(sourceStrides))
+        return mlir::failure();
+      byteOffset += linearizeIndex(offsets, *sourceStrides) * getElementTypeSizeInBytes(subviewType.getElementType());
      value = resolveAlias(subviewOp.getSource(), knowledge);
      continue;
    }
@@ -235,17 +577,206 @@ llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddressImpl(mlir::Va
  }
 }

-} // namespace
+llvm::FailureOr<CompiledAddressExpr> compileContiguousAddressExprImpl(mlir::Value value) {
+  int64_t constantByteOffset = 0;
+  CompiledIndexExpr byteOffsetExpr;
+  {
+    CompiledIndexExprNode expr;
+    expr.kind = CompiledIndexExprNode::Kind::Constant;
+    expr.constant = 0;
+    byteOffsetExpr = makeCompiledIndexExpr(std::move(expr));
+  }

-llvm::FailureOr<int64_t> resolveIndexValue(mlir::Value value) { return resolveIndexValueImpl(value, nullptr); }
+  while (true) {
+    if (mlir::isa<mlir::BlockArgument>(value))
+      return CompiledAddressExpr {value, byteOffsetExpr};
+
+    mlir::Operation* definingOp = value.getDefiningOp();
+    if (!definingOp)
+      return mlir::failure();
+
+    if (auto dpsDefiningOp = mlir::dyn_cast<mlir::DestinationStyleOpInterface>(definingOp)) {
+      mlir::OpOperand* tiedOperand = dpsDefiningOp.getTiedOpOperand(mlir::dyn_cast<mlir::OpResult>(value));
+      if (!tiedOperand)
+        return mlir::failure();
+      value = tiedOperand->get();
+      continue;
+    }
+
+    if (auto forOp = mlir::dyn_cast<mlir::scf::ForOp>(definingOp)) {
+      auto result = mlir::dyn_cast<mlir::OpResult>(value);
+      if (!result)
+        return mlir::failure();
+
+      auto yieldOp = mlir::cast<mlir::scf::YieldOp>(forOp.getBody()->getTerminator());
+      mlir::Value yieldedValue = yieldOp.getOperand(result.getResultNumber());
+      if (auto blockArgument = mlir::dyn_cast<mlir::BlockArgument>(yieldedValue)) {
+        if (blockArgument.getOwner() == forOp.getBody() && blockArgument.getArgNumber() > 0
+            && static_cast<unsigned>(blockArgument.getArgNumber() - 1) < forOp.getInitArgs().size()) {
+          value = forOp.getInitArgs()[blockArgument.getArgNumber() - 1];
+          continue;
+        }
+      }
+
+      value = yieldedValue;
+      continue;
+    }
+
+    if (auto subviewOp = mlir::dyn_cast<mlir::memref::SubViewOp>(definingOp)) {
+      auto sourceType = mlir::dyn_cast<mlir::MemRefType>(subviewOp.getSource().getType());
+      auto subviewType = mlir::dyn_cast<mlir::MemRefType>(subviewOp.getType());
+      if (!sourceType || !subviewType || !sourceType.hasStaticShape() || !subviewType.hasStaticShape())
+        return mlir::failure();
+
+      llvm::SmallVector<int64_t> staticSizes;
+      staticSizes.reserve(subviewOp.getMixedSizes().size());
+      llvm::SmallVector<int64_t> staticStrides;
+      staticStrides.reserve(subviewOp.getMixedStrides().size());
+      llvm::SmallVector<int64_t> staticOffsets;
+      staticOffsets.reserve(subviewOp.getMixedOffsets().size());
+      bool hasOnlyStaticOffsets = true;
+
+      for (mlir::OpFoldResult offset : subviewOp.getMixedOffsets())
+        if (auto attr = mlir::dyn_cast<mlir::Attribute>(offset))
+          staticOffsets.push_back(mlir::cast<mlir::IntegerAttr>(attr).getInt());
+        else
+          hasOnlyStaticOffsets = false;
+      for (mlir::OpFoldResult size : subviewOp.getMixedSizes()) {
+        auto attr = mlir::dyn_cast<mlir::Attribute>(size);
+        if (!attr)
+          return mlir::failure();
+        staticSizes.push_back(mlir::cast<mlir::IntegerAttr>(attr).getInt());
+      }
+      for (mlir::OpFoldResult stride : subviewOp.getMixedStrides()) {
+        auto attr = mlir::dyn_cast<mlir::Attribute>(stride);
+        if (!attr)
+          return mlir::failure();
+        staticStrides.push_back(mlir::cast<mlir::IntegerAttr>(attr).getInt());
+      }
+
+      if (!isContiguousSubviewWithDynamicOffsets(
+            sourceType.getShape(), subviewOp.getMixedOffsets(), staticSizes, staticStrides)) {
+        return mlir::failure();
+      }
+
+      if (hasOnlyStaticOffsets) {
+        if (!isMemoryContiguous(sourceType.getShape(), staticOffsets, staticSizes, staticStrides))
+          return mlir::failure();
+
+        auto sourceStrides = getStaticMemRefStrides(sourceType);
+        if (failed(sourceStrides))
+          return mlir::failure();
+        constantByteOffset +=
+          linearizeIndex(staticOffsets, *sourceStrides) * getElementTypeSizeInBytes(subviewType.getElementType());
+      }
+      else {
+        auto sourceStrides = getStaticMemRefStrides(sourceType);
+        if (failed(sourceStrides))
+          return mlir::failure();
+        CompiledIndexExpr offsetExpr;
+        {
+          CompiledIndexExprNode expr;
+          expr.kind = CompiledIndexExprNode::Kind::Constant;
+          expr.constant = 0;
+          offsetExpr = makeCompiledIndexExpr(std::move(expr));
+        }
+
+        for (auto [mixedOffset, sourceStride] : llvm::zip_equal(subviewOp.getMixedOffsets(), *sourceStrides)) {
+          CompiledIndexExpr operandExpr;
+          if (auto attr = mlir::dyn_cast<mlir::Attribute>(mixedOffset)) {
+            CompiledIndexExprNode expr;
+            expr.kind = CompiledIndexExprNode::Kind::Constant;
+            expr.constant = mlir::cast<mlir::IntegerAttr>(attr).getInt() * sourceStride
+                          * getElementTypeSizeInBytes(subviewType.getElementType());
+            operandExpr = makeCompiledIndexExpr(std::move(expr));
+          }
+          else {
+            auto compiledOffset = compileIndexValueImpl(mlir::cast<mlir::Value>(mixedOffset));
+            if (failed(compiledOffset))
+              return mlir::failure();
+            CompiledIndexExpr scaleExpr;
+            {
+              CompiledIndexExprNode expr;
+              expr.kind = CompiledIndexExprNode::Kind::Constant;
+              expr.constant = sourceStride * getElementTypeSizeInBytes(subviewType.getElementType());
+              scaleExpr = makeCompiledIndexExpr(std::move(expr));
+            }
+
+            CompiledIndexExprNode expr;
+            expr.kind = CompiledIndexExprNode::Kind::Mul;
+            expr.operands = {*compiledOffset, scaleExpr};
+            operandExpr = makeCompiledIndexExpr(std::move(expr));
+          }
+
+          CompiledIndexExprNode expr;
+          expr.kind = CompiledIndexExprNode::Kind::Add;
+          expr.operands = {offsetExpr, operandExpr};
+          offsetExpr = makeCompiledIndexExpr(std::move(expr));
+        }
+
+        CompiledIndexExpr constantExpr;
+        {
+          CompiledIndexExprNode expr;
+          expr.kind = CompiledIndexExprNode::Kind::Constant;
+          expr.constant = constantByteOffset;
+          constantExpr = makeCompiledIndexExpr(std::move(expr));
+        }
+        CompiledIndexExprNode expr;
+        expr.kind = CompiledIndexExprNode::Kind::Add;
+        expr.operands = {constantExpr, offsetExpr};
+        byteOffsetExpr = makeCompiledIndexExpr(std::move(expr));
+        constantByteOffset = 0;
+      }
+
+      value = subviewOp.getSource();
+      continue;
+    }
+
+    if (auto castOp = mlir::dyn_cast<mlir::memref::CastOp>(definingOp)) {
+      value = castOp.getSource();
+      continue;
+    }
+    if (auto collapseOp = mlir::dyn_cast<mlir::memref::CollapseShapeOp>(definingOp)) {
+      value = collapseOp.getSrc();
+      continue;
+    }
+    if (auto expandOp = mlir::dyn_cast<mlir::memref::ExpandShapeOp>(definingOp)) {
+      value = expandOp.getSrc();
+      continue;
+    }
+
+    if (mlir::isa<mlir::memref::AllocOp, mlir::memref::GetGlobalOp>(definingOp)) {
+      if (constantByteOffset != 0) {
+        CompiledIndexExpr constantExpr;
+        {
+          CompiledIndexExprNode expr;
+          expr.kind = CompiledIndexExprNode::Kind::Constant;
+          expr.constant = constantByteOffset;
+          constantExpr = makeCompiledIndexExpr(std::move(expr));
+        }
+        if (byteOffsetExpr.node->kind == CompiledIndexExprNode::Kind::Constant && byteOffsetExpr.node->constant == 0)
+          byteOffsetExpr = constantExpr;
+        else {
+          CompiledIndexExprNode expr;
+          expr.kind = CompiledIndexExprNode::Kind::Add;
+          expr.operands = {constantExpr, byteOffsetExpr};
+          byteOffsetExpr = makeCompiledIndexExpr(std::move(expr));
+        }
+      }
+      return CompiledAddressExpr {value, byteOffsetExpr};
+    }
+
+    return mlir::failure();
+  }
+}
+
+} // namespace

 llvm::FailureOr<int64_t> resolveIndexValue(mlir::Value value, const StaticValueKnowledge& knowledge) {
  return resolveIndexValueImpl(value, &knowledge);
 }

-llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddress(mlir::Value value) {
-  return resolveContiguousAddressImpl(value, nullptr);
-}
+llvm::FailureOr<CompiledIndexExpr> compileIndexExpr(mlir::Value value) { return compileIndexValueImpl(value); }

 llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddress(mlir::Value value,
                                                                    const StaticValueKnowledge& knowledge) {
@@ -256,4 +787,21 @@ mlir::Value resolveLoopCarriedAlias(mlir::Value value, const StaticValueKnowledg
  return resolveLoopCarriedAliasImpl(value, &knowledge);
 }

+llvm::FailureOr<CompiledAddressExpr> compileContiguousAddressExpr(mlir::Value value) {
+  return compileContiguousAddressExprImpl(value);
+}
+
+llvm::FailureOr<int64_t> CompiledIndexExpr::evaluate(const StaticValueKnowledge& knowledge) const {
+  return evaluateCompiledIndexExpr(*this, knowledge);
+}
+
+llvm::FailureOr<ResolvedContiguousAddress> CompiledAddressExpr::evaluate(const StaticValueKnowledge& knowledge,
+                                                                         std::optional<unsigned> lane) const {
+  (void) lane;
+  auto resolvedOffset = byteOffset.evaluate(knowledge);
+  if (failed(resolvedOffset))
+    return mlir::failure();
+  return ResolvedContiguousAddress {base, *resolvedOffset};
+}
+
 } // namespace onnx_mlir
@@ -1,10 +1,14 @@
 #pragma once

+#include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/MemRef/IR/MemRef.h"
 #include "mlir/IR/Value.h"

 #include "llvm/ADT/DenseMap.h"

+#include <memory>
+#include <optional>
+
 namespace onnx_mlir {

 /// Describes a value as a base addressable object plus a statically known
@@ -23,21 +27,68 @@ struct StaticValueKnowledge {
  StaticValueKnowledge() {}
 };

+struct CompiledIndexExprNode;
+
+struct CompiledIndexExpr {
+  std::shared_ptr<CompiledIndexExprNode> node;
+
+  CompiledIndexExpr() = default;
+  explicit CompiledIndexExpr(std::shared_ptr<CompiledIndexExprNode> node)
+  : node(std::move(node)) {}
+
+  llvm::FailureOr<int64_t> evaluate(const StaticValueKnowledge& knowledge) const;
+};
+
+struct CompiledIndexExprNode {
+  enum class Kind {
+    Constant,
+    Symbol,
+    Add,
+    Sub,
+    Mul,
+    DivUI,
+    DivSI,
+    RemUI,
+    RemSI,
+    MinUI,
+    CmpI,
+    Select,
+    ConstantGlobalLoad
+  };
+
+  Kind kind = Kind::Constant;
+  int64_t constant = 0;
+  mlir::Value symbol;
+  mlir::arith::CmpIPredicate predicate = mlir::arith::CmpIPredicate::eq;
+  mlir::memref::GlobalOp globalOp;
+  llvm::SmallVector<int64_t, 4> globalStrides;
+  llvm::SmallVector<CompiledIndexExpr, 4> operands;
+};
+
+struct CompiledAddressExpr {
+  mlir::Value base;
+  CompiledIndexExpr byteOffset;
+
+  llvm::FailureOr<ResolvedContiguousAddress> evaluate(const StaticValueKnowledge& knowledge,
+                                                      std::optional<unsigned> lane) const;
+};
+
 mlir::memref::GlobalOp lookupGlobalForGetGlobal(mlir::ModuleOp moduleOp, mlir::memref::GetGlobalOp getGlobalOp);

 /// Resolves a value to contiguous backing storage when that storage can be
 /// proven statically from aliases, DPS ties, casts, and subviews.
-llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddress(mlir::Value value);
 llvm::FailureOr<ResolvedContiguousAddress> resolveContiguousAddress(mlir::Value value,
-                                                                    const StaticValueKnowledge& knowledge);
+                                                                    const StaticValueKnowledge& knowledge = {});

 /// Statically evaluates index-like SSA values, including simple integer
 /// arithmetic and loop facts recorded in `knowledge`.
-llvm::FailureOr<int64_t> resolveIndexValue(mlir::Value value);
-llvm::FailureOr<int64_t> resolveIndexValue(mlir::Value value, const StaticValueKnowledge& knowledge);
+llvm::FailureOr<int64_t> resolveIndexValue(mlir::Value value, const StaticValueKnowledge& knowledge = {});
+llvm::FailureOr<CompiledIndexExpr> compileIndexExpr(mlir::Value value);

 /// Follows alias, view, and DPS chains to recover the backing value of a
 /// loop-carried memref/result.
 mlir::Value resolveLoopCarriedAlias(mlir::Value value, const StaticValueKnowledge& knowledge);

+llvm::FailureOr<CompiledAddressExpr> compileContiguousAddressExpr(mlir::Value value);
+
 } // namespace onnx_mlir
@@ -0,0 +1,182 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/IR/Matchers.h"
+
+#include "AffineUtils.hpp"
+#include "ConstantUtils.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+static FailureOr<int64_t> floorDivSigned(int64_t lhs, int64_t rhs) {
+  if (rhs <= 0)
+    return failure();
+
+  int64_t quotient = lhs / rhs;
+  int64_t remainder = lhs % rhs;
+  if (remainder != 0 && lhs < 0)
+    --quotient;
+  return quotient;
+}
+
+static FailureOr<int64_t> ceilDivSigned(int64_t lhs, int64_t rhs) {
+  if (rhs <= 0)
+    return failure();
+
+  int64_t quotient = lhs / rhs;
+  int64_t remainder = lhs % rhs;
+  if (remainder != 0 && lhs > 0)
+    ++quotient;
+  return quotient;
+}
+
+Value createOrFoldAffineApply(
+  RewriterBase& rewriter, Location loc, AffineMap map, ValueRange operands, Operation* constantAnchor) {
+  assert(constantAnchor && "expected a valid constant anchor");
+  assert(map.getNumResults() == 1 && "affine.apply expects a single-result affine map");
+
+  SmallVector<Attribute> operandConstants;
+  operandConstants.reserve(operands.size());
+  for (Value operand : operands) {
+    std::optional<int64_t> constantValue = matchConstantIndexValue(operand);
+    if (!constantValue)
+      return affine::AffineApplyOp::create(rewriter, loc, map, operands).getResult();
+    operandConstants.push_back(rewriter.getIndexAttr(*constantValue));
+  }
+
+  SmallVector<Attribute> foldedResults;
+  if (succeeded(map.constantFold(operandConstants, foldedResults)) && foldedResults.size() == 1)
+    if (auto constantResult = dyn_cast<IntegerAttr>(foldedResults.front()))
+      return getOrCreateIndexConstant(rewriter, constantAnchor, constantResult.getInt());
+
+  return affine::AffineApplyOp::create(rewriter, loc, map, operands).getResult();
+}
+
+Value createOrFoldAffineApply(
+  RewriterBase& rewriter, Location loc, AffineExpr expr, ValueRange dims, Operation* constantAnchor) {
+  AffineMap map = AffineMap::get(/*dimCount=*/dims.size(), /*symbolCount=*/0, expr);
+  return createOrFoldAffineApply(rewriter, loc, map, dims, constantAnchor);
+}
+
+Value affineMulConst(RewriterBase& rewriter, Location loc, Value value, int64_t multiplier, Operation* constantAnchor) {
+  assert(constantAnchor && "expected a valid constant anchor");
+  if (multiplier == 0)
+    return getOrCreateIndexConstant(rewriter, constantAnchor, 0);
+  if (multiplier == 1)
+    return value;
+
+  AffineExpr d0 = getAffineDimExpr(0, rewriter.getContext());
+  return createOrFoldAffineApply(rewriter, loc, d0 * multiplier, ValueRange {value}, constantAnchor);
+}
+
+Value affineModConst(RewriterBase& rewriter, Location loc, Value value, int64_t divisor, Operation* constantAnchor) {
+  assert(constantAnchor && "expected a valid constant anchor");
+  assert(divisor > 0 && "expected a positive affine.mod divisor");
+  if (divisor == 1)
+    return getOrCreateIndexConstant(rewriter, constantAnchor, 0);
+
+  AffineExpr d0 = getAffineDimExpr(0, rewriter.getContext());
+  return createOrFoldAffineApply(rewriter, loc, d0 % divisor, ValueRange {value}, constantAnchor);
+}
+
+Value affineFloorDivConst(
+  RewriterBase& rewriter, Location loc, Value value, int64_t divisor, Operation* constantAnchor) {
+  assert(constantAnchor && "expected a valid constant anchor");
+  assert(divisor > 0 && "expected a positive affine.floor_div divisor");
+  if (divisor == 1)
+    return value;
+
+  AffineExpr d0 = getAffineDimExpr(0, rewriter.getContext());
+  return createOrFoldAffineApply(rewriter, loc, d0.floorDiv(divisor), ValueRange {value}, constantAnchor);
+}
+
+FailureOr<int64_t> evaluateAffineExpr(AffineExpr expr, ArrayRef<int64_t> dims, ArrayRef<int64_t> symbols) {
+  if (auto constant = dyn_cast<AffineConstantExpr>(expr))
+    return constant.getValue();
+  if (auto dim = dyn_cast<AffineDimExpr>(expr)) {
+    unsigned position = dim.getPosition();
+    if (position >= dims.size())
+      return failure();
+    return dims[position];
+  }
+  if (auto symbol = dyn_cast<AffineSymbolExpr>(expr)) {
+    unsigned position = symbol.getPosition();
+    if (position >= symbols.size())
+      return failure();
+    return symbols[position];
+  }
+
+  auto binary = dyn_cast<AffineBinaryOpExpr>(expr);
+  if (!binary)
+    return failure();
+
+  FailureOr<int64_t> lhs = evaluateAffineExpr(binary.getLHS(), dims, symbols);
+  FailureOr<int64_t> rhs = evaluateAffineExpr(binary.getRHS(), dims, symbols);
+  if (failed(lhs) || failed(rhs))
+    return failure();
+
+  switch (binary.getKind()) {
+  case AffineExprKind::Add:      return *lhs + *rhs;
+  case AffineExprKind::Mul:      return *lhs * *rhs;
+  case AffineExprKind::FloorDiv: return floorDivSigned(*lhs, *rhs);
+  case AffineExprKind::CeilDiv:  return ceilDivSigned(*lhs, *rhs);
+  case AffineExprKind::Mod:      {
+    FailureOr<int64_t> div = floorDivSigned(*lhs, *rhs);
+    if (failed(div))
+      return failure();
+    return *lhs - *div * *rhs;
+  }
+  default: return failure();
+  }
+}
+
+FailureOr<int64_t> evaluateSingleResultAffineMap(AffineMap map, ArrayRef<int64_t> operands) {
+  if (map.getNumResults() != 1 || operands.size() != map.getNumInputs())
+    return failure();
+
+  ArrayRef<int64_t> dims(operands.data(), map.getNumDims());
+  ArrayRef<int64_t> symbols(operands.data() + map.getNumDims(), map.getNumSymbols());
+  return evaluateAffineExpr(map.getResult(0), dims, symbols);
+}
+
+FailureOr<int64_t> evaluateAffineApply(affine::AffineApplyOp affineApply, IndexValueResolver resolver) {
+  SmallVector<int64_t, 4> operands;
+  operands.reserve(affineApply.getMapOperands().size());
+  for (Value operand : affineApply.getMapOperands()) {
+    FailureOr<int64_t> folded = resolver(operand);
+    if (failed(folded))
+      return failure();
+    operands.push_back(*folded);
+  }
+
+  return evaluateSingleResultAffineMap(affineApply.getAffineMap(), operands);
+}
+
+bool isSingleResultSymbolFreeAffineMap(AffineMap map) { return map.getNumResults() == 1 && map.getNumSymbols() == 0; }
+
+bool isDimAndConstantAffineExpr(AffineExpr expr) {
+  switch (expr.getKind()) {
+  case AffineExprKind::Constant:
+  case AffineExprKind::DimId:    return true;
+  case AffineExprKind::SymbolId: return false;
+  case AffineExprKind::Add:      {
+    auto binaryExpr = cast<AffineBinaryOpExpr>(expr);
+    return isDimAndConstantAffineExpr(binaryExpr.getLHS()) && isDimAndConstantAffineExpr(binaryExpr.getRHS());
+  }
+  case AffineExprKind::Mul: {
+    auto binaryExpr = cast<AffineBinaryOpExpr>(expr);
+    return (isa<AffineConstantExpr>(binaryExpr.getLHS()) && isDimAndConstantAffineExpr(binaryExpr.getRHS()))
+        || (isa<AffineConstantExpr>(binaryExpr.getRHS()) && isDimAndConstantAffineExpr(binaryExpr.getLHS()));
+  }
+  case AffineExprKind::FloorDiv:
+  case AffineExprKind::CeilDiv:
+  case AffineExprKind::Mod:      {
+    auto binaryExpr = cast<AffineBinaryOpExpr>(expr);
+    return isa<AffineConstantExpr>(binaryExpr.getRHS()) && isDimAndConstantAffineExpr(binaryExpr.getLHS());
+  }
+  }
+
+  llvm_unreachable("unexpected affine expression kind");
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,55 @@
+#pragma once
+
+#include "mlir/Dialect/Affine/IR/AffineOps.h"
+#include "mlir/IR/AffineExpr.h"
+#include "mlir/IR/PatternMatch.h"
+#include "mlir/Support/LogicalResult.h"
+
+#include "llvm/ADT/FunctionExtras.h"
+
+namespace onnx_mlir {
+
+using IndexValueResolver = llvm::function_ref<llvm::FailureOr<int64_t>(mlir::Value)>;
+
+mlir::Value createOrFoldAffineApply(mlir::RewriterBase& rewriter,
+                                    mlir::Location loc,
+                                    mlir::AffineMap map,
+                                    mlir::ValueRange operands,
+                                    mlir::Operation* constantAnchor);
+
+mlir::Value createOrFoldAffineApply(mlir::RewriterBase& rewriter,
+                                    mlir::Location loc,
+                                    mlir::AffineExpr expr,
+                                    mlir::ValueRange dims,
+                                    mlir::Operation* constantAnchor);
+
+mlir::Value affineMulConst(mlir::RewriterBase& rewriter,
+                           mlir::Location loc,
+                           mlir::Value value,
+                           int64_t multiplier,
+                           mlir::Operation* constantAnchor);
+
+mlir::Value affineModConst(mlir::RewriterBase& rewriter,
+                           mlir::Location loc,
+                           mlir::Value value,
+                           int64_t divisor,
+                           mlir::Operation* constantAnchor);
+
+mlir::Value affineFloorDivConst(mlir::RewriterBase& rewriter,
+                                mlir::Location loc,
+                                mlir::Value value,
+                                int64_t divisor,
+                                mlir::Operation* constantAnchor);
+
+llvm::FailureOr<int64_t>
+evaluateAffineExpr(mlir::AffineExpr expr, llvm::ArrayRef<int64_t> dims, llvm::ArrayRef<int64_t> symbols = {});
+
+llvm::FailureOr<int64_t> evaluateSingleResultAffineMap(mlir::AffineMap map, llvm::ArrayRef<int64_t> operands);
+
+llvm::FailureOr<int64_t> evaluateAffineApply(mlir::affine::AffineApplyOp affineApply, IndexValueResolver resolver);
+
+bool isSingleResultSymbolFreeAffineMap(mlir::AffineMap map);
+
+bool isDimAndConstantAffineExpr(mlir::AffineExpr expr);
+
+} // namespace onnx_mlir
@@ -0,0 +1,32 @@
+#include "src/Accelerators/PIM/Common/IR/BatchCoreUtils.hpp"
+#include "src/Accelerators/PIM/Common/PimCommon.hpp"
+
+namespace onnx_mlir {
+
+llvm::SmallVector<int32_t> getBatchCoreIds(pim::PimCoreBatchOp coreBatchOp) {
+  auto coreIdsAttr = coreBatchOp->getAttrOfType<mlir::DenseI32ArrayAttr>(onnx_mlir::kCoreIdsAttrName);
+  assert(coreIdsAttr && "pim.core_batch requires coreIds array attribute");
+  return llvm::SmallVector<int32_t>(coreIdsAttr.asArrayRef().begin(), coreIdsAttr.asArrayRef().end());
+}
+
+llvm::SmallVector<int32_t> getLaneChunkCoreIds(llvm::ArrayRef<int32_t> coreIds, size_t laneCount, unsigned lane) {
+  llvm::SmallVector<int32_t> laneCoreIds;
+  laneCoreIds.reserve(coreIds.size() / laneCount);
+  for (size_t chunkIndex = 0; chunkIndex < coreIds.size() / laneCount; ++chunkIndex)
+    laneCoreIds.push_back(coreIds[chunkIndex * laneCount + lane]);
+  return laneCoreIds;
+}
+
+bool isExplicitHostMemCopyOperand(mlir::Operation* op, unsigned operandIndex) {
+  if (mlir::isa<pim::PimMemCopyHostToDevOp>(op))
+    return operandIndex == 3;
+  if (mlir::isa<pim::PimMemCopyDevToHostOp>(op))
+    return operandIndex == 2;
+  return false;
+}
+
+bool isExplicitDevToHostTargetOperand(mlir::Operation* op, unsigned operandIndex) {
+  return mlir::isa<pim::PimMemCopyDevToHostOp>(op) && operandIndex == 2;
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,18 @@
+#pragma once
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/SmallVector.h"
+
+#include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
+
+namespace onnx_mlir {
+
+llvm::SmallVector<int32_t> getBatchCoreIds(pim::PimCoreBatchOp coreBatchOp);
+
+llvm::SmallVector<int32_t> getLaneChunkCoreIds(llvm::ArrayRef<int32_t> coreIds, size_t laneCount, unsigned lane);
+
+bool isExplicitHostMemCopyOperand(mlir::Operation* op, unsigned operandIndex);
+
+bool isExplicitDevToHostTargetOperand(mlir::Operation* op, unsigned operandIndex);
+
+} // namespace onnx_mlir
@@ -0,0 +1,157 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinOps.h"
+#include "mlir/IR/Dialect.h"
+
+#include "ConstantUtils.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+static std::optional<int64_t> getIndexConstantValue(arith::ConstantOp constantOp) {
+  if (!constantOp.getType().isIndex())
+    return std::nullopt;
+
+  auto intAttr = dyn_cast<IntegerAttr>(constantOp.getValue());
+  if (!intAttr || !intAttr.getType().isIndex())
+    return std::nullopt;
+
+  return intAttr.getInt();
+}
+
+Block* getConstantInsertionBlock(Operation* anchorOp) {
+  assert(anchorOp && "expected a valid anchor operation");
+
+  if (auto funcOp = dyn_cast<func::FuncOp>(anchorOp))
+    return &funcOp.getBody().front();
+  if (auto funcOp = anchorOp->getParentOfType<func::FuncOp>())
+    return &funcOp.getBody().front();
+  if (auto moduleOp = dyn_cast<ModuleOp>(anchorOp))
+    return moduleOp.getBody();
+  if (auto moduleOp = anchorOp->getParentOfType<ModuleOp>())
+    return moduleOp.getBody();
+  return anchorOp->getBlock();
+}
+
+Value getOrCreateConstant(OperationFolder& folder, Operation* anchorOp, Attribute value, Type type) {
+  assert(anchorOp && "expected a valid anchor operation");
+  Block* hostBlock = getConstantInsertionBlock(anchorOp);
+  for (Operation& op : *hostBlock) {
+    auto constantOp = dyn_cast<arith::ConstantOp>(&op);
+    if (!constantOp || constantOp.getType() != type || constantOp.getValue() != value)
+      continue;
+    return constantOp.getResult();
+  }
+
+  auto* arithDialect = anchorOp->getContext()->getOrLoadDialect<arith::ArithDialect>();
+  return folder.getOrCreateConstant(hostBlock, arithDialect, value, type);
+}
+
+Value getOrCreateConstant(RewriterBase& rewriter, Operation* anchorOp, Attribute value, Type type) {
+  assert(anchorOp && "expected a valid anchor operation");
+  Block* hostBlock = getConstantInsertionBlock(anchorOp);
+  for (Operation& op : *hostBlock) {
+    auto constantOp = dyn_cast<arith::ConstantOp>(&op);
+    if (!constantOp || constantOp.getType() != type || constantOp.getValue() != value)
+      continue;
+    return constantOp.getResult();
+  }
+
+  OpBuilder::InsertionGuard guard(rewriter);
+  rewriter.setInsertionPointToStart(hostBlock);
+  return arith::ConstantOp::create(rewriter, anchorOp->getLoc(), type, cast<TypedAttr>(value)).getResult();
+}
+
+Value getOrCreateConstantLike(OperationFolder& folder, arith::ConstantOp constantOp) {
+  return getOrCreateConstant(folder, constantOp.getOperation(), constantOp.getValue(), constantOp.getType());
+}
+
+Value getOrCreateIndexConstant(OperationFolder& folder, Operation* anchorOp, int64_t value) {
+  Builder builder(anchorOp->getContext());
+  return getOrCreateConstant(folder, anchorOp, builder.getIndexAttr(value), builder.getIndexType());
+}
+
+Value getOrCreateIndexConstant(RewriterBase& rewriter, Operation* anchorOp, int64_t value) {
+  Builder builder(anchorOp->getContext());
+  return getOrCreateConstant(rewriter, anchorOp, builder.getIndexAttr(value), builder.getIndexType());
+}
+
+void hoistAndUniquifyIndexConstants(func::FuncOp funcOp, RewriterBase& rewriter) {
+  if (funcOp.getBody().empty())
+    return;
+
+  Block& entryBlock = funcOp.getBody().front();
+  DenseMap<int64_t, Value> canonicalByValue;
+  SmallVector<arith::ConstantOp> constants;
+
+  funcOp.walk([&](arith::ConstantOp constantOp) {
+    if (!getIndexConstantValue(constantOp))
+      return;
+    constants.push_back(constantOp);
+  });
+
+  for (arith::ConstantOp constantOp : constants) {
+    auto value = getIndexConstantValue(constantOp);
+    if (!value || constantOp->getBlock() != &entryBlock)
+      continue;
+    canonicalByValue.try_emplace(*value, constantOp.getResult());
+  }
+
+  for (arith::ConstantOp constantOp : constants) {
+    auto value = getIndexConstantValue(constantOp);
+    if (!value)
+      continue;
+
+    Value canonical = canonicalByValue.lookup(*value);
+    if (!canonical) {
+      OpBuilder::InsertionGuard guard(rewriter);
+      rewriter.setInsertionPointToStart(&entryBlock);
+      Builder builder(funcOp.getContext());
+      canonical =
+        arith::ConstantOp::create(rewriter, constantOp.getLoc(), builder.getIndexType(), builder.getIndexAttr(*value));
+      canonicalByValue[*value] = canonical;
+    }
+
+    if (constantOp.getResult() == canonical)
+      continue;
+
+    constantOp.getResult().replaceAllUsesWith(canonical);
+  }
+
+  for (arith::ConstantOp constantOp : llvm::reverse(constants)) {
+    auto value = getIndexConstantValue(constantOp);
+    if (!value)
+      continue;
+    if (constantOp.getResult() == canonicalByValue.lookup(*value))
+      continue;
+    if (constantOp.use_empty())
+      rewriter.eraseOp(constantOp);
+  }
+}
+
+std::optional<int64_t> matchConstantIndexValue(Value value) {
+  if (!value || !value.getType().isIndex())
+    return std::nullopt;
+
+  if (auto constant = value.getDefiningOp<arith::ConstantIndexOp>())
+    return constant.value();
+
+  if (auto constant = value.getDefiningOp<arith::ConstantOp>())
+    if (auto intAttr = dyn_cast<IntegerAttr>(constant.getValue()); intAttr && intAttr.getType().isIndex())
+      return intAttr.getInt();
+
+  return std::nullopt;
+}
+
+std::optional<int64_t> matchConstantIndexValue(OpFoldResult value) {
+  if (auto attr = dyn_cast<Attribute>(value))
+    if (auto intAttr = dyn_cast<IntegerAttr>(attr); intAttr && intAttr.getType().isIndex())
+      return intAttr.getInt();
+  if (auto operand = dyn_cast<Value>(value))
+    return matchConstantIndexValue(operand);
+  return std::nullopt;
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,33 @@
+#pragma once
+
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/IR/PatternMatch.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Transforms/FoldUtils.h"
+
+#include <optional>
+
+namespace onnx_mlir {
+
+mlir::Block* getConstantInsertionBlock(mlir::Operation* anchorOp);
+
+mlir::Value
+getOrCreateConstant(mlir::OperationFolder& folder, mlir::Operation* anchorOp, mlir::Attribute value, mlir::Type type);
+
+mlir::Value
+getOrCreateConstant(mlir::RewriterBase& rewriter, mlir::Operation* anchorOp, mlir::Attribute value, mlir::Type type);
+
+mlir::Value getOrCreateConstantLike(mlir::OperationFolder& folder, mlir::arith::ConstantOp constantOp);
+
+mlir::Value getOrCreateIndexConstant(mlir::OperationFolder& folder, mlir::Operation* anchorOp, int64_t value);
+
+mlir::Value getOrCreateIndexConstant(mlir::RewriterBase& rewriter, mlir::Operation* anchorOp, int64_t value);
+
+void hoistAndUniquifyIndexConstants(mlir::func::FuncOp funcOp, mlir::RewriterBase& rewriter);
+
+std::optional<int64_t> matchConstantIndexValue(mlir::Value value);
+
+std::optional<int64_t> matchConstantIndexValue(mlir::OpFoldResult value);
+
+} // namespace onnx_mlir
@@ -1,24 +1,37 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"

+#include "llvm/ADT/SmallVector.h"
+
 #include "src/Accelerators/PIM/Common/IR/CoreBlockUtils.hpp"
 #include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"

 namespace onnx_mlir {

 bool isCoreStaticAddressOp(mlir::Operation* op) {
-  return mlir::isa<mlir::arith::ConstantOp,
-                   mlir::arith::AddIOp,
-                   mlir::arith::SubIOp,
-                   mlir::arith::MulIOp,
-                   mlir::arith::DivUIOp,
-                   mlir::arith::RemUIOp,
-                   mlir::arith::IndexCastOp,
-                   mlir::memref::AllocOp,
-                   mlir::memref::SubViewOp,
-                   mlir::memref::CastOp,
-                   mlir::memref::CollapseShapeOp,
-                   mlir::memref::ExpandShapeOp>(op);
+  if (mlir::isa<mlir::arith::ConstantOp,
+                mlir::arith::AddIOp,
+                mlir::arith::SubIOp,
+                mlir::arith::MulIOp,
+                mlir::arith::DivUIOp,
+                mlir::arith::DivSIOp,
+                mlir::arith::MinUIOp,
+                mlir::arith::RemUIOp,
+                mlir::arith::RemSIOp,
+                mlir::arith::IndexCastOp,
+                mlir::arith::CmpIOp,
+                mlir::memref::AllocOp,
+                mlir::memref::SubViewOp,
+                mlir::memref::CastOp,
+                mlir::memref::CollapseShapeOp,
+                mlir::memref::ExpandShapeOp>(op))
+    return true;
+
+  if (auto selectOp = mlir::dyn_cast<mlir::arith::SelectOp>(op))
+    return selectOp.getType().isIntOrIndex();
+
+  return false;
 }

 mlir::LogicalResult
@@ -29,6 +42,9 @@ walkPimCoreBlock(mlir::Block& block,
  for (mlir::Operation& op : block) {
    if (mlir::isa<pim::PimHaltOp, mlir::scf::YieldOp>(op) || isCoreStaticAddressOp(&op))
      continue;
+    if (auto loadOp = mlir::dyn_cast<mlir::memref::LoadOp>(op);
+        loadOp && succeeded(resolveIndexValue(loadOp.getResult(), knowledge)))
+      continue;

    if (auto forOp = mlir::dyn_cast<mlir::scf::ForOp>(op)) {
      mlir::Block& loopBody = forOp.getRegion().front();
@@ -64,4 +80,58 @@ walkPimCoreBlock(mlir::Block& block,
  return mlir::success(!hasFailure);
 }

+mlir::LogicalResult walkPimCoreBlockStructurally(
+  mlir::Block& block,
+  const StaticValueKnowledge& knowledge,
+  llvm::function_ref<mlir::LogicalResult(mlir::Operation&, const StaticValueKnowledge&)> callback) {
+  bool hasFailure = false;
+  for (mlir::Operation& op : block) {
+    if (mlir::isa<pim::PimHaltOp, mlir::scf::YieldOp>(op) || isCoreStaticAddressOp(&op))
+      continue;
+    if (auto loadOp = mlir::dyn_cast<mlir::memref::LoadOp>(op);
+        loadOp && succeeded(resolveIndexValue(loadOp.getResult(), knowledge)))
+      continue;
+
+    if (auto forOp = mlir::dyn_cast<mlir::scf::ForOp>(op)) {
+      mlir::Block& loopBody = forOp.getRegion().front();
+      auto lowerBound = resolveIndexValue(forOp.getLowerBound(), knowledge);
+      auto upperBound = resolveIndexValue(forOp.getUpperBound(), knowledge);
+      auto step = resolveIndexValue(forOp.getStep(), knowledge);
+      if (failed(lowerBound) || failed(upperBound) || failed(step)) {
+        forOp.emitOpError("requires statically evaluable scf.for bounds for PIM verification");
+        hasFailure = true;
+        continue;
+      }
+      if (*step <= 0) {
+        forOp.emitOpError("requires positive scf.for step for PIM verification");
+        hasFailure = true;
+        continue;
+      }
+
+      llvm::SmallVector<int64_t, 2> samples;
+      if (*lowerBound < *upperBound) {
+        samples.push_back(*lowerBound);
+        int64_t last = *lowerBound + ((*upperBound - 1 - *lowerBound) / *step) * *step;
+        if (last != *lowerBound)
+          samples.push_back(last);
+      }
+
+      for (int64_t inductionValue : samples) {
+        StaticValueKnowledge loopKnowledge = knowledge;
+        loopKnowledge.indexValues[forOp.getInductionVar()] = inductionValue;
+        for (auto [iterArg, iterValue] : llvm::zip_equal(forOp.getRegionIterArgs(), forOp.getInitArgs()))
+          loopKnowledge.aliases[iterArg] = iterValue;
+
+        if (failed(walkPimCoreBlockStructurally(loopBody, loopKnowledge, callback)))
+          hasFailure = true;
+      }
+      continue;
+    }
+
+    if (failed(callback(op, knowledge)))
+      hasFailure = true;
+  }
+  return mlir::success(!hasFailure);
+}
+
 } // namespace onnx_mlir
@@ -21,4 +21,12 @@ walkPimCoreBlock(mlir::Block& block,
                 const StaticValueKnowledge& knowledge,
                 llvm::function_ref<mlir::LogicalResult(mlir::Operation&, const StaticValueKnowledge&)> callback);

+/// Walks a `pim.core`-like body structurally for verification without
+/// enumerating full loop trip counts. Loop bounds must still be statically
+/// evaluable so address resolution remains well-defined.
+mlir::LogicalResult walkPimCoreBlockStructurally(
+  mlir::Block& block,
+  const StaticValueKnowledge& knowledge,
+  llvm::function_ref<mlir::LogicalResult(mlir::Operation&, const StaticValueKnowledge&)> callback);
+
 } // namespace onnx_mlir
@@ -0,0 +1,96 @@
+#include "mlir/Dialect/SCF/IR/SCF.h"
+
+#include "llvm/Support/MathExtras.h"
+
+#include <optional>
+
+#include "ConstantUtils.hpp"
+#include "LoopUtils.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+namespace {
+
+static std::optional<int64_t> getStaticTripCount(Value lowerBound, Value upperBound, Value step) {
+  auto lower = matchConstantIndexValue(lowerBound);
+  auto upper = matchConstantIndexValue(upperBound);
+  auto stepValue = matchConstantIndexValue(step);
+  if (!lower || !upper || !stepValue)
+    return std::nullopt;
+  if (*stepValue <= 0)
+    return std::nullopt;
+  if (*upper <= *lower)
+    return int64_t {0};
+  return llvm::divideCeil(*upper - *lower, *stepValue);
+}
+
+} // namespace
+
+static LogicalResult validateNormalizedLoopYields(Location loc, ValueRange initArgs, ArrayRef<Value> yieldedValues) {
+  if (yieldedValues.size() == initArgs.size())
+    return success();
+
+  emitError(loc) << "normalized loop body yielded " << yieldedValues.size() << " values for " << initArgs.size()
+                 << " iter args";
+  return failure();
+}
+
+FailureOr<NormalizedLoopResult> buildNormalizedScfFor(OpBuilder& builder,
+                                                      Location loc,
+                                                      Value lowerBound,
+                                                      Value upperBound,
+                                                      Value step,
+                                                      ValueRange initArgs,
+                                                      NormalizedLoopBodyBuilder bodyBuilder) {
+  NormalizedLoopResult result;
+
+  if (auto stepValue = matchConstantIndexValue(step); stepValue && *stepValue <= 0) {
+    emitError(loc) << "normalized scf.for requires a positive step, got " << *stepValue;
+    return failure();
+  }
+
+  if (auto tripCount = getStaticTripCount(lowerBound, upperBound, step)) {
+    if (*tripCount == 0) {
+      llvm::append_range(result.results, initArgs);
+      return result;
+    }
+
+    if (*tripCount == 1) {
+      result.inductionVar = lowerBound;
+      if (failed(bodyBuilder(builder, loc, lowerBound, initArgs, result.results)))
+        return failure();
+      if (failed(validateNormalizedLoopYields(loc, initArgs, result.results)))
+        return failure();
+      return result;
+    }
+  }
+
+  result.loop = scf::ForOp::create(builder, loc, lowerBound, upperBound, step, initArgs);
+  result.inductionVar = result.loop.getInductionVar();
+
+  {
+    OpBuilder::InsertionGuard guard(builder);
+    Block* body = result.loop.getBody();
+    if (!body->empty())
+      if (auto yieldOp = dyn_cast<scf::YieldOp>(body->back()))
+        yieldOp->erase();
+    builder.setInsertionPointToEnd(body);
+    ValueRange iterArgs = result.loop.getRegionIterArgs();
+    if (failed(bodyBuilder(builder, loc, result.inductionVar, iterArgs, result.results))) {
+      result.loop.erase();
+      return failure();
+    }
+    if (failed(validateNormalizedLoopYields(loc, initArgs, result.results))) {
+      result.loop.erase();
+      return failure();
+    }
+    scf::YieldOp::create(builder, loc, result.results);
+  }
+  builder.setInsertionPointAfter(result.loop);
+  result.results.assign(result.loop.getResults().begin(), result.loop.getResults().end());
+  return result;
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,30 @@
+#pragma once
+
+#include "mlir/Dialect/SCF/IR/SCF.h"
+#include "mlir/IR/Builders.h"
+
+#include "llvm/ADT/STLFunctionalExtras.h"
+#include "llvm/ADT/SmallVector.h"
+
+namespace onnx_mlir {
+
+struct NormalizedLoopResult {
+  mlir::Value inductionVar;
+  llvm::SmallVector<mlir::Value, 4> results;
+  mlir::scf::ForOp loop;
+
+  bool wasInlined() const { return !loop; }
+};
+
+using NormalizedLoopBodyBuilder = llvm::function_ref<mlir::LogicalResult(
+  mlir::OpBuilder&, mlir::Location, mlir::Value, mlir::ValueRange, llvm::SmallVectorImpl<mlir::Value>&)>;
+
+mlir::FailureOr<NormalizedLoopResult> buildNormalizedScfFor(mlir::OpBuilder& builder,
+                                                            mlir::Location loc,
+                                                            mlir::Value lowerBound,
+                                                            mlir::Value upperBound,
+                                                            mlir::Value step,
+                                                            mlir::ValueRange initArgs,
+                                                            NormalizedLoopBodyBuilder bodyBuilder);
+
+} // namespace onnx_mlir
@@ -1,4 +1,5 @@
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/Support/ErrorHandling.h"

 #include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"

@@ -35,6 +36,30 @@ int64_t getNumElements(llvm::ArrayRef<int64_t> shape) {
  return numElements;
 }

+bool hasByteSizedElementType(mlir::Type elementType) {
+  if (mlir::isa<mlir::IndexType>(elementType))
+    return true;
+  if (auto intType = mlir::dyn_cast<mlir::IntegerType>(elementType))
+    return intType.getWidth() > 0 && intType.getWidth() % 8 == 0;
+  if (auto floatType = mlir::dyn_cast<mlir::FloatType>(elementType))
+    return floatType.getWidth() > 0 && floatType.getWidth() % 8 == 0;
+  return false;
+}
+
+size_t getElementTypeSizeInBytes(mlir::Type elementType) {
+  if (mlir::isa<mlir::IndexType>(elementType))
+    return mlir::IndexType::kInternalStorageBitWidth / 8;
+  if (auto intType = mlir::dyn_cast<mlir::IntegerType>(elementType))
+    return static_cast<size_t>(intType.getWidth() / 8);
+  if (auto floatType = mlir::dyn_cast<mlir::FloatType>(elementType))
+    return static_cast<size_t>(floatType.getWidth() / 8);
+  llvm_unreachable("expected byte-sized integer, float, or index element type");
+}
+
+size_t getShapedTypeSizeInBytes(mlir::ShapedType shapedType) {
+  return static_cast<size_t>(shapedType.getNumElements()) * getElementTypeSizeInBytes(shapedType.getElementType());
+}
+
 bool isMemoryContiguous(llvm::ArrayRef<int64_t> srcShape,
                        llvm::ArrayRef<int64_t> offsets,
                        llvm::ArrayRef<int64_t> sizes,
@@ -86,4 +111,56 @@ bool isMemoryContiguous(llvm::ArrayRef<int64_t> srcShape,
  return true;
 }

+bool isContiguousSubviewWithDynamicOffsets(llvm::ArrayRef<int64_t> sourceShape,
+                                           llvm::ArrayRef<mlir::OpFoldResult> mixedOffsets,
+                                           llvm::ArrayRef<int64_t> staticSizes,
+                                           llvm::ArrayRef<int64_t> staticStrides) {
+  if (sourceShape.size() != mixedOffsets.size() || sourceShape.size() != staticSizes.size()
+      || sourceShape.size() != staticStrides.size()) {
+    return false;
+  }
+
+  if (llvm::any_of(staticStrides, [](int64_t stride) { return stride != 1; }))
+    return false;
+
+  auto reversedTriples =
+    llvm::zip_equal(llvm::reverse(sourceShape), llvm::reverse(mixedOffsets), llvm::reverse(staticSizes));
+
+  auto firstNonZeroOrDynamicOffset = llvm::find_if(reversedTriples, [](auto triple) {
+    auto [_sourceDim, offset, _size] = triple;
+    if (auto attr = mlir::dyn_cast<mlir::Attribute>(offset))
+      return mlir::cast<mlir::IntegerAttr>(attr).getInt() != 0;
+    return true;
+  });
+
+  if (firstNonZeroOrDynamicOffset != reversedTriples.end()) {
+    auto [sourceDim, offset, size] = *firstNonZeroOrDynamicOffset;
+    if (auto attr = mlir::dyn_cast<mlir::Attribute>(offset)) {
+      int64_t staticOffset = mlir::cast<mlir::IntegerAttr>(attr).getInt();
+      if (size > sourceDim - staticOffset)
+        return false;
+    }
+
+    ++firstNonZeroOrDynamicOffset;
+    for (auto it = firstNonZeroOrDynamicOffset; it != reversedTriples.end(); ++it)
+      if (std::get<2>(*it) != 1)
+        return false;
+  }
+
+  auto reversedSizes = llvm::zip_equal(llvm::reverse(sourceShape), llvm::reverse(staticSizes));
+  auto firstDifferentSize = llvm::find_if(reversedSizes, [](auto pair) {
+    auto [sourceDim, size] = pair;
+    return size != sourceDim;
+  });
+
+  if (firstDifferentSize != reversedSizes.end()) {
+    ++firstDifferentSize;
+    for (auto it = firstDifferentSize; it != reversedSizes.end(); ++it)
+      if (std::get<1>(*it) != 1)
+        return false;
+  }
+
+  return true;
+}
+
 } // namespace onnx_mlir
@@ -1,8 +1,14 @@
 #pragma once

+#include "mlir/IR/BuiltinTypes.h"
+#include "mlir/IR/OpDefinition.h"
+#include "mlir/IR/Value.h"
+
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/SmallVector.h"

+#include <cstddef>
+
 namespace onnx_mlir {

 llvm::SmallVector<int64_t> computeRowMajorStrides(llvm::ArrayRef<int64_t> shape);
@@ -14,9 +20,20 @@ int64_t linearizeIndex(llvm::ArrayRef<int64_t> indices, llvm::ArrayRef<int64_t>

 int64_t getNumElements(llvm::ArrayRef<int64_t> shape);

+bool hasByteSizedElementType(mlir::Type elementType);
+
+size_t getElementTypeSizeInBytes(mlir::Type elementType);
+
+size_t getShapedTypeSizeInBytes(mlir::ShapedType shapedType);
+
 bool isMemoryContiguous(llvm::ArrayRef<int64_t> srcShape,
                        llvm::ArrayRef<int64_t> offsets,
                        llvm::ArrayRef<int64_t> sizes,
                        llvm::ArrayRef<int64_t> strides);

+bool isContiguousSubviewWithDynamicOffsets(llvm::ArrayRef<int64_t> sourceShape,
+                                           llvm::ArrayRef<mlir::OpFoldResult> mixedOffsets,
+                                           llvm::ArrayRef<int64_t> staticSizes,
+                                           llvm::ArrayRef<int64_t> staticStrides);
+
 } // namespace onnx_mlir
@@ -1,7 +1,6 @@
-#include "src/Accelerators/PIM/Common/IR/SubviewUtils.hpp"
-
 #include "mlir/IR/BuiltinTypeInterfaces.h"

+#include "src/Accelerators/PIM/Common/IR/SubviewUtils.hpp"
 #include "src/Accelerators/PIM/Common/PimCommon.hpp"

 using namespace mlir;
@@ -32,6 +31,19 @@ Value stripMemRefViewOps(Value value) {
  }
 }

+Value stripMemRefAddressingOps(Value value) {
+  while (true) {
+    if (auto subviewOp = value.getDefiningOp<memref::SubViewOp>()) {
+      value = subviewOp.getSource();
+      continue;
+    }
+    Value strippedValue = stripMemRefViewOps(value);
+    if (strippedValue == value)
+      return value;
+    value = strippedValue;
+  }
+}
+
 bool hasAllStaticSubviewParts(memref::SubViewOp subview) {
  return llvm::all_of(subview.getStaticOffsets(), [](int64_t value) { return !ShapedType::isDynamic(value); })
      && llvm::all_of(subview.getStaticSizes(), [](int64_t value) { return !ShapedType::isDynamic(value); })
@@ -82,4 +94,13 @@ FailureOr<SmallVector<int64_t>> getStaticSubviewOffsets(const StaticSubviewInfo&
  return staticOffsets;
 }

+bool isMemRefBaseAddressableValue(Value value) {
+  value = stripMemRefAddressingOps(value);
+  if (isa<BlockArgument>(value))
+    return true;
+
+  Operation* defOp = value.getDefiningOp();
+  return defOp && isa<memref::AllocOp, memref::GetGlobalOp>(defOp);
+}
+
 } // namespace onnx_mlir
@@ -20,6 +20,8 @@ mlir::Value stripMemRefCasts(mlir::Value value);

 mlir::Value stripMemRefViewOps(mlir::Value value);

+mlir::Value stripMemRefAddressingOps(mlir::Value value);
+
 bool hasAllStaticSubviewParts(mlir::memref::SubViewOp subview);

 llvm::FailureOr<StaticSubviewInfo> getStaticSubviewInfo(mlir::Value value);
@@ -27,4 +29,6 @@ llvm::FailureOr<StaticSubviewInfo> getStaticSubviewInfo(mlir::Value value);
 /// Returns the offsets in `info` as int64_t, failing if any offset is dynamic.
 llvm::FailureOr<llvm::SmallVector<int64_t>> getStaticSubviewOffsets(const StaticSubviewInfo& info);

+bool isMemRefBaseAddressableValue(mlir::Value value);
+
 } // namespace onnx_mlir
@@ -1,8 +1,14 @@
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/IR/BuiltinTypes.h"

 #include "llvm/ADT/SmallPtrSet.h"
 #include "llvm/ADT/SmallSet.h"

+#include "src/Accelerators/PIM/Common/IR/AddressAnalysis.hpp"
+#include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
+#include "src/Accelerators/PIM/Common/IR/SubviewUtils.hpp"
 #include "src/Accelerators/PIM/Common/IR/WeightUtils.hpp"
 #include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
@@ -19,29 +25,67 @@ void markWeightAlways(mlir::Operation* op) {

 namespace {

-template <typename MVMOpTy, typename VMMOpTy, typename ParentOpTy>
-bool hasMvmVmmWeightUse(ParentOpTy parentOp, unsigned weightIndex) {
+CompiledIndexExpr makeConstantExpr(int64_t constant) {
+  CompiledIndexExprNode expr;
+  expr.kind = CompiledIndexExprNode::Kind::Constant;
+  expr.constant = constant;
+  return CompiledIndexExpr(std::make_shared<CompiledIndexExprNode>(std::move(expr)));
+}
+
+CompiledIndexExpr makeBinaryExpr(CompiledIndexExprNode::Kind kind, CompiledIndexExpr lhs, CompiledIndexExpr rhs) {
+  CompiledIndexExprNode expr;
+  expr.kind = kind;
+  expr.operands = {std::move(lhs), std::move(rhs)};
+  return CompiledIndexExpr(std::make_shared<CompiledIndexExprNode>(std::move(expr)));
+}
+
+CompiledIndexExpr addExpr(CompiledIndexExpr lhs, CompiledIndexExpr rhs) {
+  return makeBinaryExpr(CompiledIndexExprNode::Kind::Add, std::move(lhs), std::move(rhs));
+}
+
+CompiledIndexExpr mulExpr(CompiledIndexExpr lhs, int64_t rhs) {
+  return makeBinaryExpr(CompiledIndexExprNode::Kind::Mul, std::move(lhs), makeConstantExpr(rhs));
+}
+
+llvm::FailureOr<llvm::SmallVector<int64_t>> getStaticMemRefTypeStrides(mlir::MemRefType type) {
+  llvm::SmallVector<int64_t> strides;
+  int64_t offset = 0;
+  if (failed(type.getStridesAndOffset(strides, offset)))
+    return mlir::failure();
+  if (llvm::is_contained(strides, mlir::ShapedType::kDynamic))
+    return mlir::failure();
+  return strides;
+}
+
+template <typename VMMOpTy, typename ParentOpTy>
+bool hasVmmWeightUse(ParentOpTy parentOp, unsigned weightIndex) {
+  auto weightArg = parentOp.getWeightArgument(weightIndex);
+  if (!weightArg)
+    return false;
  bool found = false;
  parentOp.walk([&](mlir::Operation* op) {
-    if (auto mvmOp = mlir::dyn_cast<MVMOpTy>(op))
-      found |= mvmOp.getWeightIndex() == weightIndex;
-    else if (auto vmmOp = mlir::dyn_cast<VMMOpTy>(op))
-      found |= vmmOp.getWeightIndex() == weightIndex;
+    if (auto vmmOp = mlir::dyn_cast<VMMOpTy>(op))
+      found |= vmmOp.getWeight() == *weightArg;
  });
  return found;
 }

-template <typename MVMOpTy, typename VMMOpTy, typename ParentOpTy>
-void walkMvmVmmWeightUses(ParentOpTy parentOp, llvm::function_ref<void(mlir::OpOperand&)> callback) {
+template <typename VMMOpTy, typename ParentOpTy>
+void walkVmmWeightUses(ParentOpTy parentOp, llvm::function_ref<void(mlir::OpOperand&)> callback) {
  auto weights = parentOp.getWeights();
  llvm::SmallSet<unsigned, 8> visited;
-  auto walkWeightIndex = [&](unsigned weightIndex) {
-    if (weightIndex < weights.size() && visited.insert(weightIndex).second)
-      callback(parentOp->getOpOperand(weightIndex));
+  auto walkWeight = [&](mlir::Value weight) {
+    for (unsigned weightIndex = 0; weightIndex < weights.size(); ++weightIndex) {
+      auto weightArg = parentOp.getWeightArgument(weightIndex);
+      if (!weightArg || *weightArg != weight)
+        continue;
+      if (visited.insert(weightIndex).second)
+        callback(parentOp->getOpOperand(weightIndex));
+      break;
+    }
  };

-  parentOp.walk([&](MVMOpTy op) { walkWeightIndex(op.getWeightIndex()); });
-  parentOp.walk([&](VMMOpTy op) { walkWeightIndex(op.getWeightIndex()); });
+  parentOp.walk([&](VMMOpTy op) { walkWeight(op.getWeight()); });
 }

 } // namespace
@@ -54,7 +98,7 @@ bool isSpatialMvmVmmWeightUse(mlir::OpOperand& use) {
  if (!computeOp || operandIndex >= computeOp.getWeights().size())
    return false;

-  return hasMvmVmmWeightUse<spatial::SpatMVMOp, spatial::SpatVMMOp>(computeOp, operandIndex);
+  return hasVmmWeightUse<spatial::SpatVMMOp>(computeOp, operandIndex);
 }

 bool hasOnlySpatialMvmVmmWeightUses(mlir::Value value) {
@@ -76,8 +120,8 @@ bool hasOnlySpatialMvmVmmWeightUses(mlir::Value value) {
        return expandShapeOp.getSrc() == currentValue && self(expandShapeOp.getResult(), self);
      if (auto collapseShapeOp = mlir::dyn_cast<mlir::tensor::CollapseShapeOp>(user))
        return collapseShapeOp.getSrc() == currentValue && self(collapseShapeOp.getResult(), self);
-      if (auto transposeOp = mlir::dyn_cast<mlir::ONNXTransposeOp>(user))
-        return transposeOp.getData() == currentValue && self(transposeOp.getResult(), self);
+      if (auto transposeOp = mlir::dyn_cast<mlir::linalg::TransposeOp>(user))
+        return transposeOp.getInput() == currentValue && self(transposeOp.getResult()[0], self);

      return false;
    });
@@ -90,19 +134,183 @@ void walkPimMvmVmmWeightUses(mlir::Operation* root, llvm::function_ref<void(mlir
  assert(root && "expected valid root op");
  root->walk([&](pim::PimCoreOp coreOp) {
    coreOp.walk([&](pim::PimVMMOp vmmOp) {
-      auto weights = coreOp.getWeights();
-      unsigned weightIndex = vmmOp.getWeightIndex();
-      if (weightIndex < weights.size())
-        callback(coreOp->getOpOperand(weightIndex));
+      if (auto weightIndex = resolveWeightIndex(coreOp.getOperation(), vmmOp.getWeight()))
+        callback(coreOp->getOpOperand(*weightIndex));
    });
  });
  root->walk([&](pim::PimCoreBatchOp coreBatchOp) {
-    auto weights = coreBatchOp.getWeights();
-    for (auto weight : weights)
-      for (mlir::OpOperand& use : weight.getUses())
-        if (use.getOwner() == coreBatchOp.getOperation())
-          callback(use);
+    coreBatchOp.walk([&](pim::PimVMMOp vmmOp) {
+      if (auto weightIndex = resolveWeightIndex(coreBatchOp.getOperation(), vmmOp.getWeight()))
+        callback(coreBatchOp->getOpOperand(*weightIndex));
+    });
  });
 }

+std::optional<unsigned> resolveWeightIndex(mlir::Operation* weightOwner, mlir::Value weight) {
+  weight = stripMemRefAddressingOps(weight);
+
+  if (auto coreOp = mlir::dyn_cast_or_null<pim::PimCoreOp>(weightOwner)) {
+    for (unsigned weightIndex = 0; weightIndex < coreOp.getWeights().size(); ++weightIndex)
+      if (coreOp.getWeightArgument(weightIndex) == weight)
+        return weightIndex;
+    return std::nullopt;
+  }
+
+  if (auto coreBatchOp = mlir::dyn_cast_or_null<pim::PimCoreBatchOp>(weightOwner)) {
+    for (unsigned weightIndex = 0; weightIndex < coreBatchOp.getWeights().size(); ++weightIndex)
+      if (coreBatchOp.getWeightArgument(weightIndex) == weight)
+        return weightIndex;
+    return std::nullopt;
+  }
+
+  return std::nullopt;
+}
+
+llvm::FailureOr<ResolvedWeightView>
+resolveWeightView(mlir::Operation* weightOwner, mlir::Value weight, const StaticValueKnowledge& knowledge) {
+  llvm::SmallVector<mlir::Operation*> viewOps;
+  mlir::Value current = weight;
+
+  while (true) {
+    if (mlir::Value directAlias = knowledge.aliases.lookup(current); directAlias && directAlias != current) {
+      current = directAlias;
+      continue;
+    }
+
+    if (auto defOp = current.getDefiningOp()) {
+      if (auto getGlobalOp = mlir::dyn_cast<mlir::memref::GetGlobalOp>(defOp)) {
+        auto moduleOp = weightOwner ? weightOwner->getParentOfType<mlir::ModuleOp>() : mlir::ModuleOp {};
+        auto globalOp = lookupGlobalForGetGlobal(moduleOp, getGlobalOp);
+        if (!globalOp || !globalOp.getInitialValue())
+          return mlir::failure();
+
+        auto denseAttr = mlir::dyn_cast<mlir::DenseElementsAttr>(*globalOp.getInitialValue());
+        if (!denseAttr)
+          return mlir::failure();
+
+        ResolvedWeightView view;
+        view.globalOp = globalOp;
+        view.shape.assign(denseAttr.getType().getShape().begin(), denseAttr.getType().getShape().end());
+        view.strides = computeRowMajorStrides(view.shape);
+
+        CompiledIndexExpr offsetExpr = makeConstantExpr(0);
+        for (mlir::Operation* viewOp : llvm::reverse(viewOps)) {
+          if (auto subview = mlir::dyn_cast<mlir::memref::SubViewOp>(viewOp)) {
+            for (auto [offset, stride, sourceStride] :
+                 llvm::zip_equal(subview.getMixedOffsets(), subview.getStaticStrides(), view.strides)) {
+              CompiledIndexExpr offsetValue = makeConstantExpr(0);
+              if (auto attr = mlir::dyn_cast<mlir::Attribute>(offset)) {
+                auto intAttr = mlir::dyn_cast<mlir::IntegerAttr>(attr);
+                if (!intAttr)
+                  return mlir::failure();
+                offsetValue = makeConstantExpr(intAttr.getInt());
+              }
+              else if (auto value = mlir::dyn_cast<mlir::Value>(offset)) {
+                auto compiledOffset = compileIndexExpr(value);
+                if (failed(compiledOffset))
+                  return mlir::failure();
+                offsetValue = *compiledOffset;
+              }
+              else {
+                return mlir::failure();
+              }
+              offsetExpr = addExpr(std::move(offsetExpr), mulExpr(std::move(offsetValue), sourceStride));
+            }
+            auto resultType = mlir::cast<mlir::MemRefType>(subview.getResult().getType());
+            auto resultStrides = getStaticMemRefTypeStrides(resultType);
+            if (failed(resultStrides))
+              return mlir::failure();
+            view.shape.assign(resultType.getShape().begin(), resultType.getShape().end());
+            view.strides = std::move(*resultStrides);
+            continue;
+          }
+
+          if (auto collapse = mlir::dyn_cast<mlir::memref::CollapseShapeOp>(viewOp)) {
+            auto resultType = mlir::cast<mlir::MemRefType>(collapse.getResult().getType());
+            auto resultStrides = getStaticMemRefTypeStrides(resultType);
+            if (failed(resultStrides))
+              return mlir::failure();
+            view.shape.assign(resultType.getShape().begin(), resultType.getShape().end());
+            view.strides = std::move(*resultStrides);
+            continue;
+          }
+
+          if (auto expand = mlir::dyn_cast<mlir::memref::ExpandShapeOp>(viewOp)) {
+            auto resultType = mlir::cast<mlir::MemRefType>(expand.getResult().getType());
+            auto resultStrides = getStaticMemRefTypeStrides(resultType);
+            if (failed(resultStrides))
+              return mlir::failure();
+            view.shape.assign(resultType.getShape().begin(), resultType.getShape().end());
+            view.strides = std::move(*resultStrides);
+            continue;
+          }
+
+          if (auto castOp = mlir::dyn_cast<mlir::memref::CastOp>(viewOp)) {
+            auto resultType = mlir::cast<mlir::MemRefType>(castOp.getResult().getType());
+            auto resultStrides = getStaticMemRefTypeStrides(resultType);
+            if (failed(resultStrides))
+              return mlir::failure();
+            view.shape.assign(resultType.getShape().begin(), resultType.getShape().end());
+            view.strides = std::move(*resultStrides);
+            continue;
+          }
+
+          return mlir::failure();
+        }
+
+        auto resolvedOffset = offsetExpr.evaluate(knowledge);
+        if (failed(resolvedOffset))
+          return mlir::failure();
+        view.offset = *resolvedOffset;
+        return view;
+      }
+
+      if (auto subview = mlir::dyn_cast<mlir::memref::SubViewOp>(defOp)) {
+        viewOps.push_back(defOp);
+        current = subview.getSource();
+        continue;
+      }
+
+      if (auto collapse = mlir::dyn_cast<mlir::memref::CollapseShapeOp>(defOp)) {
+        viewOps.push_back(defOp);
+        current = collapse.getSrc();
+        continue;
+      }
+
+      if (auto expand = mlir::dyn_cast<mlir::memref::ExpandShapeOp>(defOp)) {
+        viewOps.push_back(defOp);
+        current = expand.getSrc();
+        continue;
+      }
+
+      if (auto castOp = mlir::dyn_cast<mlir::memref::CastOp>(defOp)) {
+        viewOps.push_back(defOp);
+        current = castOp.getSource();
+        continue;
+      }
+
+      return mlir::failure();
+    }
+
+    if (mlir::Value loopAlias = resolveLoopCarriedAlias(current, knowledge); loopAlias && loopAlias != current) {
+      current = loopAlias;
+      continue;
+    }
+
+    auto weightIndex = resolveWeightIndex(weightOwner, current);
+    if (!weightIndex)
+      return mlir::failure();
+
+    if (auto coreOp = mlir::dyn_cast_or_null<pim::PimCoreOp>(weightOwner)) {
+      current = coreOp.getWeights()[*weightIndex];
+      continue;
+    }
+    if (auto coreBatchOp = mlir::dyn_cast_or_null<pim::PimCoreBatchOp>(weightOwner)) {
+      current = coreBatchOp.getWeights()[*weightIndex];
+      continue;
+    }
+    return mlir::failure();
+  }
+}
+
 } // namespace onnx_mlir
@@ -1,15 +1,34 @@
 #pragma once

-#include "mlir/IR/Operation.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/IR/BuiltinOps.h"
 #include "mlir/IR/Value.h"

+#include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/STLFunctionalExtras.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/StringRef.h"

+#include <optional>
+
+#include "src/Accelerators/PIM/Common/IR/AddressAnalysis.hpp"
+#include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
+
 inline constexpr llvm::StringRef PimWeightAlwaysAttrName = "weightAlways";

 namespace onnx_mlir {

+struct ResolvedWeightView {
+  mlir::memref::GlobalOp globalOp;
+  llvm::SmallVector<int64_t> shape;
+  llvm::SmallVector<int64_t> strides;
+  int64_t offset = 0;
+
+  bool operator==(const ResolvedWeightView& other) const {
+    return globalOp == other.globalOp && shape == other.shape && strides == other.strides && offset == other.offset;
+  }
+};
+
 bool hasWeightAlways(mlir::Operation* op);

 /// Tags an op as producing a value that should stay materialized as a reusable
@@ -26,4 +45,20 @@ bool hasOnlySpatialMvmVmmWeightUses(mlir::Value value);
 /// passes can identify globals that must remain weight-backed.
 void walkPimMvmVmmWeightUses(mlir::Operation* root, llvm::function_ref<void(mlir::OpOperand&)> callback);

+std::optional<unsigned> resolveWeightIndex(mlir::Operation* weightOwner, mlir::Value weight);
+llvm::FailureOr<ResolvedWeightView>
+resolveWeightView(mlir::Operation* weightOwner, mlir::Value weight, const StaticValueKnowledge& knowledge = {});
+
+template <typename CoreLikeOpTy>
+llvm::SmallVector<unsigned, 8> getUsedWeightIndices(CoreLikeOpTy coreLikeOp) {
+  llvm::SmallVector<unsigned, 8> indices;
+  coreLikeOp.walk([&](pim::PimVMMOp vmmOp) {
+    auto weightIndex = resolveWeightIndex(coreLikeOp.getOperation(), vmmOp.getWeight());
+    if (weightIndex && !llvm::is_contained(indices, *weightIndex))
+      indices.push_back(*weightIndex);
+  });
+  llvm::sort(indices);
+  return indices;
+}
+
 } // namespace onnx_mlir
@@ -12,6 +12,7 @@
 #include "llvm/ADT/StringRef.h"

 #include "src/Accelerators/PIM/Common/IR/AddressAnalysis.hpp"
+#include "src/Accelerators/PIM/Common/IR/ConstantUtils.hpp"
 #include "src/Accelerators/PIM/Common/IR/CoreBlockUtils.hpp"
 #include "src/Accelerators/PIM/Common/IR/EntryPointUtils.hpp"
 #include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
@@ -0,0 +1,222 @@
+#include "llvm/Support/ErrorHandling.h"
+#include "llvm/Support/raw_ostream.h"
+
+#include "CheckedArithmetic.hpp"
+#include "src/Accelerators/PIM/Common/PimCommon.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir::pim {
+
+namespace {
+
+static void emitCrashMessage(llvm::StringRef fieldName, llvm::StringRef message) {
+  llvm::errs() << "PIM " << fieldName << " " << message << "\n";
+}
+
+template <typename To, typename From>
+static FailureOr<To> checkedCastAtLocation(From value, Location loc, llvm::StringRef fieldName) {
+  static_assert(std::is_integral_v<To> && std::is_integral_v<From>, "checkedCastAtLocation requires integral types");
+
+  using ToLimits = std::numeric_limits<To>;
+
+  if constexpr (std::is_signed_v<From> == std::is_signed_v<To>) {
+    if (value < static_cast<From>(ToLimits::min()) || value > static_cast<From>(ToLimits::max())) {
+      emitCheckedArithmeticError(loc, fieldName, "is outside representable range");
+      return failure();
+    }
+  }
+  else if constexpr (std::is_signed_v<From>) {
+    using UnsignedFrom = std::make_unsigned_t<From>;
+    using UnsignedTo = std::make_unsigned_t<To>;
+    if (value < 0 || static_cast<UnsignedFrom>(value) > static_cast<UnsignedTo>(ToLimits::max())) {
+      emitCheckedArithmeticError(loc, fieldName, "is outside representable range");
+      return failure();
+    }
+  }
+  else {
+    using UnsignedFrom = std::make_unsigned_t<From>;
+    using UnsignedTo = std::conditional_t<std::is_signed_v<To>, std::make_unsigned_t<To>, To>;
+    if (static_cast<UnsignedFrom>(value) > static_cast<UnsignedTo>(ToLimits::max())) {
+      emitCheckedArithmeticError(loc, fieldName, "is outside representable range");
+      return failure();
+    }
+  }
+
+  return static_cast<To>(value);
+}
+
+template <typename UInt>
+FailureOr<UInt> checkedMulAtLocation(UInt lhs, UInt rhs, Location loc, llvm::StringRef fieldName) {
+  static_assert(std::is_integral_v<UInt> && std::is_unsigned_v<UInt>,
+                "checkedMulAtLocation requires unsigned integral types");
+  if (lhs != 0 && rhs > std::numeric_limits<UInt>::max() / lhs) {
+    emitCheckedArithmeticError(loc, fieldName, "multiplication overflow");
+    return failure();
+  }
+  return lhs * rhs;
+}
+
+} // namespace
+
+InFlightDiagnostic emitCheckedArithmeticError(Operation* anchor, llvm::StringRef fieldName, llvm::StringRef message) {
+  assert(anchor && "expected arithmetic diagnostics to have an anchor op");
+  return anchor->emitOpError() << fieldName << " " << message;
+}
+
+InFlightDiagnostic emitCheckedArithmeticError(Location loc, llvm::StringRef fieldName, llvm::StringRef message) {
+  return emitError(loc) << "PIM " << fieldName << " " << message;
+}
+
+FailureOr<int32_t> checkedI32(int64_t value, Operation* anchor, llvm::StringRef fieldName) {
+  return checkedCast<int32_t>(value, anchor, fieldName);
+}
+
+FailureOr<int32_t> checkedI32(uint64_t value, Operation* anchor, llvm::StringRef fieldName) {
+  return checkedCast<int32_t>(value, anchor, fieldName);
+}
+
+FailureOr<uint8_t> checkedU8(uint64_t value, Operation* anchor, llvm::StringRef fieldName) {
+  return checkedCast<uint8_t>(value, anchor, fieldName);
+}
+
+FailureOr<size_t> checkedSize(int64_t value, Operation* anchor, llvm::StringRef fieldName) {
+  return checkedCast<size_t>(value, anchor, fieldName);
+}
+
+FailureOr<IntegerAttr>
+getCheckedI32Attr(Builder& builder, Operation* anchor, int64_t value, llvm::StringRef fieldName) {
+  assert(anchor && "checked op-based attrs require a non-null diagnostic anchor");
+  auto checkedValue = checkedI32(value, anchor, fieldName);
+  if (failed(checkedValue))
+    return failure();
+  return builder.getI32IntegerAttr(*checkedValue);
+}
+
+FailureOr<IntegerAttr>
+getCheckedI32Attr(Builder& builder, Operation* anchor, uint64_t value, llvm::StringRef fieldName) {
+  assert(anchor && "checked op-based attrs require a non-null diagnostic anchor");
+  auto checkedValue = checkedI32(value, anchor, fieldName);
+  if (failed(checkedValue))
+    return failure();
+  return builder.getI32IntegerAttr(*checkedValue);
+}
+
+FailureOr<IntegerAttr> getCheckedI32Attr(Builder& builder, Location loc, int64_t value, llvm::StringRef fieldName) {
+  auto checkedValue = checkedCastAtLocation<int32_t>(value, loc, fieldName);
+  if (failed(checkedValue))
+    return failure();
+  return builder.getI32IntegerAttr(*checkedValue);
+}
+
+FailureOr<IntegerAttr> getCheckedI32Attr(Builder& builder, Location loc, uint64_t value, llvm::StringRef fieldName) {
+  auto checkedValue = checkedCastAtLocation<int32_t>(value, loc, fieldName);
+  if (failed(checkedValue))
+    return failure();
+  return builder.getI32IntegerAttr(*checkedValue);
+}
+
+FailureOr<uint64_t> getCheckedShapedTypeSizeInBytes(ShapedType type, Operation* anchor, llvm::StringRef fieldName) {
+  assert(anchor && "checked op-based size helpers require a non-null diagnostic anchor");
+  if (!type.hasStaticShape()) {
+    emitCheckedArithmeticError(anchor, fieldName, "requires static shaped type");
+    return failure();
+  }
+  if (!hasByteSizedElementType(type.getElementType())) {
+    emitCheckedArithmeticError(anchor, fieldName, "requires byte-sized element type");
+    return failure();
+  }
+
+  uint64_t elements = 1;
+  for (int64_t dim : type.getShape()) {
+    if (dim < 0) {
+      emitCheckedArithmeticError(anchor, fieldName, "requires nonnegative dimensions");
+      return failure();
+    }
+
+    auto nextElements = checkedMul(elements, static_cast<uint64_t>(dim), anchor, fieldName);
+    if (failed(nextElements))
+      return failure();
+    elements = *nextElements;
+  }
+
+  return checkedMul(
+    elements, static_cast<uint64_t>(getElementTypeSizeInBytes(type.getElementType())), anchor, fieldName);
+}
+
+FailureOr<uint64_t> getCheckedShapedTypeSizeInBytes(ShapedType type, Location loc, llvm::StringRef fieldName) {
+  if (!type.hasStaticShape()) {
+    emitCheckedArithmeticError(loc, fieldName, "requires static shaped type");
+    return failure();
+  }
+  if (!hasByteSizedElementType(type.getElementType())) {
+    emitCheckedArithmeticError(loc, fieldName, "requires byte-sized element type");
+    return failure();
+  }
+
+  uint64_t elements = 1;
+  for (int64_t dim : type.getShape()) {
+    if (dim < 0) {
+      emitCheckedArithmeticError(loc, fieldName, "requires nonnegative dimensions");
+      return failure();
+    }
+
+    auto nextElements = checkedMulAtLocation(elements, static_cast<uint64_t>(dim), loc, fieldName);
+    if (failed(nextElements))
+      return failure();
+    elements = *nextElements;
+  }
+
+  return checkedMulAtLocation(
+    elements, static_cast<uint64_t>(getElementTypeSizeInBytes(type.getElementType())), loc, fieldName);
+}
+
+int32_t checkedI32OrCrash(int64_t value, llvm::StringRef fieldName) {
+  if (value < std::numeric_limits<int32_t>::min() || value > std::numeric_limits<int32_t>::max()) {
+    emitCrashMessage(fieldName, "is outside representable range");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return static_cast<int32_t>(value);
+}
+
+int32_t checkedI32OrCrash(uint64_t value, llvm::StringRef fieldName) {
+  if (value > static_cast<uint64_t>(std::numeric_limits<int32_t>::max())) {
+    emitCrashMessage(fieldName, "is outside representable range");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return static_cast<int32_t>(value);
+}
+
+uint8_t checkedU8OrCrash(uint64_t value, llvm::StringRef fieldName) {
+  if (value > static_cast<uint64_t>(std::numeric_limits<uint8_t>::max())) {
+    emitCrashMessage(fieldName, "is outside representable range");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return static_cast<uint8_t>(value);
+}
+
+size_t checkedSizeOrCrash(int64_t value, llvm::StringRef fieldName) {
+  if (value < 0) {
+    emitCrashMessage(fieldName, "is outside representable range");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return static_cast<size_t>(value);
+}
+
+size_t checkedAddOrCrash(size_t lhs, size_t rhs, llvm::StringRef fieldName) {
+  if (rhs > std::numeric_limits<size_t>::max() - lhs) {
+    emitCrashMessage(fieldName, "addition overflow");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return lhs + rhs;
+}
+
+size_t checkedMulOrCrash(size_t lhs, size_t rhs, llvm::StringRef fieldName) {
+  if (lhs != 0 && rhs > std::numeric_limits<size_t>::max() / lhs) {
+    emitCrashMessage(fieldName, "multiplication overflow");
+    llvm_unreachable("PIM checked arithmetic failure");
+  }
+  return lhs * rhs;
+}
+
+} // namespace onnx_mlir::pim
@@ -0,0 +1,107 @@
+#pragma once
+
+#include "mlir/IR/Builders.h"
+#include "mlir/IR/BuiltinTypeInterfaces.h"
+#include "mlir/IR/Operation.h"
+#include "mlir/Support/LogicalResult.h"
+
+#include "llvm/ADT/StringRef.h"
+
+#include <cstddef>
+#include <cstdint>
+#include <limits>
+#include <type_traits>
+
+namespace onnx_mlir::pim {
+
+mlir::InFlightDiagnostic
+emitCheckedArithmeticError(mlir::Operation* anchor, llvm::StringRef fieldName, llvm::StringRef message);
+
+mlir::InFlightDiagnostic
+emitCheckedArithmeticError(mlir::Location loc, llvm::StringRef fieldName, llvm::StringRef message);
+
+template <typename To, typename From>
+mlir::FailureOr<To> checkedCast(From value, mlir::Operation* anchor, llvm::StringRef fieldName) {
+  static_assert(std::is_integral_v<To> && std::is_integral_v<From>, "checkedCast requires integral types");
+
+  using ToLimits = std::numeric_limits<To>;
+
+  if constexpr (std::is_signed_v<From> == std::is_signed_v<To>) {
+    if (value < static_cast<From>(ToLimits::min()) || value > static_cast<From>(ToLimits::max())) {
+      emitCheckedArithmeticError(anchor, fieldName, "is outside representable range");
+      return mlir::failure();
+    }
+  }
+  else if constexpr (std::is_signed_v<From>) {
+    using UnsignedFrom = std::make_unsigned_t<From>;
+    using UnsignedTo = std::make_unsigned_t<To>;
+    if (value < 0 || static_cast<UnsignedFrom>(value) > static_cast<UnsignedTo>(ToLimits::max())) {
+      emitCheckedArithmeticError(anchor, fieldName, "is outside representable range");
+      return mlir::failure();
+    }
+  }
+  else {
+    using UnsignedFrom = std::make_unsigned_t<From>;
+    using UnsignedTo = std::conditional_t<std::is_signed_v<To>, std::make_unsigned_t<To>, To>;
+    if (static_cast<UnsignedFrom>(value) > static_cast<UnsignedTo>(ToLimits::max())) {
+      emitCheckedArithmeticError(anchor, fieldName, "is outside representable range");
+      return mlir::failure();
+    }
+  }
+
+  return static_cast<To>(value);
+}
+
+template <typename UInt>
+mlir::FailureOr<UInt> checkedAdd(UInt lhs, UInt rhs, mlir::Operation* anchor, llvm::StringRef fieldName) {
+  static_assert(std::is_integral_v<UInt> && std::is_unsigned_v<UInt>, "checkedAdd requires unsigned integral types");
+  if (rhs > std::numeric_limits<UInt>::max() - lhs) {
+    emitCheckedArithmeticError(anchor, fieldName, "addition overflow");
+    return mlir::failure();
+  }
+  return lhs + rhs;
+}
+
+template <typename UInt>
+mlir::FailureOr<UInt> checkedMul(UInt lhs, UInt rhs, mlir::Operation* anchor, llvm::StringRef fieldName) {
+  static_assert(std::is_integral_v<UInt> && std::is_unsigned_v<UInt>, "checkedMul requires unsigned integral types");
+  if (lhs != 0 && rhs > std::numeric_limits<UInt>::max() / lhs) {
+    emitCheckedArithmeticError(anchor, fieldName, "multiplication overflow");
+    return mlir::failure();
+  }
+  return lhs * rhs;
+}
+
+mlir::FailureOr<int32_t> checkedI32(int64_t value, mlir::Operation* anchor, llvm::StringRef fieldName);
+mlir::FailureOr<int32_t> checkedI32(uint64_t value, mlir::Operation* anchor, llvm::StringRef fieldName);
+
+mlir::FailureOr<uint8_t> checkedU8(uint64_t value, mlir::Operation* anchor, llvm::StringRef fieldName);
+
+mlir::FailureOr<size_t> checkedSize(int64_t value, mlir::Operation* anchor, llvm::StringRef fieldName);
+
+mlir::FailureOr<mlir::IntegerAttr>
+getCheckedI32Attr(mlir::Builder& builder, mlir::Operation* anchor, int64_t value, llvm::StringRef fieldName);
+
+mlir::FailureOr<mlir::IntegerAttr>
+getCheckedI32Attr(mlir::Builder& builder, mlir::Operation* anchor, uint64_t value, llvm::StringRef fieldName);
+
+mlir::FailureOr<mlir::IntegerAttr>
+getCheckedI32Attr(mlir::Builder& builder, mlir::Location loc, int64_t value, llvm::StringRef fieldName);
+
+mlir::FailureOr<mlir::IntegerAttr>
+getCheckedI32Attr(mlir::Builder& builder, mlir::Location loc, uint64_t value, llvm::StringRef fieldName);
+
+mlir::FailureOr<uint64_t>
+getCheckedShapedTypeSizeInBytes(mlir::ShapedType type, mlir::Operation* anchor, llvm::StringRef fieldName);
+
+mlir::FailureOr<uint64_t>
+getCheckedShapedTypeSizeInBytes(mlir::ShapedType type, mlir::Location loc, llvm::StringRef fieldName);
+
+int32_t checkedI32OrCrash(int64_t value, llvm::StringRef fieldName);
+int32_t checkedI32OrCrash(uint64_t value, llvm::StringRef fieldName);
+uint8_t checkedU8OrCrash(uint64_t value, llvm::StringRef fieldName);
+size_t checkedSizeOrCrash(int64_t value, llvm::StringRef fieldName);
+size_t checkedAddOrCrash(size_t lhs, size_t rhs, llvm::StringRef fieldName);
+size_t checkedMulOrCrash(size_t lhs, size_t rhs, llvm::StringRef fieldName);
+
+} // namespace onnx_mlir::pim
@@ -18,7 +18,7 @@ void dumpModule(mlir::ModuleOp moduleOp, const std::string& name) {
  std::fstream file(dialectsDir + "/" + name + ".mlir", std::ios::out);
  llvm::raw_os_ostream os(file);
  mlir::OpPrintingFlags flags;
-  flags.elideLargeElementsAttrs();
+  flags.elideLargeElementsAttrs().enableDebugInfo(true, false);
  moduleOp.print(os, flags);
  os.flush();
  file.close();
@@ -7,10 +7,36 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/StringRef.h"

+#include <cstdint>
 #include <system_error>

 namespace onnx_mlir::pim {

+struct CappedDiagnosticReporter {
+  explicit CappedDiagnosticReporter(int64_t maxReportedFailures = 8)
+  : maxReportedFailures(maxReportedFailures) {}
+
+  template <typename EmitFn>
+  void report(mlir::Operation* op, EmitFn&& emit) {
+    numFailures++;
+    if (numFailures <= maxReportedFailures)
+      emit(op);
+  }
+
+  void emitSuppressedSummary(mlir::Operation* op, llvm::StringRef failureDescription) const {
+    if (numFailures > maxReportedFailures)
+      op->emitError() << "suppressed " << (numFailures - maxReportedFailures) << " additional " << failureDescription;
+  }
+
+  void noteFailures(int64_t count) { numFailures += count; }
+
+  bool hasFailure() const { return numFailures != 0; }
+
+private:
+  int64_t maxReportedFailures;
+  int64_t numFailures = 0;
+};
+
 /// Emits a consistent diagnostic for target paths that require static shapes.
 mlir::InFlightDiagnostic emitUnsupportedStaticShapeDiagnostic(mlir::Operation* op, llvm::StringRef valueDescription);

@@ -1,21 +1,22 @@
-#include "src/Accelerators/PIM/Common/Support/ReportUtils.hpp"
-
 #include "llvm/Support/Format.h"

 #include "src/Accelerators/PIM/Common/Support/FileSystemUtils.hpp"
+#include "src/Accelerators/PIM/Common/Support/ReportUtils.hpp"

 namespace onnx_mlir {

-std::fstream openReportFile(const std::string& name) {
+std::fstream openReportFileWithExtension(const std::string& name, llvm::StringRef extension) {
  std::string outputDir = getOutputDir();
  if (outputDir.empty())
    return {};

  std::string reportsDir = outputDir + "/reports";
  createDirectory(reportsDir);
-  return std::fstream(reportsDir + "/" + name + ".txt", std::ios::out);
+  return std::fstream(reportsDir + "/" + name + "." + extension.str(), std::ios::out);
 }

+std::fstream openReportFile(const std::string& name) { return openReportFileWithExtension(name, "txt"); }
+
 std::string formatReportMemory(uint64_t bytes) {
  const char* units[] = {"B", "KB", "MB", "GB", "TB", "PB", "EB"};
  int i = 0;
@@ -1,10 +1,9 @@
 #pragma once

-#include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/STLExtras.h"
 #include "llvm/Support/raw_ostream.h"

-#include <cstdint>
 #include <fstream>
 #include <limits>
 #include <string>
@@ -12,6 +11,7 @@
 namespace onnx_mlir {

 std::fstream openReportFile(const std::string& name);
+std::fstream openReportFileWithExtension(const std::string& name, llvm::StringRef extension);
 std::string formatReportMemory(uint64_t bytes);

 struct ReportField {
@@ -16,8 +16,8 @@ add_pim_library(OMPimCompilerOptions
 add_pim_library(OMPimCompilerUtils
  PimCompilerUtils.cpp
  PimArtifactWriter.cpp
-  PimBatchEmission.cpp
  PimCodeGen.cpp
+  PimMemoryLiveness.cpp
  PimWeightEmitter.cpp

  EXCLUDE_FROM_OM_LIBS
@@ -29,7 +29,9 @@ add_pim_library(OMPimCompilerUtils
  OMPimCompilerOptions
  OMPimCommon
  OMPimBufferization
-  OMPimStaticMemoryCoalescing
+  OMPimMemoryCoalescing
+  OMPimHostConstantFolding
+  OMPimVerification
  OMPimPasses
  OMONNXToSpatial
  OMSpatialToPim
@@ -20,38 +20,6 @@ using namespace mlir;

 namespace onnx_mlir {

-OnnxMlirCompilerErrorCodes writeHostCoreArtifacts(StringRef outputDirPath) {
-  std::error_code errorCode;
-  std::string outputHostCorePath = outputDirPath.str() + "/core_0.pim";
-  raw_fd_ostream hostFileStream(outputHostCorePath, errorCode, sys::fs::OF_None);
-  if (errorCode) {
-    errs() << "Error while opening host core file `" << outputHostCorePath << "`: " << errorCode.message() << '\n';
-    return InvalidOutputFileAccess;
-  }
-
-  pim_binary::writeHeader(hostFileStream);
-  pim_binary::InstructionRecord noop;
-  noop.opcode = pim_binary::Opcode::sldi;
-  pim_binary::writeInstructionRecord(hostFileStream, noop);
-  pim_binary::writeInstructionRecord(hostFileStream, noop);
-  pim_binary::patchInstructionCount(hostFileStream, 2);
-  hostFileStream.close();
-
-  if (pimEmitJson.getValue()) {
-    std::string outputHostJsonPath = outputDirPath.str() + "/core_0.json";
-    raw_fd_ostream hostJsonStream(outputHostJsonPath, errorCode);
-    if (errorCode) {
-      errs() << "Error while opening host core json file `" << outputHostJsonPath << "`: " << errorCode.message()
-             << '\n';
-      return InvalidOutputFileAccess;
-    }
-    // The host core json contains two no-op-like instructions to satisfy pimsim-nn
-    hostJsonStream << "[{\"imm\":0,\"op\":\"sldi\",\"rd\":0},{\"imm\":0,\"op\":\"sldi\",\"rd\":0}]";
-    hostJsonStream.close();
-  }
-  return CompilerSuccess;
-}
-
 OnnxMlirCompilerErrorCodes
 writeMemoryBinary(ModuleOp moduleOp, func::FuncOp funcOp, PimAcceleratorMemory& memory, StringRef outputDirPath) {
  auto memoryFilePath = (outputDirPath + "/memory.bin").str();
@@ -80,7 +48,7 @@ writeMemoryBinary(ModuleOp moduleOp, func::FuncOp funcOp, PimAcceleratorMemory&
    if (!denseAttr)
      return;

-    MemEntry memEntry = memory.hostMem.getMemEntry(getGlobalOp.getResult());
+    MemEntry memEntry = memory.hostMem.getMemEntry({getGlobalOp.getResult(), std::nullopt});
    ArrayRef<char> rawData = denseAttr.getRawData();
    char* dst = memoryBuffer.data() + memEntry.address;

@@ -109,9 +77,6 @@ OnnxMlirCompilerErrorCodes writeConfigJson(func::FuncOp funcOp,
  json::Object configJson;

  configJson["core_cnt"] = maxCoreId + 1;
-  configJson["adc_count"] = 16;
-  configJson["cell_precision"] = 2;
-  configJson["xbar_array_count"] = crossbarCountInCore.getValue();
  configJson["xbar_size"] = {crossbarSize.getValue(), crossbarSize.getValue()};
  configJson["array_group_map"] = std::move(xbarsPerArrayGroup);

@@ -12,7 +12,6 @@ namespace onnx_mlir {

 class PimAcceleratorMemory;

-OnnxMlirCompilerErrorCodes writeHostCoreArtifacts(llvm::StringRef outputDirPath);
 OnnxMlirCompilerErrorCodes writeMemoryBinary(mlir::ModuleOp moduleOp,
                                             mlir::func::FuncOp funcOp,
                                             PimAcceleratorMemory& memory,
@@ -1,136 +0,0 @@
-#include "mlir/IR/Builders.h"
-#include "mlir/IR/BuiltinOps.h"
-#include "mlir/IR/IRMapping.h"
-
-#include "src/Accelerators/PIM/Common/PimCommon.hpp"
-#include "src/Accelerators/PIM/Compiler/PimBatchEmission.hpp"
-
-using namespace mlir;
-
-namespace onnx_mlir {
-namespace {
-
-static SmallVector<int32_t> getBatchCoreIds(pim::PimCoreBatchOp coreBatchOp) {
-  auto coreIdsAttr = coreBatchOp->getAttrOfType<DenseI32ArrayAttr>(onnx_mlir::kCoreIdsAttrName);
-  assert(coreIdsAttr && "pim.core_batch requires coreIds array attribute");
-  return SmallVector<int32_t>(coreIdsAttr.asArrayRef().begin(), coreIdsAttr.asArrayRef().end());
-}
-
-static SmallVector<int32_t> getLaneChunkCoreIds(ArrayRef<int32_t> coreIds, size_t laneCount, unsigned lane) {
-  SmallVector<int32_t> laneCoreIds;
-  laneCoreIds.reserve(coreIds.size() / laneCount);
-  for (size_t chunkIndex = 0; chunkIndex < coreIds.size() / laneCount; ++chunkIndex)
-    laneCoreIds.push_back(coreIds[chunkIndex * laneCount + lane]);
-  return laneCoreIds;
-}
-
-static void scalarizeBatchOpsInCore(pim::PimCoreOp scalarCore, size_t laneCount, unsigned lane) {
-  IRRewriter rewriter(scalarCore.getContext());
-  SmallVector<Operation*> batchOps;
-  scalarCore.walk([&](Operation* op) {
-    if (isa<pim::PimSendBatchOp,
-            pim::PimSendTensorBatchOp,
-            pim::PimReceiveBatchOp,
-            pim::PimReceiveTensorBatchOp,
-            pim::PimMemCopyHostToDevBatchOp>(op)) {
-      batchOps.push_back(op);
-    }
-  });
-
-  for (Operation* op : batchOps) {
-    rewriter.setInsertionPoint(op);
-
-    if (auto sendBatchOp = dyn_cast<pim::PimSendBatchOp>(op)) {
-      pim::PimSendOp::create(rewriter,
-                             sendBatchOp.getLoc(),
-                             sendBatchOp.getInput(),
-                             sendBatchOp.getSizeAttr(),
-                             rewriter.getI32IntegerAttr(sendBatchOp.getTargetCoreIds()[lane]));
-      rewriter.eraseOp(op);
-      continue;
-    }
-
-    if (auto sendTensorBatchOp = dyn_cast<pim::PimSendTensorBatchOp>(op)) {
-      pim::PimSendTensorOp::create(
-        rewriter,
-        sendTensorBatchOp.getLoc(),
-        sendTensorBatchOp.getInput(),
-        rewriter.getDenseI32ArrayAttr(getLaneChunkCoreIds(sendTensorBatchOp.getTargetCoreIds(), laneCount, lane)));
-      rewriter.eraseOp(op);
-      continue;
-    }
-
-    if (auto receiveBatchOp = dyn_cast<pim::PimReceiveBatchOp>(op)) {
-      auto scalarReceive =
-        pim::PimReceiveOp::create(rewriter,
-                                  receiveBatchOp.getLoc(),
-                                  receiveBatchOp.getOutput().getType(),
-                                  receiveBatchOp.getOutputBuffer(),
-                                  receiveBatchOp.getSizeAttr(),
-                                  rewriter.getI32IntegerAttr(receiveBatchOp.getSourceCoreIds()[lane]));
-      rewriter.replaceOp(op, scalarReceive->getResults());
-      continue;
-    }
-
-    if (auto receiveTensorBatchOp = dyn_cast<pim::PimReceiveTensorBatchOp>(op)) {
-      auto scalarReceive = pim::PimReceiveTensorOp::create(
-        rewriter,
-        receiveTensorBatchOp.getLoc(),
-        receiveTensorBatchOp.getOutput().getType(),
-        receiveTensorBatchOp.getOutputBuffer(),
-        rewriter.getDenseI32ArrayAttr(getLaneChunkCoreIds(receiveTensorBatchOp.getSourceCoreIds(), laneCount, lane)));
-      rewriter.replaceOp(op, scalarReceive->getResults());
-      continue;
-    }
-
-    auto memcpBatchOp = cast<pim::PimMemCopyHostToDevBatchOp>(op);
-    auto scalarCopy = pim::PimMemCopyHostToDevOp::create(rewriter,
-                                                         memcpBatchOp.getLoc(),
-                                                         memcpBatchOp.getOutput().getType(),
-                                                         memcpBatchOp.getDeviceTarget(),
-                                                         memcpBatchOp.getHostSource(),
-                                                         memcpBatchOp.getDeviceTargetOffsetAttr(),
-                                                         memcpBatchOp.getHostSourceOffsetAttr(),
-                                                         memcpBatchOp.getSizeAttr());
-    rewriter.replaceOp(op, scalarCopy->getResults());
-  }
-}
-
-} // namespace
-
-LogicalResult withScalarCoreFromBatchLane(pim::PimCoreBatchOp coreBatchOp,
-                                          unsigned lane,
-                                          llvm::function_ref<LogicalResult(pim::PimCoreOp)> callback) {
-  OwningOpRef<ModuleOp> scratchModule = ModuleOp::create(coreBatchOp.getLoc());
-  OpBuilder builder(scratchModule->getContext());
-  builder.setInsertionPointToStart(scratchModule->getBody());
-
-  size_t laneCount = static_cast<size_t>(coreBatchOp.getLaneCount());
-  size_t weightsPerLane = coreBatchOp.getWeights().size() / laneCount;
-  SmallVector<Value> laneWeights;
-  laneWeights.reserve(weightsPerLane);
-  for (size_t weightIndex = 0; weightIndex < weightsPerLane; ++weightIndex)
-    laneWeights.push_back(coreBatchOp.getWeights()[lane * weightsPerLane + weightIndex]);
-
-  auto coreIds = getBatchCoreIds(coreBatchOp);
-  auto scalarCore = pim::PimCoreOp::create(
-    builder, coreBatchOp.getLoc(), ValueRange(laneWeights), builder.getI32IntegerAttr(coreIds[lane]));
-  Block* block = builder.createBlock(&scalarCore.getBody(), scalarCore.getBody().end());
-  IRMapping mapper;
-  if (coreBatchOp.getBody().front().getNumArguments() == 1)
-    mapper.map(coreBatchOp.getBody().front().getArgument(0), coreBatchOp.getInputs()[lane]);
-
-  builder.setInsertionPointToEnd(block);
-  for (Operation& op : coreBatchOp.getBody().front()) {
-    Operation* cloned = builder.clone(op, mapper);
-    for (auto [originalResult, clonedResult] : llvm::zip(op.getResults(), cloned->getResults()))
-      mapper.map(originalResult, clonedResult);
-  }
-
-  if (block->empty() || !isa<pim::PimHaltOp>(block->back()))
-    pim::PimHaltOp::create(builder, coreBatchOp.getLoc());
-  scalarizeBatchOpsInCore(scalarCore, laneCount, lane);
-  return callback(scalarCore);
-}
-
-} // namespace onnx_mlir
@@ -1,13 +0,0 @@
-#pragma once
-
-#include "llvm/ADT/STLFunctionalExtras.h"
-
-#include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
-
-namespace onnx_mlir {
-
-mlir::LogicalResult withScalarCoreFromBatchLane(pim::PimCoreBatchOp coreBatchOp,
-                                                unsigned lane,
-                                                llvm::function_ref<mlir::LogicalResult(pim::PimCoreOp)> callback);
-
-} // namespace onnx_mlir
@@ -6,8 +6,8 @@
 #include "llvm/Support/raw_ostream.h"

 #include <array>
-#include <cassert>
-#include <limits>
+
+#include "src/Accelerators/PIM/Common/Support/CheckedArithmetic.hpp"

 namespace onnx_mlir::pim_binary {

@@ -70,9 +70,7 @@ inline void writeUint32LE(llvm::raw_ostream& os, uint32_t value) {
  os.write(bytes.data(), bytes.size());
 }

-inline void writeInt32LE(llvm::raw_ostream& os, int32_t value) {
-  writeUint32LE(os, static_cast<uint32_t>(value));
-}
+inline void writeInt32LE(llvm::raw_ostream& os, int32_t value) { writeUint32LE(os, static_cast<uint32_t>(value)); }

 inline void writeHeader(llvm::raw_ostream& os) {
  os.write(kMagic, sizeof(kMagic));
@@ -97,15 +95,10 @@ inline void writeInstructionRecord(llvm::raw_ostream& os, const InstructionRecor
  writeInt32LE(os, record.generic3);
 }

-inline int32_t toI32(int64_t value) {
-  assert(value >= std::numeric_limits<int32_t>::min() && value <= std::numeric_limits<int32_t>::max()
-         && "PIM binary field out of int32 range");
-  return static_cast<int32_t>(value);
-}
+inline int32_t toI32(int64_t value) { return onnx_mlir::pim::checkedI32OrCrash(value, "binary field"); }

 inline uint8_t toU8(int64_t value) {
-  assert(value >= 0 && value <= std::numeric_limits<uint8_t>::max() && "PIM binary field out of uint8 range");
-  return static_cast<uint8_t>(value);
+  return onnx_mlir::pim::checkedU8OrCrash(static_cast<uint64_t>(value), "binary field");
 }

 inline int32_t getOptionalInt(const llvm::json::Object& object, llvm::StringRef key, int32_t defaultValue = 0) {
@@ -186,39 +179,39 @@ inline Opcode opcodeFromString(llvm::StringRef opName) {

 inline llvm::StringRef opcodeToString(Opcode opcode) {
  switch (opcode) {
-  case Opcode::nop: return "nop";
-  case Opcode::sldi: return "sldi";
-  case Opcode::sld: return "sld";
-  case Opcode::sadd: return "sadd";
-  case Opcode::ssub: return "ssub";
-  case Opcode::smul: return "smul";
-  case Opcode::saddi: return "saddi";
-  case Opcode::smuli: return "smuli";
-  case Opcode::setbw: return "setbw";
-  case Opcode::mvmul: return "mvmul";
-  case Opcode::vvadd: return "vvadd";
-  case Opcode::vvsub: return "vvsub";
-  case Opcode::vvmul: return "vvmul";
-  case Opcode::vvdmul: return "vvdmul";
-  case Opcode::vvmax: return "vvmax";
-  case Opcode::vvsll: return "vvsll";
-  case Opcode::vvsra: return "vvsra";
-  case Opcode::vavg: return "vavg";
-  case Opcode::vrelu: return "vrelu";
-  case Opcode::vtanh: return "vtanh";
-  case Opcode::vsigm: return "vsigm";
+  case Opcode::nop:      return "nop";
+  case Opcode::sldi:     return "sldi";
+  case Opcode::sld:      return "sld";
+  case Opcode::sadd:     return "sadd";
+  case Opcode::ssub:     return "ssub";
+  case Opcode::smul:     return "smul";
+  case Opcode::saddi:    return "saddi";
+  case Opcode::smuli:    return "smuli";
+  case Opcode::setbw:    return "setbw";
+  case Opcode::mvmul:    return "mvmul";
+  case Opcode::vvadd:    return "vvadd";
+  case Opcode::vvsub:    return "vvsub";
+  case Opcode::vvmul:    return "vvmul";
+  case Opcode::vvdmul:   return "vvdmul";
+  case Opcode::vvmax:    return "vvmax";
+  case Opcode::vvsll:    return "vvsll";
+  case Opcode::vvsra:    return "vvsra";
+  case Opcode::vavg:     return "vavg";
+  case Opcode::vrelu:    return "vrelu";
+  case Opcode::vtanh:    return "vtanh";
+  case Opcode::vsigm:    return "vsigm";
  case Opcode::vsoftmax: return "vsoftmax";
-  case Opcode::vmv: return "vmv";
-  case Opcode::vrsu: return "vrsu";
-  case Opcode::vrsl: return "vrsl";
-  case Opcode::ld: return "ld";
-  case Opcode::st: return "st";
-  case Opcode::lldi: return "lldi";
-  case Opcode::lmv: return "lmv";
-  case Opcode::send: return "send";
-  case Opcode::recv: return "recv";
-  case Opcode::wait: return "wait";
-  case Opcode::sync: return "sync";
+  case Opcode::vmv:      return "vmv";
+  case Opcode::vrsu:     return "vrsu";
+  case Opcode::vrsl:     return "vrsl";
+  case Opcode::ld:       return "ld";
+  case Opcode::st:       return "st";
+  case Opcode::lldi:     return "lldi";
+  case Opcode::lmv:      return "lmv";
+  case Opcode::send:     return "send";
+  case Opcode::recv:     return "recv";
+  case Opcode::wait:     return "wait";
+  case Opcode::sync:     return "sync";
  }
  llvm_unreachable("Unsupported PIM binary opcode");
 }
@@ -235,9 +228,7 @@ inline InstructionRecord makeInstructionRecord(const llvm::json::Object& instruc
  case Opcode::sldi:
  case Opcode::saddi:
  case Opcode::smuli:
-  case Opcode::lldi:
-    record.r2OrImm = getOptionalInt(instruction, "imm");
-    break;
+  case Opcode::lldi:  record.r2OrImm = getOptionalInt(instruction, "imm"); break;
  case Opcode::mvmul:
    record.r2OrImm = getOptionalInt(instruction, "mbiw");
    record.generic1 = getOptionalInt(instruction, "relu");
@@ -252,9 +243,7 @@ inline InstructionRecord makeInstructionRecord(const llvm::json::Object& instruc
    record.r2OrImm = getOptionalInt(instruction, "core");
    record.generic3 = getOptionalInt(instruction, "size");
    break;
-  default:
-    record.r2OrImm = getOptionalInt(instruction, "rs2");
-    break;
+  default: record.r2OrImm = getOptionalInt(instruction, "rs2"); break;
  }

  if (record.opcode != Opcode::mvmul && record.opcode != Opcode::setbw) {
@@ -371,8 +360,7 @@ inline llvm::json::Object makeInstructionJson(const InstructionRecord& record) {
    break;
  case Opcode::wait:
  case Opcode::sync:
-  case Opcode::nop:
-    break;
+  case Opcode::nop:  break;
  }

  return instruction;
@@ -4,13 +4,18 @@

 #include "llvm-project/clang/include/clang/Basic/LLVM.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/Hashing.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/Support/JSON.h"
 #include "llvm/Support/raw_os_ostream.h"

 #include <fstream>
+#include <limits>
 #include <optional>
+#include <string>

 #include "onnx-mlir/Compiler/OMCompilerTypes.h"
+#include "src/Accelerators/PIM/Common/IR/AddressAnalysis.hpp"
 #include "src/Accelerators/PIM/Common/PimCommon.hpp"
 #include "src/Accelerators/PIM/Common/Support/ReportUtils.hpp"
 #include "src/Accelerators/PIM/Compiler/PimBinaryFormat.hpp"
@@ -23,6 +28,23 @@ struct MemEntry {
  size_t size;
 };

+struct PhysicalSlotInfo {
+  size_t id = 0;
+  size_t address = 0;
+  size_t size = 0;
+};
+
+struct MemoryPlanArtifacts {
+  std::string textReport;
+};
+
+struct MemoryValueKey {
+  mlir::Value value;
+  std::optional<unsigned> lane;
+
+  bool operator==(const MemoryValueKey& other) const { return value == other.value && lane == other.lane; }
+};
+
 struct MemoryReportRow {
  uint64_t numAlloca = 0;
  uint64_t sizeAlloca = 0;
@@ -35,6 +57,19 @@ struct MemoryReportRow {
  }
 };

+enum class MemoryReportKind {
+  None,
+  Alloca,
+  Global,
+  Input
+};
+
+struct PendingMemEntry {
+  MemEntry memEntry;
+  MemoryValueKey key;
+  MemoryReportKind reportKind = MemoryReportKind::None;
+};
+
 struct MemoryReportEntry {
  enum class Kind {
    Core,
@@ -50,33 +85,39 @@ struct MemoryReportEntry {
 };

 class PimMemory {
-  llvm::SmallVector<std::pair<MemEntry, mlir::Value>, 32> memEntries;
-  llvm::SmallDenseMap<mlir::Value, MemEntry, 32>& globalMemEntriesMap;
-  llvm::SmallDenseMap<mlir::Value, MemEntry, 32> ownedMemEntriesMap;
+  llvm::SmallVector<PendingMemEntry, 32> memEntries;
+  llvm::SmallVector<PhysicalSlotInfo, 32> localPhysicalSlots;
+  llvm::SmallDenseMap<MemoryValueKey, MemEntry, 32>& globalMemEntriesMap;
+  llvm::SmallDenseMap<MemoryValueKey, MemEntry, 32> ownedMemEntriesMap;
+  MemoryReportRow reportRow;
+  MemoryPlanArtifacts livenessArtifacts;

  size_t minAlignment = 4;
  size_t firstAvailableAddress = 0;
+  size_t nextPhysicalSlotId = 0;

-  MemEntry* gatherMemEntry(mlir::Value value);
+  MemEntry* gatherMemEntry(mlir::Value value, std::optional<unsigned> lane = std::nullopt);
  void allocateGatheredMemory();
-  void allocateMemoryForValue(mlir::Value value, MemEntry& memEntry);
+  void allocateMemoryForValue(const MemoryValueKey& key, MemEntry& memEntry, MemoryReportKind reportKind);
+  PhysicalSlotInfo allocatePhysicalSlot(size_t slotSize, const MemoryValueKey& key);

 public:
-  PimMemory(llvm::SmallDenseMap<mlir::Value, MemEntry, 32>& globalMemEntriesMap)
+  PimMemory(llvm::SmallDenseMap<MemoryValueKey, MemEntry, 32>& globalMemEntriesMap)
  : globalMemEntriesMap(globalMemEntriesMap) {}

  void allocateHost(mlir::ModuleOp moduleOp, mlir::func::FuncOp funcOp);
-  void allocateCore(mlir::Operation* op);
+  void allocateCore(mlir::Operation* op, std::optional<unsigned> lane = std::nullopt);
  MemoryReportRow getReportRow() const;
+  const MemoryPlanArtifacts& getLivenessArtifacts() const { return livenessArtifacts; }
  void remove(mlir::Value val);

  size_t getFirstAvailableAddress() const { return firstAvailableAddress; }
-  MemEntry getMemEntry(mlir::Value value) const;
+  MemEntry getMemEntry(const MemoryValueKey& key) const;
 };

 class PimAcceleratorMemory {
 public:
-  llvm::SmallDenseMap<mlir::Value, MemEntry, 32> memEntriesMap;
+  llvm::SmallDenseMap<MemoryValueKey, MemEntry, 32> memEntriesMap;
  PimMemory hostMem;

 private:
@@ -84,14 +125,24 @@ private:
  std::fstream fileReport;
  std::optional<MemoryReportRow> hostReportRow;
  llvm::SmallVector<MemoryReportEntry, 32> reportEntries;
+  uint64_t totalWeightBytes = 0;
+  mutable llvm::DenseMap<mlir::Value, CompiledIndexExpr> compiledIndexExprs;
+  mutable llvm::DenseMap<mlir::Value, CompiledAddressExpr> compiledAddressExprs;

 public:
  PimAcceleratorMemory()
  : hostMem(memEntriesMap), fileReport(openReportFile("memory_report")) {}
+  PimAcceleratorMemory(const llvm::SmallDenseMap<MemoryValueKey, MemEntry, 32>& initialMemEntries, bool enableReport)
+  : memEntriesMap(initialMemEntries),
+    hostMem(memEntriesMap),
+    fileReport(enableReport ? openReportFile("memory_report") : std::fstream()) {}

  PimMemory& getOrCreateDeviceMem(size_t id);

-  size_t getValueAddress(mlir::Value value, const StaticValueKnowledge& knowledge = {}) const;
+  size_t getValueAddress(mlir::Value value,
+                         const StaticValueKnowledge& knowledge = {},
+                         std::optional<unsigned> lane = std::nullopt) const;
+  llvm::FailureOr<int64_t> getIndexValue(mlir::Value value, const StaticValueKnowledge& knowledge = {}) const;
  void reportHost();
  void recordCoreReport(size_t coreId, const MemoryReportRow& row);
  void recordBatchReport(uint64_t batchId,
@@ -99,19 +150,29 @@ public:
                         const MemoryReportRow& perCoreRow,
                         uint64_t totalAllocaCount,
                         uint64_t totalAllocaBytes);
+  void setTotalWeightBytes(uint64_t bytes) { totalWeightBytes = bytes; }
  void flushReport();
  void clean(mlir::Operation* op);
 };

+struct CoreEmissionJob {
+  mlir::Operation* coreLikeOp = nullptr;
+  size_t originalCoreId = 0;
+  size_t emittedCoreId = 0;
+  llvm::SmallVector<unsigned, 4> lanes;
+  std::optional<uint64_t> batchReportId;
+};
+
 class PimCodeGen {
  PimAcceleratorMemory& memory;
  llvm::raw_fd_ostream& coreBinaryStream;
  llvm::raw_fd_ostream* coreJsonStream;
  const llvm::DenseMap<size_t, size_t>& emittedCoreIds;
+  std::optional<unsigned> batchLane;
  mutable uint32_t emittedInstructionCount = 0;

  size_t addressOf(mlir::Value value, const StaticValueKnowledge& knowledge) const {
-    return memory.getValueAddress(value, knowledge);
+    return memory.getValueAddress(value, knowledge, batchLane);
  }
  size_t remapCoreId(size_t coreId) const;

@@ -141,15 +202,17 @@ public:
  : memory(memory), coreBinaryStream(coreBinary), coreJsonStream(coreJson), emittedCoreIds(emittedCoreIds) {}

  uint32_t getEmittedInstructionCount() const { return emittedInstructionCount; }
+  void setBatchLane(std::optional<unsigned> lane) { batchLane = lane; }
+  llvm::FailureOr<int64_t> indexOf(mlir::Value value, const StaticValueKnowledge& knowledge) const {
+    return memory.getIndexValue(value, knowledge);
+  }

  void codeGenLoadOp(pim::PimMemCopyHostToDevOp loadOp, const StaticValueKnowledge& knowledge) const;
  void codeGenStoreOp(pim::PimMemCopyDevToHostOp storeOp, const StaticValueKnowledge& knowledge) const;
  void codeGenLmvOp(pim::PimMemCopyOp lmvOp, const StaticValueKnowledge& knowledge) const;

  void codeGenReceiveOp(pim::PimReceiveOp receiveOp, const StaticValueKnowledge& knowledge) const;
-  void codeGenReceiveTensorOp(pim::PimReceiveTensorOp receiveTensorOp, const StaticValueKnowledge& knowledge) const;
  void codeGenSendOp(pim::PimSendOp sendOp, const StaticValueKnowledge& knowledge) const;
-  void codeGenSendTensorOp(pim::PimSendTensorOp sendTensorOp, const StaticValueKnowledge& knowledge) const;
  void codeGenConcatOp(pim::PimConcatOp concatOp, const StaticValueKnowledge& knowledge) const;

  template <typename MVMTy>
@@ -172,3 +235,20 @@ public:
 OnnxMlirCompilerErrorCodes compileToPimCode(mlir::ModuleOp& moduleOpRef, std::string& outputDirName);

 } // namespace onnx_mlir
+
+namespace llvm {
+
+template <>
+struct DenseMapInfo<onnx_mlir::MemoryValueKey> {
+  static onnx_mlir::MemoryValueKey getEmptyKey() { return {DenseMapInfo<mlir::Value>::getEmptyKey(), 0}; }
+
+  static onnx_mlir::MemoryValueKey getTombstoneKey() { return {DenseMapInfo<mlir::Value>::getTombstoneKey(), 0}; }
+
+  static unsigned getHashValue(const onnx_mlir::MemoryValueKey& key) {
+    return hash_combine(key.value, key.lane.value_or(std::numeric_limits<unsigned>::max()));
+  }
+
+  static bool isEqual(const onnx_mlir::MemoryValueKey& lhs, const onnx_mlir::MemoryValueKey& rhs) { return lhs == rhs; }
+};
+
+} // namespace llvm
@@ -1,3 +1,5 @@
+#include "llvm/Support/ErrorHandling.h"
+
 #include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"

 #define DEBUG_TYPE "PimCompilerOptions"
@@ -13,12 +15,35 @@ llvm::cl::opt<PimEmissionTargetType> pimEmissionTarget(
  llvm::cl::init(EmitPimCodegen),
  llvm::cl::cat(OnnxMlirOptions));

+llvm::cl::opt<PimMergeSchedulerType>
+  pimMergeScheduler("pim-merge-scheduler",
+                    llvm::cl::desc("Scheduler used by the Spatial merge-compute-nodes pass"),
+                    llvm::cl::values(clEnumValN(MergeSchedulerPeft, "peft", "Use PEFT scheduling")),
+                    llvm::cl::init(MergeSchedulerPeft),
+                    llvm::cl::cat(OnnxMlirOptions));
+
+llvm::cl::opt<PimMemoryReportLevel> pimMemoryReport(
+  "pim-memory-report",
+  llvm::cl::desc("Emit a human-readable PIM memory planning report"),
+  llvm::cl::values(clEnumValN(PimMemoryReportNone, "none", "Do not emit any PIM memory planning report")),
+  llvm::cl::values(
+    clEnumValN(PimMemoryReportSummary, "summary", "Emit a concise slot reuse report with key offenders")),
+  llvm::cl::values(clEnumValN(PimMemoryReportFull, "full", "Emit the full detailed PIM memory planning report")),
+  llvm::cl::init(PimMemoryReportNone),
+  llvm::cl::cat(OnnxMlirOptions));
+
 llvm::cl::opt<bool>
  pimOnlyCodegen("pim-only-codegen",
                 llvm::cl::desc("Only generate code for PIM (assume input is already in bufferized PIM IR)"),
                 llvm::cl::init(false),
                 llvm::cl::cat(OnnxMlirOptions));

+llvm::cl::opt<bool>
+  pimDisableMemoryCoalescing("pim-disable-memory-coalescing",
+                             llvm::cl::desc("Skip the PIM memory coalescing pass (developer diagnostic option)"),
+                             llvm::cl::init(false),
+                             llvm::cl::cat(OnnxMlirOptions));
+
 llvm::cl::opt<bool> useExperimentalConvImpl("use-experimental-conv-impl",
                                            llvm::cl::desc("Use experimental implementation for convolution"),
                                            llvm::cl::init(false),
@@ -30,24 +55,27 @@ llvm::cl::opt<bool> pimEmitJson("pim-emit-json",
                                llvm::cl::cat(OnnxMlirOptions));

 llvm::cl::opt<size_t>
-  crossbarSize("crossbar-size", llvm::cl::desc("Width and heigth of a single crossbar"), llvm::cl::init(2));
+  crossbarSize("crossbar-size", llvm::cl::desc("Width and height of a single crossbar"), llvm::cl::init(2));

 llvm::cl::opt<size_t>
  crossbarCountInCore("crossbar-count", llvm::cl::desc("Number of crossbars in each core"), llvm::cl::init(256));

 llvm::cl::opt<long> coresCount("core-count",
-                               llvm::cl::desc("Number of cores in the chip. `-1` to use the minimum amount of cores."),
+                               llvm::cl::desc("Number of cores in the chip. Required for PIM compilation."),
                               llvm::cl::init(-1));

-llvm::cl::opt<size_t> dcpCriticalWindowSize(
-  "dcp-critical-window-size",
-  llvm::cl::desc("Number of lowest-slack virtual nodes considered by each DCP coarsening iteration. "
-                 "Use 0 to run the legacy full-graph DCP analysis."),
-  llvm::cl::init(4000));
-
 llvm::cl::opt<bool>
  ignoreConcatError("ignore-concat-error",
                    llvm::cl::desc("Ignore ConcatOp corner case: do not assert and do a simplification"),
                    llvm::cl::init(false));

+bool hasExplicitPimCoreCount() { return coresCount.getNumOccurrences() != 0; }
+
+void verifyExplicitPimCoreCount() {
+  if (!hasExplicitPimCoreCount())
+    llvm::report_fatal_error("PIM compilation requires an explicit --core-count=<positive integer>");
+  if (coresCount.getValue() <= 0)
+    llvm::report_fatal_error("PIM compilation requires --core-count to be a positive integer");
+}
+
 } // namespace onnx_mlir
@@ -20,17 +20,32 @@ typedef enum {
  EmitPimCodegen = 3
 } PimEmissionTargetType;

+typedef enum {
+  MergeSchedulerPeft = 0,
+} PimMergeSchedulerType;
+
+typedef enum {
+  PimMemoryReportNone = 0,
+  PimMemoryReportSummary = 1,
+  PimMemoryReportFull = 2,
+} PimMemoryReportLevel;
+
 extern llvm::cl::OptionCategory OnnxMlirOptions;
 extern llvm::cl::opt<PimEmissionTargetType> pimEmissionTarget;
+extern llvm::cl::opt<PimMergeSchedulerType> pimMergeScheduler;
+extern llvm::cl::opt<PimMemoryReportLevel> pimMemoryReport;

 extern llvm::cl::opt<bool> pimOnlyCodegen;
+extern llvm::cl::opt<bool> pimDisableMemoryCoalescing;
 extern llvm::cl::opt<bool> useExperimentalConvImpl;
 extern llvm::cl::opt<bool> pimEmitJson;

 extern llvm::cl::opt<size_t> crossbarSize;
 extern llvm::cl::opt<size_t> crossbarCountInCore;
 extern llvm::cl::opt<long> coresCount;
-extern llvm::cl::opt<size_t> dcpCriticalWindowSize;
+
+bool hasExplicitPimCoreCount();
+void verifyExplicitPimCoreCount();

 // This option, by default set to false, will ignore an error when resolving a
 // specific tiles of the operands of a concat. This specific case is when the
@@ -17,6 +17,7 @@ void addPassesPim(OwningOpRef<ModuleOp>& module,
                  PassManager& pm,
                  EmissionTargetType& emissionTarget,
                  std::string outputNameNoExt) {
+  verifyExplicitPimCoreCount();

  if (pimOnlyCodegen) {
    // Skip all the lowering passes and directly generate code for PIM.
@@ -29,31 +30,27 @@ void addPassesPim(OwningOpRef<ModuleOp>& module,
  if (pimEmissionTarget >= EmitSpatial) {
    pm.addPass(createONNXToSpatialPass());
    pm.addPass(createMergeComputeNodesPass());
-    // pm.addPass(createCountInstructionPass());
    pm.addPass(createMessagePass("Onnx lowered to Spatial"));
  }

  if (pimEmissionTarget >= EmitPim) {
    pm.addPass(createSpatialToPimPass());
-    // pm.addPass(createCountInstructionPass());
    pm.addPass(createMessagePass("Spatial lowered to Pim"));
  }

  if (pimEmissionTarget >= EmitPimBufferized) {
    pm.addPass(createPimBufferizationPass());
-    pm.addPass(createPimStaticMemoryCoalescingPass());
-    // pm.addPass(createCountInstructionPass());
    pm.addPass(createMessagePass("Pim bufferized"));
  }

  if (pimEmissionTarget >= EmitPimCodegen) {
    pm.addPass(createPimHostConstantFoldingPass());
    pm.addPass(createMessagePass("Pim host constants folded"));
-    pm.addPass(createPimMaterializeHostConstantsPass());
+    if (!pimDisableMemoryCoalescing)
+      pm.addPass(createPimMemoryCoalescingPass());
    pm.addPass(createPimVerificationPass());
    pm.addPass(createMessagePass("Pim verified"));
    pm.addPass(createEmitPimCodePass());
-    // pm.addPass(createCountInstructionPass());
    pm.addPass(createMessagePass("Pim code emitted"));
  }
 }
@@ -0,0 +1,733 @@
+#include "mlir/Dialect/Bufferization/IR/BufferizableOpInterface.h"
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
+#include "mlir/IR/Value.h"
+#include "mlir/Interfaces/DestinationStyleOpInterface.h"
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallPtrSet.h"
+#include "llvm/Support/raw_ostream.h"
+
+#include <numeric>
+#include <string>
+#include <tuple>
+#include <utility>
+
+#include "Common/Support/CheckedArithmetic.hpp"
+#include "Common/Support/ReportUtils.hpp"
+#include "src/Accelerators/PIM/Common/PimCommon.hpp"
+#include "src/Accelerators/PIM/Compiler/PimMemoryLiveness.hpp"
+#include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"
+
+using namespace llvm;
+using namespace mlir;
+using namespace onnx_mlir;
+
+namespace {
+
+static std::optional<unsigned> getLaneForMemoryValue(mlir::Value value, std::optional<unsigned> lane) {
+  if (!lane)
+    return std::nullopt;
+  auto allocOp = value.getDefiningOp<memref::AllocOp>();
+  if (!allocOp || !allocOp->getParentOfType<pim::PimCoreBatchOp>())
+    return std::nullopt;
+  return lane;
+}
+
+static MemoryValueKey getMemoryValueKey(mlir::Value value, std::optional<unsigned> lane = std::nullopt) {
+  return {value, getLaneForMemoryValue(value, lane)};
+}
+
+struct MemoryTouchInterval {
+  uint64_t start = 0;
+  uint64_t end = 0;
+  Operation* startOp = nullptr;
+  Operation* endOp = nullptr;
+  Operation* firstTouchOp = nullptr;
+  Operation* lastTouchOp = nullptr;
+  uint64_t firstTouchPosition = 0;
+  uint64_t lastTouchPosition = 0;
+  bool hasRuntimeUse = false;
+  bool startUsedAllocFallback = false;
+  bool endUsedFallback = false;
+  bool escapesLoop = false;
+  std::string fallbackReason;
+  llvm::SmallVector<std::string, 8> aliasesFollowed;
+};
+
+struct OperationOrdering {
+  llvm::DenseMap<Operation*, uint64_t> position;
+  llvm::DenseMap<Operation*, uint64_t> subtreeEnd;
+  uint64_t nextPosition = 0;
+};
+
+static std::string printValueToString(mlir::Value value) {
+  std::string text;
+  llvm::raw_string_ostream os(text);
+  value.print(os);
+  os.flush();
+  return text;
+}
+
+static std::string printOperationToString(Operation* op) {
+  if (!op)
+    return "<none>";
+  std::string text;
+  llvm::raw_string_ostream os(text);
+  op->print(os);
+  os.flush();
+  return text;
+}
+
+static std::string printLocationToString(Location loc) {
+  std::string text;
+  llvm::raw_string_ostream os(text);
+  loc.print(os);
+  os.flush();
+  return text;
+}
+
+static std::string collapseWhitespace(StringRef text) {
+  std::string out;
+  out.reserve(text.size());
+  bool lastWasSpace = false;
+  for (char c : text) {
+    bool isSpace = c == ' ' || c == '\n' || c == '\t' || c == '\r';
+    if (isSpace) {
+      if (!lastWasSpace && !out.empty())
+        out.push_back(' ');
+      lastWasSpace = true;
+      continue;
+    }
+    out.push_back(c);
+    lastWasSpace = false;
+  }
+  return out;
+}
+
+static std::string abbreviate(StringRef text, size_t maxLen) {
+  if (text.size() <= maxLen)
+    return text.str();
+  return (text.take_front(maxLen - 3) + "...").str();
+}
+
+static std::string summarizeValue(mlir::Value value, size_t maxLen = 72) {
+  return abbreviate(collapseWhitespace(printValueToString(value)), maxLen);
+}
+
+static std::string summarizeOperation(Operation* op, size_t maxLen = 96) {
+  if (!op)
+    return "<none>";
+  std::string prefix = op->getName().getStringRef().str();
+  std::string full = collapseWhitespace(printOperationToString(op));
+  if (full == prefix)
+    return prefix;
+  return abbreviate(prefix + " :: " + full, maxLen);
+}
+
+static std::string summarizeLocation(Location loc, size_t maxLen = 88) {
+  return abbreviate(collapseWhitespace(printLocationToString(loc)), maxLen);
+}
+
+static void assignOperationOrdering(Operation* op, OperationOrdering& ordering) {
+  uint64_t position = ordering.nextPosition++;
+  ordering.position[op] = position;
+  uint64_t end = position;
+  for (Region& region : op->getRegions())
+    for (Block& block : region)
+      for (Operation& nestedOp : block) {
+        assignOperationOrdering(&nestedOp, ordering);
+        end = std::max(end, ordering.subtreeEnd.lookup(&nestedOp));
+      }
+  ordering.subtreeEnd[op] = end;
+}
+
+static OperationOrdering buildOperationOrdering(Operation* coreLikeOp) {
+  OperationOrdering ordering;
+  if (!coreLikeOp || coreLikeOp->getNumRegions() != 1 || coreLikeOp->getRegion(0).empty())
+    return ordering;
+
+  for (Operation& op : coreLikeOp->getRegion(0).front())
+    assignOperationOrdering(&op, ordering);
+  return ordering;
+}
+
+static bool isSupportedAliasOp(Operation* op) {
+  return isa<memref::SubViewOp, memref::CastOp, memref::CollapseShapeOp, memref::ExpandShapeOp>(op);
+}
+
+static bool isRuntimeMemoryTouchOp(Operation* op) {
+  return isa<pim::PimMemCopyHostToDevOp,
+             pim::PimMemCopyDevToHostOp,
+             pim::PimMemCopyOp,
+             pim::PimReceiveOp,
+             pim::PimSendOp,
+             pim::PimConcatOp,
+             pim::PimVMMOp,
+             pim::PimTransposeOp,
+             pim::PimVVAddOp,
+             pim::PimVVSubOp,
+             pim::PimVVMulOp,
+             pim::PimVVMaxOp,
+             pim::PimVVDMulOp,
+             pim::PimVAvgOp,
+             pim::PimVReluOp,
+             pim::PimVTanhOp,
+             pim::PimVSigmOp,
+             pim::PimVSoftmaxOp>(op);
+}
+
+static bool isIgnoredLivenessUser(Operation* op) {
+  return isSupportedAliasOp(op) || isa<scf::ForOp, scf::YieldOp, memref::DeallocOp>(op) || isCoreStaticAddressOp(op);
+}
+
+static bool isWithin(mlir::Value value, Region* region) {
+  if (!region)
+    return false;
+  if (auto blockArg = dyn_cast<BlockArgument>(value))
+    return blockArg.getOwner()->getParent() == region;
+  if (Operation* definingOp = value.getDefiningOp())
+    return definingOp->getParentRegion() == region || region->isAncestor(definingOp->getParentRegion());
+  return false;
+}
+
+static bool isNestedAllocation(Operation* coreLikeOp, memref::AllocOp allocOp) {
+  if (!coreLikeOp || coreLikeOp->getNumRegions() != 1 || coreLikeOp->getRegion(0).empty())
+    return false;
+  return allocOp->getBlock() != &coreLikeOp->getRegion(0).front();
+}
+
+static void addFallbackReason(std::string& reason, StringRef newReason) {
+  if (newReason.empty())
+    return;
+  if (!reason.empty())
+    reason += "; ";
+  reason += newReason.str();
+}
+
+static void appendAliasDescription(llvm::SmallVectorImpl<std::string>& aliases, mlir::Value value) {
+  std::string text = printValueToString(value);
+  if (!llvm::is_contained(aliases, text))
+    aliases.push_back(std::move(text));
+}
+
+struct OrderedTouchRange {
+  uint64_t start = 0;
+  uint64_t end = 0;
+  Operation* startOp = nullptr;
+  Operation* endOp = nullptr;
+  bool escapedLoop = false;
+};
+
+static OrderedTouchRange
+getEffectiveTouchRange(mlir::Value definingValue, Operation* user, const OperationOrdering& ordering) {
+  OrderedTouchRange range {ordering.position.lookup(user), ordering.position.lookup(user), user, user, false};
+  for (Operation* current = user; current; current = current->getParentOp()) {
+    auto forOp = dyn_cast<scf::ForOp>(current);
+    if (!forOp || isWithin(definingValue, &forOp.getRegion()))
+      continue;
+    range.start = std::min(range.start, ordering.position.lookup(forOp));
+    range.end = std::max(range.end, ordering.subtreeEnd.lookup(forOp));
+    range.startOp = forOp;
+    range.endOp = forOp;
+    range.escapedLoop = true;
+  }
+  return range;
+}
+
+static MemoryTouchInterval
+computeMemoryTouchInterval(memref::AllocOp allocOp, const OperationOrdering& ordering, uint64_t fallbackEnd) {
+  MemoryTouchInterval interval;
+  interval.start = ordering.position.lookup(allocOp);
+  interval.end = interval.start;
+  interval.startOp = allocOp;
+  interval.endOp = allocOp;
+
+  SmallPtrSet<mlir::Value, 16> visitedValues;
+  SmallPtrSet<Operation*, 32> visitedUsers;
+  SmallVector<mlir::Value> pendingValues;
+  pendingValues.push_back(allocOp.getResult());
+  auto parentLoop = allocOp->getParentOfType<scf::ForOp>();
+
+  while (!pendingValues.empty()) {
+    mlir::Value value = pendingValues.pop_back_val();
+    if (!visitedValues.insert(value).second)
+      continue;
+
+    for (Operation* user : value.getUsers()) {
+      if (!visitedUsers.insert(user).second)
+        continue;
+
+      if (isSupportedAliasOp(user)) {
+        for (mlir::Value result : user->getResults()) {
+          pendingValues.push_back(result);
+          appendAliasDescription(interval.aliasesFollowed, result);
+        }
+      }
+
+      if (auto dpsOp = dyn_cast<DestinationStyleOpInterface>(user)) {
+        for (OpResult result : user->getResults()) {
+          OpOperand* tiedOperand = dpsOp.getTiedOpOperand(result);
+          if (!tiedOperand || tiedOperand->get() != value)
+            continue;
+          pendingValues.push_back(result);
+          appendAliasDescription(interval.aliasesFollowed, result);
+        }
+      }
+
+      if (auto forOp = dyn_cast<scf::ForOp>(user)) {
+        for (auto [index, initArg] : llvm::enumerate(forOp.getInitArgs())) {
+          if (initArg != value)
+            continue;
+          pendingValues.push_back(forOp.getRegionIterArgs()[index]);
+          pendingValues.push_back(forOp.getResult(index));
+          appendAliasDescription(interval.aliasesFollowed, forOp.getRegionIterArgs()[index]);
+          appendAliasDescription(interval.aliasesFollowed, forOp.getResult(index));
+          if (parentLoop && forOp != parentLoop)
+            interval.escapesLoop = true;
+        }
+      }
+
+      if (auto yieldOp = dyn_cast<scf::YieldOp>(user)) {
+        auto forOp = dyn_cast<scf::ForOp>(yieldOp->getParentOp());
+        if (!forOp) {
+          addFallbackReason(interval.fallbackReason, "yield without scf.for parent");
+        }
+        else {
+          for (auto [index, operand] : llvm::enumerate(yieldOp.getOperands())) {
+            if (operand != value)
+              continue;
+            pendingValues.push_back(forOp.getResult(index));
+            appendAliasDescription(interval.aliasesFollowed, forOp.getResult(index));
+            if (parentLoop && forOp == parentLoop)
+              interval.escapesLoop = true;
+          }
+        }
+      }
+
+      if (isRuntimeMemoryTouchOp(user)) {
+        uint64_t touchPosition = ordering.position.lookup(user);
+        if (!interval.hasRuntimeUse || touchPosition < interval.firstTouchPosition) {
+          interval.firstTouchPosition = touchPosition;
+          interval.firstTouchOp = user;
+        }
+        if (!interval.hasRuntimeUse || touchPosition > interval.lastTouchPosition) {
+          interval.lastTouchPosition = touchPosition;
+          interval.lastTouchOp = user;
+        }
+
+        OrderedTouchRange range = getEffectiveTouchRange(allocOp.getResult(), user, ordering);
+        interval.escapesLoop |= range.escapedLoop;
+        if (!interval.hasRuntimeUse) {
+          interval.start = range.start;
+          interval.end = range.end;
+          interval.startOp = range.startOp;
+          interval.endOp = range.endOp;
+          interval.hasRuntimeUse = true;
+        }
+        else {
+          if (range.start < interval.start) {
+            interval.start = range.start;
+            interval.startOp = range.startOp;
+          }
+          if (range.end > interval.end) {
+            interval.end = range.end;
+            interval.endOp = range.endOp;
+          }
+        }
+        continue;
+      }
+
+      if (isIgnoredLivenessUser(user))
+        continue;
+
+      addFallbackReason(interval.fallbackReason, "unhandled user op");
+      interval.endUsedFallback = true;
+    }
+  }
+
+  if (!interval.hasRuntimeUse) {
+    interval.startUsedAllocFallback = true;
+    interval.endUsedFallback = true;
+    interval.start = ordering.position.lookup(allocOp);
+    interval.end = fallbackEnd;
+    interval.startOp = allocOp;
+    interval.endOp = allocOp->getParentOp();
+    interval.firstTouchPosition = interval.start;
+    interval.lastTouchPosition = interval.end;
+    addFallbackReason(interval.fallbackReason, "no runtime memory touch");
+    return interval;
+  }
+
+  if (interval.endUsedFallback) {
+    interval.end = std::max(interval.end, fallbackEnd);
+    interval.endOp = allocOp->getParentOp();
+  }
+
+  return interval;
+}
+
+static FailureOr<size_t> getAllocSizeBytes(memref::AllocOp allocOp) {
+  auto type = dyn_cast<ShapedType>(allocOp.getType());
+  if (!type)
+    return failure();
+  auto checkedBytes = pim::getCheckedShapedTypeSizeInBytes(type, allocOp, "memory allocation byte size");
+  if (failed(checkedBytes))
+    return failure();
+  return pim::checkedSize(*checkedBytes, allocOp, "memory allocation byte size");
+}
+
+static bool intervalsOverlap(const LocalAllocInterval& lhs, const LocalAllocInterval& rhs) {
+  return !(lhs.end < rhs.start || rhs.end < lhs.start);
+}
+
+static uint64_t getSlotLogicalBytes(const PlannedPhysicalSlot& slot, ArrayRef<LocalAllocInterval> intervals) {
+  uint64_t slotLogicalBytes = 0;
+  for (size_t intervalIndex : slot.intervalIndices)
+    slotLogicalBytes += intervals[intervalIndex].size;
+  return slotLogicalBytes;
+}
+
+} // namespace
+
+SmallVector<LocalAllocInterval, 0> onnx_mlir::buildLocalAllocIntervals(Operation* coreLikeOp,
+                                                                       std::optional<unsigned> lane) {
+  SmallVector<LocalAllocInterval, 0> intervals;
+  OperationOrdering ordering = buildOperationOrdering(coreLikeOp);
+  if (ordering.position.empty())
+    return intervals;
+
+  uint64_t fallbackEnd = ordering.nextPosition == 0 ? 0 : ordering.nextPosition - 1;
+  size_t nextIntervalId = 0;
+  coreLikeOp->walk([&](memref::AllocOp allocOp) {
+    auto checkedSize = getAllocSizeBytes(allocOp);
+    if (failed(checkedSize)) {
+      llvm::errs() << "Failed to compute local allocation size for value: ";
+      allocOp.getResult().print(llvm::errs());
+      llvm::errs() << "\n";
+      llvm_unreachable("Failed to compute local allocation size");
+    }
+
+    MemoryTouchInterval touchInterval = computeMemoryTouchInterval(allocOp, ordering, fallbackEnd);
+    LocalAllocInterval interval;
+    interval.id = nextIntervalId++;
+    interval.alloc = allocOp;
+    interval.key = getMemoryValueKey(allocOp.getResult(), lane);
+    interval.start = touchInterval.start;
+    interval.end = touchInterval.end;
+    interval.size = *checkedSize;
+    interval.startOp = touchInterval.startOp;
+    interval.endOp = touchInterval.endOp;
+    interval.firstTouchOp = touchInterval.firstTouchOp;
+    interval.lastTouchOp = touchInterval.lastTouchOp;
+    interval.firstTouchPosition = touchInterval.firstTouchPosition;
+    interval.lastTouchPosition = touchInterval.lastTouchPosition;
+    interval.startUsedAllocFallback = touchInterval.startUsedAllocFallback;
+    interval.endUsedFallback = touchInterval.endUsedFallback;
+    interval.hasRuntimeUse = touchInterval.hasRuntimeUse;
+    interval.insideNestedRegion = isNestedAllocation(coreLikeOp, allocOp);
+    interval.escapesLoop = touchInterval.escapesLoop;
+    interval.fallbackReason = std::move(touchInterval.fallbackReason);
+    interval.aliasesFollowed = std::move(touchInterval.aliasesFollowed);
+    intervals.push_back(std::move(interval));
+  });
+
+  return intervals;
+}
+
+SmallVector<PlannedPhysicalSlot, 0> onnx_mlir::planPhysicalSlots(MutableArrayRef<LocalAllocInterval> intervals) {
+  SmallVector<PlannedPhysicalSlot, 0> slots;
+  SmallVector<size_t> intervalOrder(intervals.size());
+  std::iota(intervalOrder.begin(), intervalOrder.end(), 0);
+  llvm::stable_sort(intervalOrder, [&](size_t lhsIndex, size_t rhsIndex) {
+    const LocalAllocInterval& lhs = intervals[lhsIndex];
+    const LocalAllocInterval& rhs = intervals[rhsIndex];
+    if (lhs.size != rhs.size)
+      return lhs.size > rhs.size;
+    if (lhs.start != rhs.start)
+      return lhs.start < rhs.start;
+    if (lhs.end != rhs.end)
+      return lhs.end < rhs.end;
+    return lhs.id < rhs.id;
+  });
+
+  for (size_t intervalIndex : intervalOrder) {
+    LocalAllocInterval& interval = intervals[intervalIndex];
+    PlannedPhysicalSlot* bestSlot = nullptr;
+    auto bestKey = std::tuple<size_t, size_t, size_t, size_t>(std::numeric_limits<size_t>::max(),
+                                                              std::numeric_limits<size_t>::max(),
+                                                              std::numeric_limits<size_t>::max(),
+                                                              std::numeric_limits<size_t>::max());
+
+    for (size_t slotIndex = 0; slotIndex < slots.size(); ++slotIndex) {
+      PlannedPhysicalSlot& slot = slots[slotIndex];
+      bool compatible = true;
+      for (size_t otherIndex : slot.intervalIndices) {
+        if (intervalsOverlap(interval, intervals[otherIndex])) {
+          compatible = false;
+          break;
+        }
+      }
+      if (!compatible)
+        continue;
+
+      size_t resultingSize = std::max(slot.requiredSize, interval.size);
+      size_t growth = resultingSize - slot.requiredSize;
+      auto candidateKey =
+        std::tuple<size_t, size_t, size_t, size_t>(growth, resultingSize, slot.intervalIndices.size(), slot.id);
+      if (candidateKey < bestKey) {
+        bestKey = candidateKey;
+        bestSlot = &slot;
+      }
+    }
+
+    if (!bestSlot) {
+      slots.push_back({slots.size(), interval.size, interval.size, 0, {intervalIndex}});
+      interval.slotPlanIndex = slots.size() - 1;
+      interval.physicalSlotId = slots.back().id;
+      interval.physicalSlotSize = slots.back().requiredSize;
+      continue;
+    }
+
+    bestSlot->requiredSize = std::max(bestSlot->requiredSize, interval.size);
+    bestSlot->size = bestSlot->requiredSize;
+    bestSlot->intervalIndices.push_back(intervalIndex);
+    interval.slotPlanIndex = static_cast<size_t>(bestSlot - slots.data());
+    interval.physicalSlotId = bestSlot->id;
+    interval.physicalSlotSize = bestSlot->requiredSize;
+  }
+
+  return slots;
+}
+
+MemoryPlanArtifacts onnx_mlir::buildMemoryPlanArtifacts(Operation* coreLikeOp,
+                                                        std::optional<unsigned> lane,
+                                                        ArrayRef<LocalAllocInterval> intervals,
+                                                        ArrayRef<PlannedPhysicalSlot> slots,
+                                                        size_t addressLimit,
+                                                        PimMemoryReportLevel reportLevel) {
+  MemoryPlanArtifacts artifacts;
+
+  uint64_t totalLogicalBytes = 0;
+  uint64_t totalPhysicalBytes = 0;
+  uint64_t fallbackIntervals = 0;
+  uint64_t noRuntimeTouchIntervals = 0;
+  uint64_t reusedAllocations = 0;
+  uint64_t nestedIntervals = 0;
+  uint64_t loopEscapingIntervals = 0;
+  size_t largestLogicalAllocation = 0;
+  size_t largestPhysicalSlot = 0;
+  size_t maximumAssignedAddress = 0;
+
+  for (const LocalAllocInterval& interval : intervals) {
+    totalLogicalBytes += interval.size;
+    largestLogicalAllocation = std::max(largestLogicalAllocation, interval.size);
+    maximumAssignedAddress = std::max(maximumAssignedAddress, interval.assignedAddress + interval.physicalSlotSize);
+    if (interval.startUsedAllocFallback || interval.endUsedFallback)
+      ++fallbackIntervals;
+    if (!interval.hasRuntimeUse)
+      ++noRuntimeTouchIntervals;
+    if (interval.insideNestedRegion)
+      ++nestedIntervals;
+    if (interval.escapesLoop)
+      ++loopEscapingIntervals;
+  }
+  for (const PlannedPhysicalSlot& slot : slots) {
+    totalPhysicalBytes += slot.size;
+    largestPhysicalSlot = std::max(largestPhysicalSlot, slot.size);
+    if (slot.intervalIndices.size() > 1)
+      reusedAllocations += slot.intervalIndices.size() - 1;
+  }
+
+  uint64_t savedBytes = totalLogicalBytes >= totalPhysicalBytes ? totalLogicalBytes - totalPhysicalBytes : 0;
+  double savedPercent =
+    totalLogicalBytes == 0 ? 0.0 : 100.0 * static_cast<double>(savedBytes) / static_cast<double>(totalLogicalBytes);
+
+  raw_string_ostream os(artifacts.textReport);
+  os << "=== PIM Memory Liveness Report ===\n";
+  os << "Op: " << coreLikeOp->getName() << "\n";
+  if (lane)
+    os << "Lane: " << *lane << "\n";
+  os << "Summary:\n";
+  os << "  logical allocation bytes:  " << formatReportMemory(totalLogicalBytes) << " (" << totalLogicalBytes << ")\n";
+  os << "  physical allocation bytes: " << formatReportMemory(totalPhysicalBytes) << " (" << totalPhysicalBytes
+     << ")\n";
+  os << "  saved bytes:               " << formatReportMemory(savedBytes) << " (" << savedBytes << ")\n";
+  os << "  saved percent: " << format("%.2f%%", savedPercent) << "\n";
+  os << "  intervals: " << intervals.size() << "\n";
+  os << "  physical slots: " << slots.size() << "\n";
+  os << "  reused allocations: " << reusedAllocations << "\n";
+  os << "  fallback intervals: " << fallbackIntervals << "\n";
+  os << "  intervals with no runtime memory touch: " << noRuntimeTouchIntervals << "\n";
+  os << "  nested allocations: " << nestedIntervals << "\n";
+  os << "  loop-escaping allocations: " << loopEscapingIntervals << "\n";
+  os << "  largest logical allocation: " << largestLogicalAllocation << "\n";
+  os << "  largest physical slot: " << largestPhysicalSlot << "\n";
+  os << "  address limit: " << addressLimit << "\n";
+  os << "  peak physical memory: " << formatReportMemory(maximumAssignedAddress) << " (" << maximumAssignedAddress
+     << ")\n";
+  os << "  maximum assigned address: " << maximumAssignedAddress << "\n";
+
+  os << "\nHow To Read:\n";
+  os << "  `summary` only shows the strongest reuse cases and the worst offenders.\n";
+  os << "  Use `--pim-memory-report=full` when you need the complete slot-by-slot and interval-by-interval dump.\n";
+  os << "  Large single-use slots, fallback intervals, and nested single-use allocations are the best places\n";
+  os << "  to inspect if allocations should be moved, sunk, or made easier to coalesce earlier in the pipeline.\n";
+
+  SmallVector<const PlannedPhysicalSlot*> reusedSlots;
+  SmallVector<const PlannedPhysicalSlot*> singleUseSlots;
+  for (const PlannedPhysicalSlot& slot : slots)
+    if (slot.intervalIndices.size() > 1)
+      reusedSlots.push_back(&slot);
+    else
+      singleUseSlots.push_back(&slot);
+
+  llvm::stable_sort(reusedSlots, [&](const PlannedPhysicalSlot* lhs, const PlannedPhysicalSlot* rhs) {
+    uint64_t lhsLogicalBytes = getSlotLogicalBytes(*lhs, intervals);
+    uint64_t rhsLogicalBytes = getSlotLogicalBytes(*rhs, intervals);
+    if (lhs->intervalIndices.size() != rhs->intervalIndices.size())
+      return lhs->intervalIndices.size() > rhs->intervalIndices.size();
+    if (lhsLogicalBytes != rhsLogicalBytes)
+      return lhsLogicalBytes > rhsLogicalBytes;
+    if (lhs->size != rhs->size)
+      return lhs->size > rhs->size;
+    return lhs->id < rhs->id;
+  });
+  llvm::stable_sort(singleUseSlots, [&](const PlannedPhysicalSlot* lhs, const PlannedPhysicalSlot* rhs) {
+    if (lhs->size != rhs->size)
+      return lhs->size > rhs->size;
+    return lhs->id < rhs->id;
+  });
+
+  constexpr size_t kSummaryReuseLimit = 6;
+  constexpr size_t kSummaryOffenderLimit = 10;
+
+  os << "\nBest Reuse:\n";
+  if (reusedSlots.empty()) {
+    os << "  no slots were shared by multiple intervals\n";
+  }
+  else {
+    for (const PlannedPhysicalSlot* slot : ArrayRef(reusedSlots).take_front(kSummaryReuseLimit)) {
+      uint64_t slotLogicalBytes = getSlotLogicalBytes(*slot, intervals);
+      os << "  slot #" << slot->id << "  addr=" << slot->address << "  size=" << formatReportMemory(slot->size)
+         << "  intervals=" << slot->intervalIndices.size() << "  logical_sum=" << formatReportMemory(slotLogicalBytes)
+         << "\n";
+      for (size_t intervalIndex : slot->intervalIndices) {
+        const LocalAllocInterval& interval = intervals[intervalIndex];
+        os << "    #" << interval.id << "  [" << interval.start << "," << interval.end << "]"
+           << "  logical=" << formatReportMemory(interval.size)
+           << "  first=" << summarizeOperation(interval.firstTouchOp, 40)
+           << "  last=" << summarizeOperation(interval.lastTouchOp, 40) << "\n";
+      }
+    }
+  }
+
+  os << "\nTop Offenders:\n";
+  bool printedAttention = false;
+  for (const PlannedPhysicalSlot* slot : ArrayRef(singleUseSlots).take_front(kSummaryOffenderLimit)) {
+    const LocalAllocInterval& interval = intervals[slot->intervalIndices.front()];
+    printedAttention = true;
+    os << "  slot #" << slot->id << " is single-use"
+       << "  size=" << formatReportMemory(slot->size) << "  interval=#" << interval.id
+       << "  value=" << summarizeValue(interval.key.value, 56) << "\n";
+    os << "    first=" << summarizeOperation(interval.firstTouchOp, 40)
+       << "  last=" << summarizeOperation(interval.lastTouchOp, 40)
+       << "  nested=" << (interval.insideNestedRegion ? "yes" : "no")
+       << "  escapes_loop=" << (interval.escapesLoop ? "yes" : "no") << "\n";
+  }
+  size_t fallbackPrinted = 0;
+  for (const LocalAllocInterval& interval : intervals) {
+    if (!(interval.startUsedAllocFallback || interval.endUsedFallback) || fallbackPrinted >= kSummaryOffenderLimit)
+      continue;
+    printedAttention = true;
+    ++fallbackPrinted;
+    os << "  fallback interval #" << interval.id << "  size=" << formatReportMemory(interval.size)
+       << "  value=" << summarizeValue(interval.key.value, 56) << "\n";
+    os << "    reason: " << (interval.fallbackReason.empty() ? "<none>" : interval.fallbackReason) << "\n";
+  }
+  size_t nestedPrinted = 0;
+  for (const LocalAllocInterval& interval : intervals) {
+    if (nestedPrinted >= kSummaryOffenderLimit)
+      break;
+    if (!(interval.insideNestedRegion && slots[interval.slotPlanIndex].intervalIndices.size() == 1))
+      continue;
+    printedAttention = true;
+    ++nestedPrinted;
+    os << "  nested single-use interval #" << interval.id << "  slot #" << interval.physicalSlotId
+       << "  size=" << formatReportMemory(interval.size) << "  value=" << summarizeValue(interval.key.value, 56)
+       << "\n";
+    os << "    hint: move or sink this alloc inside the nested region if the IR allows it.\n";
+  }
+  if (!printedAttention)
+    os << "  no obvious blockers detected in this core\n";
+
+  if (reportLevel == PimMemoryReportFull) {
+    os << "\nSlot Reuse:\n";
+    for (const PlannedPhysicalSlot& slot : slots) {
+      uint64_t slotLogicalBytes = getSlotLogicalBytes(slot, intervals);
+      os << "  slot #" << slot.id << "  addr=" << slot.address << "  size=" << formatReportMemory(slot.size) << " ("
+         << slot.size << ")"
+         << "  intervals=" << slot.intervalIndices.size() << "  logical_sum=" << formatReportMemory(slotLogicalBytes)
+         << "\n";
+      for (size_t intervalIndex : slot.intervalIndices) {
+        const LocalAllocInterval& interval = intervals[intervalIndex];
+        mlir::Value allocValue = interval.key.value;
+        os << "    [" << interval.start << "," << interval.end << "]"
+           << "  #" << interval.id << "  logical=" << formatReportMemory(interval.size)
+           << "  nested=" << (interval.insideNestedRegion ? "yes" : "no")
+           << "  escapes_loop=" << (interval.escapesLoop ? "yes" : "no")
+           << "  first=" << summarizeOperation(interval.firstTouchOp, 48)
+           << "  last=" << summarizeOperation(interval.lastTouchOp, 48) << "\n";
+        os << "      value=" << summarizeValue(allocValue) << "\n";
+      }
+    }
+  }
+
+  if (reportLevel == PimMemoryReportFull) {
+    os << "\nInterval Details:\n";
+    for (const LocalAllocInterval& interval : intervals) {
+      const PlannedPhysicalSlot& slot = slots[interval.slotPlanIndex];
+      mlir::Value allocValue = interval.key.value;
+      Operation* definingOp = allocValue.getDefiningOp();
+      os << "  #" << interval.id << "  slot=" << slot.id << "  live=[" << interval.start << "," << interval.end << "]"
+         << "  logical=" << formatReportMemory(interval.size)
+         << "  slot_size=" << formatReportMemory(interval.physicalSlotSize) << "  addr=" << interval.assignedAddress
+         << "\n";
+      os << "    value=" << summarizeValue(allocValue, 88) << "\n";
+      os << "    type=" << allocValue.getType() << "\n";
+      os << "    loc="
+         << summarizeLocation(definingOp ? definingOp->getLoc() : UnknownLoc::get(coreLikeOp->getContext())) << "\n";
+      os << "    nested=" << (interval.insideNestedRegion ? "yes" : "no")
+         << "  escapes_loop=" << (interval.escapesLoop ? "yes" : "no")
+         << "  start_fallback=" << (interval.startUsedAllocFallback ? "yes" : "no")
+         << "  end_fallback=" << (interval.endUsedFallback ? "yes" : "no") << "\n";
+      os << "    first_use=" << summarizeOperation(interval.firstTouchOp) << " @" << interval.firstTouchPosition
+         << "\n";
+      os << "    last_use=" << summarizeOperation(interval.lastTouchOp) << " @" << interval.lastTouchPosition << "\n";
+      os << "    slot_peers=";
+      bool first = true;
+      for (size_t otherIndex : slot.intervalIndices) {
+        if (intervals[otherIndex].id == interval.id)
+          continue;
+        if (!first)
+          os << ", ";
+        os << "#" << intervals[otherIndex].id;
+        first = false;
+      }
+      if (first)
+        os << "<none>";
+      os << "\n";
+      if (!interval.fallbackReason.empty())
+        os << "    fallback_reason=" << interval.fallbackReason << "\n";
+      if (!interval.aliasesFollowed.empty()) {
+        os << "    aliases_followed=" << interval.aliasesFollowed.size() << "\n";
+        for (const std::string& alias : interval.aliasesFollowed)
+          os << "      - " << abbreviate(collapseWhitespace(alias), 108) << "\n";
+      }
+    }
+  }
+  os.flush();
+
+  return artifacts;
+}
@@ -0,0 +1,63 @@
+#pragma once
+
+#include "mlir/Dialect/MemRef/IR/MemRef.h"
+
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/SmallVector.h"
+
+#include <limits>
+#include <optional>
+#include <string>
+
+#include "src/Accelerators/PIM/Compiler/PimCodeGen.hpp"
+#include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"
+
+namespace onnx_mlir {
+
+struct LocalAllocInterval {
+  size_t id = 0;
+  mlir::memref::AllocOp alloc;
+  MemoryValueKey key;
+  uint64_t start = 0;
+  uint64_t end = 0;
+  size_t size = 0;
+  mlir::Operation* startOp = nullptr;
+  mlir::Operation* endOp = nullptr;
+  mlir::Operation* firstTouchOp = nullptr;
+  mlir::Operation* lastTouchOp = nullptr;
+  uint64_t firstTouchPosition = 0;
+  uint64_t lastTouchPosition = 0;
+  bool startUsedAllocFallback = false;
+  bool endUsedFallback = false;
+  bool hasRuntimeUse = false;
+  bool insideNestedRegion = false;
+  bool escapesLoop = false;
+  std::string fallbackReason;
+  llvm::SmallVector<std::string, 8> aliasesFollowed;
+  size_t slotPlanIndex = std::numeric_limits<size_t>::max();
+  size_t physicalSlotId = std::numeric_limits<size_t>::max();
+  size_t assignedAddress = 0;
+  size_t physicalSlotSize = 0;
+};
+
+struct PlannedPhysicalSlot {
+  size_t id = std::numeric_limits<size_t>::max();
+  size_t requiredSize = 0;
+  size_t size = 0;
+  size_t address = 0;
+  llvm::SmallVector<size_t, 8> intervalIndices;
+};
+
+llvm::SmallVector<LocalAllocInterval, 0> buildLocalAllocIntervals(mlir::Operation* coreLikeOp,
+                                                                  std::optional<unsigned> lane);
+
+llvm::SmallVector<PlannedPhysicalSlot, 0> planPhysicalSlots(llvm::MutableArrayRef<LocalAllocInterval> intervals);
+
+MemoryPlanArtifacts buildMemoryPlanArtifacts(mlir::Operation* coreLikeOp,
+                                             std::optional<unsigned> lane,
+                                             llvm::ArrayRef<LocalAllocInterval> intervals,
+                                             llvm::ArrayRef<PlannedPhysicalSlot> slots,
+                                             size_t addressLimit,
+                                             PimMemoryReportLevel reportLevel);
+
+} // namespace onnx_mlir
@@ -1,209 +1,93 @@
 #include "mlir/Dialect/MemRef/IR/MemRef.h"
 #include "mlir/IR/BuiltinAttributes.h"
-#include "mlir/IR/BuiltinOps.h"
 #include "mlir/IR/BuiltinTypes.h"

-#include "llvm/ADT/STLExtras.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/raw_ostream.h"

 #include <cassert>

+#include "Common/Support/CheckedArithmetic.hpp"
 #include "Conversion/ONNXToSpatial/Common/Common.hpp"
 #include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
-#include "src/Accelerators/PIM/Common/IR/SubviewUtils.hpp"
-#include "src/Accelerators/PIM/Common/IR/WeightUtils.hpp"
-#include "src/Accelerators/PIM/Compiler/PimBatchEmission.hpp"
-#include "src/Accelerators/PIM/Compiler/PimCodeGen.hpp"
 #include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"
 #include "src/Accelerators/PIM/Compiler/PimWeightEmitter.hpp"
-#include "src/Accelerators/PIM/Dialect/Pim/PimOps.hpp"

 using namespace llvm;
 using namespace mlir;

 namespace onnx_mlir {
-namespace {
+namespace {} // namespace

-struct DenseWeightView {
-  DenseElementsAttr denseAttr;
-  SmallVector<int64_t> shape;
-  SmallVector<int64_t> strides;
-  int64_t offset = 0;
-};
-
-FailureOr<DenseWeightView> resolveDenseWeightView(ModuleOp moduleOp, mlir::Value weight) {
-  SmallVector<memref::SubViewOp> subviews;
-  mlir::Value current = weight;
-  memref::GetGlobalOp getGlobalOp;
-
-  while (true) {
-    Operation* defOp = current.getDefiningOp();
-    if (!defOp)
-      return failure();
-    if ((getGlobalOp = dyn_cast<memref::GetGlobalOp>(defOp)))
-      break;
-    if (auto subview = dyn_cast<memref::SubViewOp>(defOp)) {
-      if (!hasAllStaticSubviewParts(subview))
-        return failure();
-      subviews.push_back(subview);
-      current = subview.getSource();
-      continue;
-    }
-    if (auto cast = dyn_cast<memref::CastOp>(defOp)) {
-      current = cast.getSource();
-      continue;
-    }
-    return failure();
-  }
-
-  auto globalOp = lookupGlobalForGetGlobal(moduleOp, getGlobalOp);
-  if (!globalOp || !globalOp.getInitialValue())
-    return failure();
-
-  auto denseAttr = dyn_cast<DenseElementsAttr>(*globalOp.getInitialValue());
-  if (!denseAttr)
-    return failure();
-
-  DenseWeightView view;
-  view.denseAttr = denseAttr;
-  view.shape.assign(denseAttr.getType().getShape().begin(), denseAttr.getType().getShape().end());
-  view.strides = computeRowMajorStrides(view.shape);
-
-  for (memref::SubViewOp subview : llvm::reverse(subviews)) {
-    SmallVector<int64_t> nextStrides;
-    nextStrides.reserve(subview.getStaticStrides().size());
-    for (auto [offset, stride, sourceStride] :
-         llvm::zip_equal(subview.getStaticOffsets(), subview.getStaticStrides(), view.strides)) {
-      view.offset += offset * sourceStride;
-      nextStrides.push_back(stride * sourceStride);
-    }
-    view.shape.assign(subview.getStaticSizes().begin(), subview.getStaticSizes().end());
-    view.strides = std::move(nextStrides);
-  }
-
-  return view;
-}
-
-SmallVector<unsigned, 8> getUsedWeightIndices(Block& block) {
-  SmallVector<unsigned, 8> indices;
-  auto addIndex = [&](unsigned weightIndex) {
-    if (!llvm::is_contained(indices, weightIndex))
-      indices.push_back(weightIndex);
-  };
-
-  block.walk([&](pim::PimVMMOp vmmOp) { addIndex(vmmOp.getWeightIndex()); });
-  llvm::sort(indices);
-  return indices;
-}
-
-SmallVector<unsigned, 8> getUsedWeightIndices(pim::PimCoreOp coreOp) {
-  return getUsedWeightIndices(coreOp.getBody().front());
-}
-
-SmallVector<Operation*> collectTopLevelCoreLikeOps(func::FuncOp funcOp) {
-  SmallVector<Operation*> coreLikeOps;
-  for (Operation& op : funcOp.getBody().front())
-    if (dyn_cast<pim::PimCoreOp>(&op) || dyn_cast<pim::PimCoreBatchOp>(&op))
-      coreLikeOps.push_back(&op);
-  return coreLikeOps;
-}
-
-} // namespace
-
-llvm::DenseMap<size_t, llvm::DenseMap<mlir::Value, std::string>>
-createAndPopulateWeightFolder(func::FuncOp funcOp, StringRef outputDirPath) {
-  ModuleOp moduleOp = funcOp->getParentOfType<ModuleOp>();
+WeightEmissionResult createAndPopulateWeightFolder(ArrayRef<WeightFileRequest> requests, StringRef outputDirPath) {
  auto coreWeightsDirPath = outputDirPath + "/weights";
  auto error = sys::fs::create_directory(coreWeightsDirPath);
  assert(!error && "Error creating weights directory");
  size_t indexFileName = 0;

  int64_t xbarSize = crossbarSize.getValue();
-  llvm::DenseMap<size_t, llvm::DenseMap<mlir::Value, std::string>> mapCoreWeightToFileName;
-  llvm::DenseMap<memref::GlobalOp, std::string> mapGlobalOpToFileName;
+  WeightEmissionResult result;
+  llvm::SmallVector<std::pair<ResolvedWeightView, std::string>, 16> materializedWeights;

-  SmallVector<Operation*> coreLikeOps = collectTopLevelCoreLikeOps(funcOp);
+  auto materializeWeight = [&](const ResolvedWeightView& weightView) -> std::string {
+    if (auto it = llvm::find_if(materializedWeights, [&](const auto& entry) { return entry.first == weightView; });
+        it != materializedWeights.end())
+      return it->second;

-  for (Operation* op : coreLikeOps) {
-    auto processCore = [&](pim::PimCoreOp coreOp) {
-      size_t coreId = static_cast<size_t>(coreOp.getCoreId());
-      for (unsigned index : getUsedWeightIndices(coreOp)) {
-        if (index >= coreOp.getWeights().size()) {
-          coreOp.emitWarning("Weight index " + std::to_string(index) + " is out of range");
-          assert(index < coreOp.getWeights().size() && "Weight index is out of range");
-        }
-        mlir::Value weight = coreOp.getWeights()[index];
+    auto globalOp = weightView.globalOp;
+    auto denseAttr = mlir::dyn_cast<DenseElementsAttr>(*globalOp.getInitialValue());
+    assert(denseAttr && "Weight global must have dense initial value");

-        auto weightView = resolveDenseWeightView(moduleOp, weight);
-        if (failed(weightView)) {
-          coreOp.emitWarning("Weight is not from a memref.get_global at index " + std::to_string(index));
-          assert(succeeded(weightView) && "Weight is not from a dense memref.global view");
-        }
+    ArrayRef<int64_t> shape = weightView.shape;
+    assert(isMatrixShape(shape) && "Weight matrix must be 2-dimensional");
+    int64_t numRows = shape[0];
+    int64_t numCols = shape[1];
+    assert(numRows <= xbarSize && numCols <= xbarSize && "Weight dimensions must not exceed crossbar size");

-        if (mapCoreWeightToFileName[coreId].contains(weight))
-          continue;
+    size_t elementByteWidth = getElementTypeSizeInBytes(denseAttr.getElementType());

-        auto getGlobalOp = weight.getDefiningOp<memref::GetGlobalOp>();
-        auto globalOp = getGlobalOp ? lookupGlobalForGetGlobal(moduleOp, getGlobalOp) : memref::GlobalOp {};
-        if (globalOp && mapGlobalOpToFileName.contains(globalOp)) {
-          auto& fileName = mapGlobalOpToFileName[globalOp];
-          mapCoreWeightToFileName[coreId].insert({weight, fileName});
-          continue;
-        }
-
-        DenseElementsAttr denseAttr = weightView->denseAttr;
-        ArrayRef<int64_t> shape = weightView->shape;
-        assert(isMatrixShape(shape) && "Weight matrix must be 2-dimensional");
-        int64_t numRows = shape[0];
-        int64_t numCols = shape[1];
-        assert(numRows <= xbarSize && numCols <= xbarSize && "Weight dimensions must not exceed crossbar size");
-
-        size_t elementByteWidth = denseAttr.getElementType().getIntOrFloatBitWidth() / 8;
-
-        std::string newFileName = "crossbar_" + std::to_string(indexFileName++) + ".bin";
-        auto weightFilePath = (coreWeightsDirPath + "/" + newFileName).str();
-        std::error_code errorCode;
-        raw_fd_ostream weightFileStream(weightFilePath, errorCode, sys::fs::OF_None);
-        if (errorCode) {
-          errs() << "Error while opening weight file `" << weightFilePath << "`: " << errorCode.message() << '\n';
-          assert(errorCode);
-        }
-
-        uint64_t zero = 0;
-        for (int64_t row = 0; row < xbarSize; row++) {
-          for (int64_t col = 0; col < xbarSize; col++) {
-            if (row < numRows && col < numCols) {
-              int64_t elementIndex = weightView->offset + row * weightView->strides[0] + col * weightView->strides[1];
-              APInt bits = denseAttr.getValues<APFloat>()[elementIndex].bitcastToAPInt();
-              uint64_t word = bits.getZExtValue();
-              weightFileStream.write(reinterpret_cast<const char*>(&word), elementByteWidth);
-            }
-            else {
-              weightFileStream.write(reinterpret_cast<const char*>(&zero), elementByteWidth);
-            }
-          }
-        }
-
-        weightFileStream.close();
-        if (globalOp)
-          mapGlobalOpToFileName.insert({globalOp, newFileName});
-        mapCoreWeightToFileName[coreId].insert({weight, newFileName});
-      }
-      return success();
-    };
-
-    if (auto coreOp = dyn_cast<pim::PimCoreOp>(op)) {
-      (void) processCore(coreOp);
-      continue;
+    std::string newFileName = "crossbar_" + std::to_string(indexFileName++) + ".bin";
+    auto weightFilePath = (coreWeightsDirPath + "/" + newFileName).str();
+    std::error_code errorCode;
+    raw_fd_ostream weightFileStream(weightFilePath, errorCode, sys::fs::OF_None);
+    if (errorCode) {
+      errs() << "Error while opening weight file `" << weightFilePath << "`: " << errorCode.message() << '\n';
+      assert(errorCode);
    }

-    auto coreBatchOp = cast<pim::PimCoreBatchOp>(op);
-    for (unsigned lane = 0; lane < static_cast<unsigned>(coreBatchOp.getLaneCount()); ++lane)
-      if (failed(withScalarCoreFromBatchLane(coreBatchOp, lane, processCore)))
-        return mapCoreWeightToFileName;
+    uint64_t zero = 0;
+    for (int64_t row = 0; row < xbarSize; row++) {
+      for (int64_t col = 0; col < xbarSize; col++) {
+        if (row < numRows && col < numCols) {
+          int64_t elementIndex = weightView.offset + row * weightView.strides[0] + col * weightView.strides[1];
+          APInt bits = denseAttr.getValues<APFloat>()[elementIndex].bitcastToAPInt();
+          uint64_t word = bits.getZExtValue();
+          weightFileStream.write(reinterpret_cast<const char*>(&word), elementByteWidth);
+        }
+        else {
+          weightFileStream.write(reinterpret_cast<const char*>(&zero), elementByteWidth);
+        }
+      }
+    }
+
+    weightFileStream.close();
+    materializedWeights.push_back({weightView, newFileName});
+    uint64_t weightBytes = pim::checkedMulOrCrash(
+      pim::checkedMulOrCrash(static_cast<size_t>(xbarSize), static_cast<size_t>(xbarSize), "weight element count"),
+      elementByteWidth,
+      "weight byte size");
+    result.totalWeightBytes = pim::checkedAddOrCrash(result.totalWeightBytes, weightBytes, "total weight bytes");
+    return newFileName;
+  };
+
+  for (const WeightFileRequest& request : requests) {
+    auto& coreFiles = result.mapCoreWeightToFileName[request.coreId];
+    coreFiles.reserve(request.weights.size());
+    for (const ResolvedWeightView& weight : request.weights)
+      coreFiles.push_back(materializeWeight(weight));
  }
-  return mapCoreWeightToFileName;
+
+  return result;
 }

 } // namespace onnx_mlir
@@ -1,16 +1,29 @@
 #pragma once

 #include "mlir/Dialect/Func/IR/FuncOps.h"
-#include "mlir/IR/Value.h"

 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/StringRef.h"

+#include <cstdint>
 #include <string>

+#include "src/Accelerators/PIM/Common/IR/WeightUtils.hpp"
+
 namespace onnx_mlir {

-llvm::DenseMap<size_t, llvm::DenseMap<mlir::Value, std::string>>
-createAndPopulateWeightFolder(mlir::func::FuncOp funcOp, llvm::StringRef outputDirPath);
+struct WeightFileRequest {
+  size_t coreId = 0;
+  llvm::SmallVector<ResolvedWeightView, 8> weights;
+};
+
+struct WeightEmissionResult {
+  llvm::DenseMap<size_t, llvm::SmallVector<std::string, 8>> mapCoreWeightToFileName;
+  uint64_t totalWeightBytes = 0;
+};
+
+WeightEmissionResult createAndPopulateWeightFolder(llvm::ArrayRef<WeightFileRequest> requests,
+                                                   llvm::StringRef outputDirPath);

 } // namespace onnx_mlir
@@ -3,11 +3,12 @@ mlir_tablegen(ONNXToSpatial.hpp.inc -gen-rewriters "-I${ONNX_MLIR_SRC_ROOT}")
 add_public_tablegen_target(ONNXToSpatialIncGen)

 add_pim_library(OMONNXToSpatial
-  ConversionPatterns.cpp
-  HostFoldability.cpp
-  HostLegality.cpp
-  PrePatterns.cpp
-  PostPatterns.cpp
+  Patterns.cpp
+  CompileTime.cpp
+  ONNXToSpatialVerifier.cpp
+  Patterns/Pre.cpp
+  Patterns/Post.cpp
+  Patterns/GeneratedConversion.cpp
  Patterns/Math/Conv.cpp
  Patterns/Math/Elementwise.cpp
  Patterns/Math/Gemm.cpp
@@ -21,9 +22,13 @@ add_pim_library(OMONNXToSpatial
  Patterns/Tensor/Gather.cpp
  Patterns/Tensor/Resize.cpp
  Patterns/Tensor/Reshape.cpp
+  Patterns/Tensor/Slice.cpp
  Patterns/Tensor/Split.cpp
+  Patterns/Tensor/Transpose.cpp
  ONNXToSpatialPass.cpp
+  Common/AttributeUtils.cpp
  Common/ComputeRegionBuilder.cpp
+  Common/IndexingUtils.cpp
  Common/ShapeTilingUtils.cpp
  Common/WeightMaterialization.cpp

@@ -33,6 +38,7 @@ add_pim_library(OMONNXToSpatial
  ONNXToSpatialIncGen

  LINK_LIBS PUBLIC
+  MLIRLinalgDialect
  MLIRSCFDialect
  MLIRTosaDialect
  OMCompilerOptions
@@ -0,0 +1,23 @@
+#include "mlir/IR/BuiltinAttributes.h"
+
+#include "AttributeUtils.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+int64_t getI64Attr(ArrayAttr attr, size_t index) { return cast<IntegerAttr>(attr[index]).getInt(); }
+
+int64_t getOptionalI64Attr(std::optional<ArrayAttr> attr, size_t index, int64_t defaultValue) {
+  return attr ? getI64Attr(*attr, index) : defaultValue;
+}
+
+llvm::SmallVector<int64_t> getI64ArrayAttrValues(ArrayAttr attr) {
+  llvm::SmallVector<int64_t> values;
+  values.reserve(attr.size());
+  for (Attribute value : attr)
+    values.push_back(cast<IntegerAttr>(value).getInt());
+  return values;
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,18 @@
+#pragma once
+
+#include "mlir/IR/BuiltinAttributes.h"
+
+#include "llvm/ADT/SmallVector.h"
+
+#include <cstddef>
+#include <optional>
+
+namespace onnx_mlir {
+
+int64_t getI64Attr(mlir::ArrayAttr attr, size_t index);
+
+int64_t getOptionalI64Attr(std::optional<mlir::ArrayAttr> attr, size_t index, int64_t defaultValue);
+
+llvm::SmallVector<int64_t> getI64ArrayAttrValues(mlir::ArrayAttr attr);
+
+} // namespace onnx_mlir
@@ -1,6 +1,8 @@
 #pragma once

+#include "AttributeUtils.hpp"
 #include "ComputeRegionBuilder.hpp"
+#include "IndexingUtils.hpp"
 #include "ShapeTilingUtils.hpp"
 #include "WeightMaterialization.hpp"
 #include "src/Accelerators/PIM/Common/PimCommon.hpp"
@@ -1,5 +1,6 @@
 #pragma once

+#include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/IR/Block.h"
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/ValueRange.h"
@@ -7,9 +8,12 @@

 #include <cassert>
 #include <cstddef>
+#include <limits>
 #include <type_traits>
 #include <utility>

+#include "src/Accelerators/PIM/Common/Support/CheckedArithmetic.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"

 namespace onnx_mlir {
@@ -18,13 +22,17 @@ namespace detail {

 inline mlir::ValueRange getBlockArgs(mlir::Block* block) { return mlir::ValueRange(block->getArguments()); }

+inline mlir::ValueRange getInputBlockArgs(mlir::Block* block, size_t weightCount) {
+  return mlir::ValueRange(block->getArguments()).drop_front(weightCount);
+}
+
 template <typename Fn, size_t... Is>
 decltype(auto) invokeWithBlockArgs(Fn&& fn, mlir::Block* block, std::index_sequence<Is...>) {
  return std::forward<Fn>(fn)(block->getArgument(Is)...);
 }

 template <typename Fn, size_t... Is>
-decltype(auto) invokeWithValues(Fn&& fn, mlir::ArrayRef<mlir::Value> values, std::index_sequence<Is...>) {
+decltype(auto) invokeWithValues(Fn&& fn, mlir::ValueRange values, std::index_sequence<Is...>) {
  return std::forward<Fn>(fn)(values[Is]...);
 }

@@ -45,6 +53,13 @@ using InvokeWithBlockArgsResultT = typename InvokeWithBlockArgsResult<Fn, Seq>::
 template <typename Fn>
 using InvokeWithValueRangeResultT = std::invoke_result_t<Fn, mlir::ValueRange>;

+struct SpatComputeBatchBodyArgs {
+  mlir::Value lane;
+  mlir::ValueRange weights;
+  mlir::ValueRange inputs;
+  mlir::ValueRange outputs;
+};
+
 } // namespace detail

 template <typename RewriterT>
@@ -85,6 +100,8 @@ auto createSpatCompute(RewriterT& rewriter,
  auto computeOp = spatial::SpatCompute::create(rewriter, loc, resultTypes, weights, inputs);

  auto* block = new mlir::Block();
+  for (mlir::Value weight : weights)
+    block->addArgument(weight.getType(), loc);
  for (mlir::Value input : inputs)
    block->addArgument(input.getType(), loc);

@@ -93,14 +110,17 @@ auto createSpatCompute(RewriterT& rewriter,

  using BodyResult = detail::InvokeWithBlockArgsResultT<std::decay_t<BodyFn>, std::make_index_sequence<NumInputs>>;
  if constexpr (std::is_same_v<BodyResult, void>) {
-    detail::invokeWithBlockArgs(std::forward<BodyFn>(body), block, std::make_index_sequence<NumInputs> {});
+    detail::invokeWithValues(std::forward<BodyFn>(body),
+                             detail::getInputBlockArgs(block, weights.size()),
+                             std::make_index_sequence<NumInputs> {});

    rewriter.setInsertionPointAfter(computeOp);
    return computeOp;
  }
  else {
-    auto bodyResult =
-      detail::invokeWithBlockArgs(std::forward<BodyFn>(body), block, std::make_index_sequence<NumInputs> {});
+    auto bodyResult = detail::invokeWithValues(std::forward<BodyFn>(body),
+                                               detail::getInputBlockArgs(block, weights.size()),
+                                               std::make_index_sequence<NumInputs> {});
    if (mlir::failed(bodyResult)) {
      rewriter.setInsertionPointAfter(computeOp);
      rewriter.eraseOp(computeOp);
@@ -123,6 +143,8 @@ auto createSpatCompute(RewriterT& rewriter,
  auto computeOp = spatial::SpatCompute::create(rewriter, loc, resultTypes, weights, inputs);

  auto* block = new mlir::Block();
+  for (mlir::Value weight : weights)
+    block->addArgument(weight.getType(), loc);
  for (mlir::Value input : inputs)
    block->addArgument(input.getType(), loc);

@@ -131,13 +153,13 @@ auto createSpatCompute(RewriterT& rewriter,

  using BodyResult = detail::InvokeWithValueRangeResultT<std::decay_t<BodyFn>>;
  if constexpr (std::is_same_v<BodyResult, void>) {
-    std::forward<BodyFn>(body)(detail::getBlockArgs(block));
+    std::forward<BodyFn>(body)(detail::getInputBlockArgs(block, weights.size()));

    rewriter.setInsertionPointAfter(computeOp);
    return computeOp;
  }
  else {
-    auto bodyResult = std::forward<BodyFn>(body)(detail::getBlockArgs(block));
+    auto bodyResult = std::forward<BodyFn>(body)(detail::getInputBlockArgs(block, weights.size()));
    if (mlir::failed(bodyResult)) {
      rewriter.setInsertionPointAfter(computeOp);
      rewriter.eraseOp(computeOp);
@@ -148,6 +170,98 @@ auto createSpatCompute(RewriterT& rewriter,
  }
 }

+template <typename RewriterT, typename BodyFn>
+auto createSpatComputeBatch(RewriterT& rewriter,
+                            mlir::Location loc,
+                            mlir::TypeRange resultTypes,
+                            int64_t laneCount,
+                            mlir::ValueRange weights,
+                            mlir::ValueRange inputs,
+                            BodyFn&& body) {
+  if (laneCount <= 0 || laneCount > std::numeric_limits<int32_t>::max())
+    return mlir::FailureOr<spatial::SpatComputeBatch>(mlir::failure());
+
+  auto laneCountAttr = pim::getCheckedI32Attr(rewriter, loc, laneCount, "spatial compute_batch lane count");
+  if (mlir::failed(laneCountAttr))
+    return mlir::FailureOr<spatial::SpatComputeBatch>(mlir::failure());
+
+  auto batchOp = spatial::SpatComputeBatch::create(rewriter, loc, resultTypes, *laneCountAttr, weights, inputs);
+
+  mlir::SmallVector<mlir::Type> blockArgTypes {rewriter.getIndexType()};
+  mlir::SmallVector<mlir::Location> blockArgLocs {loc};
+  blockArgTypes.reserve(1 + weights.size() + inputs.size() + resultTypes.size());
+  blockArgLocs.reserve(1 + weights.size() + inputs.size() + resultTypes.size());
+  for (mlir::Value weight : weights) {
+    blockArgTypes.push_back(weight.getType());
+    blockArgLocs.push_back(weight.getLoc());
+  }
+  for (mlir::Value input : inputs) {
+    blockArgTypes.push_back(input.getType());
+    blockArgLocs.push_back(input.getLoc());
+  }
+  for (mlir::Type resultType : resultTypes) {
+    blockArgTypes.push_back(resultType);
+    blockArgLocs.push_back(loc);
+  }
+
+  auto* block =
+    rewriter.createBlock(&batchOp.getBody(), batchOp.getBody().end(), mlir::TypeRange(blockArgTypes), blockArgLocs);
+  rewriter.setInsertionPointToStart(block);
+
+  detail::SpatComputeBatchBodyArgs args {
+    block->getArgument(0),
+    mlir::ValueRange(block->getArguments()).slice(1, weights.size()),
+    mlir::ValueRange(block->getArguments()).slice(1 + weights.size(), inputs.size()),
+    mlir::ValueRange(block->getArguments()).drop_front(1 + weights.size() + inputs.size())};
+
+  using BodyResult = std::invoke_result_t<BodyFn, detail::SpatComputeBatchBodyArgs>;
+  if constexpr (std::is_same_v<BodyResult, void>) {
+    std::forward<BodyFn>(body)(args);
+    rewriter.setInsertionPointAfter(batchOp);
+    return mlir::FailureOr<spatial::SpatComputeBatch>(batchOp);
+  }
+  else {
+    auto bodyResult = std::forward<BodyFn>(body)(args);
+    if (mlir::failed(bodyResult)) {
+      rewriter.setInsertionPointAfter(batchOp);
+      rewriter.eraseOp(batchOp);
+      return mlir::FailureOr<spatial::SpatComputeBatch>(mlir::failure());
+    }
+    rewriter.setInsertionPointAfter(batchOp);
+    return mlir::FailureOr<spatial::SpatComputeBatch>(batchOp);
+  }
+}
+
+inline void createParallelInsertSliceIntoBatchOutput(mlir::PatternRewriter& rewriter,
+                                                     mlir::Location loc,
+                                                     mlir::Value source,
+                                                     mlir::Value dest,
+                                                     mlir::ArrayRef<mlir::OpFoldResult> offsets,
+                                                     mlir::ArrayRef<mlir::OpFoldResult> sizes,
+                                                     mlir::ArrayRef<mlir::OpFoldResult> strides) {
+  auto inParallelOp = spatial::SpatInParallelOp::create(rewriter, loc);
+  rewriter.setInsertionPointToStart(&inParallelOp.getRegion().front());
+  mlir::tensor::ParallelInsertSliceOp::create(rewriter, loc, source, dest, offsets, sizes, strides);
+}
+
+template <typename BodyFn>
+mlir::Value materializeOrComputeUnary(mlir::Value input,
+                                      mlir::RankedTensorType resultType,
+                                      mlir::PatternRewriter& rewriter,
+                                      mlir::Location loc,
+                                      BodyFn&& build) {
+  auto&& buildFn = build;
+  if (isCompileTimeComputable(input))
+    return buildFn(input);
+
+  auto computeOp = createSpatCompute<1>(
+    rewriter, loc, mlir::TypeRange {resultType}, {}, mlir::ValueRange {input}, [&](mlir::Value computeInput) {
+      mlir::Value result = buildFn(computeInput);
+      spatial::SpatYieldOp::create(rewriter, loc, result);
+    });
+  return computeOp.getResult(0);
+}
+
 mlir::Value sumTensors(mlir::ArrayRef<mlir::Value> tensors, mlir::ConversionPatternRewriter& rewriter);

 } // namespace onnx_mlir
@@ -0,0 +1,45 @@
+#include <algorithm>
+
+#include "IndexingUtils.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+int64_t normalizeAxis(int64_t axis, int64_t rank) { return axis >= 0 ? axis : rank + axis; }
+
+FailureOr<int64_t> normalizeAxisChecked(int64_t axis, int64_t rank) {
+  int64_t normalizedAxis = normalizeAxis(axis, rank);
+  if (normalizedAxis < 0 || normalizedAxis >= rank)
+    return failure();
+  return normalizedAxis;
+}
+
+int64_t normalizeIndex(int64_t index, int64_t dimSize) { return index >= 0 ? index : dimSize + index; }
+
+static SmallVector<int64_t> normalizeAxesImpl(std::optional<ArrayAttr> axesAttr, int64_t rank) {
+  SmallVector<int64_t> normalizedAxes;
+  if (!axesAttr) {
+    normalizedAxes.reserve(rank);
+    for (int64_t axis = 0; axis < rank; ++axis)
+      normalizedAxes.push_back(axis);
+  }
+  else {
+    normalizedAxes.reserve(axesAttr->size());
+    for (Attribute attr : *axesAttr)
+      normalizedAxes.push_back(normalizeAxis(cast<IntegerAttr>(attr).getInt(), rank));
+    llvm::sort(normalizedAxes);
+    normalizedAxes.erase(std::unique(normalizedAxes.begin(), normalizedAxes.end()), normalizedAxes.end());
+  }
+  return normalizedAxes;
+}
+
+FailureOr<SmallVector<int64_t>> normalizeAxesChecked(std::optional<ArrayAttr> axesAttr, int64_t rank) {
+  SmallVector<int64_t> normalizedAxes = normalizeAxesImpl(axesAttr, rank);
+  for (int64_t axis : normalizedAxes)
+    if (axis < 0 || axis >= rank)
+      return failure();
+  return normalizedAxes;
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,20 @@
+#pragma once
+
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/Support/LogicalResult.h"
+
+#include "llvm/ADT/SmallVector.h"
+
+#include <optional>
+
+namespace onnx_mlir {
+
+int64_t normalizeAxis(int64_t axis, int64_t rank);
+
+mlir::FailureOr<int64_t> normalizeAxisChecked(int64_t axis, int64_t rank);
+
+int64_t normalizeIndex(int64_t index, int64_t dimSize);
+
+mlir::FailureOr<llvm::SmallVector<int64_t>> normalizeAxesChecked(std::optional<mlir::ArrayAttr> axesAttr, int64_t rank);
+
+} // namespace onnx_mlir
@@ -3,26 +3,93 @@

 #include "llvm/ADT/SmallVector.h"

+#include <functional>
+
+#include "IndexingUtils.hpp"
 #include "ShapeTilingUtils.hpp"
+#include "src/Accelerators/PIM/Common/IR/ConstantUtils.hpp"
 #include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"

 using namespace mlir;

 namespace onnx_mlir {

+bool hasStaticPositiveShape(ArrayRef<int64_t> shape) {
+  return llvm::all_of(shape, [](int64_t dim) { return dim > 0; });
+}
+
+bool hasStaticPositiveShape(RankedTensorType type) {
+  return type.hasStaticShape() && hasStaticPositiveShape(type.getShape());
+}
+
+int64_t getStaticShapeElementCount(ArrayRef<int64_t> shape) {
+  return std::accumulate(shape.begin(), shape.end(), int64_t {1}, std::multiplies<int64_t> {});
+}
+
+SmallVector<int64_t> permuteShape(ArrayRef<int64_t> shape, ArrayRef<int64_t> permutation) {
+  SmallVector<int64_t> permutedShape;
+  permutedShape.reserve(permutation.size());
+  for (int64_t axis : permutation)
+    permutedShape.push_back(shape[axis]);
+  return permutedShape;
+}
+
+SmallVector<int64_t> invertPermutation(ArrayRef<int64_t> permutation) {
+  SmallVector<int64_t> inversePermutation(permutation.size());
+  for (auto [newIndex, oldIndex] : llvm::enumerate(permutation))
+    inversePermutation[oldIndex] = static_cast<int64_t>(newIndex);
+  return inversePermutation;
+}
+
+FailureOr<SmallVector<int64_t>> getTransposePermutationChecked(std::optional<ArrayAttr> permAttr, int64_t rank) {
+  SmallVector<int64_t> permutation;
+  if (!permAttr) {
+    permutation.reserve(rank);
+    for (int64_t dim = rank - 1; dim >= 0; --dim)
+      permutation.push_back(dim);
+    return permutation;
+  }
+
+  if (static_cast<int64_t>(permAttr->size()) != rank)
+    return failure();
+
+  permutation.reserve(permAttr->size());
+  SmallVector<bool> seen(rank, false);
+  for (IntegerAttr attr : permAttr->getAsRange<IntegerAttr>()) {
+    int64_t axis = attr.getInt();
+    if (axis < 0 || axis >= rank || seen[axis])
+      return failure();
+    seen[axis] = true;
+    permutation.push_back(axis);
+  }
+  return permutation;
+}
+
+SmallVector<OpFoldResult> getUnitStrides(PatternRewriter& rewriter, int64_t rank) {
+  return SmallVector<OpFoldResult>(rank, rewriter.getIndexAttr(1));
+}
+
+SmallVector<OpFoldResult> getZeroOffsets(PatternRewriter& rewriter, int64_t rank) {
+  return SmallVector<OpFoldResult>(rank, rewriter.getIndexAttr(0));
+}
+
+SmallVector<OpFoldResult> getStaticSizes(PatternRewriter& rewriter, ArrayRef<int64_t> shape) {
+  SmallVector<OpFoldResult> sizes;
+  sizes.reserve(shape.size());
+  for (int64_t dim : shape)
+    sizes.push_back(rewriter.getIndexAttr(dim));
+  return sizes;
+}
+
 SmallVector<Value> sliceTensor(
  const Value& tensorToSlice, size_t axis, int64_t sliceSize, ConversionPatternRewriter& rewriter, Location loc) {
  ArrayRef<long> shape = getTensorShape(tensorToSlice);
  assert("Invalid axis" && axis < shape.size());

  SmallVector<OpFoldResult> strides(shape.size(), rewriter.getIndexAttr(1));
-  SmallVector<OpFoldResult> offsets(shape.size(), rewriter.getIndexAttr(0));
-  SmallVector<OpFoldResult> sizes;
-  sizes.reserve(shape.size());
-  for (const auto size : shape)
-    sizes.push_back(rewriter.getIndexAttr(size));
+  SmallVector<OpFoldResult> offsets = getZeroOffsets(rewriter, shape.size());
+  SmallVector<OpFoldResult> sizes = getStaticSizes(rewriter, shape);
  sizes[axis] = rewriter.getIndexAttr(sliceSize);

  long length = shape[axis];
@@ -44,7 +111,7 @@ SmallVector<Value> sliceTensor(
      RankedTensorType::get(sliceShape, cast<RankedTensorType>(tensorToSlice.getType()).getElementType());

    Value slice;
-    if (isHostFoldableValue(tensorToSlice)) {
+    if (isCompileTimeComputable(tensorToSlice)) {
      slice = tensor::ExtractSliceOp::create(rewriter, loc, tensorToSlice, offsets, sizes, strides);
    }
    else {
@@ -80,38 +147,33 @@ sliceVectorPerCrossbarPerCore(const Value& vectorToSlice, ConversionPatternRewri
  return slicesPerCore;
 }

-DenseMap<HSliceId, DenseMap<CoreId, SmallVector<Value>>> tileMatrix(
-  Value& matrixToTile, int64_t hSliceSize, int64_t vSliceSize, ConversionPatternRewriter& rewriter, Location& loc) {
-  assert("Not a matrix" && isMatrixShape(getTensorShape(matrixToTile)));
+Value extractAxisSlice(
+  PatternRewriter& rewriter, Location loc, Value source, int64_t axis, int64_t offset, int64_t size) {
+  auto sourceType = cast<RankedTensorType>(source.getType());
+  SmallVector<int64_t> resultShape(sourceType.getShape());
+  resultShape[axis] = size;
+  auto resultType = RankedTensorType::get(resultShape, sourceType.getElementType(), sourceType.getEncoding());

-  DenseMap<HSliceId, DenseMap<CoreId, SmallVector<Value>>> tiles;
-
-  SmallVector<Value> hSlices = sliceTensor(matrixToTile, 1, hSliceSize, rewriter, loc);
-  size_t numHSlices = hSlices.size();
-  for (size_t hSliceId = 0; hSliceId < numHSlices; hSliceId++) {
-    Value hSlice = hSlices[hSliceId];
-    SmallVector<Value> vSlices = sliceTensor(hSlice, 0, vSliceSize, rewriter, loc);
-    for (size_t vSliceId = 0; vSliceId < vSlices.size(); vSliceId++) {
-      size_t coreId = vSliceId / crossbarCountInCore;
-      Value vSlice = vSlices[vSliceId];
-      tiles[hSliceId][coreId].push_back(vSlice);
-    }
-  }
-  return tiles;
+  SmallVector<OpFoldResult> offsets = getZeroOffsets(rewriter, sourceType.getRank());
+  SmallVector<OpFoldResult> sizes = getStaticSizes(rewriter, sourceType.getShape());
+  offsets[axis] = rewriter.getIndexAttr(offset);
+  sizes[axis] = rewriter.getIndexAttr(size);
+  return tensor::ExtractSliceOp::create(
+           rewriter, loc, resultType, source, offsets, sizes, getUnitStrides(rewriter, sourceType.getRank()))
+    .getResult();
 }

-tensor::SplatOp
-broadcastToVector(Value scalarToBroadcast, int64_t length, ConversionPatternRewriter& rewriter, Location loc) {
-  auto oldType = cast<RankedTensorType>(scalarToBroadcast.getType());
-  Type elementType = oldType.getElementType();
-  int64_t shape[2] = {1, length};
-  Type type = oldType.cloneWith(ArrayRef(shape), elementType);
-
-  auto zero = arith::ConstantIndexOp::create(rewriter, loc, 0).getResult();
-  SmallVector<Value> index(oldType.getRank(), zero);
-  auto elementValue = tensor::ExtractOp::create(rewriter, loc, scalarToBroadcast, index).getResult();
-
-  return tensor::SplatOp::create(rewriter, loc, type, elementValue);
+Value insertStaticSlice(
+  PatternRewriter& rewriter, Location loc, Value source, Value dest, ArrayRef<OpFoldResult> offsets) {
+  auto sourceType = cast<RankedTensorType>(source.getType());
+  return tensor::InsertSliceOp::create(rewriter,
+                                       loc,
+                                       source,
+                                       dest,
+                                       offsets,
+                                       getStaticSizes(rewriter, sourceType.getShape()),
+                                       getUnitStrides(rewriter, sourceType.getRank()))
+    .getResult();
 }

 } // namespace onnx_mlir
@@ -3,6 +3,7 @@
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/Value.h"
+#include "mlir/IR/ValueRange.h"
 #include "mlir/Transforms/DialectConversion.h"

 #include "llvm/ADT/ArrayRef.h"
@@ -11,46 +12,12 @@

 #include <cassert>
 #include <cstddef>
+#include <optional>
 #include <type_traits>
 #include <utility>

 namespace onnx_mlir {

-template <class ShapedType>
-inline auto getImageWidth(const ShapedType& shapedType) {
-  return shapedType.getDimSize(2);
-}
-
-template <class ShapedType>
-inline auto getImageHeight(const ShapedType& shapedType) {
-  return shapedType.getDimSize(3);
-}
-
-template <class ShapedType>
-inline auto getImageChannel(const ShapedType& shapedType) {
-  return shapedType.getDimSize(1);
-}
-
-template <class ShapedType>
-inline auto getImageN(const ShapedType& shapedType) {
-  return shapedType.getDimSize(0);
-}
-
-template <class ShapedType>
-inline auto getKernelWidth(const ShapedType& shapedType) {
-  return shapedType.getDimSize(2);
-}
-
-template <class ShapedType>
-inline auto getKernelHeight(const ShapedType& shapedType) {
-  return shapedType.getDimSize(3);
-}
-
-template <class ShapedType>
-inline auto getFilterCount(const ShapedType& shapedType) {
-  return shapedType.getDimSize(0);
-}
-
 using HSliceId = size_t;
 using CoreId = size_t;

@@ -87,17 +54,6 @@ bool isHVectorShape(mlir::ArrayRef<T> shape) {
  return shape.size() == 2 && shape[0] == 1;
 }

-template <class T>
-bool isVVectorShape(mlir::ArrayRef<T> shape) {
-  return shape.size() == 2 && shape[1] == 1;
-}
-
-template <class T>
-T getVectorLength(mlir::ArrayRef<T> shape) {
-  assert(isVectorShape(shape));
-  return shape[0] != 1 ? shape[0] : shape[1];
-}
-
 inline auto getTensorShape(mlir::Value tensor) {
  return mlir::cast<mlir::RankedTensorType>(tensor.getType()).getShape();
 }
@@ -109,6 +65,25 @@ inline bool haveSameStaticShape(mlir::Value lhs, mlir::Value rhs) {
      && lhsType.getShape() == rhsType.getShape();
 }

+bool hasStaticPositiveShape(mlir::ArrayRef<int64_t> shape);
+
+bool hasStaticPositiveShape(mlir::RankedTensorType type);
+
+int64_t getStaticShapeElementCount(mlir::ArrayRef<int64_t> shape);
+
+llvm::SmallVector<int64_t> permuteShape(mlir::ArrayRef<int64_t> shape, mlir::ArrayRef<int64_t> permutation);
+
+llvm::SmallVector<int64_t> invertPermutation(mlir::ArrayRef<int64_t> permutation);
+
+mlir::FailureOr<llvm::SmallVector<int64_t>> getTransposePermutationChecked(std::optional<mlir::ArrayAttr> permAttr,
+                                                                           int64_t rank);
+
+llvm::SmallVector<mlir::OpFoldResult> getUnitStrides(mlir::PatternRewriter& rewriter, int64_t rank);
+
+llvm::SmallVector<mlir::OpFoldResult> getZeroOffsets(mlir::PatternRewriter& rewriter, int64_t rank);
+
+llvm::SmallVector<mlir::OpFoldResult> getStaticSizes(mlir::PatternRewriter& rewriter, mlir::ArrayRef<int64_t> shape);
+
 /// Slices a statically shaped tensor along one axis into contiguous pieces of
 /// at most `sliceSize` elements.
 llvm::SmallVector<mlir::Value> sliceTensor(const mlir::Value& tensorToSlice,
@@ -127,18 +102,13 @@ llvm::SmallVector<mlir::Value> sliceVector(const mlir::Value& vectorToSlice,
 llvm::DenseMap<CoreId, llvm::SmallVector<mlir::Value>> sliceVectorPerCrossbarPerCore(
  const mlir::Value& vectorToSlice, mlir::ConversionPatternRewriter& rewriter, mlir::Location loc);

-/// Tiles a matrix first across output columns and then across input rows so it
-/// can be assigned to crossbars grouped by core.
-llvm::DenseMap<HSliceId, llvm::DenseMap<CoreId, llvm::SmallVector<mlir::Value>>>
-tileMatrix(mlir::Value& matrixToTile,
-           int64_t hSliceSize,
-           int64_t vSliceSize,
-           mlir::ConversionPatternRewriter& rewriter,
-           mlir::Location& loc);
+mlir::Value extractAxisSlice(
+  mlir::PatternRewriter& rewriter, mlir::Location loc, mlir::Value source, int64_t axis, int64_t offset, int64_t size);

-mlir::tensor::SplatOp broadcastToVector(mlir::Value scalarToBroadcast,
-                                        int64_t length,
-                                        mlir::ConversionPatternRewriter& rewriter,
-                                        mlir::Location loc);
+mlir::Value insertStaticSlice(mlir::PatternRewriter& rewriter,
+                              mlir::Location loc,
+                              mlir::Value source,
+                              mlir::Value dest,
+                              llvm::ArrayRef<mlir::OpFoldResult> offsets);

 } // namespace onnx_mlir
@@ -1,4 +1,5 @@
 #include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/IR/BuiltinTypes.h"
 #include "mlir/IR/IRMapping.h"
@@ -43,8 +44,8 @@ bool isWeightLikeComputeOperand(Value value) {
      value = collapseShapeOp.getSrc();
      continue;
    }
-    if (auto transposeOp = dyn_cast<ONNXTransposeOp>(definingOp)) {
-      value = transposeOp.getData();
+    if (auto transposeOp = dyn_cast<linalg::TransposeOp>(definingOp)) {
+      value = transposeOp.getInput();
      continue;
    }

@@ -80,7 +81,7 @@ FailureOr<Value> materializeWeightLikeValueInBlock(Value value, IRRewriter& rewr
    return referencedValue.getResult();
  }

-  if (!isa<tensor::ExtractSliceOp, tensor::ExpandShapeOp, tensor::CollapseShapeOp, ONNXTransposeOp>(definingOp))
+  if (!isa<tensor::ExtractSliceOp, tensor::ExpandShapeOp, tensor::CollapseShapeOp, linalg::TransposeOp>(definingOp))
    return failure();

  IRMapping localMapper;
@@ -0,0 +1,299 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/IR/BuiltinTypes.h"
+
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallBitVector.h"
+#include "llvm/ADT/SmallPtrSet.h"
+#include "llvm/ADT/SmallVector.h"
+
+#include <utility>
+
+#include "src/Accelerators/PIM/Common/IR/ConstantUtils.hpp"
+#include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
+#include "src/Dialect/ONNX/ONNXOps.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+namespace {
+
+static bool hasStaticUnitStrides(tensor::ExtractSliceOp extractSliceOp) {
+  return llvm::all_of(extractSliceOp.getStaticStrides(), [](int64_t stride) { return stride == 1; });
+}
+
+static bool hasConstantIndices(tensor::ExtractOp extractOp) {
+  return llvm::all_of(extractOp.getIndices(), [](Value index) { return matchConstantIndexValue(index).has_value(); });
+}
+
+static bool isStaticTensorResult(Operation* op) {
+  return llvm::all_of(op->getResultTypes(), [](Type type) {
+    auto shapedType = dyn_cast<ShapedType>(type);
+    return shapedType && shapedType.hasStaticShape();
+  });
+}
+
+static FailureOr<DenseElementsAttr> transposeDenseElements(DenseElementsAttr denseAttr, ArrayRef<int64_t> perms) {
+  auto tensorType = dyn_cast<RankedTensorType>(denseAttr.getType());
+  if (!tensorType)
+    return failure();
+
+  int64_t rank = tensorType.getRank();
+  if (static_cast<int64_t>(perms.size()) != rank)
+    return failure();
+
+  llvm::SmallBitVector seen(rank);
+  SmallVector<int64_t> transposedShape;
+  transposedShape.reserve(rank);
+  for (int64_t perm : perms) {
+    if (perm < 0 || perm >= rank || seen.test(perm))
+      return failure();
+    seen.set(perm);
+    transposedShape.push_back(tensorType.getShape()[perm]);
+  }
+
+  auto transposedType = RankedTensorType::get(transposedShape, tensorType.getElementType(), tensorType.getEncoding());
+  if (denseAttr.isSplat())
+    return DenseElementsAttr::get(transposedType, denseAttr.getSplatValue<Attribute>());
+
+  SmallVector<Attribute> originalValues(denseAttr.getValues<Attribute>());
+  SmallVector<Attribute> transposedValues(originalValues.size());
+  SmallVector<int64_t> originalStrides = computeRowMajorStrides(tensorType.getShape());
+  SmallVector<int64_t> transposedStrides = computeRowMajorStrides(transposedShape);
+  SmallVector<int64_t> originalIndices(rank);
+
+  for (auto [linearIndex, value] : llvm::enumerate(originalValues)) {
+    int64_t remaining = static_cast<int64_t>(linearIndex);
+    for (int64_t dim = 0; dim < rank; ++dim) {
+      originalIndices[dim] = remaining / originalStrides[dim];
+      remaining %= originalStrides[dim];
+    }
+
+    int64_t transposedLinearIndex = 0;
+    for (int64_t dim = 0; dim < rank; ++dim)
+      transposedLinearIndex += originalIndices[perms[dim]] * transposedStrides[dim];
+
+    transposedValues[transposedLinearIndex] = value;
+  }
+
+  return DenseElementsAttr::get(transposedType, transposedValues);
+}
+
+static FailureOr<DenseElementsAttr> reshapeDenseElements(DenseElementsAttr denseAttr, RankedTensorType resultType) {
+  auto sourceType = dyn_cast<RankedTensorType>(denseAttr.getType());
+  if (!sourceType || !resultType || sourceType.getNumElements() != resultType.getNumElements())
+    return failure();
+
+  if (denseAttr.isSplat())
+    return DenseElementsAttr::get(resultType, denseAttr.getSplatValue<Attribute>());
+
+  SmallVector<Attribute> values(denseAttr.getValues<Attribute>());
+  return DenseElementsAttr::get(resultType, values);
+}
+
+static FailureOr<DenseElementsAttr> extractSliceDenseElements(DenseElementsAttr denseAttr,
+                                                              tensor::ExtractSliceOp extractSliceOp) {
+  auto sourceType = dyn_cast<RankedTensorType>(denseAttr.getType());
+  auto resultType = dyn_cast<RankedTensorType>(extractSliceOp.getType());
+  if (!sourceType || !resultType || !sourceType.hasStaticShape() || !resultType.hasStaticShape())
+    return failure();
+
+  ArrayRef<int64_t> offsets = extractSliceOp.getStaticOffsets();
+  ArrayRef<int64_t> sizes = extractSliceOp.getStaticSizes();
+  ArrayRef<int64_t> strides = extractSliceOp.getStaticStrides();
+  if (llvm::any_of(offsets, [](int64_t value) { return ShapedType::isDynamic(value); })
+      || llvm::any_of(sizes, [](int64_t value) { return ShapedType::isDynamic(value); })
+      || llvm::any_of(strides, [](int64_t stride) { return ShapedType::isDynamic(stride) || stride != 1; }))
+    return failure();
+
+  if (denseAttr.isSplat())
+    return DenseElementsAttr::get(resultType, denseAttr.getSplatValue<Attribute>());
+
+  SmallVector<Attribute> sourceValues(denseAttr.getValues<Attribute>());
+  SmallVector<int64_t> sourceStrides = computeRowMajorStrides(sourceType.getShape());
+  SmallVector<int64_t> resultStrides = computeRowMajorStrides(resultType.getShape());
+  SmallVector<Attribute> resultValues;
+  resultValues.reserve(resultType.getNumElements());
+
+  for (int64_t linearIndex = 0; linearIndex < resultType.getNumElements(); ++linearIndex) {
+    int64_t remaining = linearIndex;
+    int64_t sourceLinearIndex = 0;
+    for (int64_t dim = 0; dim < resultType.getRank(); ++dim) {
+      const int64_t resultIndex = resultStrides.empty() ? 0 : remaining / resultStrides[dim];
+      remaining = resultStrides.empty() ? 0 : remaining % resultStrides[dim];
+      sourceLinearIndex += (offsets[dim] + resultIndex) * sourceStrides[dim];
+    }
+    resultValues.push_back(sourceValues[sourceLinearIndex]);
+  }
+
+  return DenseElementsAttr::get(resultType, resultValues);
+}
+
+static DenseElementsAttr getDirectDenseConstantAttr(Value value) {
+  if (auto constantOp = value.getDefiningOp<arith::ConstantOp>())
+    return dyn_cast<DenseElementsAttr>(constantOp.getValue());
+  if (auto constantOp = value.getDefiningOp<ONNXConstantOp>())
+    return dyn_cast_or_null<DenseElementsAttr>(constantOp.getValueAttr());
+  return nullptr;
+}
+
+static DenseElementsAttr getHostConstantDenseElementsAttrImpl(Value value, llvm::SmallPtrSetImpl<Operation*>& visited) {
+  auto* definingOp = value.getDefiningOp();
+  if (!definingOp || !visited.insert(definingOp).second)
+    return nullptr;
+
+  // Rebuild dense attributes through view-only host-foldable chains so later
+  // lowering stages can still recognize grouped/sliced constants.
+  if (auto denseAttr = getDirectDenseConstantAttr(value))
+    return denseAttr;
+
+  if (auto transposeOp = dyn_cast<ONNXTransposeOp>(definingOp)) {
+    auto inputAttr = getHostConstantDenseElementsAttrImpl(transposeOp.getData(), visited);
+    if (!inputAttr)
+      return nullptr;
+
+    SmallVector<int64_t> perm;
+    perm.reserve(transposeOp.getPermAttr().size());
+    for (IntegerAttr attr : transposeOp.getPermAttr().getAsRange<IntegerAttr>())
+      perm.push_back(attr.getInt());
+    auto transposedAttr = transposeDenseElements(inputAttr, perm);
+    return succeeded(transposedAttr) ? *transposedAttr : nullptr;
+  }
+
+  if (auto transposeOp = dyn_cast<linalg::TransposeOp>(definingOp)) {
+    auto inputAttr = getHostConstantDenseElementsAttrImpl(transposeOp.getInput(), visited);
+    if (!inputAttr)
+      return nullptr;
+
+    SmallVector<int64_t> perm(transposeOp.getPermutation().begin(), transposeOp.getPermutation().end());
+    auto transposedAttr = transposeDenseElements(inputAttr, perm);
+    return succeeded(transposedAttr) ? *transposedAttr : nullptr;
+  }
+
+  if (auto collapseShapeOp = dyn_cast<tensor::CollapseShapeOp>(definingOp)) {
+    auto inputAttr = getHostConstantDenseElementsAttrImpl(collapseShapeOp.getSrc(), visited);
+    if (!inputAttr)
+      return nullptr;
+    auto reshapedAttr = reshapeDenseElements(inputAttr, cast<RankedTensorType>(collapseShapeOp.getType()));
+    return succeeded(reshapedAttr) ? *reshapedAttr : nullptr;
+  }
+
+  if (auto expandShapeOp = dyn_cast<tensor::ExpandShapeOp>(definingOp)) {
+    auto inputAttr = getHostConstantDenseElementsAttrImpl(expandShapeOp.getSrc(), visited);
+    if (!inputAttr)
+      return nullptr;
+    auto reshapedAttr = reshapeDenseElements(inputAttr, cast<RankedTensorType>(expandShapeOp.getType()));
+    return succeeded(reshapedAttr) ? *reshapedAttr : nullptr;
+  }
+
+  if (auto extractSliceOp = dyn_cast<tensor::ExtractSliceOp>(definingOp)) {
+    auto inputAttr = getHostConstantDenseElementsAttrImpl(extractSliceOp.getSource(), visited);
+    if (!inputAttr)
+      return nullptr;
+    auto slicedAttr = extractSliceDenseElements(inputAttr, extractSliceOp);
+    return succeeded(slicedAttr) ? *slicedAttr : nullptr;
+  }
+
+  return nullptr;
+}
+
+static std::optional<CompileTimeSource>
+getCompileTimeSourceImpl(Operation* op, llvm::SmallPtrSetImpl<Operation*>& visited, size_t chainLength = 0) {
+  if (!op)
+    return std::nullopt;
+
+  if (!visited.insert(op).second)
+    return {
+      {op, chainLength}
+    };
+
+  if (isa<arith::ConstantOp, ONNXConstantOp, ONNXNoneOp>(op))
+    return {
+      {op, chainLength}
+    };
+
+  chainLength += 1;
+
+  if (auto extractOp = dyn_cast<tensor::ExtractOp>(op))
+    return hasConstantIndices(extractOp)
+           ? getCompileTimeSourceImpl(extractOp.getTensor().getDefiningOp(), visited, chainLength)
+           : std::nullopt;
+
+  if (!isStaticTensorResult(op))
+    return std::nullopt;
+
+  if (auto transposeOp = dyn_cast<ONNXTransposeOp>(op))
+    return getCompileTimeSourceImpl(transposeOp.getData().getDefiningOp(), visited, chainLength);
+
+  if (auto transposeOp = dyn_cast<linalg::TransposeOp>(op))
+    return getCompileTimeSourceImpl(transposeOp.getInput().getDefiningOp(), visited, chainLength);
+
+  if (auto collapseShapeOp = dyn_cast<tensor::CollapseShapeOp>(op))
+    return getCompileTimeSourceImpl(collapseShapeOp.getSrc().getDefiningOp(), visited, chainLength);
+
+  if (auto expandShapeOp = dyn_cast<tensor::ExpandShapeOp>(op))
+    return getCompileTimeSourceImpl(expandShapeOp.getSrc().getDefiningOp(), visited, chainLength);
+
+  if (auto extractSliceOp = dyn_cast<tensor::ExtractSliceOp>(op))
+    return hasStaticUnitStrides(extractSliceOp)
+           ? getCompileTimeSourceImpl(extractSliceOp.getSource().getDefiningOp(), visited, chainLength)
+           : std::nullopt;
+
+  if (auto splatOp = dyn_cast<tensor::SplatOp>(op))
+    return getCompileTimeSourceImpl(splatOp.getInput().getDefiningOp(), visited, chainLength);
+
+  if (auto extractRowsOp = dyn_cast<spatial::SpatExtractRowsOp>(op))
+    return getCompileTimeSourceImpl(extractRowsOp.getInput().getDefiningOp(), visited, chainLength);
+
+  if (auto concatOp = dyn_cast<spatial::SpatConcatOp>(op)) {
+    std::optional<CompileTimeSource> res = {};
+    for (auto operandValue : concatOp.getOperands()) {
+      auto partialRes = getCompileTimeSourceImpl(operandValue.getDefiningOp(), visited, chainLength);
+      if (!partialRes)
+        return std::nullopt;
+
+      if (!res) {
+        res = partialRes;
+        continue;
+      }
+      if (res->chainLength < partialRes->chainLength)
+        res = partialRes;
+    }
+    return res;
+  }
+
+  return std::nullopt;
+}
+
+} // namespace
+
+std::optional<CompileTimeSource> getCompileTimeSource(Operation* op) {
+  llvm::SmallPtrSet<Operation*, 8> visited;
+  return getCompileTimeSourceImpl(op, visited);
+}
+
+bool isCompileTimeComputable(Value value) {
+  auto* definingOp = value.getDefiningOp();
+  if (!definingOp)
+    return false;
+
+  llvm::SmallPtrSet<Operation*, 8> visited;
+  return getCompileTimeSourceImpl(definingOp, visited).has_value();
+}
+
+bool isCompileTimeOp(Operation* op) {
+  llvm::SmallPtrSet<Operation*, 8> visited;
+  return getCompileTimeSourceImpl(op, visited).has_value();
+}
+
+DenseElementsAttr getHostConstDenseElementsAttr(Value value) {
+  llvm::SmallPtrSet<Operation*, 8> visited;
+  return getHostConstantDenseElementsAttrImpl(value, visited);
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,22 @@
+#pragma once
+
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/IR/Operation.h"
+#include "mlir/IR/Value.h"
+
+namespace onnx_mlir {
+
+struct CompileTimeSource {
+  mlir::Operation* source;
+  size_t chainLength;
+};
+
+std::optional<CompileTimeSource> getCompileTimeSource(mlir::Operation* op);
+
+bool isCompileTimeComputable(mlir::Value value);
+
+bool isCompileTimeOp(mlir::Operation* op);
+
+mlir::DenseElementsAttr getHostConstDenseElementsAttr(mlir::Value value);
+
+} // namespace onnx_mlir
@@ -1,75 +0,0 @@
-#include "mlir/Dialect/Arith/IR/Arith.h"
-#include "mlir/Dialect/Tensor/IR/Tensor.h"
-#include "mlir/IR/BuiltinTypes.h"
-
-#include "llvm/ADT/SmallPtrSet.h"
-
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
-#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
-#include "src/Dialect/ONNX/ONNXOps.hpp"
-
-using namespace mlir;
-
-namespace onnx_mlir {
-
-namespace {
-
-static bool hasStaticUnitStrides(tensor::ExtractSliceOp extractSliceOp) {
-  return llvm::all_of(extractSliceOp.getStaticStrides(), [](int64_t stride) { return stride == 1; });
-}
-
-static bool isStaticTensorResult(Operation* op) {
-  return llvm::all_of(op->getResultTypes(), [](Type type) {
-    auto shapedType = dyn_cast<ShapedType>(type);
-    return shapedType && shapedType.hasStaticShape();
-  });
-}
-
-static bool isHostFoldableOpImpl(Operation* op, llvm::SmallPtrSetImpl<Operation*>& visited) {
-  if (!op || !visited.insert(op).second)
-    return false;
-
-  if (isa<arith::ConstantOp, ONNXConstantOp, ONNXNoneOp>(op))
-    return true;
-
-  if (!isStaticTensorResult(op))
-    return false;
-
-  if (auto transposeOp = dyn_cast<ONNXTransposeOp>(op))
-    return isHostFoldableValue(transposeOp.getData());
-
-  if (auto collapseShapeOp = dyn_cast<tensor::CollapseShapeOp>(op))
-    return isHostFoldableValue(collapseShapeOp.getSrc());
-
-  if (auto expandShapeOp = dyn_cast<tensor::ExpandShapeOp>(op))
-    return isHostFoldableValue(expandShapeOp.getSrc());
-
-  if (auto extractSliceOp = dyn_cast<tensor::ExtractSliceOp>(op))
-    return hasStaticUnitStrides(extractSliceOp) && isHostFoldableValue(extractSliceOp.getSource());
-
-  if (auto extractRowsOp = dyn_cast<spatial::SpatExtractRowsOp>(op))
-    return isHostFoldableValue(extractRowsOp.getInput());
-
-  if (auto concatOp = dyn_cast<spatial::SpatConcatOp>(op))
-    return llvm::all_of(concatOp.getInputs(), isHostFoldableValue);
-
-  return false;
-}
-
-} // namespace
-
-bool isHostFoldableValue(Value value) {
-  auto* definingOp = value.getDefiningOp();
-  if (!definingOp)
-    return false;
-
-  llvm::SmallPtrSet<Operation*, 8> visited;
-  return isHostFoldableOpImpl(definingOp, visited);
-}
-
-bool isHostFoldableOp(Operation* op) {
-  llvm::SmallPtrSet<Operation*, 8> visited;
-  return isHostFoldableOpImpl(op, visited);
-}
-
-} // namespace onnx_mlir
@@ -1,12 +0,0 @@
-#pragma once
-
-#include "mlir/IR/Operation.h"
-#include "mlir/IR/Value.h"
-
-namespace onnx_mlir {
-
-bool isHostFoldableValue(mlir::Value value);
-
-bool isHostFoldableOp(mlir::Operation* op);
-
-} // namespace onnx_mlir
@@ -1,29 +0,0 @@
-#include "mlir/Dialect/Arith/IR/Arith.h"
-#include "mlir/Dialect/Func/IR/FuncOps.h"
-#include "mlir/Dialect/Tensor/IR/Tensor.h"
-
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostLegality.hpp"
-#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
-
-using namespace mlir;
-
-namespace onnx_mlir {
-
-LogicalResult verifyONNXToSpatialHostLegality(func::FuncOp funcOp) {
-  bool hasFailure = false;
-
-  for (Operation& op : funcOp.getFunctionBody().front()) {
-    if (isa<func::ReturnOp, spatial::SpatCompute, spatial::SpatComputeBatch>(&op))
-      continue;
-    if (isHostFoldableOp(&op))
-      continue;
-
-    op.emitOpError("non-foldable top-level runtime op remains after ONNX-to-Spatial; lower it inside spat.compute");
-    hasFailure = true;
-  }
-
-  return success(!hasFailure);
-}
-
-} // namespace onnx_mlir
@@ -1,10 +0,0 @@
-#pragma once
-
-#include "mlir/Dialect/Func/IR/FuncOps.h"
-#include "mlir/Support/LogicalResult.h"
-
-namespace onnx_mlir {
-
-mlir::LogicalResult verifyONNXToSpatialHostLegality(mlir::func::FuncOp funcOp);
-
-} // namespace onnx_mlir
@@ -1,26 +1,22 @@
+#include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/Arith/IR/Arith.h"
 #include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
 #include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/IR/IRMapping.h"
 #include "mlir/Pass/Pass.h"
 #include "mlir/Pass/PassManager.h"
-#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
 #include "mlir/Transforms/Passes.h"

 #include "llvm/ADT/SmallVector.h"
-#include "llvm/Support/Debug.h"

 #include "Common/Common.hpp"
 #include "Common/PimCommon.hpp"
-#include "src/Accelerators/PIM/Common/PimCommon.hpp"
-#include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostLegality.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/PostPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/PrePatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ONNXToSpatialVerifier.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
-#include "src/Compiler/CompilerOptions.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

 using namespace mlir;
@@ -46,7 +42,8 @@ static void populateEmptyFunction(func::FuncOp funcOp) {
  IRRewriter rewriter(funcOp.getContext());
  IRMapping mapper;
  SmallVector<spatial::SpatCompute> computes(funcOp.getOps<spatial::SpatCompute>());
-  if (!computes.empty())
+  SmallVector<spatial::SpatComputeBatch> computeBatches(funcOp.getOps<spatial::SpatComputeBatch>());
+  if (!computes.empty() || !computeBatches.empty())
    return;

  auto returnOp = cast<func::ReturnOp>(funcOp.getFunctionBody().front().getTerminator());
@@ -91,13 +88,27 @@ void ONNXToSpatialPass::runOnOperation() {
  ModuleOp moduleOp = getOperation();
  MLIRContext* ctx = &getContext();

+  ConversionTarget preTarget(*ctx);
+  preTarget.addLegalDialect<spatial::SpatialDialect,
+                            ONNXDialect,
+                            linalg::LinalgDialect,
+                            tensor::TensorDialect,
+                            affine::AffineDialect,
+                            arith::ArithDialect,
+                            scf::SCFDialect>();
+  preTarget.addIllegalOp<ONNXConstantOp, ONNXFlattenOp>();
+
  RewritePatternSet prePatterns(ctx);
  populatePrePatterns(prePatterns, ctx);
-  if (failed(applyPatternsGreedily(moduleOp, std::move(prePatterns))))
-    moduleOp.emitWarning("failed to apply ONNX-to-Spatial pre-patterns; continuing");
+  if (failed(applyPartialConversion(moduleOp, preTarget, std::move(prePatterns)))) {
+    moduleOp.emitError("failed to apply ONNX-to-Spatial pre-rewrites");
+    signalPassFailure();
+    return;
+  }

  auto entryFunc = getPimEntryFunc(moduleOp);
  if (failed(entryFunc)) {
+    moduleOp.emitError("failed to locate the PIM entry function during ONNX-to-Spatial lowering");
    signalPassFailure();
    return;
  }
@@ -105,11 +116,15 @@ void ONNXToSpatialPass::runOnOperation() {
  ConversionTarget target(*ctx);
  target.addLegalDialect<spatial::SpatialDialect,
                         ONNXDialect,
+                         linalg::LinalgDialect,
                         tensor::TensorDialect,
+                         affine::AffineDialect,
                         arith::ArithDialect,
                         scf::SCFDialect>();
  target.addIllegalOp<ONNXMatMulOp>();
+  target.addIllegalOp<ONNXTransposeOp>();
  target.addIllegalOp<ONNXAddOp>();
+  target.addIllegalOp<ONNXSubOp>();
  target.addIllegalOp<ONNXDivOp>();
  target.addIllegalOp<ONNXMulOp>();
  target.addIllegalOp<ONNXGemmOp>();
@@ -123,37 +138,28 @@ void ONNXToSpatialPass::runOnOperation() {
  target.addIllegalOp<ONNXGatherOp>();
  target.addIllegalOp<ONNXReshapeOp>();
  target.addIllegalOp<ONNXResizeOp>();
+  target.addIllegalOp<ONNXSliceOp>();
  target.addIllegalOp<ONNXLRNOp>();
+  target.addIllegalOp<ONNXReduceMeanOp>();
  target.addIllegalOp<ONNXReduceMeanV13Op>();
  target.addIllegalOp<ONNXSplitOp>();

  RewritePatternSet conversionPatterns(ctx);
  populateConversionPatterns(conversionPatterns, ctx);
  if (failed(applyPartialConversion(moduleOp, target, std::move(conversionPatterns)))) {
+    moduleOp.emitError("failed to convert required ONNX ops to Spatial ops");
    signalPassFailure();
    return;
  }

-  RewritePatternSet earlyPostPatterns(ctx);
-  populateEarlyPostPatterns(earlyPostPatterns, ctx);
-  if (failed(applyPatternsGreedily(*entryFunc, std::move(earlyPostPatterns)))) {
-    signalPassFailure();
-    return;
-  }
-
-  if (coresCount != -1) {
-    int computeOpsCount = 0;
-    for (Operation& op : entryFunc->getFunctionBody().front().getOperations())
-      if (isa<spatial::SpatCompute>(op))
-        computeOpsCount++;
-
-    if (computeOpsCount > coresCount) {
-      entryFunc->emitError() << "number of compute ops (" << computeOpsCount << ") exceeds the core count ("
-                             << coresCount << ")";
-      signalPassFailure();
-      return;
-    }
-  }
+  ConversionTarget earlyPostTarget(*ctx);
+  earlyPostTarget.addLegalDialect<spatial::SpatialDialect,
+                                  ONNXDialect,
+                                  linalg::LinalgDialect,
+                                  tensor::TensorDialect,
+                                  affine::AffineDialect,
+                                  arith::ArithDialect,
+                                  scf::SCFDialect>();

  PassManager cleanupPM(ctx);
  cleanupPM.addPass(createCanonicalizerPass());
@@ -162,14 +168,23 @@ void ONNXToSpatialPass::runOnOperation() {

  annotateWeightsConstants(*entryFunc);

+  ConversionTarget postTarget(*ctx);
+  postTarget.addLegalDialect<spatial::SpatialDialect,
+                             ONNXDialect,
+                             linalg::LinalgDialect,
+                             tensor::TensorDialect,
+                             affine::AffineDialect,
+                             arith::ArithDialect,
+                             scf::SCFDialect>();
+  postTarget.addDynamicallyLegalOp<spatial::SpatCompute>(
+    [](spatial::SpatCompute computeOp) { return !requiresPostRewrite(computeOp); });
+  postTarget.addDynamicallyLegalOp<spatial::SpatComputeBatch>(
+    [](spatial::SpatComputeBatch computeOp) { return !requiresPostRewrite(computeOp); });
+
  RewritePatternSet postPatterns(ctx);
  populatePostPatterns(postPatterns, ctx);
-  if (failed(applyPatternsGreedily(*entryFunc, std::move(postPatterns)))) {
-    signalPassFailure();
-    return;
-  }
-
-  if (failed(verifyONNXToSpatialHostLegality(*entryFunc))) {
+  if (failed(applyPartialConversion(*entryFunc, postTarget, std::move(postPatterns)))) {
+    moduleOp.emitError("failed to normalize weight-like Spatial compute operands before Spatial-to-PIM lowering");
    signalPassFailure();
    return;
  }
@@ -177,6 +192,11 @@ void ONNXToSpatialPass::runOnOperation() {
  populateEmptyFunction(*entryFunc);

  dumpModule(moduleOp, "spatial0");
+
+  if (failed(verifyONNXToSpatial(*entryFunc))) {
+    moduleOp.emitError("ONNX-to-Spatial host legality verification failed");
+    signalPassFailure();
+  }
 }

 std::unique_ptr<Pass> createONNXToSpatialPass() { return std::make_unique<ONNXToSpatialPass>(); }
@@ -0,0 +1,157 @@
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Support/LLVM.h"
+
+#include "Common/IR/WeightUtils.hpp"
+#include "src/Accelerators/PIM/Common/Support/Diagnostics.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ONNXToSpatialVerifier.hpp"
+#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+namespace {
+
+void checkWeightUseChains(func::FuncOp func, pim::CappedDiagnosticReporter& diagnostics) {
+  func.walk([&](Operation* op) {
+    if (!hasWeightAlways(op))
+      return;
+
+    for (Value result : op->getResults()) {
+      if (hasOnlySpatialMvmVmmWeightUses(result))
+        continue;
+
+      diagnostics.report(op, [&](Operation* illegalOp) {
+        illegalOp->emitOpError(
+          "weight-marked values may only flow through static view/slice helper chains into Spatial VMM weights");
+      });
+      return;
+    }
+  });
+}
+
+Region* getParentRegion(Value value) {
+  if (auto blockArg = dyn_cast<BlockArgument>(value))
+    return blockArg.getOwner()->getParent();
+  if (Operation* definingOp = value.getDefiningOp())
+    return definingOp->getParentRegion();
+  return nullptr;
+}
+
+bool isDefinedInsideRegion(Value value, Region& region) {
+  Region* parentRegion = getParentRegion(value);
+  return parentRegion && (&region == parentRegion || region.isAncestor(parentRegion));
+}
+
+bool isLegalHostBackedValue(Value value) {
+  Operation* definingOp = value.getDefiningOp();
+  if (!definingOp)
+    return isa<BlockArgument>(value);
+
+  if (isa<spatial::SpatChannelReceiveOp>(definingOp))
+    return false;
+
+  return definingOp->getDialect()->getNamespace() != "spat";
+}
+
+LogicalResult verifyComputeLikeInputs(Operation* computeLikeOp,
+                                      ValueRange inputs,
+                                      bool allowChannelReceiveInputs,
+                                      StringRef kind,
+                                      pim::CappedDiagnosticReporter& diagnostics) {
+  for (auto [inputIndex, input] : llvm::enumerate(inputs)) {
+    unsigned currentInputIndex = inputIndex;
+    Operation* definingOp = input.getDefiningOp();
+    if (allowChannelReceiveInputs && isa_and_nonnull<spatial::SpatChannelReceiveOp>(definingOp))
+      continue;
+    if (isLegalHostBackedValue(input))
+      continue;
+
+    diagnostics.report(computeLikeOp, [&](Operation* illegalOp) {
+      InFlightDiagnostic diagnostic = illegalOp->emitOpError()
+                                   << kind << " input #" << currentInputIndex
+                                   << (allowChannelReceiveInputs ? " must come from the host or an explicit "
+                                                                   "spat.channel_receive"
+                                                                 : " must come from the host");
+      if (definingOp)
+        diagnostic.attachNote(definingOp->getLoc()) << "illegal Spatial producer is " << definingOp->getName();
+    });
+    return failure();
+  }
+  return success();
+}
+
+void verifyNoExternalTensorCaptures(Operation* ownerOp,
+                                    Region& region,
+                                    StringRef kind,
+                                    pim::CappedDiagnosticReporter& diagnostics) {
+  region.walk([&](Operation* op) {
+    for (OpOperand& operand : op->getOpOperands()) {
+      Value value = operand.get();
+      if (!isa<TensorType>(value.getType()))
+        continue;
+      if (isDefinedInsideRegion(value, region) || isa<BlockArgument>(value))
+        continue;
+
+      Operation* definingOp = value.getDefiningOp();
+      if (definingOp && definingOp->hasTrait<OpTrait::ConstantLike>())
+        continue;
+
+      diagnostics.report(ownerOp, [&](Operation* illegalOp) {
+        InFlightDiagnostic diagnostic = illegalOp->emitOpError() << kind << " body may not capture external tensor "
+                                                                 << "values";
+        diagnostic.attachNote(op->getLoc())
+          << "tensor operand #" << operand.getOperandNumber() << " is defined outside the compute body by "
+          << (definingOp ? definingOp->getName().getStringRef() : StringRef("<block argument>"));
+      });
+    }
+  });
+}
+
+} // namespace
+
+LogicalResult verifyONNXToSpatial(func::FuncOp funcOp) {
+  pim::CappedDiagnosticReporter diagnostics;
+
+  for (Operation& op : funcOp.getOps()) {
+    if (isa<func::ReturnOp, spatial::SpatCompute, spatial::SpatComputeBatch>(&op))
+      continue;
+    if (isCompileTimeOp(&op))
+      continue;
+
+    diagnostics.report(&op, [](Operation* illegalOp) {
+      illegalOp->emitOpError(
+        "non-foldable top-level runtime op remains after ONNX-to-Spatial; lower it inside spat.compute");
+    });
+  }
+  checkWeightUseChains(funcOp, diagnostics);
+  diagnostics.emitSuppressedSummary(funcOp, "ONNX-to-Spatial verification failed");
+
+  return success(!diagnostics.hasFailure());
+}
+
+LogicalResult verifySpatialCommunicationInvariants(func::FuncOp funcOp) {
+  pim::CappedDiagnosticReporter diagnostics;
+
+  for (auto computeOp : funcOp.getOps<spatial::SpatCompute>()) {
+    (void) verifyComputeLikeInputs(
+      computeOp.getOperation(), computeOp.getInputs(), /*allowChannelReceiveInputs=*/true, "spat.compute", diagnostics);
+    verifyNoExternalTensorCaptures(computeOp.getOperation(), computeOp.getBody(), "spat.compute", diagnostics);
+  }
+
+  for (auto computeBatchOp : funcOp.getOps<spatial::SpatComputeBatch>()) {
+    (void) verifyComputeLikeInputs(computeBatchOp.getOperation(),
+                                   computeBatchOp.getInputs(),
+                                   /*allowChannelReceiveInputs=*/false,
+                                   "spat.compute_batch",
+                                   diagnostics);
+    verifyNoExternalTensorCaptures(
+      computeBatchOp.getOperation(), computeBatchOp.getBody(), "spat.compute_batch", diagnostics);
+  }
+
+  diagnostics.emitSuppressedSummary(funcOp, "Spatial communication invariant verification failed");
+  return success(!diagnostics.hasFailure());
+}
+
+} // namespace onnx_mlir
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Support/LogicalResult.h"
+
+namespace onnx_mlir {
+
+mlir::LogicalResult verifyONNXToSpatial(mlir::func::FuncOp funcOp);
+mlir::LogicalResult verifySpatialCommunicationInvariants(mlir::func::FuncOp funcOp);
+
+} // namespace onnx_mlir
@@ -1,20 +1,16 @@
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"

 using namespace mlir;

 namespace onnx_mlir {

-namespace {
-
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ONNXToSpatial.hpp.inc"
-
-} // namespace
-
-void populateConversionPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx) {
-  patterns.add<removeLRN>(ctx);
+void populatePrePatterns(RewritePatternSet& patterns, MLIRContext* ctx) { populateGeneratedPrePatterns(patterns, ctx); }

+void populateConversionPatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
+  populateGeneratedConversionPatterns(patterns, ctx);
  populateElementwisePatterns(patterns, ctx);
+  populateMatMulRewritePatterns(patterns, ctx);
  populateGemmPatterns(patterns, ctx);
  populateConvPatterns(patterns, ctx);
  populatePoolPatterns(patterns, ctx);
@@ -26,7 +22,13 @@ void populateConversionPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRCon
  populateGatherPatterns(patterns, ctx);
  populateResizePatterns(patterns, ctx);
  populateReshapePatterns(patterns, ctx);
+  populateSlicePatterns(patterns, ctx);
  populateSplitPatterns(patterns, ctx);
+  populateTransposePatterns(patterns, ctx);
+}
+
+void populatePostPatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
+  populateWeightPromotionPatterns(patterns, ctx);
 }

 } // namespace onnx_mlir
@@ -1,38 +1,40 @@
 #pragma once

+#include "mlir/Dialect/Func/IR/FuncOps.h"
 #include "mlir/IR/MLIRContext.h"
 #include "mlir/Transforms/DialectConversion.h"

+#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
+
 namespace onnx_mlir {

+void populatePrePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
 void populateConversionPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+void populatePostPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+
+void populateGeneratedPrePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+void populateGeneratedConversionPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+void populateWeightPromotionPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);

 void populateConvPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateElementwisePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateGemmPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateMatMulRewritePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populatePoolPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateReduceMeanPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateReluPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateSigmoidPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateSoftmaxPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateConcatPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateGatherPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateResizePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
 void populateReshapePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
-
+void populateSlicePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
 void populateSplitPatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+void populateTransposePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx);
+
+bool requiresPostRewrite(spatial::SpatCompute computeOp);
+bool requiresPostRewrite(spatial::SpatComputeBatch computeOp);
+void annotateWeightsConstants(mlir::func::FuncOp funcOp);

 } // namespace onnx_mlir
@@ -0,0 +1,18 @@
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+namespace {
+
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ONNXToSpatial.hpp.inc"
+
+} // namespace
+
+void populateGeneratedConversionPatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
+  patterns.add<removeLRN>(ctx);
+}
+
+} // namespace onnx_mlir
@@ -7,7 +7,7 @@

 #include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -83,7 +83,7 @@ static FailureOr<Value> materializeBroadcastedConstantTensor(Value value,
  }

  auto broadcastedAttr = DenseElementsAttr::get(resultType, resultValues);
-  return arith::ConstantOp::create(rewriter, loc, resultType, broadcastedAttr).getResult();
+  return getOrCreateConstant(rewriter, rewriter.getInsertionBlock()->getParentOp(), broadcastedAttr, resultType);
 }

 static FailureOr<Value>
@@ -121,7 +121,7 @@ static FailureOr<Value> materializeReciprocalTensor(Value value,
  }

  auto reciprocalAttr = DenseFPElementsAttr::get(resultType, reciprocalValues);
-  return arith::ConstantOp::create(rewriter, loc, resultType, reciprocalAttr).getResult();
+  return getOrCreateConstant(rewriter, rewriter.getInsertionBlock()->getParentOp(), reciprocalAttr, resultType);
 }

 template <typename OnnxOp, typename SpatialOp>
@@ -189,6 +189,7 @@ struct DivToSpatialCompute : OpConversionPattern<ONNXDivOp> {

 void populateElementwisePatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
  patterns.add<BinaryElementwiseToSpatialCompute<ONNXAddOp, spatial::SpatVAddOp>>(ctx);
+  patterns.add<BinaryElementwiseToSpatialCompute<ONNXSubOp, spatial::SpatVSubOp>>(ctx);
  patterns.add<BinaryElementwiseToSpatialCompute<ONNXMulOp, spatial::SpatVMulOp>>(ctx);
  patterns.add<DivToSpatialCompute>(ctx);
 }
@@ -1,13 +1,18 @@
+#include "mlir/Dialect/Affine/IR/AffineOps.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Transforms/DialectConversion.h"

 #include "llvm/ADT/SmallVector.h"

 #include <algorithm>
+#include <numeric>
+#include <optional>
+#include <type_traits>

+#include "src/Accelerators/PIM/Common/IR/AffineUtils.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -16,26 +21,85 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static SmallVector<int64_t> normalizeAxes(ArrayAttr axesAttr, int64_t rank) {
+struct ReduceMeanSemantics {
+  SmallVector<int64_t> axes;
+  int64_t keepdims = 1;
+  bool isIdentity = false;
+};
+
+static bool isNoneValueLike(Value value) { return isa_and_nonnull<ONNXNoneOp>(value.getDefiningOp()); }
+
+static FailureOr<SmallVector<int64_t>> getConstantIntValues(Value value) {
+  auto denseAttr = dyn_cast_or_null<DenseIntElementsAttr>(getHostConstDenseElementsAttr(value));
+  if (!denseAttr)
+    return failure();
+  return SmallVector<int64_t>(denseAttr.getValues<int64_t>().begin(), denseAttr.getValues<int64_t>().end());
+}
+
+static FailureOr<SmallVector<int64_t>> normalizeAxesChecked(ArrayRef<int64_t> axes, int64_t rank) {
  SmallVector<int64_t> normalizedAxes;
-  if (!axesAttr) {
-    normalizedAxes.reserve(rank);
-    for (int64_t axis = 0; axis < rank; axis++)
-      normalizedAxes.push_back(axis);
-    return normalizedAxes;
+  normalizedAxes.reserve(axes.size());
+  for (int64_t axis : axes) {
+    auto normalizedAxis = normalizeAxisChecked(axis, rank);
+    if (failed(normalizedAxis))
+      return failure();
+    normalizedAxes.push_back(*normalizedAxis);
  }
-
-  normalizedAxes.reserve(axesAttr.size());
-  for (Attribute attr : axesAttr) {
-    int64_t axis = cast<IntegerAttr>(attr).getInt();
-    normalizedAxes.push_back(axis >= 0 ? axis : rank + axis);
-  }
-
  llvm::sort(normalizedAxes);
  normalizedAxes.erase(std::unique(normalizedAxes.begin(), normalizedAxes.end()), normalizedAxes.end());
  return normalizedAxes;
 }

+template <typename ReduceMeanOp, typename ReduceMeanOpAdaptor>
+static FailureOr<ReduceMeanSemantics>
+getReduceMeanSemantics(ReduceMeanOp reduceMeanOp, ReduceMeanOpAdaptor adaptor, int64_t inputRank) {
+  ReduceMeanSemantics semantics;
+  semantics.keepdims = reduceMeanOp.getKeepdims();
+
+  if constexpr (std::is_same_v<ReduceMeanOp, ONNXReduceMeanV13Op>) {
+    auto axes = onnx_mlir::normalizeAxesChecked(std::optional<ArrayAttr>(reduceMeanOp.getAxesAttr()), inputRank);
+    if (failed(axes))
+      return failure();
+    semantics.axes = std::move(*axes);
+    return semantics;
+  }
+  else {
+    if (isNoneValueLike(adaptor.getAxes())) {
+      if (reduceMeanOp.getNoopWithEmptyAxes() != 0) {
+        semantics.isIdentity = true;
+        return semantics;
+      }
+
+      semantics.axes.reserve(inputRank);
+      for (int64_t axis = 0; axis < inputRank; ++axis)
+        semantics.axes.push_back(axis);
+      return semantics;
+    }
+
+    auto axes = getConstantIntValues(adaptor.getAxes());
+    if (failed(axes))
+      return failure();
+
+    if (axes->empty()) {
+      if (reduceMeanOp.getNoopWithEmptyAxes() != 0) {
+        semantics.isIdentity = true;
+        return semantics;
+      }
+
+      semantics.axes.reserve(inputRank);
+      for (int64_t axis = 0; axis < inputRank; ++axis)
+        semantics.axes.push_back(axis);
+      return semantics;
+    }
+
+    auto normalizedAxes = normalizeAxesChecked(*axes, inputRank);
+    if (failed(normalizedAxes))
+      return failure();
+    semantics.axes = std::move(*normalizedAxes);
+    return semantics;
+  }
+}
+
 static SmallVector<bool> buildReducedAxesMask(ArrayRef<int64_t> axes, int64_t rank) {
  SmallVector<bool> reducedAxes(rank, false);
  for (int64_t axis : axes) {
@@ -50,6 +114,184 @@ static RankedTensorType getAllOnesType(RankedTensorType inputType, Type elementT
  return RankedTensorType::get(SmallVector<int64_t>(inputType.getRank(), 1), elementType);
 }

+static RankedTensorType getKeepdimsType(RankedTensorType inputType, Type elementType, ArrayRef<bool> reducedAxes) {
+  SmallVector<int64_t> shape;
+  shape.reserve(inputType.getRank());
+  for (auto [dim, isReduced] : llvm::zip_equal(inputType.getShape(), reducedAxes))
+    shape.push_back(isReduced ? 1 : dim);
+  return RankedTensorType::get(shape, elementType, inputType.getEncoding());
+}
+
+static RankedTensorType getCompactKeptType(RankedTensorType inputType, Type elementType, ArrayRef<bool> reducedAxes) {
+  SmallVector<int64_t> shape;
+  for (auto [dim, isReduced] : llvm::zip_equal(inputType.getShape(), reducedAxes))
+    if (!isReduced)
+      shape.push_back(dim);
+  return RankedTensorType::get(shape, elementType, inputType.getEncoding());
+}
+
+static RankedTensorType getReducedSliceType(RankedTensorType inputType, ArrayRef<bool> reducedAxes) {
+  SmallVector<int64_t> shape;
+  shape.reserve(inputType.getRank());
+  for (auto [dim, isReduced] : llvm::zip_equal(inputType.getShape(), reducedAxes))
+    shape.push_back(isReduced ? dim : 1);
+  return RankedTensorType::get(shape, inputType.getElementType(), inputType.getEncoding());
+}
+
+static RankedTensorType getLanePackedKeepdimsType(int64_t laneCount, RankedTensorType leafType) {
+  SmallVector<int64_t> shape(leafType.getShape().begin(), leafType.getShape().end());
+  shape.front() = laneCount;
+  return RankedTensorType::get(shape, leafType.getElementType(), leafType.getEncoding());
+}
+
+static SmallVector<int64_t> getKeptAxes(ArrayRef<bool> reducedAxes) {
+  SmallVector<int64_t> keptAxes;
+  for (auto [axis, isReduced] : llvm::enumerate(reducedAxes))
+    if (!isReduced)
+      keptAxes.push_back(static_cast<int64_t>(axis));
+  return keptAxes;
+}
+
+static Value
+computeLaneIndex(Value lane, int64_t stride, int64_t dimSize, ConversionPatternRewriter& rewriter, Location loc) {
+  if (dimSize == 1)
+    return getOrCreateIndexConstant(rewriter, rewriter.getInsertionBlock()->getParentOp(), 0);
+
+  MLIRContext* context = rewriter.getContext();
+  AffineExpr d0 = getAffineDimExpr(0, context);
+  AffineExpr expr = d0;
+  if (stride != 1)
+    expr = expr.floorDiv(stride);
+  if (dimSize != 1)
+    expr = expr % dimSize;
+  return createOrFoldAffineApply(rewriter, loc, expr, ValueRange {lane}, rewriter.getInsertionBlock()->getParentOp());
+}
+
+static FailureOr<Value> buildReduceMeanKeepdimsBatch(Value input,
+                                                     ArrayRef<bool> reducedAxes,
+                                                     RankedTensorType batchType,
+                                                     RankedTensorType leafType,
+                                                     ConversionPatternRewriter& rewriter,
+                                                     Location loc) {
+  auto inputType = cast<RankedTensorType>(input.getType());
+  auto sliceType = getReducedSliceType(inputType, reducedAxes);
+  SmallVector<int64_t> keptAxes = getKeptAxes(reducedAxes);
+
+  int64_t laneCount = 1;
+  SmallVector<int64_t> keptAxisStrides(keptAxes.size(), 1);
+  for (int64_t index = static_cast<int64_t>(keptAxes.size()) - 1; index >= 0; --index) {
+    keptAxisStrides[index] = laneCount;
+    int64_t dimSize = inputType.getDimSize(keptAxes[index]);
+    if (dimSize <= 0)
+      return failure();
+    if (laneCount > std::numeric_limits<int32_t>::max() / dimSize)
+      return failure();
+    laneCount *= dimSize;
+  }
+
+  SmallVector<OpFoldResult> sliceOffsets;
+  SmallVector<OpFoldResult> sliceSizes;
+  SmallVector<OpFoldResult> insertOffsets;
+  SmallVector<OpFoldResult> insertSizes(inputType.getRank(), rewriter.getIndexAttr(1));
+  SmallVector<OpFoldResult> unitStrides = getUnitStrides(rewriter, inputType.getRank());
+  sliceOffsets.reserve(inputType.getRank());
+  sliceSizes.reserve(inputType.getRank());
+  insertOffsets.reserve(inputType.getRank());
+
+  auto batchOp =
+    createSpatComputeBatch(rewriter,
+                           loc,
+                           TypeRange {batchType},
+                           laneCount,
+                           {},
+                           ValueRange {input},
+                           [&](detail::SpatComputeBatchBodyArgs args) {
+                             size_t keptAxisIndex = 0;
+                             sliceOffsets.clear();
+                             sliceSizes.clear();
+                             insertOffsets.clear();
+                             for (auto [axis, isReduced] : llvm::enumerate(reducedAxes)) {
+                               if (isReduced) {
+                                 sliceOffsets.push_back(rewriter.getIndexAttr(0));
+                                 sliceSizes.push_back(rewriter.getIndexAttr(inputType.getDimSize(axis)));
+                                 continue;
+                               }
+
+                               Value axisIndex = computeLaneIndex(
+                                 args.lane, keptAxisStrides[keptAxisIndex], inputType.getDimSize(axis), rewriter, loc);
+                               ++keptAxisIndex;
+                               sliceOffsets.push_back(axisIndex);
+                               sliceSizes.push_back(rewriter.getIndexAttr(1));
+                             }
+
+                             insertOffsets.push_back(args.lane);
+                             insertOffsets.append(inputType.getRank() - 1, rewriter.getIndexAttr(0));
+
+                             Value slice = tensor::ExtractSliceOp::create(
+                               rewriter, loc, sliceType, args.inputs.front(), sliceOffsets, sliceSizes, unitStrides);
+                             Value reduced = spatial::SpatVAvgOp::create(rewriter, loc, leafType, slice).getResult();
+                             createParallelInsertSliceIntoBatchOutput(
+                               rewriter, loc, reduced, args.outputs.front(), insertOffsets, insertSizes, unitStrides);
+                           });
+  if (failed(batchOp))
+    return failure();
+  return (*batchOp).getResult(0);
+}
+
+static Value buildKeepdimsFromLanePackedBatch(Value batchValue,
+                                              RankedTensorType keepdimsType,
+                                              RankedTensorType compactKeptType,
+                                              ArrayRef<bool> reducedAxes,
+                                              ConversionPatternRewriter& rewriter,
+                                              Location loc) {
+  auto batchType = cast<RankedTensorType>(batchValue.getType());
+  if (batchType == keepdimsType)
+    return batchValue;
+
+  SmallVector<ReassociationIndices> collapseToFlat {{}};
+  for (int64_t axis = 0; axis < batchType.getRank(); ++axis)
+    collapseToFlat.front().push_back(axis);
+
+  SmallVector<ReassociationIndices> expandFlatToCompact(1);
+  for (int64_t axis = 0; axis < compactKeptType.getRank(); ++axis)
+    expandFlatToCompact.front().push_back(axis);
+
+  SmallVector<ReassociationIndices> expandCompactToKeepdims;
+  ReassociationIndices pendingLeadingReducedAxes;
+  for (auto [axis, isReduced] : llvm::enumerate(reducedAxes)) {
+    if (isReduced) {
+      if (expandCompactToKeepdims.empty())
+        pendingLeadingReducedAxes.push_back(axis);
+      else
+        expandCompactToKeepdims.back().push_back(axis);
+      continue;
+    }
+
+    expandCompactToKeepdims.emplace_back();
+    auto& group = expandCompactToKeepdims.back();
+    group.append(pendingLeadingReducedAxes.begin(), pendingLeadingReducedAxes.end());
+    pendingLeadingReducedAxes.clear();
+    group.push_back(axis);
+  }
+  if (!pendingLeadingReducedAxes.empty())
+    expandCompactToKeepdims.back().append(pendingLeadingReducedAxes.begin(), pendingLeadingReducedAxes.end());
+
+  auto reshapeCompute =
+    createSpatCompute<1>(rewriter, loc, TypeRange {keepdimsType}, {}, ValueRange {batchValue}, [&](Value input) {
+      auto flatType =
+        RankedTensorType::get({batchType.getDimSize(0)}, batchType.getElementType(), batchType.getEncoding());
+      Value flat = tensor::CollapseShapeOp::create(rewriter, loc, flatType, input, collapseToFlat);
+      Value compact = flat;
+      if (compactKeptType != flatType)
+        compact = tensor::ExpandShapeOp::create(rewriter, loc, compactKeptType, flat, expandFlatToCompact);
+      Value keepdims = compact;
+      if (keepdimsType != compactKeptType)
+        keepdims = tensor::ExpandShapeOp::create(rewriter, loc, keepdimsType, compact, expandCompactToKeepdims);
+      spatial::SpatYieldOp::create(rewriter, loc, keepdims);
+    });
+  return reshapeCompute.getResult(0);
+}
+
 static SmallVector<ReassociationIndices> buildCollapseReassociation(ArrayRef<bool> reducedAxes) {
  SmallVector<ReassociationIndices> reassociation;
  ReassociationIndices currentGroup;
@@ -72,70 +314,14 @@ static SmallVector<ReassociationIndices> buildCollapseReassociation(ArrayRef<boo
  return reassociation;
 }

-static Value
-createAverageCompute(Value input, RankedTensorType resultType, ConversionPatternRewriter& rewriter, Location loc) {
-  constexpr size_t numInputs = 1;
-  auto computeOp = createSpatCompute<numInputs>(rewriter, loc, resultType, {}, ValueRange {input}, [&](Value x) {
-    auto avgOp = spatial::SpatVAvgOp::create(rewriter, loc, resultType, x);
-    spatial::SpatYieldOp::create(rewriter, loc, avgOp.getResult());
-  });
-  return computeOp.getResult(0);
-}
-
-static Value concatValues(ValueRange inputs, int64_t axis, ConversionPatternRewriter& rewriter, Location loc) {
-  auto firstType = cast<RankedTensorType>(inputs.front().getType());
-  SmallVector<int64_t> outputShape(firstType.getShape().begin(), firstType.getShape().end());
-  int64_t concatDimSize = 0;
-  for (Value input : inputs)
-    concatDimSize += cast<RankedTensorType>(input.getType()).getDimSize(axis);
-  outputShape[axis] = concatDimSize;
-  auto resultType = RankedTensorType::get(outputShape, firstType.getElementType(), firstType.getEncoding());
-
-  if (llvm::all_of(inputs, isHostFoldableValue))
-    return createSpatConcat(rewriter, loc, axis, inputs);
-
-  auto concatCompute = createSpatCompute(rewriter, loc, TypeRange {resultType}, {}, inputs, [&](ValueRange args) {
-    spatial::SpatYieldOp::create(rewriter, loc, createSpatConcat(rewriter, loc, axis, args));
-  });
-  return concatCompute.getResult(0);
-}
-
-static Value buildReduceMeanKeepdims(Value input,
-                                     ArrayRef<bool> reducedAxes,
-                                     int64_t axis,
-                                     RankedTensorType leafType,
-                                     ConversionPatternRewriter& rewriter,
-                                     Location loc) {
-  int64_t rank = cast<RankedTensorType>(input.getType()).getRank();
-  if (axis == rank)
-    return createAverageCompute(input, leafType, rewriter, loc);
-
-  if (reducedAxes[axis])
-    return buildReduceMeanKeepdims(input, reducedAxes, axis + 1, leafType, rewriter, loc);
-
-  SmallVector<Value> slices = sliceTensor(input, axis, /*sliceSize=*/1, rewriter, loc);
-  SmallVector<Value> reducedSlices;
-  reducedSlices.reserve(slices.size());
-  for (Value slice : slices)
-    reducedSlices.push_back(buildReduceMeanKeepdims(slice, reducedAxes, axis + 1, leafType, rewriter, loc));
-
-  return concatValues(reducedSlices, axis, rewriter, loc);
-}
-
 static Value squeezeReducedAxes(Value keepdimsValue,
                                RankedTensorType resultType,
                                ArrayRef<bool> reducedAxes,
                                ConversionPatternRewriter& rewriter,
                                Location loc) {
-  if (resultType.getRank() == 0) {
-    SmallVector<Value> indices(cast<RankedTensorType>(keepdimsValue.getType()).getRank(),
-                               arith::ConstantIndexOp::create(rewriter, loc, 0));
-    Value element = tensor::ExtractOp::create(rewriter, loc, keepdimsValue, indices);
-    return tensor::FromElementsOp::create(rewriter, loc, resultType, ValueRange {element});
-  }
-
-  auto reassociation = buildCollapseReassociation(reducedAxes);
-  if (isHostFoldableValue(keepdimsValue))
+  SmallVector<ReassociationIndices> reassociation =
+    resultType.getRank() == 0 ? SmallVector<ReassociationIndices> {} : buildCollapseReassociation(reducedAxes);
+  if (isCompileTimeComputable(keepdimsValue))
    return tensor::CollapseShapeOp::create(rewriter, loc, resultType, keepdimsValue, reassociation).getResult();

  auto squeezeCompute =
@@ -146,28 +332,55 @@ static Value squeezeReducedAxes(Value keepdimsValue,
  return squeezeCompute.getResult(0);
 }

-struct ReduceMeanToSpatialCompute : OpConversionPattern<ONNXReduceMeanV13Op> {
-  using OpConversionPattern::OpConversionPattern;
+template <typename ReduceMeanOp>
+struct ReduceMeanToSpatialCompute : OpConversionPattern<ReduceMeanOp> {
+  using OpConversionPattern<ReduceMeanOp>::OpConversionPattern;
+  using Adaptor = typename ReduceMeanOp::Adaptor;

-  LogicalResult matchAndRewrite(ONNXReduceMeanV13Op reduceMeanOp,
-                                ONNXReduceMeanV13OpAdaptor adaptor,
+  LogicalResult matchAndRewrite(ReduceMeanOp reduceMeanOp,
+                                Adaptor adaptor,
                                ConversionPatternRewriter& rewriter) const override {
    auto inputType = dyn_cast<RankedTensorType>(adaptor.getData().getType());
    auto resultType = dyn_cast<RankedTensorType>(reduceMeanOp.getReduced().getType());
    if (!inputType || !resultType || !inputType.hasStaticShape() || !resultType.hasStaticShape())
      return failure();
+    if (inputType.getRank() == 0) {
+      rewriter.replaceOp(reduceMeanOp, adaptor.getData());
+      return success();
+    }

-    SmallVector<int64_t> axes = normalizeAxes(reduceMeanOp.getAxesAttr(), inputType.getRank());
-    SmallVector<bool> reducedAxes = buildReducedAxesMask(axes, inputType.getRank());
+    auto semantics = getReduceMeanSemantics(reduceMeanOp, adaptor, inputType.getRank());
+    if (failed(semantics))
+      return rewriter.notifyMatchFailure(reduceMeanOp, "requires compile-time constant, in-range ReduceMean axes");
+    if (semantics->isIdentity) {
+      if (inputType != resultType)
+        return rewriter.notifyMatchFailure(
+          reduceMeanOp, "noop_with_empty_axes identity requires the result type to match the input type");
+      rewriter.replaceOp(reduceMeanOp, adaptor.getData());
+      return success();
+    }
+
+    SmallVector<bool> reducedAxes = buildReducedAxesMask(semantics->axes, inputType.getRank());
    if (reducedAxes.empty() && inputType.getRank() != 0)
      return failure();

    Location loc = reduceMeanOp.getLoc();
    RankedTensorType leafType = getAllOnesType(inputType, resultType.getElementType());
-    Value reducedKeepdims =
-      buildReduceMeanKeepdims(adaptor.getData(), reducedAxes, /*axis=*/0, leafType, rewriter, loc);
+    RankedTensorType compactKeptType = getCompactKeptType(inputType, resultType.getElementType(), reducedAxes);
+    RankedTensorType keepdimsType = getKeepdimsType(inputType, resultType.getElementType(), reducedAxes);
+    int64_t laneCount = 1;
+    for (int64_t dim : compactKeptType.getShape())
+      laneCount *= dim;
+    RankedTensorType batchType = getLanePackedKeepdimsType(laneCount, leafType);

-    if (reduceMeanOp.getKeepdims() != 0) {
+    auto lanePackedKeepdims =
+      buildReduceMeanKeepdimsBatch(adaptor.getData(), reducedAxes, batchType, leafType, rewriter, loc);
+    if (failed(lanePackedKeepdims))
+      return failure();
+    Value reducedKeepdims =
+      buildKeepdimsFromLanePackedBatch(*lanePackedKeepdims, keepdimsType, compactKeptType, reducedAxes, rewriter, loc);
+
+    if (semantics->keepdims != 0) {
      rewriter.replaceOp(reduceMeanOp, reducedKeepdims);
      return success();
    }
@@ -181,7 +394,7 @@ struct ReduceMeanToSpatialCompute : OpConversionPattern<ONNXReduceMeanV13Op> {
 } // namespace

 void populateReduceMeanPatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
-  patterns.add<ReduceMeanToSpatialCompute>(ctx);
+  patterns.add<ReduceMeanToSpatialCompute<ONNXReduceMeanV13Op>, ReduceMeanToSpatialCompute<ONNXReduceMeanOp>>(ctx);
 }

 } // namespace onnx_mlir
@@ -12,6 +12,7 @@
 #include <optional>
 #include <type_traits>

+#include "src/Accelerators/PIM/Common/IR/LoopUtils.hpp"
 #include "src/Accelerators/PIM/Common/PimCommon.hpp"
 #include "src/Accelerators/PIM/Compiler/PimCompilerOptions.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
@@ -23,43 +24,26 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-template <typename ArrayAttrT>
-static int64_t getI64(ArrayAttrT arrayAttr, size_t index) {
-  return cast<IntegerAttr>(arrayAttr[index]).getInt();
-}
-
-template <typename ArrayAttrT>
-static int64_t getOptionalI64(std::optional<ArrayAttrT> arrayAttr, size_t index, int64_t defaultValue) {
-  return arrayAttr ? getI64(*arrayAttr, index) : defaultValue;
-}
-
-static Value materializeContiguousTile(ConversionPatternRewriter& rewriter, Location loc, Value tile) {
+static Value materializeTileTensor(ConversionPatternRewriter& rewriter, Location loc, Value tile) {
  auto tileType = cast<RankedTensorType>(tile.getType());
  Value empty = tensor::EmptyOp::create(rewriter, loc, tileType.getShape(), tileType.getElementType());
-
-  SmallVector<OpFoldResult> offsets(tileType.getRank(), rewriter.getIndexAttr(0));
-  SmallVector<OpFoldResult> sizes;
-  sizes.reserve(tileType.getRank());
-  for (int64_t dimSize : tileType.getShape())
-    sizes.push_back(rewriter.getIndexAttr(dimSize));
-  SmallVector<OpFoldResult> strides(tileType.getRank(), rewriter.getIndexAttr(1));
-
-  return tensor::InsertSliceOp::create(rewriter, loc, tile, empty, offsets, sizes, strides);
+  return insertStaticSlice(rewriter, loc, tile, empty, getZeroOffsets(rewriter, tileType.getRank()));
 }

 static Value
 createPoolFillElement(ConversionPatternRewriter& rewriter, Location loc, Type elementType, bool useMinimumValue) {
+  Operation* anchorOp = rewriter.getInsertionBlock()->getParentOp();
  if (!useMinimumValue)
-    return arith::ConstantOp::create(rewriter, loc, elementType, rewriter.getZeroAttr(elementType));
+    return getOrCreateConstant(rewriter, anchorOp, rewriter.getZeroAttr(elementType), elementType);

  if (auto floatType = dyn_cast<FloatType>(elementType)) {
    auto minValue = llvm::APFloat::getInf(floatType.getFloatSemantics(), /*Negative=*/true);
-    return arith::ConstantOp::create(rewriter, loc, elementType, rewriter.getFloatAttr(floatType, minValue));
+    return getOrCreateConstant(rewriter, anchorOp, rewriter.getFloatAttr(floatType, minValue), elementType);
  }

  if (auto integerType = dyn_cast<IntegerType>(elementType)) {
    auto minValue = llvm::APInt::getSignedMinValue(integerType.getWidth());
-    return arith::ConstantOp::create(rewriter, loc, elementType, rewriter.getIntegerAttr(integerType, minValue));
+    return getOrCreateConstant(rewriter, anchorOp, rewriter.getIntegerAttr(integerType, minValue), elementType);
  }

  llvm_unreachable("unsupported pool element type");
@@ -166,7 +150,7 @@ static FailureOr<Value> createAverageScaleTensor(ConversionPatternRewriter& rewr
  }

  auto scaleAttr = DenseElementsAttr::get(scaleType, scaleValues);
-  return arith::ConstantOp::create(rewriter, loc, scaleType, scaleAttr).getResult();
+  return getOrCreateConstant(rewriter, rewriter.getInsertionBlock()->getParentOp(), scaleAttr, scaleType);
 }

 template <typename PoolOp>
@@ -197,12 +181,12 @@ struct PoolToSpatialComputeBase : public OpConversionPattern<PoolOp> {
    const int64_t inputWidth = xType.getDimSize(3);
    const int64_t outputHeight = outType.getDimSize(2);
    const int64_t outputWidth = outType.getDimSize(3);
-    const int64_t kernelHeight = getI64(kernelAttr, 0);
-    const int64_t kernelWidth = getI64(kernelAttr, 1);
-    const int64_t strideHeight = getOptionalI64(poolOp.getStrides(), 0, 1);
-    const int64_t strideWidth = getOptionalI64(poolOp.getStrides(), 1, 1);
-    const int64_t dilationHeight = getOptionalI64(poolOp.getDilations(), 0, 1);
-    const int64_t dilationWidth = getOptionalI64(poolOp.getDilations(), 1, 1);
+    const int64_t kernelHeight = getI64Attr(kernelAttr, 0);
+    const int64_t kernelWidth = getI64Attr(kernelAttr, 1);
+    const int64_t strideHeight = getOptionalI64Attr(poolOp.getStrides(), 0, 1);
+    const int64_t strideWidth = getOptionalI64Attr(poolOp.getStrides(), 1, 1);
+    const int64_t dilationHeight = getOptionalI64Attr(poolOp.getDilations(), 0, 1);
+    const int64_t dilationWidth = getOptionalI64Attr(poolOp.getDilations(), 1, 1);

    int64_t padTop = 0;
    int64_t padLeft = 0;
@@ -212,10 +196,10 @@ struct PoolToSpatialComputeBase : public OpConversionPattern<PoolOp> {
    if (auto padsAttr = poolOp.getPads()) {
      if (padsAttr->size() != 4)
        return rewriter.notifyMatchFailure(poolOp, "pads must have four elements.");
-      padTop = getI64(*padsAttr, 0);
-      padLeft = getI64(*padsAttr, 1);
-      padBottom = getI64(*padsAttr, 2);
-      padRight = getI64(*padsAttr, 3);
+      padTop = getI64Attr(*padsAttr, 0);
+      padLeft = getI64Attr(*padsAttr, 1);
+      padBottom = getI64Attr(*padsAttr, 2);
+      padRight = getI64Attr(*padsAttr, 3);
    }
    else {
      StringRef autoPad = poolOp.getAutoPad();
@@ -283,94 +267,111 @@ struct PoolToSpatialComputeBase : public OpConversionPattern<PoolOp> {
          createPaddedPoolInput(rewriter, loc, poolOp, xArg, xType, padTop, padLeft, padBottom, padRight);
        Value pooledOutputInit = tensor::EmptyOp::create(rewriter, loc, outType.getShape(), outType.getElementType());

-        Value c0 = arith::ConstantIndexOp::create(rewriter, loc, 0);
-        Value c1 = arith::ConstantIndexOp::create(rewriter, loc, 1);
-        Value cOutputPatchCount = arith::ConstantIndexOp::create(rewriter, loc, outputPatchCount);
-        Value cOutputPixelsPerBatch = arith::ConstantIndexOp::create(rewriter, loc, outputHeight * outputWidth);
-        Value cOutputWidth = arith::ConstantIndexOp::create(rewriter, loc, outputWidth);
-        Value cStrideHeight = arith::ConstantIndexOp::create(rewriter, loc, strideHeight);
-        Value cStrideWidth = arith::ConstantIndexOp::create(rewriter, loc, strideWidth);
+        Operation* anchorOp = rewriter.getInsertionBlock()->getParentOp();
+        Value c0 = getOrCreateIndexConstant(rewriter, anchorOp, 0);
+        Value c1 = getOrCreateIndexConstant(rewriter, anchorOp, 1);
+        Value cOutputPatchCount = getOrCreateIndexConstant(rewriter, anchorOp, outputPatchCount);
+        Value cOutputPixelsPerBatch = getOrCreateIndexConstant(rewriter, anchorOp, outputHeight * outputWidth);
+        Value cOutputWidth = getOrCreateIndexConstant(rewriter, anchorOp, outputWidth);
+        Value cStrideHeight = getOrCreateIndexConstant(rewriter, anchorOp, strideHeight);
+        Value cStrideWidth = getOrCreateIndexConstant(rewriter, anchorOp, strideWidth);

-        auto outputLoop = scf::ForOp::create(rewriter, loc, c0, cOutputPatchCount, c1, ValueRange {pooledOutputInit});
-        rewriter.setInsertionPointToStart(outputLoop.getBody());
+        auto outputLoop = buildNormalizedScfFor(
+          rewriter,
+          loc,
+          c0,
+          cOutputPatchCount,
+          c1,
+          ValueRange {pooledOutputInit},
+          [&](OpBuilder&,
+              Location nestedLoc,
+              Value outputPatchIndex,
+              ValueRange iterArgs,
+              SmallVectorImpl<Value>& yielded) {
+            Value pooledOutputAcc = iterArgs.front();
+            Value batchIndex = arith::DivUIOp::create(rewriter, nestedLoc, outputPatchIndex, cOutputPixelsPerBatch);
+            Value batchPatchIndex =
+              arith::RemUIOp::create(rewriter, nestedLoc, outputPatchIndex, cOutputPixelsPerBatch);
+            Value outHeightIndex = arith::DivUIOp::create(rewriter, nestedLoc, batchPatchIndex, cOutputWidth);
+            Value outWidthIndex = arith::RemUIOp::create(rewriter, nestedLoc, batchPatchIndex, cOutputWidth);
+            Value windowBaseH = arith::MulIOp::create(rewriter, nestedLoc, outHeightIndex, cStrideHeight);
+            Value windowBaseW = arith::MulIOp::create(rewriter, nestedLoc, outWidthIndex, cStrideWidth);

-        Value outputPatchIndex = outputLoop.getInductionVar();
-        Value pooledOutputAcc = outputLoop.getRegionIterArgs().front();
+            Value updatedOutput = pooledOutputAcc;
+            for (int64_t channelTile = 0; channelTile < channelTileCount; ++channelTile) {
+              const int64_t tileChannels = std::min<int64_t>(xbarSize, channels - channelTile * xbarSize);
+              auto tileType = RankedTensorType::get({1, tileChannels, 1, 1}, outType.getElementType());
+              Value reducedWindow =
+                createPoolFillTensor(rewriter, nestedLoc, tileType, std::is_same_v<PoolOp, ONNXMaxPoolSingleOutOp>);

-        Value batchIndex = arith::DivUIOp::create(rewriter, loc, outputPatchIndex, cOutputPixelsPerBatch);
-        Value batchPatchIndex = arith::RemUIOp::create(rewriter, loc, outputPatchIndex, cOutputPixelsPerBatch);
-        Value outHeightIndex = arith::DivUIOp::create(rewriter, loc, batchPatchIndex, cOutputWidth);
-        Value outWidthIndex = arith::RemUIOp::create(rewriter, loc, batchPatchIndex, cOutputWidth);
-        Value windowBaseH = arith::MulIOp::create(rewriter, loc, outHeightIndex, cStrideHeight);
-        Value windowBaseW = arith::MulIOp::create(rewriter, loc, outWidthIndex, cStrideWidth);
+              for (int64_t kernelH = 0; kernelH < kernelHeight; ++kernelH) {
+                Value paddedInH = windowBaseH;
+                if (kernelH * dilationHeight != 0) {
+                  Value kernelHOffset = getOrCreateIndexConstant(rewriter, anchorOp, kernelH * dilationHeight);
+                  paddedInH = arith::AddIOp::create(rewriter, nestedLoc, paddedInH, kernelHOffset);
+                }

-        Value updatedOutput = pooledOutputAcc;
-        for (int64_t channelTile = 0; channelTile < channelTileCount; ++channelTile) {
-          const int64_t tileChannels = std::min<int64_t>(xbarSize, channels - channelTile * xbarSize);
-          auto tileType = RankedTensorType::get({1, tileChannels, 1, 1}, outType.getElementType());
-          Value reducedWindow =
-            createPoolFillTensor(rewriter, loc, tileType, std::is_same_v<PoolOp, ONNXMaxPoolSingleOutOp>);
+                for (int64_t kernelW = 0; kernelW < kernelWidth; ++kernelW) {
+                  Value paddedInW = windowBaseW;
+                  if (kernelW * dilationWidth != 0) {
+                    Value kernelWOffset = getOrCreateIndexConstant(rewriter, anchorOp, kernelW * dilationWidth);
+                    paddedInW = arith::AddIOp::create(rewriter, nestedLoc, paddedInW, kernelWOffset);
+                  }

-          for (int64_t kernelH = 0; kernelH < kernelHeight; ++kernelH) {
-            Value paddedInH = windowBaseH;
-            if (kernelH * dilationHeight != 0) {
-              Value kernelHOffset = arith::ConstantIndexOp::create(rewriter, loc, kernelH * dilationHeight);
-              paddedInH = arith::AddIOp::create(rewriter, loc, paddedInH, kernelHOffset);
-            }
-
-            for (int64_t kernelW = 0; kernelW < kernelWidth; ++kernelW) {
-              Value paddedInW = windowBaseW;
-              if (kernelW * dilationWidth != 0) {
-                Value kernelWOffset = arith::ConstantIndexOp::create(rewriter, loc, kernelW * dilationWidth);
-                paddedInW = arith::AddIOp::create(rewriter, loc, paddedInW, kernelWOffset);
+                  SmallVector<OpFoldResult> offsets = {
+                    batchIndex, rewriter.getIndexAttr(channelTile * xbarSize), paddedInH, paddedInW};
+                  SmallVector<OpFoldResult> sizes = {rewriter.getIndexAttr(1),
+                                                     rewriter.getIndexAttr(tileChannels),
+                                                     rewriter.getIndexAttr(1),
+                                                     rewriter.getIndexAttr(1)};
+                  SmallVector<OpFoldResult> strides = {rewriter.getIndexAttr(1),
+                                                       rewriter.getIndexAttr(1),
+                                                       rewriter.getIndexAttr(1),
+                                                       rewriter.getIndexAttr(1)};
+                  Value windowValue =
+                    tensor::ExtractSliceOp::create(rewriter, nestedLoc, tileType, paddedInput, offsets, sizes, strides);
+                  windowValue = materializeTileTensor(rewriter, nestedLoc, windowValue);
+                  reducedWindow = ReduceOp::create(rewriter, nestedLoc, tileType, reducedWindow, windowValue);
+                }
              }

-              SmallVector<OpFoldResult> offsets = {
-                batchIndex, rewriter.getIndexAttr(channelTile * xbarSize), paddedInH, paddedInW};
-              SmallVector<OpFoldResult> sizes = {rewriter.getIndexAttr(1),
-                                                 rewriter.getIndexAttr(tileChannels),
-                                                 rewriter.getIndexAttr(1),
-                                                 rewriter.getIndexAttr(1)};
-              SmallVector<OpFoldResult> strides = {
+              if constexpr (std::is_same_v<PoolOp, ONNXAveragePoolOp>) {
+                SmallVector<OpFoldResult> scaleOffsets = {rewriter.getIndexAttr(0),
+                                                          rewriter.getIndexAttr(channelTile * xbarSize),
+                                                          outHeightIndex,
+                                                          outWidthIndex};
+                SmallVector<OpFoldResult> scaleSizes = {rewriter.getIndexAttr(1),
+                                                        rewriter.getIndexAttr(tileChannels),
+                                                        rewriter.getIndexAttr(1),
+                                                        rewriter.getIndexAttr(1)};
+                SmallVector<OpFoldResult> scaleStrides = {rewriter.getIndexAttr(1),
+                                                          rewriter.getIndexAttr(1),
+                                                          rewriter.getIndexAttr(1),
+                                                          rewriter.getIndexAttr(1)};
+                Value scaleSlice = tensor::ExtractSliceOp::create(
+                  rewriter, nestedLoc, tileType, averageScaleTensor, scaleOffsets, scaleSizes, scaleStrides);
+                scaleSlice = materializeTileTensor(rewriter, nestedLoc, scaleSlice);
+                reducedWindow = spatial::SpatVMulOp::create(rewriter, nestedLoc, tileType, reducedWindow, scaleSlice);
+              }
+
+              SmallVector<OpFoldResult> outputOffsets = {
+                batchIndex, rewriter.getIndexAttr(channelTile * xbarSize), outHeightIndex, outWidthIndex};
+              SmallVector<OpFoldResult> outputSizes = {rewriter.getIndexAttr(1),
+                                                       rewriter.getIndexAttr(tileChannels),
+                                                       rewriter.getIndexAttr(1),
+                                                       rewriter.getIndexAttr(1)};
+              SmallVector<OpFoldResult> outputStrides = {
                rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1)};
-              Value windowValue =
-                tensor::ExtractSliceOp::create(rewriter, loc, tileType, paddedInput, offsets, sizes, strides);
-              windowValue = materializeContiguousTile(rewriter, loc, windowValue);
-              reducedWindow = ReduceOp::create(rewriter, loc, tileType, reducedWindow, windowValue);
+              updatedOutput = tensor::InsertSliceOp::create(
+                rewriter, nestedLoc, reducedWindow, updatedOutput, outputOffsets, outputSizes, outputStrides);
            }
-          }
+            yielded.push_back(updatedOutput);
+            return success();
+          });
+        if (failed(outputLoop))
+          return failure();

-          if constexpr (std::is_same_v<PoolOp, ONNXAveragePoolOp>) {
-            SmallVector<OpFoldResult> scaleOffsets = {
-              rewriter.getIndexAttr(0), rewriter.getIndexAttr(channelTile * xbarSize), outHeightIndex, outWidthIndex};
-            SmallVector<OpFoldResult> scaleSizes = {rewriter.getIndexAttr(1),
-                                                    rewriter.getIndexAttr(tileChannels),
-                                                    rewriter.getIndexAttr(1),
-                                                    rewriter.getIndexAttr(1)};
-            SmallVector<OpFoldResult> scaleStrides = {
-              rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1)};
-            Value scaleSlice = tensor::ExtractSliceOp::create(
-              rewriter, loc, tileType, averageScaleTensor, scaleOffsets, scaleSizes, scaleStrides);
-            scaleSlice = materializeContiguousTile(rewriter, loc, scaleSlice);
-            reducedWindow = spatial::SpatVMulOp::create(rewriter, loc, tileType, reducedWindow, scaleSlice);
-          }
-
-          SmallVector<OpFoldResult> outputOffsets = {
-            batchIndex, rewriter.getIndexAttr(channelTile * xbarSize), outHeightIndex, outWidthIndex};
-          SmallVector<OpFoldResult> outputSizes = {rewriter.getIndexAttr(1),
-                                                   rewriter.getIndexAttr(tileChannels),
-                                                   rewriter.getIndexAttr(1),
-                                                   rewriter.getIndexAttr(1)};
-          SmallVector<OpFoldResult> outputStrides = {
-            rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1), rewriter.getIndexAttr(1)};
-          updatedOutput = tensor::InsertSliceOp::create(
-            rewriter, loc, reducedWindow, updatedOutput, outputOffsets, outputSizes, outputStrides);
-        }
-
-        scf::YieldOp::create(rewriter, loc, updatedOutput);
-
-        rewriter.setInsertionPointAfter(outputLoop);
-        spatial::SpatYieldOp::create(rewriter, loc, outputLoop.getResult(0));
+        spatial::SpatYieldOp::create(rewriter, loc, outputLoop->results.front());
        return success();
      });
    if (failed(computeOp))
@@ -1,9 +1,11 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Transforms/DialectConversion.h"

+#include "src/Accelerators/PIM/Common/IR/LoopUtils.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -12,61 +14,94 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static int64_t normalizeAxis(int64_t axis, int64_t rank) { return axis >= 0 ? axis : rank + axis; }
+static Value buildLoopSoftmaxSlice(Value input,
+                                   Value accumulator,
+                                   RankedTensorType inputType,
+                                   ArrayRef<Value> outerIndices,
+                                   ConversionPatternRewriter& rewriter,
+                                   Location loc) {
+  int64_t rank = inputType.getRank();
+  SmallVector<int64_t> sliceShape(static_cast<size_t>(rank - 1), 1);
+  sliceShape.push_back(inputType.getDimSize(rank - 1));
+  auto sliceType = RankedTensorType::get(sliceShape, inputType.getElementType(), inputType.getEncoding());

-static SmallVector<int64_t> permuteShape(ArrayRef<int64_t> shape, ArrayRef<int64_t> permutation) {
-  SmallVector<int64_t> permutedShape;
-  permutedShape.reserve(permutation.size());
-  for (int64_t axis : permutation)
-    permutedShape.push_back(shape[axis]);
-  return permutedShape;
+  SmallVector<OpFoldResult> offsets;
+  SmallVector<OpFoldResult> sizes;
+  SmallVector<OpFoldResult> strides = getUnitStrides(rewriter, rank);
+  offsets.reserve(rank);
+  sizes.reserve(rank);
+
+  for (Value outerIndex : outerIndices) {
+    offsets.push_back(outerIndex);
+    sizes.push_back(rewriter.getIndexAttr(1));
+  }
+  offsets.push_back(rewriter.getIndexAttr(0));
+  sizes.push_back(rewriter.getIndexAttr(inputType.getDimSize(rank - 1)));
+
+  Value inputSlice = tensor::ExtractSliceOp::create(rewriter, loc, sliceType, input, offsets, sizes, strides);
+  Value softmaxSlice = spatial::SpatSoftmaxOp::create(rewriter, loc, sliceType, inputSlice).getResult();
+  return tensor::InsertSliceOp::create(rewriter, loc, softmaxSlice, accumulator, offsets, sizes, strides);
 }

-static Value createSoftmaxCompute(Value input, ConversionPatternRewriter& rewriter, Location loc) {
+static FailureOr<Value> buildLoopSoftmaxNest(Value input,
+                                             Value accumulator,
+                                             RankedTensorType inputType,
+                                             int64_t axis,
+                                             SmallVectorImpl<Value>& outerIndices,
+                                             ConversionPatternRewriter& rewriter,
+                                             Location loc) {
+  if (axis == inputType.getRank() - 1)
+    return buildLoopSoftmaxSlice(input, accumulator, inputType, outerIndices, rewriter, loc);
+
+  Operation* anchorOp = rewriter.getInsertionBlock()->getParentOp();
+  Value c0 = getOrCreateIndexConstant(rewriter, anchorOp, 0);
+  Value c1 = getOrCreateIndexConstant(rewriter, anchorOp, 1);
+  Value cUpper = getOrCreateIndexConstant(rewriter, anchorOp, inputType.getDimSize(axis));
+
+  auto loop = buildNormalizedScfFor(
+    rewriter,
+    loc,
+    c0,
+    cUpper,
+    c1,
+    ValueRange {accumulator},
+    [&](OpBuilder& builder, Location nestedLoc, Value loopIndex, ValueRange iterArgs, SmallVectorImpl<Value>& yielded) {
+      outerIndices.push_back(loopIndex);
+      auto updatedAccumulator =
+        buildLoopSoftmaxNest(input, iterArgs.front(), inputType, axis + 1, outerIndices, rewriter, nestedLoc);
+      outerIndices.pop_back();
+      if (failed(updatedAccumulator))
+        return failure();
+      yielded.push_back(*updatedAccumulator);
+      return success();
+    });
+  if (failed(loop))
+    return failure();
+  return loop->results.front();
+}
+
+static FailureOr<Value> createLoopSoftmaxCompute(Value input, ConversionPatternRewriter& rewriter, Location loc) {
  auto inputType = cast<RankedTensorType>(input.getType());
  constexpr size_t numInputs = 1;
-  auto computeOp =
-    createSpatCompute<numInputs>(rewriter, loc, TypeRange {inputType}, {}, ValueRange {input}, [&](Value x) {
-      auto softmaxOp = spatial::SpatSoftmaxOp::create(rewriter, loc, inputType, x);
-      spatial::SpatYieldOp::create(rewriter, loc, softmaxOp.getResult());
+  auto computeOp = createSpatCompute<numInputs>(
+    rewriter, loc, TypeRange {inputType}, {}, ValueRange {input}, [&](Value x) -> LogicalResult {
+      if (inputType.getRank() == 1) {
+        Value softmax = spatial::SpatSoftmaxOp::create(rewriter, loc, inputType, x).getResult();
+        spatial::SpatYieldOp::create(rewriter, loc, softmax);
+        return success();
+      }
+
+      Value outputInit = tensor::EmptyOp::create(rewriter, loc, inputType.getShape(), inputType.getElementType());
+      SmallVector<Value> outerIndices;
+      auto result = buildLoopSoftmaxNest(x, outputInit, inputType, /*axis=*/0, outerIndices, rewriter, loc);
+      if (failed(result))
+        return failure();
+      spatial::SpatYieldOp::create(rewriter, loc, *result);
+      return success();
    });
-  return computeOp.getResult(0);
-}
-
-static Value concatValues(ValueRange inputs, int64_t axis, ConversionPatternRewriter& rewriter, Location loc) {
-  auto firstType = cast<RankedTensorType>(inputs.front().getType());
-  SmallVector<int64_t> outputShape(firstType.getShape().begin(), firstType.getShape().end());
-  int64_t concatDimSize = 0;
-  for (Value input : inputs)
-    concatDimSize += cast<RankedTensorType>(input.getType()).getDimSize(axis);
-  outputShape[axis] = concatDimSize;
-  auto resultType = RankedTensorType::get(outputShape, firstType.getElementType(), firstType.getEncoding());
-
-  if (llvm::all_of(inputs, isHostFoldableValue))
-    return createSpatConcat(rewriter, loc, axis, inputs);
-
-  auto concatCompute = createSpatCompute(rewriter, loc, TypeRange {resultType}, {}, inputs, [&](ValueRange args) {
-    spatial::SpatYieldOp::create(rewriter, loc, createSpatConcat(rewriter, loc, axis, args));
-  });
-  return concatCompute.getResult(0);
-}
-
-static Value
-buildSoftmax(Value input, int64_t softmaxAxis, int64_t axis, ConversionPatternRewriter& rewriter, Location loc) {
-  auto inputType = cast<RankedTensorType>(input.getType());
-  if (axis == inputType.getRank())
-    return createSoftmaxCompute(input, rewriter, loc);
-
-  if (axis == softmaxAxis)
-    return buildSoftmax(input, softmaxAxis, axis + 1, rewriter, loc);
-
-  SmallVector<Value> slices = sliceTensor(input, axis, /*sliceSize=*/1, rewriter, loc);
-  SmallVector<Value> rebuiltSlices;
-  rebuiltSlices.reserve(slices.size());
-  for (Value slice : slices)
-    rebuiltSlices.push_back(buildSoftmax(slice, softmaxAxis, axis + 1, rewriter, loc));
-
-  return concatValues(rebuiltSlices, axis, rewriter, loc);
+  if (failed(computeOp))
+    return failure();
+  return computeOp->getResult(0);
 }

 struct SoftmaxToSpatialCompute : OpConversionPattern<ONNXSoftmaxOp> {
@@ -79,45 +114,40 @@ struct SoftmaxToSpatialCompute : OpConversionPattern<ONNXSoftmaxOp> {
    if (!inputType || !inputType.hasStaticShape())
      return failure();

-    int64_t axis = normalizeAxis(softmaxOp.getAxis(), inputType.getRank());
-    if (axis < 0 || axis >= inputType.getRank())
+    auto axis = normalizeAxisChecked(softmaxOp.getAxis(), inputType.getRank());
+    if (failed(axis))
      return failure();

    Value input = adaptor.getInput();
    Value result;
-    if (axis == inputType.getRank() - 1) {
-      result = buildSoftmax(input, axis, /*axis=*/0, rewriter, softmaxOp.getLoc());
+    if (*axis == inputType.getRank() - 1) {
+      auto computed = createLoopSoftmaxCompute(input, rewriter, softmaxOp.getLoc());
+      if (failed(computed))
+        return failure();
+      result = *computed;
    }
    else {
      SmallVector<int64_t> permutation;
      permutation.reserve(inputType.getRank());
      for (int64_t dim = 0; dim < inputType.getRank(); ++dim)
-        if (dim != axis)
+        if (dim != *axis)
          permutation.push_back(dim);
-      permutation.push_back(axis);
-
-      SmallVector<int64_t> inversePermutation(inputType.getRank());
-      for (auto [newIndex, oldIndex] : llvm::enumerate(permutation))
-        inversePermutation[oldIndex] = static_cast<int64_t>(newIndex);
+      permutation.push_back(*axis);
+      SmallVector<int64_t> inversePermutation = invertPermutation(permutation);

      auto transposedType = RankedTensorType::get(
        permuteShape(inputType.getShape(), permutation), inputType.getElementType(), inputType.getEncoding());
-      auto preTransposeCompute =
-        createSpatCompute<1>(rewriter, softmaxOp.getLoc(), TypeRange {transposedType}, {}, input, [&](Value x) {
-          Value transposed = ONNXTransposeOp::create(
-            rewriter, softmaxOp.getLoc(), transposedType, x, rewriter.getI64ArrayAttr(permutation));
-          spatial::SpatYieldOp::create(rewriter, softmaxOp.getLoc(), transposed);
-        });
-      Value transposedInput = preTransposeCompute.getResult(0);
-      Value transposedResult = buildSoftmax(
-        transposedInput, /*softmaxAxis=*/inputType.getRank() - 1, /*axis=*/0, rewriter, softmaxOp.getLoc());
-      auto postTransposeCompute =
-        createSpatCompute<1>(rewriter, softmaxOp.getLoc(), TypeRange {inputType}, {}, transposedResult, [&](Value x) {
-          Value transposed = ONNXTransposeOp::create(
-            rewriter, softmaxOp.getLoc(), inputType, x, rewriter.getI64ArrayAttr(inversePermutation));
-          spatial::SpatYieldOp::create(rewriter, softmaxOp.getLoc(), transposed);
-        });
-      result = postTransposeCompute.getResult(0);
+      Value transposedInput =
+        ONNXTransposeOp::create(
+          rewriter, softmaxOp.getLoc(), transposedType, input, rewriter.getI64ArrayAttr(permutation))
+          .getResult();
+      auto transposedResult = createLoopSoftmaxCompute(transposedInput, rewriter, softmaxOp.getLoc());
+      if (failed(transposedResult))
+        return failure();
+      result =
+        ONNXTransposeOp::create(
+          rewriter, softmaxOp.getLoc(), inputType, *transposedResult, rewriter.getI64ArrayAttr(inversePermutation))
+          .getResult();
    }

    rewriter.replaceOp(softmaxOp, result);
@@ -0,0 +1,288 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Func/IR/FuncOps.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/IR/IRMapping.h"
+#include "mlir/IR/PatternMatch.h"
+
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallVector.h"
+
+#include "src/Accelerators/PIM/Common/IR/WeightUtils.hpp"
+#include "src/Accelerators/PIM/Common/Support/CheckedArithmetic.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/WeightMaterialization.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
+#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
+#include "src/Dialect/ONNX/ONNXOps.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+
+namespace {
+
+static bool isWeightMaterializationHelperUser(Operation* op) {
+  return isa<tensor::ExtractSliceOp, tensor::ExpandShapeOp, tensor::CollapseShapeOp, linalg::TransposeOp>(op);
+}
+
+static bool canPromoteInputBlockArgument(BlockArgument arg) {
+  return !arg.use_empty() && llvm::all_of(arg.getUsers(), isWeightMaterializationHelperUser);
+}
+
+static bool canPromoteInputBlockArgument(std::optional<BlockArgument> arg) {
+  return arg && canPromoteInputBlockArgument(*arg);
+}
+
+static bool isDirectConstantValue(Value value) {
+  return isa_and_nonnull<arith::ConstantOp, ONNXConstantOp>(value.getDefiningOp());
+}
+
+struct PromotedOperands {
+  SmallVector<bool> promoteInput;
+  SmallVector<Value> newWeights;
+  SmallVector<Value> newInputs;
+  SmallVector<Type> newInputTypes;
+  SmallVector<Location> newInputLocs;
+};
+
+template <typename ComputeOpTy>
+static bool hasPromotableWeightLikeInputs(ComputeOpTy compute) {
+  for (auto [inputIdx, input] : llvm::enumerate(compute.getInputs())) {
+    if (!isWeightLikeComputeOperand(input))
+      continue;
+    if (isDirectConstantValue(input) && !canPromoteInputBlockArgument(compute.getInputArgument(inputIdx)))
+      continue;
+    return true;
+  }
+  return false;
+}
+
+template <typename ComputeOpTy>
+static FailureOr<PromotedOperands> computePromotedOperands(ComputeOpTy compute) {
+  PromotedOperands promoted;
+  promoted.promoteInput.assign(compute.getInputs().size(), false);
+  promoted.newWeights.append(compute.getWeights().begin(), compute.getWeights().end());
+  promoted.newWeights.reserve(compute.getWeights().size() + compute.getInputs().size());
+  promoted.newInputs.reserve(compute.getInputs().size());
+  promoted.newInputTypes.reserve(compute.getInputs().size());
+  promoted.newInputLocs.reserve(compute.getInputs().size());
+
+  bool needsRewrite = false;
+  for (auto [inputIdx, input] : llvm::enumerate(compute.getInputs())) {
+    if (!isWeightLikeComputeOperand(input))
+      goto keep_input;
+    if (isDirectConstantValue(input) && !canPromoteInputBlockArgument(compute.getInputArgument(inputIdx)))
+      goto keep_input;
+    promoted.promoteInput[inputIdx] = true;
+    promoted.newWeights.push_back(input);
+    needsRewrite = true;
+    continue;
+
+keep_input:
+    promoted.newInputs.push_back(input);
+    promoted.newInputTypes.push_back(input.getType());
+    promoted.newInputLocs.push_back(input.getLoc());
+  }
+
+  if (!needsRewrite)
+    return failure();
+  return promoted;
+}
+
+template <typename ComputeOpTy>
+static LogicalResult mapPromotedInputArguments(ComputeOpTy compute,
+                                               const PromotedOperands& promoted,
+                                               IRRewriter& bodyRewriter,
+                                               IRMapping& mapper,
+                                               std::function<std::optional<BlockArgument>(size_t)> getNewInputArg,
+                                               PatternRewriter& rewriter) {
+  size_t newInputIdx = 0;
+  for (auto [oldInputIdx, input] : llvm::enumerate(compute.getInputs())) {
+    auto oldArg = compute.getInputArgument(oldInputIdx);
+    if (!oldArg)
+      return rewriter.notifyMatchFailure(compute, "missing input block argument during rewrite");
+    if (!promoted.promoteInput[oldInputIdx]) {
+      auto newInputArg = getNewInputArg(newInputIdx++);
+      if (!newInputArg)
+        return rewriter.notifyMatchFailure(compute, "missing rewritten input block argument");
+      mapper.map(*oldArg, *newInputArg);
+      continue;
+    }
+
+    auto clonedValue = materializeWeightLikeValueInBlock(input, bodyRewriter, mapper);
+    if (failed(clonedValue))
+      return rewriter.notifyMatchFailure(compute, "failed to materialize promoted weight-like operand");
+    mapper.map(*oldArg, *clonedValue);
+  }
+  return success();
+}
+
+// Promotes foldable helper chains from runtime inputs to weights to avoid artificial compute inputs.
+struct PromoteWeightLikeComputeInputsPattern : OpRewritePattern<spatial::SpatCompute> {
+  using OpRewritePattern<spatial::SpatCompute>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(spatial::SpatCompute compute, PatternRewriter& rewriter) const override {
+    auto promoted = computePromotedOperands(compute);
+    if (failed(promoted))
+      return rewriter.notifyMatchFailure(compute, "no weight-like inputs to promote");
+    Block& oldBlock = compute.getBody().front();
+
+    rewriter.setInsertionPointAfter(compute);
+    auto newCompute = spatial::SpatCompute::create(
+      rewriter, compute.getLoc(), compute.getResultTypes(), promoted->newWeights, promoted->newInputs);
+    SmallVector<Type> newBlockArgTypes;
+    SmallVector<Location> newBlockArgLocs;
+    for (Value weight : promoted->newWeights) {
+      newBlockArgTypes.push_back(weight.getType());
+      newBlockArgLocs.push_back(weight.getLoc());
+    }
+    llvm::append_range(newBlockArgTypes, promoted->newInputTypes);
+    llvm::append_range(newBlockArgLocs, promoted->newInputLocs);
+    auto* newBlock = rewriter.createBlock(
+      &newCompute.getBody(), newCompute.getBody().end(), TypeRange(newBlockArgTypes), newBlockArgLocs);
+    newCompute.getProperties().setOperandSegmentSizes(
+      {static_cast<int>(promoted->newWeights.size()), static_cast<int>(promoted->newInputs.size())});
+    rewriter.setInsertionPointToStart(newBlock);
+
+    IRRewriter bodyRewriter(rewriter.getContext());
+    bodyRewriter.setInsertionPointToStart(newBlock);
+
+    IRMapping mapper;
+    for (auto [weightIndex, weight] : llvm::enumerate(compute.getWeights())) {
+      auto oldWeightArg = compute.getWeightArgument(weightIndex);
+      auto newWeightArg = newCompute.getWeightArgument(weightIndex);
+      if (!oldWeightArg || !newWeightArg)
+        return rewriter.notifyMatchFailure(compute, "missing compute weight block argument during rewrite");
+      mapper.map(*oldWeightArg, *newWeightArg);
+    }
+    if (failed(mapPromotedInputArguments(
+          compute,
+          *promoted,
+          bodyRewriter,
+          mapper,
+          [&](size_t index) { return newCompute.getInputArgument(index); },
+          rewriter)))
+      return failure();
+
+    for (Operation& op : oldBlock.without_terminator())
+      rewriter.clone(op, mapper);
+
+    auto oldYield = cast<spatial::SpatYieldOp>(oldBlock.getTerminator());
+    SmallVector<Value> newYieldOperands;
+    newYieldOperands.reserve(oldYield.getOutputs().size());
+    for (Value operand : oldYield.getOutputs()) {
+      auto mapped = mapper.lookupOrNull(operand);
+      newYieldOperands.push_back(mapped ? cast<Value>(mapped) : operand);
+    }
+    spatial::SpatYieldOp::create(rewriter, oldYield.getLoc(), newYieldOperands);
+
+    rewriter.replaceOp(compute, newCompute.getResults());
+    return success();
+  }
+};
+
+// Promotes foldable batch helper chains to weights while preserving compact compute_batch IR.
+struct PromoteWeightLikeComputeBatchInputsPattern : OpRewritePattern<spatial::SpatComputeBatch> {
+  using OpRewritePattern<spatial::SpatComputeBatch>::OpRewritePattern;
+
+  LogicalResult matchAndRewrite(spatial::SpatComputeBatch compute, PatternRewriter& rewriter) const override {
+    auto promoted = computePromotedOperands(compute);
+    if (failed(promoted))
+      return rewriter.notifyMatchFailure(compute, "no weight-like batch inputs to promote");
+    Block& oldBlock = compute.getBody().front();
+
+    rewriter.setInsertionPointAfter(compute);
+
+    auto laneCountAttr = pim::getCheckedI32Attr(
+      rewriter, compute, static_cast<uint64_t>(compute.getLaneCount()), "promoted compute_batch lane count");
+    if (failed(laneCountAttr))
+      return failure();
+    auto newCompute = spatial::SpatComputeBatch::create(
+      rewriter, compute.getLoc(), compute.getResultTypes(), *laneCountAttr, promoted->newWeights, promoted->newInputs);
+    auto laneArg = compute.getLaneArgument();
+    if (!laneArg)
+      return rewriter.notifyMatchFailure(compute, "missing compute_batch lane block argument");
+    SmallVector<Type> newBlockArgTypes;
+    SmallVector<Location> newBlockArgLocs;
+    newBlockArgTypes.reserve(1 + promoted->newWeights.size() + promoted->newInputTypes.size()
+                             + compute.getNumResults());
+    newBlockArgLocs.reserve(1 + promoted->newWeights.size() + promoted->newInputLocs.size() + compute.getNumResults());
+    newBlockArgTypes.push_back(laneArg->getType());
+    newBlockArgLocs.push_back(laneArg->getLoc());
+    for (Value weight : promoted->newWeights) {
+      newBlockArgTypes.push_back(weight.getType());
+      newBlockArgLocs.push_back(weight.getLoc());
+    }
+    llvm::append_range(newBlockArgTypes, promoted->newInputTypes);
+    llvm::append_range(newBlockArgLocs, promoted->newInputLocs);
+    for (auto [resultIndex, resultType] : llvm::enumerate(compute.getResultTypes())) {
+      auto outputArg = compute.getOutputArgument(resultIndex);
+      if (!outputArg)
+        return rewriter.notifyMatchFailure(compute, "missing compute_batch output block argument");
+      newBlockArgTypes.push_back(resultType);
+      newBlockArgLocs.push_back(outputArg->getLoc());
+    }
+
+    auto* newBlock = rewriter.createBlock(
+      &newCompute.getBody(), newCompute.getBody().end(), TypeRange(newBlockArgTypes), newBlockArgLocs);
+    newCompute.getProperties().setOperandSegmentSizes(
+      {static_cast<int>(promoted->newWeights.size()), static_cast<int>(promoted->newInputs.size())});
+    rewriter.setInsertionPointToStart(newBlock);
+
+    IRRewriter bodyRewriter(rewriter.getContext());
+    bodyRewriter.setInsertionPointToStart(newBlock);
+
+    IRMapping mapper;
+    auto newLaneArg = newCompute.getLaneArgument();
+    if (!newLaneArg)
+      return rewriter.notifyMatchFailure(compute, "missing rewritten compute_batch lane block argument");
+    mapper.map(*laneArg, *newLaneArg);
+    for (auto [weightIndex, weight] : llvm::enumerate(compute.getWeights())) {
+      auto oldWeightArg = compute.getWeightArgument(weightIndex);
+      auto newWeightArg = newCompute.getWeightArgument(weightIndex);
+      if (!oldWeightArg || !newWeightArg)
+        return rewriter.notifyMatchFailure(compute, "missing compute_batch weight block argument during rewrite");
+      mapper.map(*oldWeightArg, *newWeightArg);
+    }
+    if (failed(mapPromotedInputArguments(
+          compute,
+          *promoted,
+          bodyRewriter,
+          mapper,
+          [&](size_t index) { return newCompute.getInputArgument(index); },
+          rewriter)))
+      return failure();
+    for (auto resultIndex : llvm::seq<size_t>(0, compute.getNumResults())) {
+      auto outputArg = compute.getOutputArgument(resultIndex);
+      if (!outputArg)
+        return rewriter.notifyMatchFailure(compute, "missing compute_batch output block argument during rewrite");
+      mapper.map(*outputArg,
+                 newBlock->getArgument(1 + promoted->newWeights.size() + promoted->newInputs.size() + resultIndex));
+    }
+
+    for (Operation& op : oldBlock)
+      rewriter.clone(op, mapper);
+
+    rewriter.replaceOp(compute, newCompute.getResults());
+    return success();
+  }
+};
+
+} // namespace
+
+void populateWeightPromotionPatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
+  patterns.add<PromoteWeightLikeComputeInputsPattern, PromoteWeightLikeComputeBatchInputsPattern>(ctx);
+}
+
+void annotateWeightsConstants(func::FuncOp funcOp) {
+  funcOp.walk([&](arith::ConstantOp constantOp) {
+    if (hasOnlySpatialMvmVmmWeightUses(constantOp.getResult()))
+      markWeightAlways(constantOp);
+  });
+}
+
+bool requiresPostRewrite(spatial::SpatCompute computeOp) { return hasPromotableWeightLikeInputs(computeOp); }
+
+bool requiresPostRewrite(spatial::SpatComputeBatch computeOp) { return hasPromotableWeightLikeInputs(computeOp); }
+
+} // namespace onnx_mlir
@@ -1,6 +1,5 @@
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/PrePatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"

 using namespace mlir;

@@ -12,14 +11,12 @@ namespace {

 } // namespace

-void populatePrePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx) {
+void populateGeneratedPrePatterns(mlir::RewritePatternSet& patterns, mlir::MLIRContext* ctx) {
  patterns.add<onnxToArithConstant>(ctx);
  patterns.add<convAddToConvWithBiasLeft>(ctx);
  patterns.add<convAddToConvWithBiasRight>(ctx);
  patterns.add<matMulAddToGemm>(ctx);
-  patterns.add<matMulToGemm>(ctx);
  patterns.add<removeFlattenSameShape>(ctx);
-  populateMatMulRewritePatterns(patterns, ctx);
 }

 } // namespace onnx_mlir
@@ -3,7 +3,7 @@

 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/ComputeRegionBuilder.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -20,7 +20,7 @@ struct Concat : public OpConversionPattern<ONNXConcatOp> {
    auto inputs = adaptor.getInputs();
    int64_t axis = adaptor.getAxis();

-    if (llvm::all_of(inputs, isHostFoldableValue)) {
+    if (llvm::all_of(inputs, isCompileTimeComputable)) {
      rewriter.replaceOp(maxpoolOp, createSpatConcat(rewriter, maxpoolOp.getLoc(), axis, inputs));
      return success();
    }
@@ -6,7 +6,7 @@
 #include "llvm/ADT/SmallVector.h"

 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -15,24 +15,6 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static int64_t normalizeAxis(int64_t axis, int64_t rank) { return axis >= 0 ? axis : rank + axis; }
-
-static int64_t normalizeIndex(int64_t index, int64_t dimSize) { return index >= 0 ? index : dimSize + index; }
-
-static Value
-extractSliceAt(Value input, int64_t axis, int64_t offset, ConversionPatternRewriter& rewriter, Location loc) {
-  auto inputType = cast<RankedTensorType>(input.getType());
-  SmallVector<OpFoldResult> offsets(inputType.getRank(), rewriter.getIndexAttr(0));
-  SmallVector<OpFoldResult> sizes;
-  SmallVector<OpFoldResult> strides(inputType.getRank(), rewriter.getIndexAttr(1));
-  sizes.reserve(inputType.getRank());
-  for (int64_t dim : inputType.getShape())
-    sizes.push_back(rewriter.getIndexAttr(dim));
-  offsets[axis] = rewriter.getIndexAttr(offset);
-  sizes[axis] = rewriter.getIndexAttr(1);
-  return tensor::ExtractSliceOp::create(rewriter, loc, input, offsets, sizes, strides);
-}
-
 static Value concatGatherSlices(Value data,
                                int64_t axis,
                                ArrayRef<int64_t> indices,
@@ -45,7 +27,7 @@ static Value concatGatherSlices(Value data,
    int64_t normalizedIndex = normalizeIndex(index, axisDim);
    if (normalizedIndex < 0 || normalizedIndex >= axisDim)
      return {};
-    slices.push_back(extractSliceAt(data, axis, normalizedIndex, rewriter, loc));
+    slices.push_back(extractAxisSlice(rewriter, loc, data, axis, normalizedIndex, /*size=*/1));
  }
  if (slices.empty())
    return {};
@@ -96,11 +78,11 @@ struct Gather : OpConversionPattern<ONNXGatherOp> {
      return failure();

    int64_t rank = dataType.getRank();
-    int64_t axis = normalizeAxis(gatherOp.getAxis(), rank);
-    if (axis < 0 || axis >= rank)
+    auto axis = normalizeAxisChecked(gatherOp.getAxis(), rank);
+    if (failed(axis))
      return failure();

-    int64_t axisDim = dataType.getShape()[axis];
+    int64_t axisDim = dataType.getShape()[*axis];
    if (axisDim <= 0)
      return failure();

@@ -116,7 +98,7 @@ struct Gather : OpConversionPattern<ONNXGatherOp> {
                           [&](Value data) -> LogicalResult {
                             Value result;
                             if (indicesType.getRank() == 1) {
-                               result = concatGatherSlices(data, axis, flatIndices, axisDim, rewriter, loc);
+                               result = concatGatherSlices(data, *axis, flatIndices, axisDim, rewriter, loc);
                             }
                             else if (indicesType.getRank() == 2) {
                               int64_t rowCount = indicesType.getShape()[0];
@@ -125,12 +107,13 @@ struct Gather : OpConversionPattern<ONNXGatherOp> {
                               rows.reserve(rowCount);
                               for (int64_t row = 0; row < rowCount; ++row) {
                                 ArrayRef<int64_t> rowIndices(flatIndices.data() + row * rowWidth, rowWidth);
-                                 Value gatheredRow = concatGatherSlices(data, axis, rowIndices, axisDim, rewriter, loc);
+                                 Value gatheredRow =
+                                   concatGatherSlices(data, *axis, rowIndices, axisDim, rewriter, loc);
                                 if (!gatheredRow)
                                   return failure();
-                                 rows.push_back(addLeadingGatherDim(gatheredRow, axis, rewriter, loc));
+                                 rows.push_back(addLeadingGatherDim(gatheredRow, *axis, rewriter, loc));
                               }
-                               result = createSpatConcat(rewriter, loc, /*axis=*/axis, rows);
+                               result = createSpatConcat(rewriter, loc, /*axis=*/*axis, rows);
                             }
                             else {
                               return failure();
@@ -4,8 +4,8 @@
 #include "llvm/ADT/SmallVector.h"

 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -14,10 +14,6 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static bool haveStaticPositiveShape(ArrayRef<int64_t> shape) {
-  return llvm::all_of(shape, [](int64_t dim) { return dim > 0; });
-}
-
 static bool inferCollapseReassociation(ArrayRef<int64_t> sourceShape,
                                       ArrayRef<int64_t> resultShape,
                                       SmallVector<ReassociationIndices>& reassociation) {
@@ -80,6 +76,22 @@ static bool inferExpandReassociation(ArrayRef<int64_t> sourceShape,
  return sourceIdx == sourceShape.size() && resultIdx == resultShape.size();
 }

+static SmallVector<ReassociationIndices> getCollapseTo1DReassociation(size_t rank) {
+  SmallVector<ReassociationIndices> reassociation(1);
+  reassociation.front().reserve(rank);
+  for (size_t dim = 0; dim < rank; ++dim)
+    reassociation.front().push_back(static_cast<int64_t>(dim));
+  return reassociation;
+}
+
+static SmallVector<ReassociationIndices> getExpandFrom1DReassociation(size_t rank) {
+  SmallVector<ReassociationIndices> reassociation(1);
+  reassociation.front().reserve(rank);
+  for (size_t dim = 0; dim < rank; ++dim)
+    reassociation.front().push_back(static_cast<int64_t>(dim));
+  return reassociation;
+}
+
 struct Reshape : OpConversionPattern<ONNXReshapeOp> {
  using OpConversionPattern::OpConversionPattern;

@@ -90,7 +102,7 @@ struct Reshape : OpConversionPattern<ONNXReshapeOp> {
    auto resultType = dyn_cast<RankedTensorType>(reshapeOp.getReshaped().getType());
    if (!sourceType || !resultType || !sourceType.hasStaticShape() || !resultType.hasStaticShape())
      return failure();
-    if (!haveStaticPositiveShape(sourceType.getShape()) || !haveStaticPositiveShape(resultType.getShape()))
+    if (!hasStaticPositiveShape(sourceType) || !hasStaticPositiveShape(resultType))
      return failure();

    if (sourceType == resultType) {
@@ -99,17 +111,9 @@ struct Reshape : OpConversionPattern<ONNXReshapeOp> {
    }

    auto replaceWithReshape = [&](auto buildReshape) -> LogicalResult {
-      if (isHostFoldableValue(adaptor.getData())) {
-        rewriter.replaceOp(reshapeOp, buildReshape(adaptor.getData()));
-        return success();
-      }
-
-      auto computeOp = createSpatCompute<1>(
-        rewriter, reshapeOp.getLoc(), TypeRange {resultType}, {}, adaptor.getData(), [&](Value data) {
-          Value reshaped = buildReshape(data);
-          spatial::SpatYieldOp::create(rewriter, reshapeOp.getLoc(), reshaped);
-        });
-      rewriter.replaceOp(reshapeOp, computeOp.getResults());
+      Value reshaped =
+        materializeOrComputeUnary(adaptor.getData(), resultType, rewriter, reshapeOp.getLoc(), buildReshape);
+      rewriter.replaceOp(reshapeOp, reshaped);
      return success();
    };

@@ -126,6 +130,23 @@ struct Reshape : OpConversionPattern<ONNXReshapeOp> {
        return tensor::ExpandShapeOp::create(rewriter, reshapeOp.getLoc(), resultType, data, reassociation);
      });

+    if (sourceType.getNumElements() != resultType.getNumElements())
+      return failure();
+
+    return replaceWithReshape([&](Value data) -> Value {
+      Value reshaped = data;
+      if (sourceType.getRank() != 1) {
+        auto flatType = RankedTensorType::get({sourceType.getNumElements()}, sourceType.getElementType());
+        reshaped = tensor::CollapseShapeOp::create(
+          rewriter, reshapeOp.getLoc(), flatType, reshaped, getCollapseTo1DReassociation(sourceType.getRank()));
+      }
+      if (resultType.getRank() == 1)
+        return reshaped;
+      return tensor::ExpandShapeOp::create(
+               rewriter, reshapeOp.getLoc(), resultType, reshaped, getExpandFrom1DReassociation(resultType.getRank()))
+        .getResult();
+    });
+
    return failure();
  }
 };
@@ -1,12 +1,13 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/SCF/IR/SCF.h"
 #include "mlir/Dialect/Tensor/IR/Tensor.h"
 #include "mlir/Transforms/DialectConversion.h"

 #include "llvm/ADT/STLExtras.h"

-#include <algorithm>
-
+#include "src/Accelerators/PIM/Common/IR/LoopUtils.hpp"
 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -15,42 +16,127 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static Value
-extractSliceAt(Value input, int64_t axis, int64_t offset, ConversionPatternRewriter& rewriter, Location loc) {
-  auto inputType = cast<RankedTensorType>(input.getType());
-  SmallVector<OpFoldResult> offsets(inputType.getRank(), rewriter.getIndexAttr(0));
-  SmallVector<OpFoldResult> sizes;
-  SmallVector<OpFoldResult> strides(inputType.getRank(), rewriter.getIndexAttr(1));
-  sizes.reserve(inputType.getRank());
-  for (int64_t dim : inputType.getShape())
-    sizes.push_back(rewriter.getIndexAttr(dim));
-  offsets[axis] = rewriter.getIndexAttr(offset);
-  sizes[axis] = rewriter.getIndexAttr(1);
-  return tensor::ExtractSliceOp::create(rewriter, loc, input, offsets, sizes, strides);
+static Value buildNearestAsymmetricIndex(
+  Value outputIndex, int64_t inputDim, int64_t outputDim, ConversionPatternRewriter& rewriter, Location loc) {
+  Operation* anchorOp = rewriter.getInsertionBlock()->getParentOp();
+  Value cInputDim = getOrCreateIndexConstant(rewriter, anchorOp, inputDim);
+  Value cOutputDim = getOrCreateIndexConstant(rewriter, anchorOp, outputDim);
+  Value cInputDimLast = getOrCreateIndexConstant(rewriter, anchorOp, inputDim - 1);
+  Value scaledIndex = arith::MulIOp::create(rewriter, loc, outputIndex, cInputDim);
+  Value inputIndex = arith::DivUIOp::create(rewriter, loc, scaledIndex, cOutputDim);
+  return arith::MinUIOp::create(rewriter, loc, inputIndex, cInputDimLast);
 }

-static int64_t nearestAsymmetricIndex(int64_t outputIndex, int64_t inputDim, int64_t outputDim) {
-  return std::min<int64_t>((outputIndex * inputDim) / outputDim, inputDim - 1);
-}
+static FailureOr<Value> buildNearestResizeLoop(Value input,
+                                               RankedTensorType inputType,
+                                               RankedTensorType resultType,
+                                               ConversionPatternRewriter& rewriter,
+                                               Location loc) {
+  auto elemType = resultType.getElementType();
+  SmallVector<int64_t> unitShape(resultType.getRank(), 1);
+  auto unitTensorType = RankedTensorType::get(unitShape, elemType);

-static Value buildNearestResize(Value input,
-                                ArrayRef<int64_t> inputShape,
-                                ArrayRef<int64_t> outputShape,
-                                int64_t axis,
-                                ConversionPatternRewriter& rewriter,
-                                Location loc) {
-  if (axis == static_cast<int64_t>(outputShape.size()))
-    return input;
+  SmallVector<OpFoldResult> unitSizes(resultType.getRank(), rewriter.getIndexAttr(1));
+  SmallVector<OpFoldResult> unitStrides(resultType.getRank(), rewriter.getIndexAttr(1));

-  SmallVector<Value> slices;
-  slices.reserve(outputShape[axis]);
-  for (int64_t outputIndex = 0; outputIndex < outputShape[axis]; ++outputIndex) {
-    int64_t inputIndex = nearestAsymmetricIndex(outputIndex, inputShape[axis], outputShape[axis]);
-    Value slice = extractSliceAt(input, axis, inputIndex, rewriter, loc);
-    slices.push_back(buildNearestResize(slice, inputShape, outputShape, axis + 1, rewriter, loc));
-  }
+  Operation* anchorOp = rewriter.getInsertionBlock()->getParentOp();
+  Value c0 = getOrCreateIndexConstant(rewriter, anchorOp, 0);
+  Value c1 = getOrCreateIndexConstant(rewriter, anchorOp, 1);
+  Value cOutputN = getOrCreateIndexConstant(rewriter, anchorOp, resultType.getDimSize(0));
+  Value cOutputC = getOrCreateIndexConstant(rewriter, anchorOp, resultType.getDimSize(1));
+  Value cOutputH = getOrCreateIndexConstant(rewriter, anchorOp, resultType.getDimSize(2));
+  Value cOutputW = getOrCreateIndexConstant(rewriter, anchorOp, resultType.getDimSize(3));

-  return createSpatConcat(rewriter, loc, axis, slices);
+  Value outputInit = tensor::EmptyOp::create(rewriter, loc, resultType.getShape(), elemType);
+
+  auto batchLoop = buildNormalizedScfFor(
+    rewriter,
+    loc,
+    c0,
+    cOutputN,
+    c1,
+    ValueRange {outputInit},
+    [&](OpBuilder&, Location nestedLoc, Value outputN, ValueRange batchIterArgs, SmallVectorImpl<Value>& batchYielded) {
+      Value outputBatchAcc = batchIterArgs.front();
+      Value inputN =
+        buildNearestAsymmetricIndex(outputN, inputType.getDimSize(0), resultType.getDimSize(0), rewriter, nestedLoc);
+
+      auto channelLoop = buildNormalizedScfFor(
+        rewriter,
+        nestedLoc,
+        c0,
+        cOutputC,
+        c1,
+        ValueRange {outputBatchAcc},
+        [&](OpBuilder&,
+            Location channelLoc,
+            Value outputC,
+            ValueRange channelIterArgs,
+            SmallVectorImpl<Value>& channelYielded) {
+          Value outputChannelAcc = channelIterArgs.front();
+          Value inputC = buildNearestAsymmetricIndex(
+            outputC, inputType.getDimSize(1), resultType.getDimSize(1), rewriter, channelLoc);
+
+          auto heightLoop = buildNormalizedScfFor(
+            rewriter,
+            channelLoc,
+            c0,
+            cOutputH,
+            c1,
+            ValueRange {outputChannelAcc},
+            [&](OpBuilder&,
+                Location heightLoc,
+                Value outputH,
+                ValueRange heightIterArgs,
+                SmallVectorImpl<Value>& heightYielded) {
+              Value outputHeightAcc = heightIterArgs.front();
+              Value inputH = buildNearestAsymmetricIndex(
+                outputH, inputType.getDimSize(2), resultType.getDimSize(2), rewriter, heightLoc);
+
+              auto widthLoop = buildNormalizedScfFor(
+                rewriter,
+                heightLoc,
+                c0,
+                cOutputW,
+                c1,
+                ValueRange {outputHeightAcc},
+                [&](OpBuilder&,
+                    Location widthLoc,
+                    Value outputW,
+                    ValueRange widthIterArgs,
+                    SmallVectorImpl<Value>& widthYielded) {
+                  Value outputWidthAcc = widthIterArgs.front();
+                  Value inputW = buildNearestAsymmetricIndex(
+                    outputW, inputType.getDimSize(3), resultType.getDimSize(3), rewriter, widthLoc);
+
+                  SmallVector<OpFoldResult> inputOffsets = {inputN, inputC, inputH, inputW};
+                  Value inputSlice = tensor::ExtractSliceOp::create(
+                    rewriter, widthLoc, unitTensorType, input, inputOffsets, unitSizes, unitStrides);
+
+                  SmallVector<OpFoldResult> outputOffsets = {outputN, outputC, outputH, outputW};
+                  Value updatedOutput = tensor::InsertSliceOp::create(
+                    rewriter, widthLoc, inputSlice, outputWidthAcc, outputOffsets, unitSizes, unitStrides);
+                  widthYielded.push_back(updatedOutput);
+                  return success();
+                });
+              if (failed(widthLoop))
+                return failure();
+              heightYielded.push_back(widthLoop->results.front());
+              return success();
+            });
+          if (failed(heightLoop))
+            return failure();
+          channelYielded.push_back(heightLoop->results.front());
+          return success();
+        });
+      if (failed(channelLoop))
+        return failure();
+      batchYielded.push_back(channelLoop->results.front());
+      return success();
+    });
+  if (failed(batchLoop))
+    return failure();
+  return batchLoop->results.front();
 }

 struct Resize : OpConversionPattern<ONNXResizeOp> {
@@ -62,23 +148,30 @@ struct Resize : OpConversionPattern<ONNXResizeOp> {
    auto inputType = dyn_cast<RankedTensorType>(adaptor.getX().getType());
    auto resultType = dyn_cast<RankedTensorType>(resizeOp.getY().getType());
    if (!inputType || !resultType || !inputType.hasStaticShape() || !resultType.hasStaticShape())
-      return failure();
+      return rewriter.notifyMatchFailure(resizeOp, "resize lowering requires static ranked tensor types.");
+    if (inputType.getRank() != 4 || resultType.getRank() != 4)
+      return rewriter.notifyMatchFailure(resizeOp, "resize lowering currently supports only rank-4 NCHW tensors.");

    if (resizeOp.getMode() != "nearest" || resizeOp.getCoordinateTransformationMode() != "asymmetric"
        || resizeOp.getNearestMode() != "floor")
-      return failure();
+      return rewriter.notifyMatchFailure(resizeOp,
+                                         "resize lowering currently supports only nearest + asymmetric + floor.");

    if (llvm::any_of(inputType.getShape(), [](int64_t dim) { return dim <= 0; })
        || llvm::any_of(resultType.getShape(), [](int64_t dim) { return dim <= 0; }))
-      return failure();
+      return rewriter.notifyMatchFailure(resizeOp, "resize lowering requires positive static dimensions.");

-    auto computeOp =
-      createSpatCompute<1>(rewriter, resizeOp.getLoc(), TypeRange {resultType}, {}, adaptor.getX(), [&](Value x) {
-        Value result =
-          buildNearestResize(x, inputType.getShape(), resultType.getShape(), /*axis=*/0, rewriter, resizeOp.getLoc());
-        spatial::SpatYieldOp::create(rewriter, resizeOp.getLoc(), result);
+    auto computeOp = createSpatCompute<1>(
+      rewriter, resizeOp.getLoc(), TypeRange {resultType}, {}, adaptor.getX(), [&](Value x) -> LogicalResult {
+        auto result = buildNearestResizeLoop(x, inputType, resultType, rewriter, resizeOp.getLoc());
+        if (failed(result))
+          return failure();
+        spatial::SpatYieldOp::create(rewriter, resizeOp.getLoc(), *result);
+        return success();
      });
-    rewriter.replaceOp(resizeOp, computeOp.getResults());
+    if (failed(computeOp))
+      return failure();
+    rewriter.replaceOp(resizeOp, computeOp->getResults());
    return success();
  }
 };
@@ -0,0 +1,189 @@
+#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/IR/BuiltinAttributes.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include "llvm/ADT/SmallVector.h"
+
+#include <algorithm>
+#include <optional>
+
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
+#include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
+#include "src/Dialect/ONNX/ONNXOps.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+namespace {
+
+static FailureOr<SmallVector<int64_t>> getConstantIntValues(Value value) {
+  auto denseAttr = dyn_cast_or_null<DenseIntElementsAttr>(getHostConstDenseElementsAttr(value));
+  if (!denseAttr)
+    return failure();
+  return SmallVector<int64_t>(denseAttr.getValues<int64_t>().begin(), denseAttr.getValues<int64_t>().end());
+}
+
+static bool isNoneValueLike(Value value) { return isa_and_nonnull<ONNXNoneOp>(value.getDefiningOp()); }
+
+static FailureOr<Value> buildSlice(Value data,
+                                   RankedTensorType dataType,
+                                   RankedTensorType resultType,
+                                   ArrayRef<int64_t> starts,
+                                   ArrayRef<int64_t> ends,
+                                   std::optional<ArrayRef<int64_t>> axes,
+                                   std::optional<ArrayRef<int64_t>> steps,
+                                   ConversionPatternRewriter& rewriter,
+                                   Location loc) {
+  int64_t rank = dataType.getRank();
+  if (!dataType.hasStaticShape() || !resultType.hasStaticShape() || resultType.getRank() != rank)
+    return failure();
+
+  if (starts.size() != ends.size())
+    return failure();
+  if (axes && axes->size() != starts.size())
+    return failure();
+  if (steps && steps->size() != starts.size())
+    return failure();
+
+  SmallVector<int64_t> normalizedAxes;
+  if (axes) {
+    SmallVector<bool> seenAxes(rank, false);
+    normalizedAxes.reserve(axes->size());
+    for (int64_t axis : *axes) {
+      auto normalizedAxis = normalizeAxisChecked(axis, rank);
+      if (failed(normalizedAxis))
+        return failure();
+      if (seenAxes[*normalizedAxis])
+        return failure();
+      seenAxes[*normalizedAxis] = true;
+      normalizedAxes.push_back(*normalizedAxis);
+    }
+  }
+  else {
+    if (starts.size() > static_cast<size_t>(rank))
+      return failure();
+    normalizedAxes.reserve(starts.size());
+    for (size_t i = 0; i < starts.size(); ++i)
+      normalizedAxes.push_back(static_cast<int64_t>(i));
+  }
+
+  SmallVector<int64_t> normalizedSteps;
+  if (steps)
+    normalizedSteps.assign(steps->begin(), steps->end());
+  else
+    normalizedSteps.assign(starts.size(), 1);
+
+  SmallVector<int64_t> computedShape(dataType.getShape().begin(), dataType.getShape().end());
+  SmallVector<OpFoldResult> offsets = getZeroOffsets(rewriter, rank);
+  SmallVector<OpFoldResult> sizes = getStaticSizes(rewriter, dataType.getShape());
+  SmallVector<OpFoldResult> strides = getUnitStrides(rewriter, rank);
+
+  for (auto [sliceIndex, axis] : llvm::enumerate(normalizedAxes)) {
+    int64_t step = normalizedSteps[sliceIndex];
+    if (step <= 0)
+      return failure();
+
+    int64_t dimSize = dataType.getShape()[axis];
+    int64_t start = starts[sliceIndex];
+    int64_t end = ends[sliceIndex];
+
+    start = normalizeIndex(start, dimSize);
+    end = normalizeIndex(end, dimSize);
+
+    start = std::clamp(start, int64_t {0}, dimSize);
+    end = std::clamp(end, int64_t {0}, dimSize);
+
+    int64_t extent = std::max(end - start, int64_t {0});
+    int64_t size = (extent + step - 1) / step;
+
+    offsets[axis] = rewriter.getIndexAttr(start);
+    sizes[axis] = rewriter.getIndexAttr(size);
+    strides[axis] = rewriter.getIndexAttr(step);
+    computedShape[axis] = size;
+  }
+
+  if (llvm::ArrayRef(computedShape) != resultType.getShape())
+    return failure();
+
+  return tensor::ExtractSliceOp::create(rewriter, loc, resultType, data, offsets, sizes, strides).getResult();
+}
+
+struct Slice final : OpConversionPattern<ONNXSliceOp> {
+  using OpConversionPattern::OpConversionPattern;
+
+  LogicalResult matchAndRewrite(ONNXSliceOp sliceOp,
+                                ONNXSliceOpAdaptor adaptor,
+                                ConversionPatternRewriter& rewriter) const override {
+    auto dataType = dyn_cast<RankedTensorType>(adaptor.getData().getType());
+    auto resultType = dyn_cast<RankedTensorType>(sliceOp.getResult().getType());
+    if (!dataType || !resultType || !dataType.hasStaticShape() || !resultType.hasStaticShape())
+      return failure();
+
+    auto starts = getConstantIntValues(adaptor.getStarts());
+    auto ends = getConstantIntValues(adaptor.getEnds());
+    if (failed(starts))
+      return rewriter.notifyMatchFailure(sliceOp, "requires compile-time constant starts");
+    if (failed(ends))
+      return rewriter.notifyMatchFailure(sliceOp, "requires compile-time constant ends");
+
+    std::optional<SmallVector<int64_t>> axes;
+    if (!isNoneValueLike(adaptor.getAxes())) {
+      auto parsedAxes = getConstantIntValues(adaptor.getAxes());
+      if (failed(parsedAxes))
+        return rewriter.notifyMatchFailure(sliceOp, "requires compile-time constant axes when present");
+      axes = std::move(*parsedAxes);
+    }
+
+    std::optional<SmallVector<int64_t>> steps;
+    if (!isNoneValueLike(adaptor.getSteps())) {
+      auto parsedSteps = getConstantIntValues(adaptor.getSteps());
+      if (failed(parsedSteps))
+        return rewriter.notifyMatchFailure(sliceOp, "requires compile-time constant steps when present");
+      steps = std::move(*parsedSteps);
+      if (llvm::any_of(*steps, [](int64_t step) { return step <= 0; }))
+        return rewriter.notifyMatchFailure(sliceOp, "supports only positive constant steps");
+    }
+
+    ArrayRef<int64_t> startsRef = *starts;
+    ArrayRef<int64_t> endsRef = *ends;
+    std::optional<ArrayRef<int64_t>> axesRef = axes ? std::optional<ArrayRef<int64_t>>(ArrayRef<int64_t>(*axes))
+                                                    : std::nullopt;
+    std::optional<ArrayRef<int64_t>> stepsRef = steps ? std::optional<ArrayRef<int64_t>>(ArrayRef<int64_t>(*steps))
+                                                       : std::nullopt;
+
+    Location loc = sliceOp.getLoc();
+    auto tryBuildSlice = [&](Value data) {
+      return buildSlice(data, dataType, resultType, startsRef, endsRef, axesRef, stepsRef, rewriter, loc);
+    };
+
+    if (isCompileTimeComputable(adaptor.getData())) {
+      auto sliced = tryBuildSlice(adaptor.getData());
+      if (failed(sliced))
+        return rewriter.notifyMatchFailure(sliceOp, "failed to normalize static slice parameters");
+      rewriter.replaceOp(sliceOp, *sliced);
+      return success();
+    }
+
+    auto computeOp =
+      createSpatCompute<1>(rewriter, loc, TypeRange {resultType}, {}, adaptor.getData(), [&](Value data) {
+        auto sliced = tryBuildSlice(data);
+        if (failed(sliced))
+          return failure();
+        spatial::SpatYieldOp::create(rewriter, loc, *sliced);
+        return success();
+      });
+    if (failed(computeOp))
+      return rewriter.notifyMatchFailure(sliceOp, "failed to build runtime tensor.extract_slice lowering");
+
+    rewriter.replaceOp(sliceOp, computeOp->getResults());
+    return success();
+  }
+};
+
+} // namespace
+
+void populateSlicePatterns(RewritePatternSet& patterns, MLIRContext* ctx) { patterns.add<Slice>(ctx); }
+
+} // namespace onnx_mlir
@@ -2,8 +2,8 @@
 #include "mlir/Transforms/DialectConversion.h"

 #include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/ConversionPatterns.hpp"
-#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/HostFoldability.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/CompileTime.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
 #include "src/Accelerators/PIM/Dialect/Spatial/SpatialOps.hpp"
 #include "src/Dialect/ONNX/ONNXOps.hpp"

@@ -12,25 +12,6 @@ using namespace mlir;
 namespace onnx_mlir {
 namespace {

-static int64_t normalizeAxis(int64_t axis, int64_t rank) { return axis >= 0 ? axis : rank + axis; }
-
-static Value extractSliceAt(
-  Value input, int64_t axis, int64_t offset, int64_t size, ConversionPatternRewriter& rewriter, Location loc) {
-  auto inputType = cast<RankedTensorType>(input.getType());
-  SmallVector<OpFoldResult> offsets(inputType.getRank(), rewriter.getIndexAttr(0));
-  SmallVector<OpFoldResult> sizes;
-  SmallVector<OpFoldResult> strides(inputType.getRank(), rewriter.getIndexAttr(1));
-  sizes.reserve(inputType.getRank());
-  for (int64_t dim : inputType.getShape())
-    sizes.push_back(rewriter.getIndexAttr(dim));
-  offsets[axis] = rewriter.getIndexAttr(offset);
-  sizes[axis] = rewriter.getIndexAttr(size);
-  SmallVector<int64_t> resultShape(inputType.getShape());
-  resultShape[axis] = size;
-  auto resultType = RankedTensorType::get(resultShape, inputType.getElementType());
-  return tensor::ExtractSliceOp::create(rewriter, loc, resultType, input, offsets, sizes, strides);
-}
-
 struct Split : OpConversionPattern<ONNXSplitOp> {
  using OpConversionPattern::OpConversionPattern;

@@ -41,8 +22,8 @@ struct Split : OpConversionPattern<ONNXSplitOp> {
      return failure();

    int64_t rank = inputType.getRank();
-    int64_t axis = normalizeAxis(splitOp.getAxis(), rank);
-    if (axis < 0 || axis >= rank)
+    auto axis = normalizeAxisChecked(splitOp.getAxis(), rank);
+    if (failed(axis))
      return failure();

    SmallVector<Value> outputs;
@@ -58,12 +39,12 @@ struct Split : OpConversionPattern<ONNXSplitOp> {
      if (!resultType || !resultType.hasStaticShape())
        return failure();
      resultTypes.push_back(resultType);
-      sliceSizes.push_back(resultType.getShape()[axis]);
+      sliceSizes.push_back(resultType.getShape()[*axis]);
    }

-    if (isHostFoldableValue(adaptor.getInput())) {
+    if (isCompileTimeComputable(adaptor.getInput())) {
      for (int64_t sliceSize : sliceSizes) {
-        outputs.push_back(extractSliceAt(adaptor.getInput(), axis, offset, sliceSize, rewriter, splitOp.getLoc()));
+        outputs.push_back(extractAxisSlice(rewriter, splitOp.getLoc(), adaptor.getInput(), *axis, offset, sliceSize));
        offset += sliceSize;
      }
      rewriter.replaceOp(splitOp, outputs);
@@ -76,7 +57,8 @@ struct Split : OpConversionPattern<ONNXSplitOp> {
        runtimeOutputs.reserve(resultTypes.size());
        int64_t runtimeOffset = 0;
        for (int64_t sliceSize : sliceSizes) {
-          runtimeOutputs.push_back(extractSliceAt(input, axis, runtimeOffset, sliceSize, rewriter, splitOp.getLoc()));
+          runtimeOutputs.push_back(
+            extractAxisSlice(rewriter, splitOp.getLoc(), input, *axis, runtimeOffset, sliceSize));
          runtimeOffset += sliceSize;
        }
        spatial::SpatYieldOp::create(rewriter, splitOp.getLoc(), runtimeOutputs);
@@ -0,0 +1,135 @@
+#include "mlir/Dialect/Arith/IR/Arith.h"
+#include "mlir/Dialect/Linalg/IR/Linalg.h"
+#include "mlir/Dialect/Tensor/IR/Tensor.h"
+#include "mlir/Transforms/DialectConversion.h"
+
+#include "llvm/ADT/SmallVector.h"
+
+#include "src/Accelerators/PIM/Common/IR/ShapeUtils.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Common/Common.hpp"
+#include "src/Accelerators/PIM/Conversion/ONNXToSpatial/Patterns.hpp"
+#include "src/Dialect/ONNX/ONNXOps.hpp"
+
+using namespace mlir;
+
+namespace onnx_mlir {
+namespace {
+
+static bool isInsideSpatialComputeRegion(Operation* op) {
+  return op->getParentOfType<spatial::SpatCompute>() || op->getParentOfType<spatial::SpatComputeBatch>();
+}
+
+static Value createTransposeInit(Value input,
+                                 RankedTensorType resultType,
+                                 ArrayRef<int64_t> permutation,
+                                 ConversionPatternRewriter& rewriter,
+                                 Location loc) {
+  SmallVector<OpFoldResult> sizes;
+  sizes.reserve(resultType.getRank());
+  for (auto [resultDim, sourceDim] : llvm::zip_equal(resultType.getShape(), permutation)) {
+    if (!ShapedType::isDynamic(resultDim)) {
+      sizes.push_back(rewriter.getIndexAttr(resultDim));
+      continue;
+    }
+    sizes.push_back(tensor::DimOp::create(rewriter, loc, input, sourceDim).getResult());
+  }
+  return tensor::EmptyOp::create(rewriter, loc, sizes, resultType.getElementType()).getResult();
+}
+
+static FailureOr<Value> materializeTransposedConstant(Value input,
+                                                      RankedTensorType resultType,
+                                                      ArrayRef<int64_t> permutation,
+                                                      ConversionPatternRewriter& rewriter,
+                                                      Location loc) {
+  auto denseAttr = getHostConstDenseElementsAttr(input);
+  if (!denseAttr)
+    return failure();
+
+  auto inputType = dyn_cast<RankedTensorType>(denseAttr.getType());
+  if (!inputType || !inputType.hasStaticShape() || !resultType.hasStaticShape()
+      || inputType.getRank() != resultType.getRank()
+      || static_cast<int64_t>(permutation.size()) != inputType.getRank()) {
+    return failure();
+  }
+
+  if (denseAttr.isSplat())
+    return getOrCreateConstant(rewriter,
+                               rewriter.getInsertionBlock()->getParentOp(),
+                               DenseElementsAttr::get(resultType, denseAttr.getSplatValue<Attribute>()),
+                               resultType);
+
+  SmallVector<Attribute> inputValues(denseAttr.getValues<Attribute>());
+  SmallVector<Attribute> resultValues(inputValues.size());
+  SmallVector<int64_t> inputStrides = computeRowMajorStrides(inputType.getShape());
+  SmallVector<int64_t> resultStrides = computeRowMajorStrides(resultType.getShape());
+  SmallVector<int64_t> inputIndices(inputType.getRank(), 0);
+
+  for (auto [linearIndex, value] : llvm::enumerate(inputValues)) {
+    int64_t remaining = static_cast<int64_t>(linearIndex);
+    for (int64_t dim = 0; dim < inputType.getRank(); ++dim) {
+      inputIndices[dim] = inputStrides.empty() ? 0 : remaining / inputStrides[dim];
+      remaining = inputStrides.empty() ? 0 : remaining % inputStrides[dim];
+    }
+
+    int64_t resultLinearIndex = 0;
+    for (int64_t dim = 0; dim < resultType.getRank(); ++dim)
+      resultLinearIndex += inputIndices[permutation[dim]] * resultStrides[dim];
+
+    resultValues[resultLinearIndex] = value;
+  }
+
+  return getOrCreateConstant(rewriter,
+                             rewriter.getInsertionBlock()->getParentOp(),
+                             DenseElementsAttr::get(resultType, resultValues),
+                             resultType);
+}
+
+struct TransposeToLinalgTranspose : OpConversionPattern<ONNXTransposeOp> {
+  using OpConversionPattern::OpConversionPattern;
+
+  LogicalResult matchAndRewrite(ONNXTransposeOp transposeOp,
+                                ONNXTransposeOpAdaptor adaptor,
+                                ConversionPatternRewriter& rewriter) const override {
+    auto inputType = dyn_cast<RankedTensorType>(adaptor.getData().getType());
+    auto resultType = dyn_cast<RankedTensorType>(transposeOp.getResult().getType());
+    if (!inputType || !resultType)
+      return failure();
+
+    auto permutation = getTransposePermutationChecked(transposeOp.getPermAttr(), inputType.getRank());
+    if (failed(permutation))
+      return failure();
+    if (isCompileTimeComputable(adaptor.getData())) {
+      auto constantTranspose =
+        materializeTransposedConstant(adaptor.getData(), resultType, *permutation, rewriter, transposeOp.getLoc());
+      if (succeeded(constantTranspose)) {
+        rewriter.replaceOp(transposeOp, *constantTranspose);
+        return success();
+      }
+    }
+
+    auto buildTranspose = [&](Value input) -> Value {
+      Value init = createTransposeInit(input, resultType, *permutation, rewriter, transposeOp.getLoc());
+      return linalg::TransposeOp::create(rewriter, transposeOp.getLoc(), input, init, *permutation).getResult()[0];
+    };
+
+    if (isInsideSpatialComputeRegion(transposeOp.getOperation())) {
+      rewriter.replaceOp(transposeOp, buildTranspose(adaptor.getData()));
+      return success();
+    }
+
+    auto computeOp = createSpatCompute<1>(
+      rewriter, transposeOp.getLoc(), TypeRange {resultType}, {}, ValueRange {adaptor.getData()}, [&](Value input) {
+        spatial::SpatYieldOp::create(rewriter, transposeOp.getLoc(), buildTranspose(input));
+      });
+    rewriter.replaceOp(transposeOp, computeOp.getResult(0));
+    return success();
+  }
+};
+
+} // namespace
+
+void populateTransposePatterns(RewritePatternSet& patterns, MLIRContext* ctx) {
+  patterns.add<TransposeToLinalgTranspose>(ctx);
+}
+
+} // namespace onnx_mlir
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
ilgeco	852bef7605	ReduceMean + resnet Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-10 14:30:10 +02:00
ilgeco	237654dadf	Fix direct import Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-10 12:14:20 +02:00
ilgeco	6d69600bc1	Yolo Image Validator + new accept rule Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-10 11:59:43 +02:00
NiccoloN	aec80529ca	much faster MaterializeMergeSchedule.cpp Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-05 18:22:59 +02:00
ilgeco	8ddbbcecfa	Added support for SliceOp Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-05 17:36:51 +02:00
ilgeco	90c4339808	SpatialSubOp Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-05 17:12:16 +02:00
ilgeco	08870de1a6	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-06-05 16:43:50 +02:00
NiccoloN	a34ac223c0	fix remaining failing tests Validate Operations / validate-operations (push) Has been cancelled Details remove unsupported tests	2026-06-05 15:27:11 +02:00
NiccoloN	0fa10b4074	better Conv.cpp and fixed broken conv op validation test Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-05 13:35:27 +02:00
NiccoloN	e166ff7e1d	better AGENTS.md Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-05 11:36:01 +02:00
ilgeco	a70a8f77cf	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-06-05 10:20:09 +02:00
ilgeco	800c0c4316	Python peft and new summary report	2026-06-05 10:20:02 +02:00
NiccoloN	1e9e61f5a9	remove useless MaterializeHostConstantsPass.cpp and fix lowering before instead Validate Operations / validate-operations (push) Has been cancelled Details avoid spammy pim codegen diagnostics	2026-06-05 10:06:28 +02:00
ilgeco	27410207c4	New corner case test Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-04 16:00:48 +02:00
NiccoloN	cbc9808229	more generalized MaterializeMergeSchedule.cpp for better memory usage after materialization Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-04 12:44:57 +02:00
NiccoloN	69021d56aa	automatic code reformat Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-03 19:43:56 +02:00
NiccoloN	dc5edd032c	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-06-03 19:40:53 +02:00
NiccoloN	e33f517221	faster scheduling: split batches into numCores tasks before scheduling instead of numLanes tasks	2026-06-03 19:40:34 +02:00
ilgeco	f94b3d1020	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-03 18:15:33 +02:00
ilgeco	20cf40c9ba	Memory Liveness	2026-06-03 18:15:30 +02:00
NiccoloN	37a59054a5	better loop compaction in MaterializeMergeSchedule.cpp Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-03 16:01:19 +02:00
ilgeco	2a8faf9c6b	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-06-03 13:49:42 +02:00
ilgeco	01b9d03fc6	Early warning on memory address	2026-06-03 13:49:39 +02:00
NiccoloN	501e6c76f3	better memory report Validate Operations / validate-operations (push) Has been cancelled Details capped vector allocations at u32::MAX in rust simulator	2026-06-03 13:48:42 +02:00
ilgeco	3c2667f11e	Fix memory bug Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-03 12:59:58 +02:00
NiccoloN	0a5e73c3ea	better transpose pattern and cleanup Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-03 12:26:31 +02:00
NiccoloN	636310d0cb	add shared loop creation helpers Validate Operations / validate-operations (push) Has been cancelled Details add shared checked arithmetic helpers refactor pim passes into Pim/Transforms more robust memory coalescing pass	2026-06-01 16:49:06 +02:00
NiccoloN	356be6ccc2	uniquify constants produced by affine lowering Validate Operations / validate-operations (push) Has been cancelled Details	2026-06-01 10:52:25 +02:00
NiccoloN	b678e55d3c	compact memory contiguity with for loops Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-31 18:47:59 +02:00
NiccoloN	ab63498f3f	normalize affine arithmetic helpers Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-30 16:37:28 +02:00
NiccoloN	7c3943bd06	Merge remote-tracking branch 'origin/refactorone' into refactorone Validate Operations / validate-operations (push) Has been cancelled Details # Conflicts: # src/PIM/Dialect/Pim/Transforms/Bufferization/PimBufferizationPass.cpp	2026-05-30 16:12:42 +02:00
NiccoloN	c0238c0d06	fix high memory usage caused by MaterializeMergeSchedule.cpp with more robust code	2026-05-30 16:12:06 +02:00
NiccoloN	ff36729140	centralize logic for materializing contiguous memory into bufferization fix codegen symlinks overwrite remove deprecated pim memcp_hd_batch op	2026-05-30 16:09:58 +02:00
NiccoloN	cf93caecd5	centralize logic for materializing contiguous memory into bufferization Validate Operations / validate-operations (push) Has been cancelled Details fix codegen symlinks overwrite remove deprecated pim memcp_hd_batch op	2026-05-30 15:54:24 +02:00
NiccoloN	2d5b03c08f	automatic code reformat Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-29 19:21:37 +02:00
NiccoloN	a41f694cf0	batched matmul pattern Validate Operations / validate-operations (push) Has been cancelled Details add conv helpers new validation tests for matmul	2026-05-29 19:09:48 +02:00
NiccoloN	8bb0babf1b	finish helper refactoring Validate Operations / validate-operations (push) Has been cancelled Details use uniqued constant helpers everywhere materialize transposed constants directly	2026-05-29 17:05:45 +02:00
ilgeco	819d8af0f7	Refactor + ReduceMean batched Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-29 15:57:13 +02:00
ilgeco	832bd7f1f7	Transpose and Refactor of Patterns Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-29 13:23:31 +02:00
ilgeco	82b44a6387	New Onnx test gemm model	2026-05-29 11:41:30 +02:00
ilgeco	7fcc765d6e	New Onnx Test model	2026-05-29 11:37:17 +02:00
ilgeco	f34698a2b6	Validate new option for compile only Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-28 22:59:26 +02:00
ilgeco	1ab489fe0a	Dynamic gemm/conv	2026-05-28 18:00:14 +02:00
ilgeco	cbf7b235f1	pim-simulator now support usize addresses Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-28 17:03:19 +02:00
NiccoloN	00414dd1d9	add verification of communication invariants at the end of spatial Validate Operations / validate-operations (push) Has been cancelled Details remove dead logic	2026-05-27 19:17:48 +02:00
NiccoloN	783dffe553	fix scheduling cost model Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 17:14:19 +02:00
NiccoloN	874a2f53e6	automatic code reformat Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 16:39:56 +02:00
NiccoloN	4bdaa57656	simplify affine maps to constants where possible Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 16:39:27 +02:00
NiccoloN	1a5d7d2a3f	fix bufferization and weight emission after new gemm patterns Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 16:15:10 +02:00
ilgeco	013ae0ac2a	Update README and AGENTS Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 15:09:30 +02:00
ilgeco	c6b02af7a9	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-05-27 14:32:51 +02:00
ilgeco	d2048bd394	Add to gitignore	2026-05-27 14:32:47 +02:00
NiccoloN	158f0f0c54	update AGENTS.md Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 14:32:04 +02:00
NiccoloN	532cac8246	commit AGENTS.md Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-27 14:07:34 +02:00
NiccoloN	d609e84054	teh only weight (WIP) Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-26 18:42:14 +02:00
NiccoloN	addfc8a86e	remove other dead logic Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-25 21:22:08 +02:00
NiccoloN	0f240af271	cleanup unused channel operations and related logic Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-25 20:58:51 +02:00
ilgeco	bdc4ca33f3	No extract no more Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-25 18:19:43 +02:00
ilgeco	b79c333c6c	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone	2026-05-25 15:44:40 +02:00
ilgeco	eea9261c7b	Bye Bye DCP	2026-05-25 15:44:30 +02:00
NiccoloN	e8a08f6dd0	faster pim VerificationPass.cpp and pim code emission Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-25 15:24:12 +02:00
NiccoloN	4855a2e105	add verification of static weights in spatial Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-24 12:00:42 +02:00
NiccoloN	3a7a832198	MaterializeMergeSchedule.cpp fix for yolo11_depth_18	2026-05-24 11:54:00 +02:00
NiccoloN	48ca6bd28d	speed fix with a simple cache Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-24 10:52:28 +02:00
NiccoloN	f595cc6ffd	fix high memory usage in IR	2026-05-24 10:41:47 +02:00
NiccoloN	c734f1b37e	better MaterializeMergeSchedule.cpp that emits much more compact IR Validate Operations / validate-operations (push) Has been cancelled Details add support for other constant-time arith ops in codegen	2026-05-24 10:10:24 +02:00
NiccoloN	b79ce8eeaa	use affine dialect to express simple constant progressions Validate Operations / validate-operations (push) Has been cancelled Details run dce at the end of MaterializeMergeSchedule to get rid of unused constants	2026-05-23 14:25:34 +02:00
NiccoloN	76a37e198f	better MaterializeMergeSchedule.cpp with both send and receive compaction in for loops Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-23 11:17:36 +02:00
NiccoloN	7f3c7464b4	update cost model of batch lanes to consider only a slice of the shared batch input Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-22 22:16:19 +02:00
NiccoloN	c77ffa9c56	better MaterializeMergeSchedule.cpp with %lane indexed batch computes support for tensors of index values	2026-05-22 21:52:28 +02:00
NiccoloN	495186503c	fix cmake magic once again	2026-05-22 19:21:56 +02:00
NiccoloN	2c1da813b5	fix much stuff	2026-05-22 18:53:38 +02:00
NiccoloN	8337a11ce9	automatic code reformat	2026-05-22 15:23:48 +02:00
ilgeco	d136136d22	Fix add of input in random order for compute_batch Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-22 15:21:02 +02:00
NiccoloN	074eb183c7	saner SpatialToPimPass architecture Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-22 07:27:54 +02:00
NiccoloN	43ed3914b8	better MaterializeMergeSchedule.cpp (something still broken downstream) Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-22 06:56:39 +02:00
ilgeco	6aaf1c0870	Merge branch 'refactorone' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor into refactorone Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-21 14:44:19 +02:00
ilgeco	fe35b3ed43	Equivalent Class but broken	2026-05-21 14:43:59 +02:00
NiccoloN	90a9339686	better cmake to keep IDEs analyses happy Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-21 14:13:54 +02:00
NiccoloN	a50e77ff38	refactorone Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-20 19:06:41 +02:00
NiccoloN	f56c4159b5	Merge branch 'main' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor	2026-05-19 15:01:26 +02:00
ilgeco	5637c861b4	Merge branch 'main' of chef.heaplab.deib.polimi.it:nnicolosi/Raptor Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-19 15:00:11 +02:00
ilgeco	94157a8404	Very big timeout	2026-05-19 14:53:34 +02:00
ilgeco	68a3521978	Perft topological fix	2026-05-19 14:52:54 +02:00
NiccoloN	a103ba328b	remove dead logic	2026-05-19 12:23:01 +02:00
NiccoloN	e263e05f56	remove dead logic Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-18 18:32:40 +02:00
ilgeco	34c29fdec4	Materialize modification Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-18 17:22:13 +02:00
ilgeco	aa088e2ba5	Verify fix Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-18 17:20:40 +02:00
NiccoloN	2836e759ab	remove useless file Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-18 14:51:03 +02:00
NiccoloN	8071ebab0b	faster refactored merge pass Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-18 14:50:19 +02:00
NiccoloN	f1602c0550	add peft scheduling Validate Operations / validate-operations (push) Has been cancelled Details better deadlock report by pim simulator	2026-05-18 12:09:27 +02:00
NiccoloN	de0a2f4561	remove useless guard in gemm lowering Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-15 18:22:13 +02:00
NiccoloN	1c4a5bde76	compact softmax op lowering Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-15 18:14:59 +02:00
NiccoloN	78242e2887	compact resize op lowering Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-15 17:36:12 +02:00
NiccoloN	fe244d5aa1	new ops tests for matmul, grouped conv, concat and reshape Validate Operations / validate-operations (push) Has been cancelled Details related fixes	2026-05-14 15:54:06 +02:00
NiccoloN	d09e76c8f9	fix matmul rewriting/lowering Validate Operations / validate-operations (push) Has been cancelled Details fix reshape lowering add support for grouped-convolution lowering quieter verifier with capped error messages	2026-05-14 14:09:30 +02:00
NiccoloN	c5e608fa5b	replace greedy pattern rewrites with partial conversions Validate Operations / validate-operations (push) Has been cancelled Details better failure messages	2026-05-14 11:48:16 +02:00
ilgeco	43f3ccdd21	new yolo nodes with 100% more statics Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-14 10:47:31 +02:00
NiccoloN	8d95c604a6	automatic code formatting Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 21:51:19 +02:00
NiccoloN	55eda487dc	use seed in validate.py for deterministic tests	2026-05-13 21:49:36 +02:00
NiccoloN	061139aefb	fix wrong send/receive reordering in post dcp merge instructions compaction	2026-05-13 21:48:49 +02:00
NiccoloN	ea61540e08	fix failing validations after last commit Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 17:46:19 +02:00
NiccoloN	324178cba8	fix instructions explosion in pim host constant folding pass Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 17:31:05 +02:00
NiccoloN	e71ba07cd5	fix pim-simulator stale tests Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 16:59:53 +02:00
NiccoloN	64a3805619	fix pim-simulator stale tests Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 16:59:43 +02:00
NiccoloN	9f9e7c0892	Merge remote-tracking branch 'origin/main' Validate Operations / validate-operations (push) Has been cancelled Details	2026-05-13 16:38:33 +02:00
NiccoloN	03eab42971	remove host core generation strip config.json emitted by raptor add actual pimsim-nn configs in validation pimsim-configs	2026-05-13 16:31:01 +02:00