JIT Compilation

Shape’s JIT compilation system operates at three levels: scoped per-function JIT, tiered compilation with background promotion, and cross-function optimization with inlining and constant propagation. The content-addressed bytecode architecture makes each of these levels natural — every function blob is an independent compilation unit with a stable identity.

Scoped Per-Function JIT

Because every function lives in its own FunctionBlob, JIT compilation is naturally scoped to individual functions. Each blob is independently assessed for JIT compatibility via a per-blob preflight check before any code generation occurs.

JIT-compatible functions — those containing only supported operations (arithmetic, comparisons, local variable access, direct calls, control flow) — are compiled via Cranelift to native machine code.

JIT-incompatible functions — those using async operations, unsupported builtins, or complex runtime features — remain interpreted by the VM. There is no penalty; the interpreter handles them exactly as before.

MixedFunctionTable

The function table after selective JIT contains three entry types that coexist in a single lookup structure:

enum FunctionEntry {
    /// JIT-compiled function pointer — call directly via native ABI
    Native(*const u8),

    /// VM interpreter fallback — execute via bytecode dispatch (function index)
    Interpreted(u16),

    /// Awaiting background compilation — currently interpreted, will promote
    Pending(u16),
}

Native(*const u8) holds a raw pointer to JIT-compiled machine code. The VM calls through this pointer directly, bypassing the interpreter entirely.
Interpreted(u16) holds a function index into the linked program. The VM dispatches these through its normal bytecode loop.
Pending(u16) marks a function that has been submitted for background JIT compilation but has not yet completed. It behaves as Interpreted until the compiled result is ready.

Function Table (after selective JIT):
  [0] train_model    → Native    (numeric, hot, JIT-compatible)
  [1] parse_config   → Interpreted (complex object ops)
  [2] compute_signal → Native    (inner loop, numeric)
  [3] format_output  → Interpreted (strings, objects)

VM Fallback Trampoline

When JIT-compiled code calls a function that is Interpreted or Pending, the runtime uses a fallback trampoline to bridge the two execution modes:

The trampoline reads the function_id from the call stub embedded in the JIT code.
It marshals arguments from the JIT (native) stack layout to the VM stack layout.
It invokes the VM interpreter for that function.
When the interpreter returns, the trampoline marshals the result back to the JIT calling convention and returns control to the native caller.

This trampoline is transparent to both the JIT and interpreted sides — mixed call chains work seamlessly regardless of which functions are native and which are interpreted.

JIT Dispatch Table

The VM maintains a dispatch table that maps function IDs to JIT-compiled native code pointers:

pub type JitFnPtr = unsafe extern "C" fn(*mut u8, *const u8) -> u64;

// On VirtualMachine:
jit_dispatch_table: HashMap<u16, JitFnPtr>

External code (e.g., the shape-jit crate) registers compiled functions via vm.register_jit_function(function_id, ptr). When the VM’s Call opcode handler encounters a function with a dispatch table entry, it attempts JIT dispatch. If the marshaling bridge is not yet implemented for a particular calling convention, the VM falls through to bytecode interpretation — registered JIT entries never cause hard errors.

Content-Addressed JIT Cache

JIT output is cached by blob content hash. JitCodeCache (crates/shape-jit/src/jit_cache.rs) keeps one entry per function hash, carrying the native code pointer plus enough metadata to invalidate it:

pub struct CacheEntry {
    /// Native code pointer.
    pub code_ptr: *const u8,
    /// Content hash of the function blob.
    pub function_hash: FunctionHash,
    /// Schema version at compilation time (for shape guard invalidation).
    pub schema_version: u32,
    /// Feedback epoch at compilation time (for speculation invalidation).
    pub feedback_epoch: u32,
    /// Hashes of functions this compiled code depends on (inlined callees).
    pub dependencies: Vec<FunctionHash>,
    /// Tier 2 cache key, present for optimizing-compiler output.
    pub tier2_key: Option<Tier2CacheKey>,
}

Same blob hash means same native code. If the same utility function appears in ten different programs, baseline (Tier 1) code for it is JIT-compiled exactly once and reused everywhere — Tier 1 carries no speculation, so its output is stable for a given content hash. Tier 2 entries embed speculative shape guards; they are invalidated via invalidate_by_dependency() when an inlined callee changes or when the schema version / feedback epoch advances.

Tiered Compilation

Shape uses a three-tier compilation strategy. Every function starts interpreted and is promoted to higher tiers based on observed call frequency.

Tier Definitions

Tier	Name	Threshold	Description
0	`Interpreted`	0 calls	All functions start here. Full bytecode interpretation.
1	`BaselineJit`	100 calls	Per-function JIT compilation. No cross-function optimization.
2	`OptimizingJit`	10,000 calls	Feedback-guided inlining and constant propagation. (Devirtualization is a planned v0.4 addition.)

Promotion thresholds are checked at function entry. When a function’s call count crosses a tier boundary, a compilation request is submitted for the next tier.

Per-Function Tier State

Each function tracks its own compilation state:

struct FunctionTierState {
    /// Current execution tier
    tier: Tier,

    /// Cumulative call count since program start
    call_count: u32,

    /// Whether a compilation request is already in flight
    compilation_pending: bool,
}

call_count is a u32 — 4.3 billion entries is far above any realistic single-function call count, and a 32-bit counter keeps FunctionTierState compact for the per-function dispatch path.

The compilation_pending flag prevents duplicate submissions. Once a compilation completes, the flag is cleared and the function’s tier is updated atomically.

Background Compilation

Compilation happens off the hot path, on a dedicated background thread:

When a function crosses a tier threshold, the VM creates a CompilationRequest containing the function blob, target tier, and any profiling data collected so far.
This request is sent via an mpsc channel to the background compilation thread.
The background thread owns the JIT compiler instance. It processes requests sequentially, producing a CompilationResult with the native code pointer (or an error if compilation fails).
The result is sent back via a second mpsc channel.
The VM checks try_recv() at safe points — function entry and loop back-edges — to pick up completed compilations without blocking.
On receiving a successful result, the VM calls promote_to_native(id, ptr) to atomically swap the function table entry from Pending (or Interpreted) to Native.

VM hot path                    Background thread
    │                               │
    ├─ call_count hits 100 ─────►   │
    │   CompilationRequest          │
    │                               ├─ Cranelift compile
    │   (function continues         │
    │    interpreted)               │
    │                               ├─ CompilationResult ────►
    │                               │
    ├─ try_recv() at safe point     │
    │   promote_to_native(id, ptr)  │
    │                               │
    ├─ next call → Native           │

Functions continue executing as interpreted while compilation proceeds in the background. There is no stop-the-world pause. The transition from interpreted to native is atomic and takes effect on the next call to that function.

`--mode jit` semantics

The CLI --mode jit flag (default) requests JIT compilation for the toplevel script and every reachable function. The semantics are:

Toplevel script + functions attempt JIT compile when the bytecode is JIT-compatible (passes compile_program_selective’s per-function and main-code preflight).
On JIT-compile failure, the executor falls through to the bytecode interpreter for the whole program. This is not silent-no-output — the interpreter re-runs the same parsed Program and produces the same observable result a --mode vm invocation would.
A one-line diagnostic is emitted to stderr at tracing::info level when fall-through fires:
```
[jit-fallback] function main failed JIT compile: <reason>; running under interpreter
```
The diagnostic is always visible (uses eprintln!, no subscriber required). Verbose JIT pipeline tracing is gated behind --trace-jit=shape_jit=debug (replaces the legacy SHAPE_JIT_DEBUG env-var per closure-wave-F migration).
Tier-up promotion is preserved on hot functions per the T1@100 / T2@10k thresholds — fall-through only fires when the entire program cannot be JIT- compiled at all (e.g. toplevel main code contains an opcode that the JIT preflight rejects, such as AllocSharedModuleBinding). Programs that JIT- compile successfully run the JIT path; tier promotion happens transparently on functions that cross the call-count thresholds.

The fall-through path is implemented in JITExecutor::execute_program (crates/shape-jit/src/executor.rs). It catches every Err from the JIT sub-pipeline — preflight rejection, Cranelift codegen failure, FFI linking failure, JIT runtime signal, RETURN_TAG_NANBOXED surface-and-stop — and re-dispatches to BytecodeExecutor::execute_program with the same Program.

Verifying fall-through behavior

The supervisor-ratified corrected smoke harness reads stdout via tail -1 and the exit code separately to avoid the tail | echo EXIT=$? defection that masked silent-no-output across the entire project trajectory pre-W12:

out=$(timeout 30 ./target/release/shape run --mode $mode $file 2>/dev/null | tail -1)
ec=$?
echo "$mode/$name: $out (exit=$ec)"

VM and JIT should produce identical stdout for any program that runs without runtime error in either mode; [jit-fallback] appears on stderr only when the JIT path could not compile the program at all.

Cross-Function Optimization (Tier 2)

Tier 2 compilation is shipped: compile_optimizing_function (crates/shape-jit/src/worker.rs) runs feedback-guided optimizing compilation when a function crosses the 10,000-call threshold. Inlining is driven by CallPathPlan (crates/shape-jit/src/optimizer/call_path.rs) and the HOF-inline / call-LICM analysis passes. Tier 2 output is cached by Tier2CacheKey (crates/shape-jit/src/optimizer/cross_function.rs), and shape-guard deoptimization is tracked by DeoptTracker (crates/shape-vm/src/deopt.rs).

IC devirtualization is not shipped — it is a v0.4 candidate (§Q25.C.6 of the round-2 budget, see crates/shape-vm/src/compiler/trait_object_emission.rs). There is no DevirtAnalysis type, no CallGraph type in shape-jit, and no SpecializedCallee type in the source today; those names below are design sketches, labelled inline where they appear.

Tier 2 compilation goes beyond per-function code generation. It uses per-function feedback to specialise call sites, inlines hot callees through the CallPathPlan and HOF-inline passes, and (planned for v0.4) devirtualizes indirect calls.

Inlining Policy

Inlining is governed by a per-program CallPathPlan produced during the JIT optimizer’s call-path analysis phase:

pub struct CallPathPlan {
    /// Call instruction indices that should prefer direct-call lowering.
    pub prefer_direct_call_sites: HashSet<usize>,
    /// Per-call-site parameter local slots that must be restored after a
    /// direct-call argument write into ctx.locals[0..argc).
    pub restore_param_slots_by_call_site: HashMap<usize, Vec<u16>>,
    /// Depth guard for nested inlining.
    pub inline_depth_limit: u8,
}

analyze_call_path (crates/shape-jit/src/optimizer/call_path.rs) walks every Call instruction and decides per call site:

A call site is added to prefer_direct_call_sites when its argument count is ≤ 4 or when it sits inside a hot loop body (a loop the loop-lowering pass marked with an unroll factor greater than 1).
inline_depth_limit defaults to 4, capping how deep the inliner will recurse from any root call site. The pass bumps the limit to 6 when the whole program has ≤ 8 call instructions — small programs can afford a deeper inline budget.

There is no separate Tier 1 vs Tier 2 instruction budget — the JIT consults the same CallPathPlan regardless of tier, and the depth guard is the only hard ceiling. There is no stand-alone InlinePolicy type; the heuristics above live entirely in the call_path analysis pass.

Tier 2 Cache Key

Because Tier 2 compilation includes inlined callees, the cache key must account for the full compilation scope — not just the root function:

pub struct Tier2CacheKey {
    /// Hash of the root function blob.
    pub root_hash: [u8; 32],
    /// Sorted hashes of all inlined callee blobs.
    pub inlined_hashes: Vec<[u8; 32]>,
    /// Compiler version for invalidation.
    pub compiler_version: u32,
    /// Schema version at compilation time — bumped when object shapes
    /// change, staling any code that embedded shape guards.
    pub schema_version: u32,
    /// Feedback epoch at compilation time — bumped when a speculation
    /// assumption (e.g. a type guard) is invalidated.
    pub feedback_epoch: u32,
}

The combined_hash() method produces a single SHA-256 digest from these fields, used as the cache lookup key. If the root function or any inlined callee changes — or the schema version or feedback epoch advances — the combined hash changes and the cached output is invalidated.

Constant Propagation

When the Tier 2 compiler inlines a callee, arguments that are compile-time constants at the call site (PushConst instructions) are propagated into the inlined body. Parameter reads become the known constant value, which exposes further optimization opportunities in the inlined region — dead branch elimination, strength reduction, and constant folding. Cranelift’s own constant-folding and dead-code passes then run over the merged IR.

This happens as part of the optimizing-compilation path (compile_optimizing_function) and is keyed by Tier2CacheKey, so the same root-plus-inlined-callees scope is compiled at most once.

Devirtualization (planned — v0.4)

When the bytecode contains CallValue (an indirect call through a variable), a future Tier 2 pass could resolve the target statically and rewrite the call. This is not implemented in v0.3 — IC devirtualization is a v0.4 candidate (§Q25.C.6 of the round-2 budget). The sketch below describes the intended shape; no DevirtAnalysis or DevirtResult type exists in the source today:

DevirtAnalysis (planned, v0.4):
  - inspect a CallValue site and trace its target binding
  - Direct      → target traces to a single known function;
                  rewrite CallValue as a direct Call
  - Polymorphic → target traces to a small set of functions;
                  emit an inline cache that checks common targets first
  - Unknown     → target cannot be resolved; leave as indirect call

Until devirtualization lands, indirect calls through CallValue are lowered as indirect dispatch and are not inlining candidates.

Deoptimization

Tier 2 optimizations are speculative — feedback-guided compilation embeds shape guards so that inline-cached object-property accesses can run as direct loads. If an object shape transitions at runtime (for example a HashMap gains a property), any compiled code that guarded on the old shape must be invalidated.

DeoptTracker (crates/shape-vm/src/deopt.rs) is the index that makes this possible — it maps function IDs to the shape IDs they depend on, and keeps a reverse index from shape ID back to the dependent functions:

pub struct DeoptTracker {
    /// function_id → set of ShapeIds it depends on
    dependencies: HashMap<u16, HashSet<ShapeId>>,
    /// shape_id → set of function_ids that depend on it
    shape_dependents: HashMap<ShapeId, HashSet<u16>>,
}

After a successful Tier 2 compilation, register(function_id, shape_ids) records the shape_guards reported in the CompilationResult. When a shape transition occurs, invalidate_shape(shape_id) is called:

The DeoptTracker looks up the shape ID in shape_dependents.
It returns the list of dependent function IDs and clears their dependency entries (including the reverse mappings for any other shapes they guarded).
The caller removes those functions’ native code so the next call falls back to interpreted execution.
The function’s tier state is reset to allow re-promotion once execution stabilizes on the new shape.

This guarantees correctness: optimized code that guarded on a shape is never executed after that shape transitions. The cost is a one-time recompilation if the function remains hot.

Performance Characteristics

Tier	Throughput	Notes
Tier 0 (`Interpreted`)	~100ns/instruction (illustrative; awaiting v0.4 benchmark anchor)	Full bytecode interpretation with dispatch overhead
Tier 1 (`BaselineJit`)	Near-native for numeric code	Function call overhead reduced; no cross-function optimization
Tier 2 (`OptimizingJit`)	Native-class	Cross-function inlining eliminates call overhead; constant folding reduces work

Content-addressed caching amplifies the benefit across programs: same blob hash produces the same native code, so a function is compiled at most once globally. Shared utility functions that appear in many programs are compiled on first encounter and reused from cache for every subsequent load.

Typical promotion timeline for a hot inner loop:

First 100 calls: interpreted (Tier 0).
Calls 100-10,000: Tier 1 native code (compiled in background, available within milliseconds of crossing the threshold).
Beyond 10,000 calls: Tier 2 optimized code with inlined callees and propagated constants.

Fully Typed Native Values

The runtime is fully typed and zero-tag: every value has a compile-time-determined type, and there are no runtime type tags or tag-bit dispatch. The opcode encodes the type; the JIT generates code accordingly.

How values are represented

Values are native machine types. Scalars are raw f64 in XMM registers, raw i64/i32/i8/bool in GPR registers, and typed pointers to heap objects. The opcode carries the type — there is no runtime type classification.

Arrays are typed contiguous buffers. Array<number> maps to TypedArray<f64> — a contiguous f64 buffer with a refcounted header. Element access is a single load instruction: movsd xmm0, [data + i*8]. No per-element type checking.

Structs are C-compatible fixed layouts. type Point { x: number, y: number } produces a #[repr(C)] layout with field offsets computed at compile time. point.x compiles to load f64 [ptr + 8] — no schema lookup, no field name resolution.

FFI uses typed signatures. The JIT-to-runtime FFI functions are monomorphized per type rather than passing untyped words — for example jit_v2_struct_get_f64(ptr: *const u8, offset: u32) -> f64 (crates/shape-jit/src/ffi/v2_struct.rs).

Heap objects share a unified header. All heap-allocated objects start with an 8-byte HeapHeader containing an AtomicU32 refcount at offset 0 for single-cycle access. Clone is atomic_add 1; drop is atomic_sub 1.

For the authoritative description of the typed runtime, see the runtime v2 spec.

Generics

Generics are monomorphized. Array<number> and Array<i32> are different types with different TypedArray instantiations, different opcodes, and different JIT code paths. There is no type erasure or boxing at generic boundaries.