name: stack-spoofing-dev description: "Auth/lab dev: Windows call-stack research; unwind metadata, synthetic frames, NtContinue, thread-pool traces, gadget constraints." license: MIT compatibility: "x86-64 Windows 10 1809 through Windows 11 24H2 / Server 2022+; Classical thresholds assume Win10; Win11 22H2+ requires empirical re-tuning (see frame-math reference)." metadata: author: AeonDave version: "1.0" category: evasion language: c,cpp,rust,go,asm
Stack Spoofing — Windows x64
Produce a spoofed call stack that survives unwinder-based inspection (ETW-TI, EDR stack walkers, StackWalk64). Each frame must have a legitimate .pdata entry, an unwind description that matches the planted frame size, and a return address that points inside a known module's .text.
This skill assumes you already understand .pdata / UNWIND_INFO at the level described in windows-internals/references/exception-unwind.md. It focuses on implementing the spoofer, not on teaching the format.
When to activate
- Implementing or reviewing Draugr / SilentMoonwalk / NtContinue / YouMayPasser / VulcanRaven / Unwinder spoofers in C, C++, Rust, Go, or Plan9 ASM
- Choosing between spoof strategies for a specific Windows build or thread context (main, TP worker, alertable, console-attached)
- Debugging
spoof_init: FAIL jmp_rbxor unwinder-reported frame-size mismatches - Adjusting
MinJmpRbxFrameSize/MinAddRspXthresholds after.pdatainventory changes across Windows builds - Integrating a spoofer with an indirect syscall dispatcher (RecycleGate / Hell's / FreshyCalls)
- Hardening a pre-existing spoofer against modern EDR correlation (Eclipse, SAVE_NONVOL safety, backed-vs-unbacked caller)
- Porting a spoofer between languages without breaking the ASM/context-struct contract
If the question is "what does UNWIND_INFO look like" → wrong skill, read windows-internals/references/exception-unwind.md. If the question is "how do I make NtWriteVirtualMemory appear to come from RtlUserThreadStart" → right skill.
The three strategies, side by side
| Property | Draugr | SilentMoonwalk DESYNC | NtContinue (context-replay) |
|---|---|---|---|
| Frames planted | 3 | 4 | 0 (kernel replays CONTEXT) |
| Gadgets required | 1× JMP [RBX] |
1× JMP [RBX] + 1× ADD RSP,X;RET |
none (only a syscall;ret) |
| UNWIND_INFO frames needed | 2 (BaseThreadInitThunk, RtlUserThreadStart) |
2 (UWOP_SET_FPREG + UWOP_PUSH_NONVOL rbp) |
2 synthetic retaddrs planted in target CONTEXT.Rsp |
| Eclipse-validated? | No | Optional (cascade: wininet → user32 → kernelbase) | N/A |
| Callstack walker sees | syscall;ret → JMP [RBX] → BaseThreadInitThunk → RtlUserThreadStart → 0 |
syscall;ret → AddRspX → JmpRbx → SecondFrame(rbp) → FirstFrame(setfpreg) → 0 |
syscall;ret → BaseThreadInitThunk → RtlUserThreadStart → 0 |
| Safe on TP worker threads | No (root RSP wrong) | Yes | Yes |
| Safe with console attached | Yes | Yes | No (NtContinue races console I/O) |
| Go runtime friendly | Yes (uses pre-allocated heap buffer as fake RSP) | Yes | Risky (CONTEXT replay confuses goroutine scheduler) |
| Complexity (LOC) | ~300 + ASM | ~600 + ASM | ~150 + ASM |
Default choice: Draugr if you control the thread (main thread of an EXE, or explicit CreateThread with known root). SilentMoonwalk if you run on thread pool workers or need .pdata-coherent frames all the way down.
Notable implementations and variants
The strategies above are conceptual; below are the public PoCs/implementations you will encounter in the wild. Each is a concrete realization (or precursor) of one of the three strategies, with its own quirks.
YouMayPasser (Waldo-irc) — return-address-only minimalist baseline
64-bit weaponization of Gargoyle that extends Namaszo's original Return Address Spoofing PoC. Targets Cobalt Strike beacons. Spoofs only the immediate return address of the calling function — not a full multi-frame chain — so it is the cheapest stack-masquerading primitive.
- Strategy mapping: precursor / strict subset of Draugr (1 frame, 1 gadget).
- When to pick: BOF or short-lived primitive where one syscall's caller-frame must be hidden and a full Draugr chain is overkill.
- Caveat: hardcoded gadget offsets per Windows build — you must re-tune per build, exactly as Win11 22H2+ measurements above warn for any
JMP [RBX]consumer. Walks the same gadget cliffs as Draugr.
VulcanRaven — template-based stack mimicry with VEH cleanup
Spoofs the call stack by mirroring a real captured stack from telemetry (SysMon ProcessAccess on lsass), shipping with three example profiles selected via --wmi, --rpc, --svchost. Each profile is a captured frame chain of a legitimate Windows service path; the spoofer reproduces it byte-for-byte before issuing NtOpenProcess.
- Strategy mapping: orthogonal to Draugr/SilentMoonwalk — instead of computing a generic plausible chain, it copies a specific real one. Fewer correlation surface marks because the chain came from real telemetry.
- VEH twist: registers a vectored exception handler before resuming the spoofed thread; on access violation it redirects to
RtlExitUserThreadso the thread terminates cleanly rather than crashing the host. Adopt this pattern any time you mutateCONTEXT.Rspand cannot guarantee the planted chain unwinds correctly. - When to pick: targeted credential-access flows where you want the call chain to match a known-good svchost/RPC/WMI invocation rather than merely look generic.
- Limit: the captured chain ages — re-collect SysMon templates after major Windows feature updates or you start mimicking a chain that no longer exists in production.
Unwinder (Kudaes) — Rust weaponization of SilentMoonwalk
Rust crate (unwinder on crates.io) implementing full SilentMoonwalk DESYNC with stable, idiomatic Rust ergonomics. Supports calling arbitrary functions or indirect syscalls with up to 11 parameters, retrieves return values, and the spoof can be chained any number of times without growing the call stack (frames are recycled per call).
- Strategy mapping: SilentMoonwalk DESYNC (4 frames, JMP[RBX] + ADD RSP,X), Rust-native.
- When to pick: Rust implants where you want SilentMoonwalk without rolling your own
global_asm!trampoline. Treat it as the canonical Rust answer to the lang-c-rust-go reference's SilentMoonwalk slot. - Caveat: still subject to all the Win11 22H2+ gadget-population limits. Cascade module ordering is internal to the crate — read its source before assuming
wininet → user32 → kernelbaseis wired the way you want.
Decision tree
Implant runs on...
│
├── Main thread of a dedicated loader EXE?
│ └── Draugr (simplest, fewest gadgets, zero Eclipse concerns)
│
├── Thread pool worker (TpWorkCallback, timer, TP_IO)?
│ └── SilentMoonwalk DESYNC — only strategy with .pdata-coherent frames
│ beyond BaseThreadInitThunk/RtlUserThreadStart
│
├── Beacon in a module-stomped host (rundll32, legitimate PE)?
│ └── SilentMoonwalk DESYNC or NtContinue — Draugr's assumption
│ "this thread was started by RtlUserThreadStart" does not hold
│
├── Single one-shot syscall with console attached?
│ └── Indirect syscall only (skip spoofing) — NtContinue races console
│
└── Need template-based mimicry of a real process's stack (e.g. svchost/RPC/WMI)?
└── VulcanRaven — synthetic stack mirroring a captured SysMon profile, VEH-based cleanup
Frame math — the numbers you actually need
These are the non-negotiable sizing rules. Full derivation in references/frame-math.md.
Minimum frame sizes for JMP [RBX] gadget
The trampoline frame must hold the shadow area (0x20) plus all stack args of the syscall you are spoofing. For NT syscalls:
| Syscall arg count | Stack args (after RCX/RDX/R8/R9) | Shadow + stack args | Minimum frame |
|---|---|---|---|
| ≤ 4 | 0 | 0x20 | 0x28 |
| 5 | 1 | 0x28 | 0x30 |
| 11 (NtCreateThreadEx) | 7 | 0x58 | 0x60 |
| 18 (max practical) | 14 | 0x90 | 0x98 |
Classical Draugr literature uses 0xD8 as a "safe for everything" floor. This is wrong on Windows 11 22H2+: kernelbase.dll has had its FF 23 gadget population drastically reduced and often exposes no gadget with frame ≥ 0xD8. Use the real minimum for your specific syscall.
Rule: compute shadow (0x20) + args_on_stack * 8 + padding (0x08) and use that as your min_frame. For the common NtCreateThreadEx(11) path, 0x60 is correct.
Windows 11 22H2+ field measurements (kernelbase.dll)
| Metric | Value |
|---|---|
Total FF 23 in kernelbase .text |
~14 |
Max .pdata-validated frame size |
0x70 |
| CALL-preceded candidates (Eclipse) | 0 |
| Candidates rejected by SAVE_NONVOL filter | ~8 (of 14) |
Candidates passing frame ≥ 0x60 |
~1 |
Implication: hardcoded 0xD8 breaks. Eclipse from kernelbase alone is infeasible. Cascade wininet → user32 → kernelbase is the correct strategy; or accept the lower threshold and drop Eclipse.
Minimum ADD RSP,X;RET (SilentMoonwalk only)
X must be larger than the JMP [RBX] trampoline's frame size, so arg slots placed at [SP+0x28..SP+0x90] within the AddRspX frame never collide with the JmpRbxGadget word written at [SP + 8 + X].
Rule: min_x = max(jmp_rbx_frame_size, MIN_FLOOR) where MIN_FLOOR = 0x60 on Win11 22H2+ (was 0xB0 on Win10).
UNWIND_INFO safety filters
Reject any candidate where calc_frame_size returns 0. Causes:
- No
.pdataentry (leaf function) UWOP_SAVE_NONVOL/UWOP_SAVE_NONVOL_FARwithsave_offset >= total_alloc→ writes past frame → stack corruption when used as spoof frameUWOP_SAVE_XMM128present — spoof does not preserve XMM regs; executing the real unwinder on this function causes a #UD when unwinding saved XMM
See references/frame-math.md for the full calc_frame_size algorithm including chained unwind info (UNW_FLAG_CHAININFO) handling.
Gadget scanner — non-negotiable rules
- Scan
.textof the target module only. Never scan.rdata; byte sequenceFF 23occurs in data. - Match
byte[i] == 0xFF && byte[i+1] == 0x23forJMP [RBX]. This is a 2-byte opcode with no REX prefix. - For each hit, compute
frame_sizevia.pdatabinary search. Reject if 0. - If Eclipse required: check
byte[gadget - 5] == 0xE8(CALL rel32). Do not check0x41 FF D_or other CALL variants — callsite validation in Eclipse papers specifically relies on the 5-byteE8displacement CALL. - Deterministic selection: pick the largest
frame_sizethat passes filters. Random selection makes failure modes unreproducible. - Emit diagnostic counters on failure (
FF23_total,fs_zero,below_min,eclipse_fail,best_belowmin_fs/addr). Without these, kernelbase-has-no-gadgets failures look identical to bad-threshold failures.
Full scanner pseudocode + instrumentation patterns in references/frame-math.md.
Trampoline contract (all languages)
Every spoofer expresses the same contract between a high-level caller and a small ASM trampoline:
Caller (C / Rust / Go):
1. Resolve: module bases, function retaddrs, gadget(s), frame sizes
2. Populate a fixed-layout SpoofContext struct
3. Pre-allocate a spoofing buffer (heap-safe; see below)
4. Call ASM trampoline: (ssn, syscall_ret_addr, &ctx, args...)
ASM trampoline:
1. Save callee-saved (RBX, RBP, R12–R15, XMM6–15 if used)
2. Anchor the real RSP in a non-volatile reg (R12 is canonical)
3. Switch SP to the pre-allocated buffer (top-aligned to 16)
4. Plant synthetic frames bottom-up (sentinel 0 → outermost → innermost)
5. Load SSN into EAX, set MOV R10, RCX (syscall ABI)
6. JMP/CALL into syscall;ret gadget (never embed bare `syscall` — leaves your .text as source)
7. After return: restore SP from R12, pop callee-saved, RET
Buffer rule: never allocate the fake stack in a local variable of the ASM trampoline's frame. You are about to rewrite RSP; any local temporaries die. Pre-allocate a heap buffer (or a stable static) in the high-level caller, pass in bufPin + fakeStackTop, and use R12 to anchor the real RSP for fixup.
Why the buffer matters in Go
Go's runtime grows goroutine stacks dynamically. A large SUB SP, imm inside the trampoline can overflow stack.lo, or worse, produce a valid stack that the GC scanner then tries to walk — finding planted return addresses, treating them as Go frames, and crashing with "runtime: unreachable". The pre-allocated heap buffer sidesteps both issues:
// Pre-allocate once at Init; pin through GC via unsafe.Pointer arg
total := 8 + f2 + f1 + trampoline + 256
total = (total + 15) &^ 15
buf := make([]byte, total)
bufPin := unsafe.Pointer(&buf[0])
fakeStackTop := (uintptr(bufPin) + uintptr(len(buf))) &^ 15
Pass bufPin as an explicit arg so the GC keeps it alive for the syscall duration.
Why the buffer matters in Rust / C
- Rust:
#[naked]/global_asm!with localsub rsp, immblows through canaries and-Z stack-checkinstrumentation. Use aBox<[u8; N]>allocated in the caller and passed viardi/rsi. - C (mingw-w64):
__attribute__((naked))+ inline AT&T asm; use a file-scopestatic __thread uint8_t buf[N](TLS-backed) or a heap buffer allocated once inspoof_init.allocais unsafe here — it uses_chkstkwhich generates CFG indirect calls.
The five implementation rules
These apply across C, C++, Rust, Go, and raw ASM.
R1. Resolve frame sizes at runtime. Hardcoding BaseThreadInitThunk+0x14 and RtlUserThreadStart+0x21 is fine (those offsets are stable since Win10 1809); hardcoding Frame1Size = 0x30 is not (it changed between 20H1 and 22H2). Always parse .pdata.
R2. Cascade gadget search across modules. Never commit to a single module. Order: wininet → user32 → kernelbase → ntdll (for SM); kernelbase → ntdll (for Draugr). Emit a log line on each fallback so you know which module won at runtime.
R3. Instrument the scanner in debug builds. Zero-match failures are ambiguous without counters. See the debug pattern in references/frame-math.md §Scanner Instrumentation.
R4. Invalidate the spoof context on init failure. Do not leave partial state; downstream callers must be able to check a single SPOOF_READY flag and fall back to unspoofed dispatch. Never "partially succeed".
R5. Strip the spoofer from release builds when you do not need it. A 500-line SilentMoonwalk with 4 frames and cascade logic is a strong detection target by itself — string constants, control-flow patterns, and .pdata scans are all observable. If the binary can run backed-on-disk in a legitimate PE, skip the spoof. See the minimalism principle in edr-evasion.
Languages — what changes
C / C++ (mingw-w64 or MSVC)
__attribute__((naked))function with AT&T inline asm (GNU) or.codeblock (MASM with MSVC)- Context struct:
#pragma pack(push, 8)→ fixed field order; offsets referenced in asm as0(%rdi),8(%rdi), … - Prefer mingw-w64 over MSVC for spoofers: no
_chkstkinjection on large stack frames, predictable codegen - Link with
-nostdlib -fno-ident -fno-asynchronous-unwind-tablesso your own.pdatadoes not confuse investigators reversing your loader
See references/lang-c-rust-go.md for a ready-to-compile Draugr trampoline in mingw-w64 AT&T syntax.
Rust
#[naked](stable as of Rust 1.88) orglobal_asm!for the trampoline#[repr(C)]on the context struct — never#[repr(Rust)]no_std + no_mainfor implant builds; link with-C link-args=/NODEFAULTLIB- Caveat: LLVM aggressively allocates RBX across inline asm blocks. Always list
rbxin clobbers, or useoptions(noreturn)+ a tail call to the next phase.
See references/lang-c-rust-go.md.
Go
- Plan 9 ASM syntax (
.sfiles), one per architecture. Seedraugr_spoof_x64.stemplate in the reference file. - Frame size
$0-N— always$0(no local frame).N= size of args passed from Go (sum of typed-arg sizes rounded to 8). BYTE $0x90NOP scattered between instructions: not decorative. Plan 9 ASM's go assembler reorders "optimizable" sequences; the NOPs are padding to keep the assembler from merging or eliminating instructions that look redundant to it but are necessary for the spoof.- Never touch
g(GS:0x30on Windows) in the trampoline. The Go runtime's thread-local lookup needs it intact for goroutine scheduling on return.
See references/lang-c-rust-go.md.
Integration with indirect syscall dispatchers
A stack spoofer does not resolve SSNs or find syscall;ret gadgets — that is the indirect-syscall skill's job. The integration point is a small interface:
spoof_trampoline(ssn: u16, syscall_ret_addr: *const u8, ctx: *const SpoofContext, args...) -> NTSTATUS
Where the caller resolves (ssn, syscall_ret_addr) via RecycleGate / Hell's / FreshyCalls, and the spoof trampoline dispatches the actual syscall;ret through the spoofed stack. Loading one skill does not require the other, but production loaders combine both. The layering is:
high-level wrapper
└─ indirect_syscall.execute(ssn, gadget_addr, args…)
└─ if (spoof_ctx != 0 && spoof_dispatch != NULL):
spoof_dispatch(ssn, gadget_addr, spoof_ctx, args…) ← spoof trampoline
else:
direct_indirect_syscall(ssn, gadget_addr, args…) ← plain trampoline
See indirect-syscall/SKILL.md for the SSN side of this interface.
Diagnostic workflow (when init fails)
The failure diagnosis sequence, from most common to least:
FF23_total == 0→ target module has been stripped of gadgets (Win11 24H2 kernel32.dll). Add another module to the cascade.fs_zerodominates → SAVE_NONVOL filter is rejecting the scanner's inventory. VerifyUWOP_SAVE_NONVOLhandling:max_save_offset >= total_allocis the rejection criterion; off-by-one here eats half the population.below_mindominates,best_belowmin_fs == 0x70→ threshold too high. Compute actual required frame for your syscall's arg count; lowerMIN_JMP_RBXaccordingly.eclipse_fail == FF23_total→ noE8byte atgadget - 5. On Win11 22H2+ this is expected for kernelbase. Cascade throughwininet/user32first, then drop Eclipse for kernelbase last-resort.- Init succeeds but runtime crash → buffer too small. Recompute:
8 + frame2 + frame1 + trampoline_frame + args*8 + 0x100 padding, align to 16. - Unwinder sees "broken" stack → frame sizes mismatch between your plant and the real UNWIND_INFO. Re-read
.pdatafor the retaddr, not the function entry.
Full diagnostic script + instrumentation pattern in references/frame-math.md §Diagnosing Init Failures.
Resources
- references/frame-math.md —
calc_frame_sizealgorithm, SAVE_NONVOL safety filter, gadget scanner with instrumentation, Win11 22H2+ empirical inventory, diagnosing init failures - references/lang-c-rust-go.md — Per-language trampoline patterns (mingw-w64 AT&T asm, Rust
global_asm!, Go Plan 9), context-struct layout rules, buffer-management patterns, interop caveats - Start with
references/frame-math.md; bad unwind math invalidates every language-specific trampoline.