esp32-s3-pie-simd-optimization - SKILL.md Agent Skill

name: ESP32-S3 PIE SIMD Optimization description: > ESP32-S3 Processor Instruction Extensions (PIE) SIMD reference for accelerating C code. Use when converting scalar C loops into PIE SIMD assembly (EE.* instructions). Covers 128-bit QR registers, vector arithmetic, MAC, non-aligned data, FFT, and common patterns. Target chip: ESP32-S3 (Xtensa LX7 with TIE extensions).

ESP32-S3 PIE SIMD Quick Reference

REGISTERS

General Purpose

Register	Count	Width	Use
AR (a0-a15)	16	32-bit	Address/data, a1=sp, a2-a7=args, a8-a15=addr operands
QR (q0-q7)	8	128-bit	SIMD vectors: 16×8-bit, 8×16-bit, or 4×32-bit lanes
FR (f0-f15)	16	32-bit	IEEE754 float

Special Registers (PIE-specific)

Register	Width	Access via	Purpose
SAR	6-bit	RSR/WUR.SAR	Mul result right-shift amount (bits)
SAR_BYTE	4-bit	RUR/WUR.SAR_BYTE	Byte shift amount for SRC.Q
ACCX	40-bit	RUR/WUR ACCX_0,ACCX_1	Scalar accumulator (sum of all lane products)
QACC_H	160-bit	RUR/WUR QACC_H_0..4	Per-lane accumulator high (8×20-bit or 4×40-bit)
QACC_L	160-bit	RUR/WUR QACC_L_0..4	Per-lane accumulator low
FFT_BIT_WIDTH	4-bit	WUR.FFT_BIT_WIDTH	Bit-reversal width (0→3bit ... 7→10bit)
UA_STATE	128-bit	RUR/WUR UA_STATE	Unaligned load persistent state

Special Register Access

WSR.SAR aN          ; write SAR
RSR.SAR aN          ; read SAR
WUR.ACCX_0 aN       ; write ACCX[31:0]
WUR.ACCX_1 aN       ; write ACCX[39:32]  (lower 8 bits of aN used)
RUR.ACCX_0 aN       ; read ACCX[31:0]
RUR.ACCX_1 aN       ; read ACCX[39:32] → aN[7:0]
WUR.SAR_BYTE aN      ; write SAR_BYTE
RUR.SAR_BYTE aN      ; read SAR_BYTE
WUR.QACC_H_0 aN      ; write QACC_H segment 0 (bits 31:0)
RUR.QACC_H_0 aN      ; read QACC_H segment 0 (bits 31:0)
; QACC_H segments: _0(31:0) _1(63:32) _2(95:64) _3(127:96) _4(159:128)
; QACC_L same segment layout
EE.ZERO.ACCX         ; ACCX = 0
EE.ZERO.QACC         ; QACC_H = 0, QACC_L = 0
EE.ZERO.Q qN         ; qN = 0

ALIGNMENT RULES (CRITICAL)

PIE loads/stores force address low bits to 0:

128-bit access: low 4 bits forced to 0 (16-byte aligned)
64-bit access: low 3 bits forced to 0 (8-byte aligned)
32-bit access: low 2 bits forced to 0 (4-byte aligned)
16-bit access: low 1 bit forced to 0 (2-byte aligned)

Always declare buffers with __attribute__((aligned(16))) or use heap_caps_aligned_alloc(16, size, MALLOC_CAP_DEFAULT).

Unaligned Load Pattern (3 instructions)

EE.LD.128.USAR.IP  q0, aN, 16    ; load aligned chunk, save offset→SAR_BYTE, aN+=16
EE.VLD.128.IP      q1, aN, 16    ; load next aligned chunk, aN+=16
EE.SRC.Q           q2, q0, q1    ; {q1,q0} >> SAR_BYTE*8 → q2 (lower 128 bits)

INSTRUCTION CLASSIFICATION BY OPERATION

1. LOAD (Memory → QR / QACC)

EE.VLD.128.IP       qu, as, imm16    ; qu=load128(aligned(as)), as+=sext(imm16)*16
EE.VLD.128.XP       qu, as, ad       ; qu=load128(aligned(as)), as+=ad
EE.LD.128.USAR.IP   qu, as, imm16    ; SAR_BYTE=as[3:0], qu=load128(aligned(as)), as+=imm16
EE.LD.128.USAR.XP   qu, as, ad       ; SAR_BYTE=as[3:0], qu=load128(aligned(as)), as+=ad
EE.VLD.H.64.IP      qu, as, imm8     ; qu[127:64]=load64(aligned(as)), as+=imm8*8
EE.VLD.L.64.IP      qu, as, imm8     ; qu[63:0]=load64(aligned(as)), as+=imm8*8
EE.VLDBC.8          qu, as           ; qu = broadcast(load8(as)), as+=1
EE.VLDBC.16         qu, as           ; qu = broadcast(load16(aligned(as))), as+=2
EE.VLDBC.32         qu, as           ; qu = broadcast(load32(aligned(as))), as+=4
EE.LDQA.S8.128.IP   as, imm16        ; sign-extend 16×8bit→16×20bit into QACC_H+L, as+=imm16
EE.LDQA.S16.128.IP  as, imm16        ; sign-extend 8×16bit→8×40bit into QACC_H+L, as+=imm16
; IP = Immediate Post-increment, XP = register post-increment
; imm16 for 128-bit: step = sext(imm16[7:0])*16, range -2048..2032

2. STORE (QR → Memory)

EE.VST.128.IP       qv, as, imm16    ; store128(qv, aligned(as)), as+=sext(imm16)*16
EE.VST.128.XP       qv, as, ad       ; store128(qv, aligned(as)), as+=ad
EE.VST.H.64.IP      qv, as, imm8     ; store64(qv[127:64], aligned(as)), as+=imm8*8
EE.VST.L.64.IP      qv, as, imm8     ; store64(qv[63:0], aligned(as)), as+=imm8*8

3. VECTOR ARITHMETIC (lane-wise, saturated)

EE.VADDS.S8         qa, qx, qy       ; qa[i]=saturate_s8(qx[i]+qy[i])  for i in 0..15
EE.VADDS.S16        qa, qx, qy       ; qa[i]=saturate_s16(qx[i]+qy[i]) for i in 0..7
EE.VADDS.S32        qa, qx, qy       ; qa[i]=saturate_s32(qx[i]+qy[i]) for i in 0..3
EE.VSUBS.S8         qa, qx, qy       ; qa[i]=saturate_s8(qx[i]-qy[i])
EE.VSUBS.S16        qa, qx, qy       ; qa[i]=saturate_s16(qx[i]-qy[i])
EE.VSUBS.S32        qa, qx, qy       ; qa[i]=saturate_s32(qx[i]-qy[i])
; Combined Load+Arithmetic (qu=load result, qa=arithmetic result)
EE.VADDS.S8.LD.INCP  qu, as, qa, qx, qy  ; as+=16
EE.VSUBS.S8.LD.INCP  qu, as, qa, qx, qy
; Combined Store+Arithmetic (store qv, compute qa)
EE.VADDS.S8.ST.INCP  qv, as, qa, qx, qy  ; store qv→aligned(as), as+=16

4. VECTOR MULTIPLY (lane-wise, SAR right-shift)

EE.VMUL.S8          qz, qx, qy       ; qz[i]=(s8*qx[i] * s8*qy[i]) >> SAR[5:0]  for i=0..15
EE.VMUL.S16         qz, qx, qy       ; qz[i]=(s16*qx[i] * s16*qy[i]) >> SAR[5:0] for i=0..7
EE.VMUL.U8          qz, qx, qy       ; unsigned variant
EE.VMUL.U16         qz, qx, qy       ; unsigned variant
; SET SAR BEFORE MULTIPLY:
;   movi aN, shift_amount
;   wsr.sar aN

5. COMPLEX MULTIPLY (16-bit, 4 selectable 32-bit complex pairs from qy)

EE.CMUL.S16         qz, qx, qy, sel4
; sel4 selects which 32-bit segment of qy: 0→[31:0], 1→[63:32], 2→[95:64], 3→[127:96]
; qz[2*i+1:2*i] = complex_mul(qx[2*i+1:2*i], qy[sel4*32+31:sel4*32]) >> SAR

6. MULTIPLY-ACCUMULATE

ACCX mode (scalar sum of all lane products):

EE.VMULAS.S8.ACCX    qx, qy    ; ACCX += Σ(qx[i] * qy[i]) for i=0..15, saturated to 40-bit
EE.VMULAS.S16.ACCX   qx, qy    ; ACCX += Σ(qx[i] * qy[i]) for i=0..7
EE.VMULAS.U8.ACCX    qx, qy    ; unsigned
EE.VMULAS.U16.ACCX   qx, qy    ; unsigned
; Read raw result: RUR.ACCX_0 aN (low 32), RUR.ACCX_1 aN (high 8)
EE.SRS.ACCX rd, rs, sel        ; arithmetic right shift ACCX by rs[5:0], write 40-bit
                               ;   shifted result back to ACCX, and write it saturated
                               ;   to a 32-bit signed value min(max(ACCX>>rs,-2^31),2^31-1)
                               ;   into general register rd (sel selects shift source, 0 in esp-dl)

QACC mode (per-lane accumulation, 20-bit or 40-bit per lane):

EE.VMULAS.S8.QACC    qx, qy    ; QACC[i] += qx[i]*qy[i], 20-bit signed saturated per lane
EE.VMULAS.S16.QACC   qx, qy    ; QACC[i] += qx[i]*qy[i], 40-bit signed saturated per lane
EE.VMULAS.U8.QACC    qx, qy    ; unsigned
EE.VMULAS.U16.QACC   qx, qy    ; unsigned
; Extract QACC→QR with right-shift+sat:
EE.SRCMB.S8.QACC     qu, as, 0 ; 16×20-bit→16×8-bit, R-shift SAR, saturate to s8→qu
EE.SRCMB.S16.QACC    qu, as, 0 ; 8×40-bit→8×16-bit, R-shift SAR, saturate to s16→qu

7. COMPARISON (result = 0xFF on true, 0 on false)

EE.VMAX.S8 / S16 / S32   qa, qx, qy    ; qa[i] = max(qx[i], qy[i])
EE.VMIN.S8 / S16 / S32   qa, qx, qy    ; qa[i] = min(qx[i], qy[i])
EE.VCMP.EQ.S8 / S16 / S32  qa, qx, qy  ; qa[i] = (qx[i]==qy[i]) ? 0xFF : 0
EE.VCMP.LT.S8 / S16 / S32  qa, qx, qy  ; qa[i] = (qx[i] < qy[i]) ? 0xFF : 0
EE.VCMP.GT.S8 / S16 / S32  qa, qx, qy  ; qa[i] = (qx[i] > qy[i]) ? 0xFF : 0
; Use ANDQ with mask to implement conditional selection (e.g. ReLU)

8. BITWISE

EE.ANDQ             qa, qx, qy    ; qa = qx & qy  (128-bit)
EE.ORQ              qa, qx, qy    ; qa = qx | qy
EE.XORQ             qa, qx, qy    ; qa = qx ^ qy
EE.NOTQ             qa, qx       ; qa = ~qx

9. SHIFT

EE.VSL.32           qa, qs        ; qa[i] = qs[i] << SAR[5:0]  (4×32-bit logical left)
EE.VSR.32           qa, qs        ; qa[i] = qs[i] >> SAR[5:0]  (4×32-bit arithmetic right)
EE.SRC.Q            qa, qs0, qs1  ; qa = ({qs1,qs0} >> SAR_BYTE*8)[127:0]
; SRC.Q used with LD.128.USAR for unaligned access

10. DATA MOVEMENT

MV.QR               qu, qs        ; qu = qs
EE.ZERO.Q           qa            ; qa = 0
EE.MOVI.32.A        qs, au, sel4  ; au = qs[sel4*32+31:sel4*32]  (extract 32-bit lane)
EE.MOVI.32.Q        qu, as, sel4  ; qu[sel4*32+31:sel4*32] = as  (insert 32-bit lane)

11. ACTIVATION

EE.VRELU.S8         qs, ax, ay    ; qs[i] = (qs[i]<=0) ? (qs[i]*ax[7:0])>>ay[4:0] : qs[i]
EE.VRELU.S16        qs, ax, ay    ; qs[i] = (qs[i]<=0) ? (qs[i]*ax[15:0])>>ay[5:0] : qs[i]
EE.VPRELU.S8        qz, qx, qy, ay  ; qz[i]=(qx[i]<=0) ? (qx[i]*qy[i])>>ay[4:0] : qx[i]
EE.VPRELU.S16       qz, qx, qy, ay  ; qz[i]=(qx[i]<=0) ? (qx[i]*qy[i])>>ay[5:0] : qx[i]

12. ZIP/UNZIP (interleave/deinterleave)

EE.VZIP.8 / 16 / 32   qs0, qs1    ; interleave qs0 and qs1 elements in-place
EE.VUNZIP.8 / 16 / 32  qs0, qs1   ; deinterleave qs0 and qs1 elements in-place

13. FFT-SPECIFIC

EE.FFT.R2BF.S16                qa0, qa1, qx, qy, sel2  ; radix-2 butterfly
EE.FFT.R2BF.S16.ST.INCP        qa0, qx, qy, as, sar4   ; butterfly + store sums
EE.FFT.CMUL.S16.LD.XP          qu, as, ad, qz, qx, qy, sel8  ; complex mul + load
EE.FFT.CMUL.S16.ST.XP          qx, qy, qv, as, ad, sel8, upd4, sar4  ; complex mul + store
EE.FFT.AMS.S16.LD.INCP         qu, as, qz, qz1, qx, qy, qm, sel2  ; FFT AMS + load
EE.FFT.AMS.S16.LD.INCP.UAUP    qu, as, qz, qz1, qx, qy, qm, sel2  ; AMS+unaligned load
EE.FFT.AMS.S16.LD.R32.DECP     qu, as, qz, qz1, qx, qy, qm, sel2  ; AMS+reversed32 load
EE.FFT.AMS.S16.ST.INCP         qv, qz1, at, as, qx, qy, qm, sel2  ; AMS + store
EE.FFT.VST.R32.DECP            qv, as, sar2           ; reversed32 store, as-=16
EE.BITREV                      qa, as                 ; bit-reverse (needs FFT_BIT_WIDTH)

C → PIE SIMD CONVERSION PATTERNS

Pattern A: Element-wise Arithmetic (Add/Sub/Mul)

// C scalar: int16_t out[i] = in0[i] + in1[i] for i in 0..N-1

→

; a2=out, a3=in0, a4=in1, a5=N
    srai    a5, a5, 3            ; N/8 (8 elements per 128-bit load)
    loopgtz a5, 0f
    ee.vld.128.ip   q0, a3, 16   ; load 8×int16_t
    ee.vld.128.ip   q1, a4, 16
    ee.vadds.s16    q2, q0, q1   ; 8 lane adds with saturation
    ee.vst.128.ip   q2, a2, 16
0:

Pattern B: Dot Product (Reduce Sum)

// C scalar: sum += a[i] * b[i]  (int8_t)

→

; Use ACCX for scalar accumulation
    ee.zero.accx
    srai    a4, a4, 4            ; N/16 (16 elements per load)
    loopgtz a4, 0f
    ee.vld.128.ip   q0, a2, 16
    ee.vld.128.ip   q1, a3, 16
    ee.vmulas.s8.accx  q0, q1    ; ACCX += Σall_lanes(q0[i]*q1[i])
0:
    rur.accx_0  a2               ; result low 32
    rur.accx_1  a3               ; result high 8

Pattern C: Multiply with Scale (Weighted Sum)

// C: out[i] = (in[i] * weight[i]) >> shift

→

    movi    a6, shift_amount
    wsr.sar a6                   ; SAR controls right-shift
    srai    a5, a5, 3
    loopgtz a5, 0f
    ee.vld.128.ip   q0, a3, 16
    ee.vld.128.ip   q1, a4, 16
    ee.vmul.s16     q2, q0, q1   ; multiply, >>SAR
    ee.vst.128.ip   q2, a2, 16
0:

Pattern D: Multiply-Accumulate per Lane (e.g. Quantized Conv)

// C: accum[i] += input[i] * weight[i]  (per-lane accumulation)

→

; Use QACC for per-lane accumulation
    ee.zero.qacc
    movi    a6, shift_amount
    wsr.sar a6
    srai    a5, a5, 4            ; 16×int8 per load
    loopgtz a5, 0f
    ee.vld.128.ip     q0, a3, 16
    ee.vld.128.ip     q1, a4, 16
    ee.vmulas.s8.qacc q0, q1     ; QACC[i] += q0[i]*q1[i] for i=0..15
0:
    ee.srcmb.s8.qacc  q2, a2, 0  ; extract: 20-bit→8-bit with >>SAR, saturate

Pattern E: ReLU Activation

// C: out[i] = in[i] > 0 ? in[i] : 0  (int16_t)

→

    ee.zero.q   q3               ; zeros vector
    srai    a5, a5, 3
    loopgtz a5, 0f
    ee.vld.128.ip    q0, a3, 16
    ee.vcmp.gt.s16   q1, q0, q3  ; mask = (q0 > 0) → 0xFFFF or 0
    ee.andq          q2, q0, q1  ; apply mask
    ee.vst.128.ip    q2, a2, 16
0:

Alternative using VRELU (faster, fewer ops):

; EE.VRELU.S16 qs, ax, ay: if qs[i]<=0 then qs[i]=(qs[i]*ax)>>ay else qs[i]=qs[i]
; For standard ReLU: ax=0 (multiply by 0), ay=0 (shift 0)
    movi    a6, 0
    srai    a5, a5, 3
    loopgtz a5, 0f
    ee.vld.128.ip    q0, a3, 16
    ee.vrelu.s16     q0, a6, a6  ; in-place ReLU: negative→0
    ee.vst.128.ip    q0, a2, 16
0:

Pattern F: PReLU / LeakyReLU

// C: out[i] = in[i] > 0 ? in[i] : in[i] * slope  (int16_t, slope quantized)

→

; Load slope once, broadcast to all lanes
    ee.vldbc.16      q3, a5      ; broadcast slope to 8 lanes
    srai    a6, a6, 3
    loopgtz a6, 0f
    ee.vld.128.ip    q0, a3, 16
    ee.vprelu.s16    q2, q0, q3, a7  ; q2[i]=(q0[i]<=0)?(q0[i]*q3[i])>>a7:q0[i]
    ee.vst.128.ip    q2, a2, 16
0:

Pattern G: Loop with Unaligned Load

// C: process int16_t array at arbitrary (non-16-byte-aligned) address

→

    ee.ld.128.usar.ip  q0, a3, 16    ; first unaligned load, offset→SAR_BYTE
    ee.vld.128.ip      q1, a3, 16    ; second aligned load
    srai    a5, a5, 3
    addi    a5, a5, -1               ; one iteration already consumed
    loopgtz a5, 0f
    ee.src.q           q2, q0, q1    ; align: q2 = ({q1,q0}>>SAR_BYTE*8)[127:0]
    ee.vld.128.ip      q0, a3, 16    ; next aligned chunk
    ; ... process q2 ...
    ; swap q0↔q1 for next iteration:
    ee.src.q           q2, q1, q0    ; use the newly loaded data
    ee.vld.128.ip      q1, a3, 16
    ; ... process q2 ...
0:

Pattern H: Bias Addition (broadcast scalar to vector)

// C: out[i] = in[i] + bias  (same bias for all elements)

→

    ee.vldbc.16      q3, a4      ; broadcast bias to all 8 lanes
    srai    a5, a5, 3
    loopgtz a5, 0f
    ee.vld.128.ip    q0, a3, 16
    ee.vadds.s16     q2, q0, q3
    ee.vst.128.ip    q2, a2, 16
0:

Pattern I: Max Pooling (2x1)

// C: out[i] = max(in[i*2], in[i*2+1])

→

    srai    a5, a5, 4            ; output has N/8 elements, input has N/4
    loopgtz a5, 0f
    ee.vld.128.ip   q0, a3, 16   ; load 8 pairs = 16 elements
    ee.vld.128.ip   q1, a3, 16
    ; Use VUNZIP to separate evens and odds, then VMAX
    ee.vunzip.16    q0, q1       ; q0=evens, q1=odds
    ee.vmax.s16     q2, q0, q1
    ee.vst.128.ip   q2, a2, 16
0:

LOOP STRUCTURES

Label Naming Convention (REQUIRED)

For branch/loop targets, use local labels: either a plain number (0:, 1:, referenced as 0f/0b) or a .L-prefixed name (.Lloop, .Lremainder). Do not use full descriptive labels like loop_start: / end_label:.

Numeric local labels keep tight inner loops compact and avoid name clashes.
.L labels stay local to the file (not emitted into the symbol table) and clearly mark internal jump targets.

; GOOD — numeric local label
    loopgtz a6, 0f
        ee.vmin.s16.ld.incp q0, a3, q2, q0, q1
        ee.vld.128.ip       q1, a4, 16
        ee.vst.128.ip       q2, a2, 16
    0:

; GOOD — .L local label
    bgez  a9, .Lleft_shift
.Lright_shift_loop:
    ; ...
    bnez  a5, .Lright_shift_loop
.Lleft_shift:

; BAD — full descriptive global-style labels
right_shift_loop:
    ; ...
    bnez a5, right_shift_loop

Zero-Overhead Loop (preferred)

; LOOPGTZ: enter loop body if aN > 0; decrements LCOUNT each iteration
    srai    a5, a5, 3            ; convert element count to iteration count
    loopgtz a5, 0f
    ; ... loop body (max ~56 instructions) ...
0:

Manual Loop (when zero-overhead loop constraints violated)

    srai    a5, a5, 3
    beqz    a5, 1f
.Lloop:
    ; ... loop body ...
    addi    a5, a5, -1
    bnez    a5, .Lloop
1:

PIPELINE & SCHEDULING

Data Dependencies (insert NOPs or reorder to avoid stalls)

QR load → use: 2 cycle latency (result ready in W stage, used in R stage)
QR multiply → use: result ready in E stage (same as arithmetic), 0-1 cycle
SAR write → use: requires ISYNC or at least 1 instruction gap

Prefer LD+OP fused instructions when available

Instead of separate load + compute:

; Sub-optimal: 3 instructions
ee.vld.128.ip  q0, a3, 16
ee.vld.128.ip  q1, a4, 16
ee.vadds.s16   q2, q0, q1

; Optimal when loading new data for next iteration:
ee.vadds.s16.ld.incp  q0, a3, q2, q4, q5  ; computes q2 while loading q0 for next iter

KEY CONSTRAINTS & GOTCHAS

SAR must be set before VMUL/VMULAS/CMUL — forgotten SAR causes wrong scaling.
Only 8 QR registers (q0-q7) — register pressure is real. Plan allocation carefully.
VLD/VST force address alignment — unaligned addresses silently read/write wrong data.
No 32-bit multiply — only 8-bit and 16-bit vector multiply. Use multiple 16-bit ops for 32-bit.
Zero-overhead loop limit — max ~56 instructions in loop body, no nested zero-overhead loops.
ACCX is 40-bit — saturated to [-2^39, 2^39-1]. Read with RUR ACCX_0 + ACCX_1.
QACC is per-lane — 16 lanes of 20-bit for S8/U8, 8 lanes of 40-bit for S16/U16.
VADDS/VSUBS saturate — use when overflow is a concern; VMUL shifts and truncates (no sat).
EE.VRELU modifies input register in-place — qs is both source and destination.
Don't mix EE. and Xtensa native load/store on same address* without MEMW/DSYNC if caching matters.

QUICK DECISION TABLE: Which Instruction to Use

C Operation	Data Type	Best PIE Instruction(s)
`c = a + b` (element-wise)	int8/int16/int32	EE.VADDS.S8/S16/S32
`c = a - b` (element-wise)	int8/int16/int32	EE.VSUBS.S8/S16/S32
`c = a * b >> shift`	int8/int16	EE.VMUL.S8/S16 (set SAR first)
`sum += a[i] * b[i]` (scalar result)	int8/int16	EE.VMULAS.S8/S16.ACCX
`accum[i] += a[i] * b[i]` (per-lane)	int8/int16	EE.VMULAS.S8/S16.QACC → EE.SRCMB
`c = max(a, b)`	int8/int16/int32	EE.VMAX.S8/S16/S32
`c = min(a, b)`	int8/int16/int32	EE.VMIN.S8/S16/S32
`c = a > 0 ? a : 0` (ReLU)	int16	EE.VRELU.S16 qs, zero, zero
`c = a > 0 ? a : a*slope` (PReLU)	int8/int16	EE.VPRELU.S8/S16
`c = a > b ? 0xFF : 0` (mask)	int8/int16	EE.VCMP.GT.S8/S16
`c = a & b`	any	EE.ANDQ
`c = a	b`	any
`c = broadcast(scalar)`	int8/int16/int32	EE.VLDBC.8/16/32
unaligned load	any	EE.LD.128.USAR.IP + EE.SRC.Q
`c = a << n` (per 32-bit lane)	int32	EE.VSL.32 (set SAR first)
`c = a >> n` (per 32-bit lane)	int32	EE.VSR.32 (set SAR first)
clear accumulator	-	EE.ZERO.ACCX / EE.ZERO.QACC
complex multiply	int16	EE.CMUL.S16
Radix-2 FFT butterfly	int16	EE.FFT.R2BF.S16

FUNCTION TEMPLATE

    .align  4
    .text
    .global my_kernel_name
    .type   my_kernel_name, @function
my_kernel_name:
    # a2: output pointer (int16_t*, 16-byte aligned)
    # a3: input0 pointer  (int16_t*, 16-byte aligned)
    # a4: input1 pointer  (int16_t*, 16-byte aligned)
    # a5: element count

    entry   sp, 64              # allocate stack frame

    # Pre-loop setup
    ; set SAR if multiplying
    ; zero accumulators if accumulating
    ; broadcast scalars if needed

    srai    a5, a5, 3           # count → iterations (8 × int16 per 128-bit)
    loopgtz a5, 0f

    ee.vld.128.ip   q0, a3, 16  # load input0
    ee.vld.128.ip   q1, a4, 16  # load input1
    ee.vadds.s16    q2, q0, q1  # compute
    ee.vst.128.ip   q2, a2, 16  # store output
0:
    retw

SUMMARY: Conversion Workflow

Identify the C loop operation and data type (int8/int16/int32).
Check alignment — ensure buffers are 16-byte aligned; use unaligned load pattern if not.
Select the right PIE instruction from the decision table above.
Set SAR if the instruction uses it (multiply, shift).
Zero accumulators if using ACCX or QACC.
Compute iteration count: N / (128/bits_per_element). E.g., int16 → N/8.
Write the zero-overhead loop body using loopgtz.
Extract accumulator results (RUR.*) after the loop if needed.
Watch register pressure — you only have q0-q7, plan which hold constants, which are temporaries.

REFERENCE DOCUMENTS

When you need to query complete instruction encodings, all instruction variants, detailed bit fields of special registers, exceptions/interrupts, or Xtensa native instructions, refer to the following reference files:

Document	Path	Content
PIE Instruction Set Complete Reference	references/pie_instruction_set.md	All EE.* instructions with encoding, pseudocode, SAR/QACC/ACCX special register access, unaligned data handling, pipeline hazards
Xtensa ISA Reference	references/xtensa_isa.md	AR/FR/BR registers, all Xtensa native instructions (load/store/branch/call/loop/float/cache/TLB), exception vectors, special register map

Reading Strategy: This skill file provides quick lookup tables and conversion templates. Please read the corresponding reference files when you need the following information:

Exact bit encoding of instructions (instruction word encoding)
Complete operation pseudocode (e.g., detailed formulas for all 16 lanes)
Full segment access list for special registers (e.g., QACC_H_0~~4, QACC_L_0~~4)
Xtensa native instructions (instructions without EE.* prefix)
Exception handling, TLB, cache management