name: esp32p4-simd description: ESP32-P4 PIE (Processor Instruction Extensions) SIMD instruction set reference and optimization guide. Use when working with ESP32-P4 custom AI/DSP SIMD instructions in assembly orintrinsic form, converting scalar code to vectorized SIMD code, or implementing neural network operators for esp-dl. Covers read/write, data exchange, arithmetic, comparison, bitwise logical, shift, and FFT-dedicated instructions with 128-bit QR vector registers.
ESP32-P4 SIMD (PIE) Instruction Set
Architecture Overview
The ESP32-P4 HP core includes a custom PIE SIMD extension supporting 128-bit vector operations on 8-bit, 16-bit, and 32-bit data elements. It integrates data transfer into arithmetic instructions and supports non-aligned 128-bit vector data access.
Key Features
- 128-bit general-purpose vector registers (8 QR registers)
- 16 x 8-bit multipliers, 8 x 16-bit multipliers
- 256-bit accumulators (QACC_H, QACC_L) + 40-bit accumulator (XACC)
- Fused load-arithmetic and arithmetic-store instructions
- Configurable rounding and saturation modes
- Hardware misaligned access support
Registers
General-Purpose Registers (AR)
Only 16 of 32 RISC-V registers are available for PIE instructions:
| Registers | Description |
|---|---|
| x8-x15 (s0-s1, a0-a5) | Callee-saved and argument registers |
| x24-x31 (s8-s11, t3-t6) | Additional saved and temporary registers |
| x0-x7, x16-x23 | NOT available for PIE instructions |
Vector Registers (QR)
Eight 128-bit vector registers q0-q7. Each can hold:
- 16 x 8-bit elements
- 8 x 16-bit elements
- 4 x 32-bit elements
| Register | Bits | Access | Usage |
|---|---|---|---|
| q0-q7 | 128 | R/W | Vector operands and results |
Special Registers
| Register | Bits | Access | Purpose |
|---|---|---|---|
| SAR | 6 | R/W | Shift amount for multiply-shift and vector shift instructions |
| SAR_BYTE | 4 | R/W | Byte shift amount for non-aligned data handling |
| QACC_H | 256 | R/W | High 256-bit accumulator (8 x 32-bit for 8b MAC, 4 x 64-bit for 16b MAC) |
| QACC_L | 256 | R/W | Low 256-bit accumulator |
| XACC | 40 | R/W | 40-bit scalar accumulator for dot-product style accumulation |
| FFT_BIT_WIDTH | 4 | R/W | Bit width configuration for ESP.BITREV (range 3-10 bits) |
| UA_STATE | 128 | R/W | Unaligned state register for FFT instructions |
| CFG | 32 | R/W | Configuration register (rounding mode, saturation enable, misaligned access) |
CFG Register Fields
| Field | Bits | Description |
|---|---|---|
| vxsat_en | CFG[9] | Enable saturation status |
| vxrm | CFG[8:7:4:3] | 4-bit rounding mode (0=FLOOR, 1=CEILING, 2=UP, 3=DOWN, 4=HALF_UP, 5=HALF_DOWN, 6=HALF_EVEN, 7=UNNECESSARY) |
| rm_exc | CFG[2] | Exception status for UNNECESSARY mode (RO) |
| vxsat | CFG[1] | Saturation status (RO, cleared on CFG read) |
| mis_ld | CFG[7] | Enable hardware handle load misaligned access |
| mis_st | CFG[3] | Enable hardware handle store misaligned access |
SAR Usage Constraints
- Vector shifts (ESP.VSR.32, ESP.VSL.32): Uses lower 5 bits as shift amount
- Multiplications (ESP.VMUL., ESP.CMUL., ESP.FFT.AMS.*): Uses full SAR value for right-shift of intermediate results
- Set SAR via:
esp.movx.w.saroresp.movx.w.cfgwith appropriate vxrm
Instruction Categories
Instructions are organized into these categories. See references/instructions.md for the complete reference:
- Read Instructions - Load 128-bit/64-bit/broadcast/unaligned data from memory to QR registers
- Write Instructions - Store QR/accumulator data to memory
- Data Exchange Instructions - Move data between AR/QR registers, zip/unzip, sign/zero extend
- Arithmetic Instructions - Vector add/sub/mul, MAC operations, complex multiply, ReLU, clamping
- Comparison Instructions - Vector min/max, compare equal/less-than/greater-than, saturation
- Bitwise Logical Instructions - AND/OR/XOR/NOT on 128-bit QR registers
- Shift Instructions - Vector shifts, spliced shifts, immediate/register-controlled shifts
- FFT Dedicated Instructions - Radix-2 butterfly, complex multiply, bit-reverse, real FFT
Quick Instruction Reference
Most Common Instructions (from esp-dl patterns)
| Instruction | Description |
|---|---|
esp.vld.128.ip qN, rs, imm |
Load 128-bit, addr += imm |
esp.vld.128.xp qN, rs1, rs2 |
Load 128-bit, addr += rs2 |
esp.vldbc.16.ip qN, rs, imm |
Broadcast load 16-bit to 128-bit |
esp.vst.128.ip qN, rs, imm |
Store 128-bit, addr += imm |
esp.vadd.s16 qz, qx, qy |
Vector add 16-bit |
esp.vadd.s8 qz, qx, qy |
Vector add 8-bit |
esp.vsub.s16 qz, qx, qy |
Vector subtract 16-bit |
esp.vmul.s16 qz, qx, qy |
Vector multiply 16-bit (with SAR shift) |
esp.vmul.s8 qz, qx, qy |
Vector multiply 8-bit (with SAR shift) |
esp.vmulas.s16.qacc qx, qy |
Vector MAC 16-bit to QACC |
esp.vmulas.s8.qacc qx, qy |
Vector MAC 8-bit to QACC |
esp.vmulas.s16.xacc qx, qy |
Vector MAC 16-bit to XACC (dot product) |
esp.vsmulas.s16.qacc qx, qy, sel |
Scalar-vector MAC 16-bit to QACC |
esp.vmax.s16 qz, qx, qy |
Vector max 16-bit |
esp.vmin.s16 qz, qx, qy |
Vector min 16-bit |
esp.vcmp.eq.s16 qz, qx, qy |
Vector compare equal 16-bit |
esp.orq qz, qx, qy |
Bitwise OR 128-bit |
esp.andq qz, qx, qy |
Bitwise AND 128-bit |
esp.srcmb.s16.qacc qx, shift |
Shift QACC right and move to QR |
esp.zero.qacc |
Clear QACC_H and QACC_L |
esp.zero.xacc |
Clear XACC |
esp.zero.q qN |
Clear QR register |
esp.movx.w.sar rs |
Write SAR register |
esp.movx.r.sar.bytes rs |
Read SAR_BYTE |
Optimization Workflow
When converting scalar functions to SIMD:
- Check data alignment: Use 16-byte aligned data when possible (faster). Handle unaligned with
esp.ld.128.usar.ip+esp.src.q/esp.src.q.qup - Set SAR before multiply instructions:
esp.movx.w.sar rsto configure output shift - Process in 128-bit chunks: Loop count = total_elements / elements_per_128b (8 for 16-bit, 16 for 8-bit, 4 for 32-bit)
- Use fused load-arithmetic instructions where possible to reduce instruction count:
esp.vadd.s16.ld.incp qz, qx, qy, rs, imm- add and load nextesp.vmul.s16.ld.incp qz, qx, qy, rs, imm- multiply and load next
- Use QACC/XACC for accumulation chains: Initialize with
esp.zero.qacc, accumulate withesp.vmulas.*.qacc, extract withesp.srcmb.*.qacc - Use broadcast loads for scalar operands:
esp.vldbc.16.ip qN, rs, 0 - Handle remainders: Process full 128-bit blocks in loop, handle tail elements separately
Data Alignment Handling
Aligned Access (16-byte boundary)
esp.vld.128.ip q0, a1, 16 # load and advance by 16
esp.vst.128.ip q0, a0, 16 # store and advance by 16
Unaligned Access Pattern
# Get SAR_BYTE for output pointer
esp.ld.128.usar.ip q5, a0, 0
esp.movx.r.sar.bytes a5 # save output sar_byte
# Load unaligned data from input
esp.ld.128.usar.ip q0, a1, 16
esp.ld.128.usar.ip q1, a1, 16
# Extract aligned data from two consecutive loads
esp.src.q q2, q0, q1 # q2 = properly aligned 128-bit data
Instruction Naming Convention
ESP.<operation>.<datatype>[.<variant>]
- operation: vadd, vsub, vmul, vmulas, vsadds, etc.
- datatype: s8, s16, s32 (signed); u8, u16 (unsigned)
- variant: ld.incp (load + addr++), st.incp (store + addr++), ld.xp (load + addr+=reg), etc.
Label Naming Convention (REQUIRED)
For branch/loop targets, use local labels: either a plain number (0:, 1:, referenced as 0f/0b) or a .L-prefixed name (.Lloop, .Lremainder). Do not use full descriptive labels like loop_start: / end_label:.
- Numeric local labels keep tight inner loops compact and avoid name clashes.
.Llabels stay local to the file (not emitted into the symbol table) and clearly mark internal jump targets.
; GOOD — numeric local label
loopgtz a6, 0f
esp.vmin.s16.ld.incp q0, a3, q2, q0, q1
esp.vld.128.ip q1, a4, 16
esp.vst.128.ip q2, a2, 16
0:
; GOOD — .L local label
bgez a9, .Lleft_shift
.Lright_shift_loop:
; ...
bnez a5, .Lright_shift_loop
.Lleft_shift:
; BAD — full descriptive global-style labels
right_shift_loop:
; ...
bnez a5, right_shift_loop
References
- Full instruction listing: See
references/instructions.mdfor all instructions organized by category with syntax and semantics - Code examples: See
references/examples.mdfor patterns from esp-dl (conv2d, elementwise ops, depthwise conv, etc.)