3

It's so gay that zen1 and zen2 have giga slow microcoded implementations for the pdep and pext instructions

So slow (~140 cycles latency) that manually implementing their behavior via handfuls of other instructions is actually faster

On intel they take literally 1 cycle...

Comments
  • 0
    holdup now im curious.

    ```wat

    TEMP := SRC1;

    MASK := SRC2;

    DEST := 0 ;

    m := 0, k := 0;

    DO WHILE m < OperandSize

    · · IF MASK[ m] = 1 THEN

    · · · · DEST[ k] := TEMP[ m];

    · · · · k := k+ 1;

    · · FI

    · · m := m+ 1;

    OD

    ```

    (^pseudo from https://felixcloutier.com/x86/pext/)

    say you write a branchless version of that, just straight up pasting the same block with a rept to unroll the loop. and to keep it simple lets say we're not actually benchmarking, just going by the ops/latency found in uncle agner's tables (https://agner.org/optimize/...).

    just how much faster can it be? its a stupid problem and a very good one.
Add Comment