AVX-512

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 (Knights Landing)^[1] and Skylake-X CPUs; this includes the Core-X series (excluding the Core i5-7640X and Core i7-7740X), as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series.^[2]

AVX-512 is not the first 512-bit SIMD instruction set that Intel has introduced in processors: the earlier 512-bit SIMD instructions used in the first generation Xeon Phi coprocessors, derived from Intel's Larrabee project, are similar but not binary compatible and only partially source compatible.^[1]

AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F (AVX-512 Foundation) is required by all AVX-512 implementations.

Instruction set[]

The AVX-512 instruction set consists of several separate sets each having their own unique CPUID feature bit; however, they are typically grouped by the processor generation that implements them.

F, CD, ER, PF

Introduced with Xeon Phi x200 (Knights Landing) and Xeon Gold/Platinum (Skylake SP "Purley"), with the last two (ER and PF) being specific to Knights Landing.

AVX-512 Foundation (F) – expands most 32-bit and 64-bit based AVX instructions with the EVEX coding scheme to support 512-bit registers, operation masks, parameter broadcasting, and embedded rounding and exception control, implemented by Knights Landing and Skylake Xeon
AVX-512 Conflict Detection Instructions (CD) – efficient conflict detection to allow more loops to be vectorized, implemented by Knights Landing^[1] and Skylake X
AVX-512 Exponential and Reciprocal Instructions (ER) – exponential and reciprocal operations designed to help implement transcendental operations, implemented by Knights Landing^[1]
AVX-512 Prefetch Instructions (PF) – new prefetch capabilities, implemented by Knights Landing^[1]

VL, DQ, BW

Introduced with Skylake X and Cannon Lake.

AVX-512 Vector Length Extensions (VL) – extends most AVX-512 operations to also operate on XMM (128-bit) and YMM (256-bit) registers^[3]
AVX-512 Doubleword and Quadword Instructions (DQ) – adds new 32-bit and 64-bit AVX-512 instructions^[3]
AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations^[3]

IFMA, VBMI

Introduced with Cannon Lake.^[4]

AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.
AVX-512 Vector Byte Manipulation Instructions (VBMI) adds vector byte permutation instructions which were not present in AVX-512BW.

4VNNIW, 4FMAPS

Introduced with Knights Mill.^[5]^[6]

AVX-512 Vector Neural Network Instructions Word variable precision (4VNNIW) - vector instructions for deep learning, enhanced word, variable precision.
AVX-512 Fused Multiply Accumulation Packed Single precision (4FMAPS) - vector instructions for deep learning, floating point, single precision.

VPOPCNTDQ

Vector population count instruction. Introduced with Knights Mill and Ice Lake.^[7]

VNNI, VBMI2, BITALG

Introduced with Ice Lake.^[7]

AVX-512 Vector Neural Network Instructions (VNNI) - vector instructions for deep learning.
AVX-512 Vector Byte Manipulation Instructions 2 (VBMI2) - byte/word load, store and concatenation with shift.
AVX-512 Bit Algorithms (BITALG) - byte/word bit manipulation instructions expanding VPOPCNTDQ.

VP2INTERSECT

Introduced with Tiger Lake.

AVX-512 Vector Pair Intersection to a Pair of Mask Registers (VP2INTERSECT).

GFNI, VPCLMULQDQ, VAES

Introduced with Ice Lake.^[7]

These are not AVX-512 features per se. Together with AVX-512, they enable EVEX encoded versions of GFNI, PCLMULQDQ, and AES instructions.

Encoding and features[]

The VEX prefix used by AVX and AVX2, while flexible, did not leave enough room for the features Intel wanted to add to AVX-512. This has led them to define a new prefix called EVEX.

Compared to VEX, EVEX adds the following benefits:^[6]

Expanded register encoding allowing 32 512-bit registers.
Adds 8 new opmask registers for masking most AVX-512 instructions.
Adds a new scalar memory mode that automatically performs a broadcast.
Adds room for explicit rounding control in each instruction.
Adds a new compressed displacement memory addressing mode.

The extended registers, SIMD width bit, and opmask registers of AVX-512 are mandatory and all require support from the OS.

SIMD modes[]

The AVX-512 instructions are designed to mix with 128/256-bit AVX/AVX2 instructions without a performance penalty. However, AVX-512VL extensions allows the use of AVX-512 instructions on 128/256-bit registers XMM/YMM, so most SSE and AVX/AVX2 instructions have new AVX-512 versions encoded with the EVEX prefix which allow access to new features such as opmask and additional registers. Unlike AVX-256, the new instructions do not have new mnemonics but share namespace with AVX, making the distinction between VEX and EVEX encoded versions of an instruction ambiguous in the source code. Since AVX-512F only works on 32- and 64-bit values, SSE and AVX/AVX2 instructions that operate on bytes or words are available only with the AVX-512BW extension (Byte & Word support).^[6]

Name	Extension sets	Registers	Types
Legacy SSE	SSE-SSE4.2	xmm0-xmm15	single floats. From SSE2: bytes, words, doublewords, quadwords and double floats.
AVX-128 (VEX)	AVX, AVX2	xmm0-xmm15	bytes, words, doublewords, quadwords, single floats and double floats.
AVX-256 (VEX)	AVX, AVX2	ymm0-ymm15	single float and double float. From AVX2: bytes, words, doublewords, quadwords
AVX-128 (EVEX)	AVX-512VL	xmm0-xmm31 (k1-k7)	doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-256 (EVEX)	AVX-512VL	ymm0-ymm31 (k1-k7)	doublewords, quadwords, single float and double float. With AVX512BW: bytes and words
AVX-512 (EVEX)	AVX-512F	zmm0-zmm31 (k1-k7)	doublewords, quadwords, single float and double float. With AVX512BW: bytes and words

Extended registers[]

x64 AVX-512 register scheme as extension from the x64 AVX (YMM0-YMM15) and x64 SSE (XMM0-XMM15) registers
511 256	255 128	127 0

ZMM0	YMM0	XMM0
ZMM1	YMM1	XMM1
ZMM2	YMM2	XMM2
ZMM3	YMM3	XMM3
ZMM4	YMM4	XMM4
ZMM5	YMM5	XMM5
ZMM6	YMM6	XMM6
ZMM7	YMM7	XMM7
ZMM8	YMM8	XMM8
ZMM9	YMM9	XMM9
ZMM10	YMM10	XMM10
ZMM11	YMM11	XMM11
ZMM12	YMM12	XMM12
ZMM13	YMM13	XMM13
ZMM14	YMM14	XMM14
ZMM15	YMM15	XMM15
ZMM16	YMM16	XMM16
ZMM17	YMM17	XMM17
ZMM18	YMM18	XMM18
ZMM19	YMM19	XMM19
ZMM20	YMM20	XMM20
ZMM21	YMM21	XMM21
ZMM22	YMM22	XMM22
ZMM23	YMM23	XMM23
ZMM24	YMM24	XMM24
ZMM25	YMM25	XMM25
ZMM26	YMM26	XMM26
ZMM27	YMM27	XMM27
ZMM28	YMM28	XMM28
ZMM29	YMM29	XMM29
ZMM30	YMM30	XMM30
ZMM31	YMM31	XMM31

The width of the SIMD register file is increased from 256 bits to 512 bits, and expanded from 16 to a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

Opmask registers[]

Most AVX-512 instructions may indicate one of 8 opmask registers (k0–k7). For instructions which use a mask register as an opmask, register 'k0' is special: a hardcoded constant used to indicate unmasked operations. For other operations, such as those that write to an opmask register or perform arithmetic or logical operations, 'k0' is a functioning, valid register. In most instructions, the opmask is used to control which values are written to the destination. A flag controls the opmask behavior, which can either be "zero", which zeros everything not selected by the mask, or "merge", which leaves everything not selected untouched. The merge behavior is identical to the blend instructions.

The opmask registers are normally 16 bits wide, but can be up to 64 bits with the AVX-512BW extension.^[6] How many of the bits are actually used, though, depends on the vector type of the instructions masked. For the 32-bit single float or double words, 16 bits are used to mask the 16 elements in a 512-bit register. For double float and quad words, at most 8 mask bits are used.

The opmask register is the reason why several bitwise instructions which naturally have no element widths had them added in AVX-512. For instance, bitwise AND, OR or 128-bit shuffle now exist in both double-word and quad-word variants with the only difference being in the final masking.

New opmask instructions[]

The opmask registers have a new mini extension of instructions operating directly on them. Unlike the rest of the AVX-512 instructions, these instructions are all VEX encoded. The initial opmask instructions are all 16-bit (Word) versions. With AVX-512DQ 8-bit (Byte) versions were added to better match the needs of masking 8 64-bit values, and with AVX-512BW 32-bit (Double) and 64-bit (Quad) versions were added so they can mask up to 64 8-bit values. The instructions KORTEST and KTEST can be used to set the x86 flags based on mask registers, so that they may be used together with non-SIMD x86 branch and conditional instructions.

Instruction	Extension set	Description
`KAND`	F	Bitwise logical AND Masks
`KANDN`	F	Bitwise logical AND NOT Masks
`KMOV`	F	Move from and to Mask Registers or General Purpose Registers
`KUNPCK`	F	Unpack for Mask Registers
`KNOT`	F	NOT Mask Register
`KOR`	F	Bitwise logical OR Masks
`KORTEST`	F	OR Masks And Set Flags
`KSHIFTL`	F	Shift Left Mask Registers
`KSHIFTR`	F	Shift Right Mask Registers
`KXNOR`	F	Bitwise logical XNOR Masks
`KXOR`	F	Bitwise logical XOR Masks
`KADD`	BW/DQ	Add Two Masks
`KTEST`	BW/DQ	Bitwise comparison and set flags

New instructions in AVX-512 foundation[]

Many AVX-512 instructions are simply EVEX versions of old SSE or AVX instructions. There are, however, several new instructions, and old instructions that have been replaced with new AVX-512 versions. The new or majorly reworked instructions are listed below. These foundation instructions also include the extensions from AVX-512VL and AVX-512BW since those extensions merely add new versions of these instructions instead of new instructions.

Blend using mask[]

There are no EVEX-prefixed versions of the blend instructions from SSE4; instead, AVX-512 has a new set of blending instructions using mask registers as selectors. Together with the general compare into mask instructions below, these may be used to implement generic ternary operations or cmov, similar to XOP's VPCMOV.

Since blending is an integral part of the EVEX encoding, these instruction may also be considered basic move instructions. Using the zeroing blend mode, they can also be used as masking instructions.

Instruction	Extension set	Description
`VBLENDMPD`	F	Blend float64 vectors using opmask control
`VBLENDMPS`	F	Blend float32 vectors using opmask control
`VPBLENDMD`	F	Blend int32 vectors using opmask control
`VPBLENDMQ`	F	Blend int64 vectors using opmask control
`VPBLENDMB`	BW	Blend byte integer vectors using opmask control
`VPBLENDMW`	BW	Blend word integer vectors using opmask control

Compare into mask[]

AVX-512F has four new compare instructions. Like their XOP counterparts they use the immediate field to select between 8 different comparisons. Unlike their XOP inspiration, however, they save the result to a mask register and initially only support doubleword and quadword comparisons. The AVX-512BW extension provides the byte and word versions. Note that two mask registers may be specified for the instructions, one to write to and one to declare regular masking.^[6]

Immediate	Comparison	Description
0	EQ	Equal
1	LT	Less than
2	LE	Less than or equal
3	FALSE	Set to zero
4	NEQ	Not equal
5	NLT	Greater than or equal
6	NLE	Greater than
7	TRUE	Set to one

Instruction	Extension set	Description
`VPCMPD` `VPCMPUD`	F	Compare signed/unsigned doublewords into mask
`VPCMPQ` `VPCMPUQ`	F	Compare signed/unsigned quadwords into mask
`VPCMPB` `VPCMPUB`	BW	Compare signed/unsigned bytes into mask
`VPCMPW` `VPCMPUW`	BW	Compare signed/unsigned words into mask

Logical set mask[]

The final way to set masks is using Logical Set Mask. These instructions perform either AND or NAND, and then set the destination opmask based on the result values being zero or non-zero. Note that like the comparison instructions, these take two opmask registers, one as destination and one a regular opmask.

Instruction	Extension set	Description
`VPTESTMD`, `VPTESTMQ`	F	Logical AND and set mask for 32 or 64 bit integers.
`VPTESTNMD`, `VPTESTNMQ`	F	Logical NAND and set mask for 32 or 64 bit integers.
`VPTESTMB`, `VPTESTMW`	BW	Logical AND and set mask for 8 or 16 bit integers.
`VPTESTNMB`, `VPTESTNMW`	BW	Logical NAND and set mask for 8 or 16 bit integers.

Compress and expand[]

The compress and expand instructions match the APL operations of the same name. They use the opmask in a slightly different way from other AVX-512 instructions. Compress only saves the values marked in the mask, but saves them compacted by skipping and not reserving space for unmarked values. Expand operates in the opposite way, by loading as many values as indicated in the mask and then spreading them to the selected positions.

Instruction	Description
`VCOMPRESSPD`, `VCOMPRESSPS`	Store sparse packed double/single-precision floating-point values into dense memory
`VPCOMPRESSD`, `VPCOMPRESSQ`	Store sparse packed doubleword/quadword integer values into dense memory/register
`VEXPANDPD`, `VEXPANDPS`	Load sparse packed double/single-precision floating-point values from dense memory
`VPEXPANDD`, `VPEXPANDQ`	Load sparse packed doubleword/quadword integer values from dense memory/register

Permute[]

A new set of permute instructions have been added for full two input permutations. They all take three arguments, two source registers and one index; the result is output by either overwriting the first source register or the index register. AVX-512BW extends the instructions to also include 16-bit (word) versions, and the AVX-512_VBMI extension defines the byte versions of the instructions.

Instruction	Extension set	Description
`VPERMB`	VBMI	Permute packed bytes elements.
`VPERMW`	BW	Permute packed words elements.
`VPERMT2B`	VBMI	Full byte permute overwriting first source.
`VPERMT2W`	BW	Full word permute overwriting first source.
`VPERMI2PD`, `VPERMI2PS`	F	Full single/double floating point permute overwriting the index.
`VPERMI2D`, `VPERMI2Q`	F	Full doubleword/quadword permute overwriting the index.
`VPERMI2B`	VBMI	Full byte permute overwriting the index.
`VPERMI2W`	BW	Full word permute overwriting the index.
`VPERMT2PS`, `VPERMT2PD`	F	Full single/double floating point permute overwriting first source.
`VPERMT2D`, `VPERMT2Q`	F	Full doubleword/quadword permute overwriting first source.
`VSHUFF32x4`, `VSHUFF64x2`, `VSHUFI32x4`, `VSHUFI64x2`	F	Shuffle four packed 128-bit lines.
`VPMULTISHIFTQB`	VBMI	Select packed unaligned bytes from quadword sources.

Bitwise ternary logic[]

Two new instructions added can logically implement all possible bitwise operations between three inputs. They take three registers as input and an 8-bit immediate field. Each bit in the output is generated using a lookup of the three corresponding bits in the inputs to select one of the 8 positions in the 8-bit immediate. Since only 8 combinations are possible using three bits, this allow all possible 3 input bitwise operations to be performed.^[6] These are the only bitwise vector instructions in AVX-512F; EVEX versions of the two source SSE and AVX bitwise vector instructions AND, ANDN, OR and XOR were added in AVX-512DQ.

The difference in the doubleword and quadword versions is only the application of the opmask.

Instruction	Description
`VPTERNLOGD`, `VPTERNLOGQ`	Bitwise Ternary Logic

Truth table:

A0	A1	A2	Double AND (0x80)	Double OR (0xFE)	Bitwise blend (0xCA)
0	0	0	0	0	0
0	0	1	0	1	1
0	1	0	0	1	0
0	1	1	0	1	1
1	0	0	0	1	0
1	0	1	0	1	0
1	1	0	0	1	1
1	1	1	1	1	1

Conversions[]

A number of conversion or move instructions were added; these complete the set of conversion instructions available from SSE2.

Instruction	Extension set	Description
`VPMOVQD`, `VPMOVSQD`, `VPMOVUSQD`, `VPMOVQW`, `VPMOVSQW`,`VPMOVUSQW`, `VPMOVQB`, `VPMOVSQB`, `VPMOVUSQB`, `VPMOVDW`, `VPMOVSDW`, `VPMOVUSDW`, `VPMOVDB`, `VPMOVSDB`, `VPMOVUSDB`	F	Down convert quadword or doubleword to doubleword, word or byte; unsaturated, saturated or saturated unsigned. The reverse of the sign/zero extend instructions from SSE4.1.
`VPMOVWB`, `VPMOVSWB`, `VPMOVUSWB`	BW	Down convert word to byte; unsaturated, saturated or saturated unsigned.
`VCVTPS2UDQ`, `VCVTPD2UDQ`, `VCVTTPS2UDQ`, `VCVTTPD2UDQ`	F	Convert with or without truncation, packed single or double-precision floating point to packed unsigned doubleword integers.
`VCVTSS2USI` , `VCVTSD2USI` , `VCVTTSS2USI` , `VCVTTSD2USI`	F	Convert with or without trunction, scalar single or double-precision floating point to unsigned doubleword integer.
`VCVTPS2QQ`, `VCVTPD2QQ`, `VCVTPS2UQQ`, `VCVTPD2UQQ`, `VCVTTPS2QQ`, `VCVTTPD2QQ`, `VCVTTPS2UQQ`, `VCVTTPD2UQQ`	DQ	Convert with or without truncation, packed single or double-precision floating point to packed signed or unsigned quadword integers.
`VCVTUDQ2PS` , `VCVTUDQ2PD`	F	Convert packed unsigned doubleword integers to packed single or double-precision floating point.
`VCVTUSI2PS` , `VCVTUSI2PD`	F	Convert scalar unsigned doubleword integers to single or double-precision floating point.
`VCVTUSI2SD`, `VCVTUSI2SS`	F	Convert scalar unsigned integers to single or double-precision floating point.
`VCVTUQQ2PS`, `VCVTUQQ2PD`	DQ	Convert packed unsigned quadword integers to packed single or double-precision floating point.
`VCVTQQ2PD`, `VCVTQQ2PS`	F	Convert packed quadword integers to packed single or double-precision floating point.

Floating point decomposition[]

Among the unique new features in AVX-512F are instructions to decompose floating-point values and handle special floating-point values. Since these methods are completely new, they also exist in scalar versions.

Instruction	Description
`VGETEXPPD`, `VGETEXPPS`	Convert exponents of packed fp values into fp values
`VGETEXPSD`, `VGETEXPSS`	Convert exponent of scalar fp value into fp value
`VGETMANTPD`, `VGETMANTPS`	Extract vector of normalized mantissas from float32/float64 vector
`VGETMANTSD`, `VGETMANTSS`	Extract float32/float64 of normalized mantissa from float32/float64 scalar
`VFIXUPIMMPD`, `VFIXUPIMMPS`	Fix up special packed float32/float64 values
`VFIXUPIMMSD`, `VFIXUPIMMSS`	Fix up special scalar float32/float64 value

Floating point arithmetic[]

This is the second set of new floating-point methods, which includes new scaling and approximate calculation of reciprocal, and reciprocal of square root. The approximate reciprocal instructions guarantee to have at most a relative error of 2⁻¹⁴.^[6]

Instruction	Description
`VRCP14PD`, `VRCP14PS`	Compute approximate reciprocals of packed float32/float64 values
`VRCP14SD`, `VRCP14SS`	Compute approximate reciprocals of scalar float32/float64 value
`VRNDSCALEPS`, `VRNDSCALEPD`	Round packed float32/float64 values to include a given number of fraction bits
`VRNDSCALESS`, `VRNDSCALESD`	Round scalar float32/float64 value to include a given number of fraction bits
`VRSQRT14PD`, `VRSQRT14PS`	Compute approximate reciprocals of square roots of packed float32/float64 values
`VRSQRT14SD`, `VRSQRT14SS`	Compute approximate reciprocal of square root of scalar float32/float64 value
`VSCALEFPS`, `VSCALEFPD`	Scale packed float32/float64 values with float32/float64 values
`VSCALEFSS`, `VSCALEFSD`	Scale scalar float32/float64 value with float32/float64 value

Broadcast[]

Instruction	Extension set	Description
`VBROADCASTSS`, `VBROADCASTSD`	F, VL	Broadcast single/double floating point value
`VPBROADCASTB`, `VPBROADCASTW`, `VPBROADCASTD`, `VPBROADCASTQ`	F, VL, DQ, BW	Broadcast a byte/word/doubleword/quadword integer value
`VBROADCASTI32X2`, `VBROADCASTI64X2`, `VBROADCASTI32X4`, `VBROADCASTI32X8`, `VBROADCASTI64X4`	F, VL, DQ, BW	Broadcast two or four doubleword/quadword integer values

Miscellaneous[]

Instruction	Extension set	Description
`VALIGND`, `VALIGNQ`	F, VL	Align doubleword or quadword vectors
`VDBPSADBW`	BW	Double block packed sum-absolute-differences (SAD) on unsigned bytes
`VPABSQ`	F	Packed absolute value quadword
`VPMAXSQ`, `VPMAXUQ`	F	Maximum of packed signed/unsigned quadword
`VPMINSQ`, `VPMINUQ`	F	Minimum of packed signed/unsigned quadword
`VPROLD`, `VPROLVD`, `VPROLQ`, `VPROLVQ`, `VPRORD`, `VPRORVD`, `VPRORQ`, `VPRORVQ`	F	Bit rotate left or right
`VPSCATTERDD`, `VPSCATTERDQ`, `VPSCATTERQD`, `VPSCATTERQQ`	F	Scatter packed doubleword/quadword with signed doubleword and quadword indices
`VSCATTERDPS`, `VSCATTERDPD`, `VSCATTERQPS`, `VSCATTERQPD`	F	Scatter packed float32/float64 with signed doubleword and quadword indices

New instructions by sets[]

Conflict detection[]

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.^[8]

Instruction	Name	Description
`VPCONFLICTD`, `VPCONFLICTQ`	Detect conflicts within vector of packed double- or quadwords values.	Compares each element in the first source, to all elements on same or earlier places in the second source and forms a bit vector of the results.
`VPLZCNTD`, `VPLZCNTQ`	Count the number of leading zero bits for packed double- or quadword values.	Vectorized `LZCNT` instruction.
`VPBROADCASTMB2Q`,`VPBROADCASTMW2D`	Broadcast mask to vector register.	Either 8-bit mask to quadword vector, or 16-bit mask to doubleword vector.

Exponential and reciprocal[]

AVX-512 exponential and reciprocal instructions contain more accurate approximate reciprocal instructions than those in the AVX-512 foundation; relative error is at most 2⁻²⁸. They also contain two new exponential functions that have a relative error of at most 2⁻²³.^[6]

Instruction	Description
`VEXP2PD`, `VEXP2PS`	Compute approximate exponential 2^x of packed single or double-precision floating point values
`VRCP28PD`, `VRCP28PS`	Compute approximate reciprocals of packed single or double-precision floating point values
`VRCP28SD`, `VRCP28SS`	Compute approximate reciprocal of scalar single or double-precision floating point value
`VRSQRT28PD`, `VRSQRT28PS`	Compute approximate reciprocals of square roots of packed single or double-precision floating point values
`VRSQRT28SD`, `VRSQRT28SS`	Compute approximate reciprocal of square root of scalar single or double-precision floating point value

Prefetch[]

AVX-512 prefetch instructions contain new prefetch operations for the new scatter and gather functionality introduced in AVX2 and AVX-512. T0 prefetch means prefetching into level 1 cache and T1 means prefetching into level 2 cache.

Instruction	Description
`VGATHERPF0DPS`, `VGATHERPF0QPS`, `VGATHERPF0DPD`, `VGATHERPF0QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T0 hint.
`VGATHERPF1DPS`, `VGATHERPF1QPS`, `VGATHERPF1DPD`, `VGATHERPF1QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using opmask k1 and T1 hint.
`VSCATTERPF0DPS`, `VSCATTERPF0QPS`, `VSCATTERPF0DPD`, `VSCATTERPF0QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double-precision data using writemask k1 and T0 hint with intent to write.
`VSCATTERPF1DPS`, `VSCATTERPF1QPS`, `VSCATTERPF1DPD`, `VSCATTERPF1QPD`	Using signed dword/qword indices, prefetch sparse byte memory locations containing single/double precision data using writemask k1 and T1 hint with intent to write.

4FMAPS and 4VNNIW[]

The two sets of instructions perform multiple iterations of processing. They are generally only found in Xeon Phi products.

Instruction	Extension set	Description
`V4FMADDPS`, `V4FMADDSS`	4FMAPS	Packed/scalar single-precision floating point fused multiply-add (4-iterations)
`V4FNMADDPS`, `V4FNMADDSS`	4FMAPS	Packed/scalar single-precision floating point fused multiply-add and negate (4-iterations)
`VP4DPWSSD`	4VNNIW	Dot product of signed words with double word accumulation (4-iterations)
`VP4DPWSSDS`	4VNNIW	Dot product of signed words with double word accumulation and saturation (4-iterations)

BW, DQ and VBMI[]

AVX-512DQ adds new doubleword and quadword instructions. AVX-512BW adds byte and words versions of the same instructions, and adds byte and word version of doubleword/quadword instructions in AVX-512F. A few instructions which get only word forms with AVX-512BW acquire byte forms with the AVX-512_VBMI extension (VPERMB, VPERMI2B, VPERMT2B, VPMULTISHIFTQB).

Two new instructions were added to the mask instructions set: KADD and KTEST (B and W forms with AVX-512DQ, D and Q with AVX-512BW). The rest of mask instructions, which had only word forms, got byte forms with AVX-512DQ and doubleword/quadword forms with AVX-512BW. KUNPCKBW was extended to KUNPCKWD and KUNPCKDQ by AVX-512BW.

Among the instructions added by AVX-512DQ are several SSE, AVX instruction that didn't get AVX-512 versions with AVX-512F, among those are all the two input bitwise instructions and extract/insert integer instructions.

Instructions that are completely new are covered below.

Floating point instructions[]

Three new floating point operations are introduced. Since they are not only new to AVX-512 they have both packed/SIMD and scalar versions.

The VFPCLASS instructions tests if the floating point value is one of eight special floating-point values, which of the eight values will trigger a bit in the output mask register is controlled by the immediate field. The VRANGE instructions perform minimum or maximum operations depending on the value of the immediate field, which can also control if the operation is done absolute or not and separately how the sign is handled. The VREDUCE instructions operate on a single source, and subtract from that the integer part of the source value plus a number of bits specified in the immediate field of its fraction.

Instruction	Extension set	Description
`VFPCLASSPS`, `VFPCLASSPD`	DQ	Test types of packed single and double precision floating point values.
`VFPCLASSSS`, `VFPCLASSSD`	DQ	Test types of scalar single and double precision floating point values.
`VRANGEPS`, `VRANGEPD`	DQ	Range restriction calculation for packed floating point values.
`VRANGESS`, `VRANGESD`	DQ	Range restriction calculation for scalar floating point values.
`VREDUCEPS`, `VREDUCEPD`	DQ	Perform reduction transformation on packed floating point values.
`VREDUCESS`, `VREDUCESD`	DQ	Perform reduction transformation on scalar floating point values.

Other instructions[]

Instruction	Extension set	Description
`VPMOVM2D`, `VPMOVM2Q`	DQ	Convert mask register to double- or quad-word vector register.
`VPMOVM2B`, `VPMOVM2W`	BW	Convert mask register to byte or word vector register.
`VPMOVD2M`, `VPMOVQ2M`	DQ	Convert double- or quad-word vector register to mask register.
`VPMOVB2M`, `VPMOVW2M`	BW	Convert byte or word vector register to mask register.
`VPMULLQ`	DQ	Multiply packed quadword store low result. A quadword version of VPMULLD.

VBMI2[]

Extend VPCOMPRESS and VPEXPAND with byte and word variants. Shift instructions are new.

Instruction	Description
`VPCOMPRESSB`, `VPCOMPRESSW`	Store sparse packed byte/word integer values into dense memory/register
`VPEXPANDB`, `VPEXPANDW`	Load sparse packed byte/word integer values from dense memory/register
`VPSHLD`	Concatenate and shift packed data left logical
`VPSHLDV`	Concatenate and variable shift packed data left logical
`VPSHRD`	Concatenate and shift packed data right logical
`VPSHRDV`	Concatenate and variable shift packed data right logical

VNNI[]

Vector Neural Network Instructions. AVX512-VNNI adds EVEX-coded instructions described below. With AVX-512F, these instructions can operate on 512-bit vectors, and AVX-512VL further adds support for 128- and 256-bit vectors.

A later AVX-VNNI extension adds VEX encodings of these instructions which can only operate on 128- or 256-bit vectors. AVX-VNNI is not part of the AVX-512 suite, it does not require AVX-512F and can be implemented independently.

Instruction	Description
`VPDPBUSD`	Multiply and add unsigned and signed bytes
`VPDPBUSDS`	Multiply and add unsigned and signed bytes with saturation
`VPDPWSSD`	Multiply and add signed word integers
`VPDPWSSDS`	Multiply and add word integers with saturation

IFMA[]

Instruction	Extension set	Description
`VPMADD52LUQ`	IFMA	Packed multiply of unsigned 52-bit integers and add the low 52-bit products to qword accumulators
`VPMADD52HUQ`	IFMA	Packed multiply of unsigned 52-bit integers and add the high 52-bit products to 64-bit accumulators

VPOPCNTDQ and BITALG[]

Instruction	Extension set	Description
`VPOPCNTD`, `VPOPCNTQ`	VPOPCNTDQ	Return the number of bits set to 1 in doubleword/quadword
`VPOPCNTB`, `VPOPCNTW`	BITALG	Return the number of bits set to 1 in byte/word
`VPSHUFBITQMB`	BITALG	Shuffle bits from quadword elements using byte indexes into mask

VP2INTERSECT[]

Instruction	Extension set	Description
`VP2INTERSECTD`, `VP2INTERSECTQ`	VP2INTERSECT	Compute intersection between doublewords/quadwords to a pair of mask registers

GFNI[]

EVEX-encoded Galois field new instructions:

Instruction	Description
`VGF2P8AFFINEINVQB`	Galois field affine transformation inverse
`VGF2P8AFFINEQB`	Galois field affine transformation
`VGF2P8MULB`	Galois field multiply bytes

VPCLMULQDQ[]

VPCLMULQDQ with AVX-512F adds EVEX-encoded 512-bit version of PCLMULQDQ instruction. With AVX-512VL, it adds EVEX-encoded 256- and 128-bit versions. VPCLMULQDQ alone (that is, on non-AVX512 CPUs) adds only VEX-encoded 256-bit version. (Availability of the VEX-encoded 128-bit version is indicated by different CPUID bits: PCLMULQDQ and AVX.) The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers, but they do not extend it to select quadwords from different 128-bit fields (the meaning of imm8 operand is the same: either low or high quadword of the 128-bit field is selected).

Instruction	Description
`VPCLMULQDQ`	Carry-less multiplication quadword

VAES[]

VEX- and EVEX-encoded AES instructions. The wider than 128-bit variations of the instruction perform the same operation on each 128-bit portion of input registers. The VEX versions can be used without AVX-512 support.

Instruction	Description
`VAESDEC`	Perform one round of an AES decryption flow
`VAESDECLAST`	Perform last round of an AES decryption flow
`VAESENC`	Perform one round of an AES encryption flow
`VAESENCLAST`	Perform last round of an AES encryption flow

BF16[]

AI acceleration instructions operating on the Bfloat16 numbers.

Instruction	Description
`VCVTNE2PS2BF16`	Convert two packed single precision numbers to one packed Bfloat16 number
`VCVTNEPS2BF16`	Convert one packed single precision number to one packed Bfloat16 number
`VDPBF16PS`	Calculate dot product of two Bfloat16 pairs and accumulate the result into one packed single precision number

FP16[]

An extension of the earlier F16C instruction set, adding comprehensive support for the binary16 floating-point numbers (also known as FP16, float16 or half-precision floating-point numbers). The new instructions implement most operations that were previously available for single and double-precision floating-point numbers and also introduce new operations. Scalar and packed operations are supported.

Unlike the single and double-precision format instructions, the half-precision operands are neither conditionally flushed to zero (FTZ) nor conditionally treated as zero (DAZ) based on MXCSR settings. Denormal values are processed at full speed by hardware to facilitate using the full dynamic range of the FP16 numbers. Instructions that create FP32 and FP64 numbers still respect the MXCSR.FTZ bit.^[9]

Arithmetic instructions[]

Instruction	Description
`VADDPH`, `VADDSH`	Add packed/scalar FP16 numbers.
`VSUBPH`, `VSUBSH`	Subtract packed/scalar FP16 numbers.
`VMULPH`, `VMULSH`	Multiply packed/scalar FP16 numbers.
`VDIVPH`, `VDIVSH`	Divide packed/scalar FP16 numbers.
`VSQRTPH`, `VSQRTSH`	Compute square root of packed/scalar FP16 numbers.
`VFMADD{132, 213, 231}PH`, `VFMADD{132, 213, 231}SH`	Multiply-add packed/scalar FP16 numbers.
`VFNMADD{132, 213, 231}PH`, `VFNMADD{132, 213, 231}SH`	Negated multiply-add packed/scalar FP16 numbers.
`VFMSUB{132, 213, 231}PH`, `VFMSUB{132, 213, 231}SH`	Multiply-subtract packed/scalar FP16 numbers.
`VFNMSUB{132, 213, 231}PH`, `VFNMSUB{132, 213, 231}SH`	Negated multiply-subtract packed/scalar FP16 numbers.
`VFMADDSUB{132, 213, 231}PH`	Multiply-add (odd vector elements) or multiply-subtract (even vector elements) packed FP16 numbers.
`VFMSUBADD{132, 213, 231}PH`	Multiply-subtract (odd vector elements) or multiply-add (even vector elements) packed FP16 numbers.
`VREDUCEPH`, `VREDUCESH`	Perform reduction transformation of the packed/scalar FP16 numbers.
`VRNDSCALEPH`, `VRNDSCALESH`	Round packed/scalar FP16 numbers to a given number of fraction bits.
`VSCALEFPH`, `VSCALEFSH`	Scale packed/scalar FP16 numbers by multiplying it by a power of two.

Complex arithmetic instructions[]

Instruction	Description
`VFMULCPH`, `VFMULCSH`	Multiply packed/scalar complex FP16 numbers.
`VFCMULCPH`, `VFCMULCSH`	Multiply packed/scalar complex FP16 numbers. Complex conjugate form of the operation.
`VFMADDCPH`, `VFMADDCSH`	Multiply-add packed/scalar complex FP16 numbers.
`VFCMADDCPH`, `VFCMADDCSH`	Multiply-add packed/scalar complex FP16 numbers. Complex conjugate form of the operation.

Approximate reciprocal instructions[]

Instruction	Description
`VRCPPH`, `VRCPSH`	Compute approximate reciprocal of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2^-11+2^-14.
`VRSQRTPH`, `VRSQRTSH`	Compute approximate reciprocal square root of the packed/scalar FP16 numbers. The maximum relative error of the approximation is less than 2^-14.

Comparison instructions[]

Instruction	Description
`VCMPPH`, `VCMPSH`	Compare the packed/scalar FP16 numbers and store the result in a mask register.
`VCOMISH`	Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception if a source operand is QNaN or SNaN.
`VUCOMISH`	Compare the scalar FP16 numbers and store the result in the flags register. Signals an exception only if a source operand is SNaN.
`VMAXPH`, `VMAXSH`	Select the maximum of each vertical pair of the source packed/scalar FP16 numbers.
`VMINPH`, `VMINSH`	Select the minimum of each vertical pair of the source packed/scalar FP16 numbers.
`VFPCLASSPH`, `VFPCLASSSH`	Test packed/scalar FP16 numbers for special categories (NaN, infinity, negative zero, etc.) and store the result in a mask register.

Conversion instructions[]

Instruction	Description
`VCVTW2PH`	Convert packed signed 16-bit integers to FP16 numbers.
`VCVTUW2PH`	Convert packed unsigned 16-bit integers to FP16 numbers.
`VCVTDQ2PH`	Convert packed signed 32-bit integers to FP16 numbers.
`VCVTUDQ2PH`	Convert packed unsigned 32-bit integers to FP16 numbers.
`VCVTQQ2PH`	Convert packed signed 64-bit integers to FP16 numbers.
`VCVTUQQ2PH`	Convert packed unsigned 64-bit integers to FP16 numbers.
`VCVTPS2PHX`	Convert packed FP32 numbers to FP16 numbers. Unlike `VCVTPS2PH` from F16C, `VCVTPS2PHX` has a different encoding that also supports broadcasting.
`VCVTPD2PH`	Convert packed FP64 numbers to FP16 numbers.
`VCVTSI2SH`	Convert a scalar signed 32-bit or 64-bit integer to FP16 number.
`VCVTUSI2SH`	Convert a scalar unsigned 32-bit or 64-bit integer to FP16 number.
`VCVTSS2SH`	Convert a scalar FP32 number to FP16 number.
`VCVTSD2SH`	Convert a scalar FP64 number to FP16 number.
`VCVTPH2W`, `VCVTTPH2W`	Convert packed FP16 numbers to signed 16-bit integers. `VCVTPH2W` rounds the value according to the `MXCSR` register. `VCVTTPH2W` rounds toward zero.
`VCVTPH2UW`, `VCVTTPH2UW`	Convert packed FP16 numbers to unsigned 16-bit integers. `VCVTPH2UW` rounds the value according to the `MXCSR` register. `VCVTTPH2UW` rounds toward zero.
`VCVTPH2DQ`, `VCVTTPH2DQ`	Convert packed FP16 numbers to signed 32-bit integers. `VCVTPH2DQ` rounds the value according to the `MXCSR` register. `VCVTTPH2DQ` rounds toward zero.
`VCVTPH2UDQ`, `VCVTTPH2UDQ`	Convert packed FP16 numbers to unsigned 32-bit integers. `VCVTPH2UDQ` rounds the value according to the `MXCSR` register. `VCVTTPH2UDQ` rounds toward zero.
`VCVTPH2QQ`, `VCVTTPH2QQ`	Convert packed FP16 numbers to signed 64-bit integers. `VCVTPH2QQ` rounds the value according to the `MXCSR` register. `VCVTTPH2QQ` rounds toward zero.
`VCVTPH2UQQ`, `VCVTTPH2UQQ`	Convert packed FP16 numbers to unsigned 64-bit integers. `VCVTPH2UQQ` rounds the value according to the `MXCSR` register. `VCVTTPH2UQQ` rounds toward zero.
`VCVTPH2PSX`	Convert packed FP16 numbers to FP32 numbers. Unlike `VCVTPH2PS` from F16C, `VCVTPH2PSX` has a different encoding that also supports broadcasting.
`VCVTPH2PD`	Convert packed FP16 numbers to FP64 numbers.
`VCVTSH2SI`, `VCVTTSH2SI`	Convert a scalar FP16 number to signed 32-bit or 64-bit integer. `VCVTSH2SI` rounds the value according to the `MXCSR` register. `VCVTTSH2SI` rounds toward zero.
`VCVTSH2USI`, `VCVTTSH2USI`	Convert a scalar FP16 number to unsigned 32-bit or 64-bit integer. `VCVTSH2USI` rounds the value according to the `MXCSR` register. `VCVTTSH2USI` rounds toward zero.
`VCVTSH2SS`	Convert a scalar FP16 number to FP32 number.
`VCVTSH2SD`	Convert a scalar FP16 number to FP64 number.

Decomposition instructions[]

Instruction	Description
`VGETEXPPH`, `VGETEXPSH`	Extract exponent components of packed/scalar FP16 numbers as FP16 numbers.
`VGETMANTPH`, `VGETMANTSH`	Extract mantissa components of packed/scalar FP16 numbers as FP16 numbers.

Move instructions[]

Instruction	Description
`VMOVSH`	Move scalar FP16 number to/from memory or between vector registers.
`VMOVW`	Move scalar FP16 number to/from memory or general purpose register.

Legacy instructions upgraded with EVEX encoded versions[]

Legacy encoding			Group	Instructions	AVX-512 extensions
SSE SSE2 MMX	AVX SSE3 SSE4.1	AVX2 FMA	Group	Instructions	AVX-512 extensions
Yes	Yes	No	VADD	`VADDPD`, `VADDPS`, `VADDSD`, `VADDSS`	F, VL
			VAND	`VANDPD`, `VANDPS`, `VANDNPD`, `VANDNPS`	VL, DQ
			VCMP	`VCMPPD`, `VCMPPS`, `VCMPSD`, `VCMPSS`	F
			VCOM	`VCOMISD`, `VCOMISS`	F
			VDIV	`VDIVPD`, `VDIVPS`, `VDIVSD`, `VDIVSS`	F, VL
			VCVT	`VCVTDQ2PD`, `VCVTDQ2PS`, `VCVTPD2DQ`, `VCVTPD2PS`, `VCVTPH2PS`, `VCVTPS2PH`, `VCVTPS2DQ`, `VCVTPS2PD`, `VCVTSD2SI`, `VCVTSD2SS`, `VCVTSI2SD`, `VCVTSI2SS`, `VCVTSS2SD`, `VCVTSS2SI`, `VCVTTPD2DQ`, `VCVTTPS2DQ`, `VCVTTSD2SI`, `VCVTTSS2SI`	F, VL
			VMAX	`VMAXPD`, `VMAXPS`, `VMAXSD`, `VMAXSS`	F, VL
			VMIN	`VMINPD`, `VMINPS`, `VMINSD`, `VMINSS`	F
			VMOV	`VMOVAPD`, `VMOVAPS`, `VMOVD`, `VMOVQ`, `VMOVDDUP`, `VMOVHLPS`, `VMOVHPD`, `VMOVHPS`, `VMOVLHPS`, `VMOVLPD`, `VMOVLPS`, `VMOVNTDQA`, `VMOVNTDQ`, `VMOVNTPD`, `VMOVNTPS`, `VMOVSD`, `VMOVSHDUP`, `VMOVSLDUP`, `VMOVSS`, `VMOVUPD`, `VMOVUPS` `VMOVDQA32`, `VMOVDQA64`, `VMOVDQU8`, `VMOVDQU16`, `VMOVDQU32`, `VMOVDQU64`	F, VL, BW
			VMUL	`VMULPD`, `VMULPS`, `VMULSD`, `VMULSS`	F, VL
			VOR	`VORPD`, `VORPS`	VL, DQ
			VSQRTP	`VSQRTPD`, `VSQRTPS`, `VSQRTSD`, `VSQRTSS`	F, VL
			VSUB	`VSUBPD`, `VSUBPS`, `VSUBSD`, `VSUBSS`	F, VL
			VUCOMI	`VUCOMISD`, `VUCOMISS`	F
			VUNPCK	`VUNPCKHPD`, `VUNPCKHPS`, `VUNPCKLPD`, `VUNPCKLPS`	F, VL
			VXOR	`VXORPD`, `VXORPS`	VL, DQ
No	Yes	No	VEXTRACTPS	`VEXTRACTPS`	F
			VINSERTPS	`VINSERTPS`	F
			VPALIGNR	`VPALIGNR`	VL, BW
			VPEXTR	`VPEXTRB`, `VPEXTRW`, `VPEXTRD`, `VPEXTRQ`	BW, DQ
			VPINSR	`VPINSRB`, `VPINSRW`, `VPINSRD`, `VPINSRQ`	BW, DQ
Yes	Yes	Yes	VPACK	`VPACKSSWB`, `VPACKSSDW`, `VPACKUSDW`, `VPACKUSWB`	VL, BW
			VPADD	`VPADDB`, `VPADDW`, `VPADDD`, `VPADDQ`, `VPADDSB`, `VPADDSW`, `VPADDUSB`, `VPADDUSW`	F, VL, BW
			VPAND	`VPANDD`, `VPANDQ`, `VPANDND`, `VPANDNQ`	F, VL
			VPAVG	`VPAVGB`, `VPAVGW`	VL, BW
			VPCMPEQ	`VPCMPEQB`, `VPCMPEQW`, `VPCMPEQD`, `VPCMPEQQ`	F, VL, BW
			VPCMPGT	`VPCMPGTB`, `VPCMPGTW`, `VPCMPGTD`, `VPCMPGTQ`	F, VL, BW
			VPMAX	`VPMAXSB`, `VPMAXSW`, `VPMAXSD`, `VPMAXSQ`, `VPMAXUB`, `VPMAXUW`, `VPMAXUD`, `VPMAXUQ`	F, VL, BW
			VPMIN	`VPMINSB`, `VPMINSW`, `VPMINSD`, `VPMINSQ`, `VPMINUB`, `VPMINUW`, `VPMINUD`, `VPMINUQ`	F, VL, BW
			VPMOV	`VPMOVSXBW`, `VPMOVSXBD`, `VPMOVSXBQ`, `VPMOVSXWD`, `VPMOVSXWQ`, `VPMOVSXDQ`, `VPMOVZXBW`, `VPMOVZXBD`, `VPMOVZXBQ`, `VPMOVZXWD`, `VPMOVZXWQ`, `VPMOVZXDQ`	F, VL, BW
			VPMUL	`VPMULDQ`, `VPMULUDQ`, `VPMULHRSW`, `VPMULHUW`, `VPMULHW`, `VPMULLD`, `VPMULLQ`, `VPMULLW`	F, VL, BW
			VPOR	`VPORD`, `VPORQ`	F, VL
			VPSUB	`VPSUBB`, `VPSUBW`, `VPSUBD`, `VPSUBQ`, `VPSUBSB`, `VPSUBSW`, `VPSUBUSB`, `VPSUBUSW`	F, VL, BW
			VPUNPCK	`VPUNPCKHBW`, `VPUNPCKHWD`, `VPUNPCKHDQ`, `VPUNPCKHQDQ`, `VPUNPCKLBW`, `VPUNPCKLWD`, `VPUNPCKLDQ`, `VPUNPCKLQDQ`	F, VL, BW
			VPXOR	`VPXORD`, `VPXORQ`	F, VL
			VPSADBW	`VPSADBW`	VL, BW
			VPSHUF	`VPSHUFB`, `VPSHUFHW`, `VPSHUFLW`, `VPSHUFD`, `VPSLLDQ`, `VPSLLW`, `VPSLLD`, `VPSLLQ`, `VPSRAW`, `VPSRAD`, `VPSRAQ`, `VPSRLDQ`, `VPSRLW`, `VPSRLD`, `VPSRLQ`, `VPSLLVW`, `VPSLLVD`, `VPSLLVQ`, `VPSRLVW`, `VPSRLVD`, `VPSRLVQ`, `VPSHUFPD`, `VPSHUFPS`	F, VL, BW
No	Yes	Yes	VEXTRACT	`VEXTRACTF32X4`, `VEXTRACTF64X2`, `VEXTRACTF32X8`, `VEXTRACTF64X4`, `VEXTRACTI32X4`, `VEXTRACTI64X2`, `VEXTRACTI32X8`, `VEXTRACTI64X4`	F, VL, DQ
			VINSERT	`VINSERTF32x4`, `VINSERTF64X2`, `VINSERTF32X8`, `VINSERTF64x4`, `VINSERTI32X4`, `VINSERTI64X2`, `VINSERTI32X8`, `VINSERTI64X4`	F, VL, DQ
			VPABS	`VPABSB`, `VPABSW`, `VPABSD`, `VPABSQ`	F, VL, BW
			VPERM	`VPERMD`, `VPERMILPD`, `VPERMILPS`, `VPERMPD`, `VPERMPS`, `VPERMQ`	F, VL
			VPMADD	`VPMADDUBSW` `VPMADDWD`	VL, BW
No	No	Yes	VFMADD	`VFMADD132PD`, `VFMADD213PD`, `VFMADD231PD`, `VFMADD132PS`, `VFMADD213PS`, `VFMADD231PS`, `VFMADD132SD`, `VFMADD213SD`, `VFMADD231SD`, `VFMADD132SS`, `VFMADD213SS`, `VFMADD231SS`	F, VL
			VFMADDSUB	`VFMADDSUB132PD`, `VFMADDSUB213PD`, `VFMADDSUB231PD`, `VFMADDSUB132PS`, `VFMADDSUB213PS`, `VFMADDSUB231PS`	F, VL
			VFMSUBADD	`VFMSUBADD132PD`, `VFMSUBADD213PD`, `VFMSUBADD231PD`, `VFMSUBADD132PS`, `VFMSUBADD213PS`, `VFMSUBADD231PS`	F, VL
			VFMSUB	`VFMSUB132PD`, `VFMSUB213PD`, `VFMSUB231PD`, `VFMSUB132PS`, `VFMSUB213PS`, `VFMSUB231PS`, `VFMSUB132SD`, `VFMSUB213SD`, `VFMSUB231SD`, `VFMSUB132SS`, `VFMSUB213SS`, `VFMSUB231SS`	F, VL
			VFNMADD	`VFNMADD132PD`, `VFNMADD213PD`, `VFNMADD231PD`, `VFNMADD132PS`, `VFNMADD213PS`, `VFNMADD231PS`, `VFNMADD132SD`, `VFNMADD213SD`, `VFNMADD231SD`, `VFNMADD132SS`, `VFNMADD213SS`, `VFNMADD231SS`	F, VL
			VFNMSUB	`VFNMSUB132PD`, `VFNMSUB213PD`, `VFNMSUB231PD`, `VFNMSUB132PS`, `VFNMSUB213PS`, `VFNMSUB231PS`, `VFNMSUB132SD`, `VFNMSUB213SD`, `VFNMSUB231SD`, `VFNMSUB132SS`, `VFNMSUB213SS`, `VFNMSUB231SS`	F, VL
			VGATHER	`VGATHERDPS`, `VGATHERDPD`, `VGATHERQPS`, `VGATHERQPD`	F, VL
			VPGATHER	`VPGATHERDD`, `VPGATHERDQ`, `VPGATHERQD`, `VPGATHERQQ`	F, VL
			VPSRAV	`VPSRAVW`, `VPSRAVD`, `VPSRAVQ`	F, VL, BW

CPUs with AVX-512[]

Intel
- Knights Landing (Xeon Phi x200):^[1]^[10] AVX-512 F, CD, ER, PF
- Knights Mill (Xeon Phi x205):^[7] AVX-512 F, CD, ER, PF, 4FMAPS, 4VNNIW, VPOPCNTDQ
- Skylake-SP, Skylake-X:^[11]^[12]^[13] AVX-512 F, CD, VL, DQ, BW
- Cannon Lake:^[7] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI
- Cascade Lake: AVX-512 F, CD, VL, DQ, BW, VNNI
- Cooper Lake: AVX-512 F, CD, VL, DQ, BW, VNNI, BF16
- Ice Lake,^[7] Rocket Lake:^[14]^[15] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES
- Tiger Lake (except Pentium and Celeron but some reviewer have the CPU-Z Screenshot of Celeron 6305 with AVX-512 support^[16]^[17]):^[18] AVX-512 F, CD, VL, DQ, BW, IFMA, VBMI, VBMI2, VPOPCNTDQ, BITALG, VNNI, VPCLMULQDQ, GFNI, VAES, VP2INTERSECT
- Sapphire Rapids: BF16, FP16^[9]
Centaur Technology
- "CNS" core (8c/8t):^[19]^[20] AVX-512 F, CD, VL, BW, DQ, IFMA, VBMI
AMD
- Zen 4: BF16^[21]^[22]

Subset	F	ER	4FMAPS	VPOPCNTDQ	VL	IFMA	VNNI	BF16	VBMI2	VP2INTERSECT
Knights Landing (Xeon Phi x200, 2016)	Yes	Yes	No
Knights Mill (Xeon Phi x205, 2017)		Yes	Yes		No
Skylake-SP, Skylake-X (2017)		No		No	Yes	No
Cannon Lake (2018)						Yes	No
Cascade Lake (2019)						No	Yes	No
Cooper Lake (2020)						No		Yes	No
Ice Lake (2019)				Yes		Yes		No	Yes	No
Tiger Lake (2020)										Yes
Rocket Lake (2021)										No

QEMU supports emulating AVX-512 in its TCG.

Alder Lake and similar desktop-grade hybrid cores have the silicon units for AVX-512, but the feature is disabled.^[23]^[24]

Performance[]

Intel "Vectorization" Advisor (starting from version 2016 Update 3) supports native AVX-512 performance and vector code quality analysis for 2nd generation Intel Xeon Phi (codenamed Knights Landing) processor. Along with traditional hotspots profile, Advisor Recommendations and "seamless" integration of Intel Compiler vectorization diagnostics, Advisor Survey analysis also provides AVX-512 ISA metrics and new AVX-512-specific "traits", e.g. Scatter, Compress/Expand, mask utilization.^[25]^[26]

On some processors AVX-512 instructions cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors.^[27]

References[]

^ Jump up to: ^a ^b ^c ^d ^e ^f James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.
^ "Advanced Intelligence for High-Density Edge Solutions". Intel. Intel. Retrieved 8 February 2018.
^ Jump up to: ^a ^b ^c James Reinders (17 July 2014). "Additional AVX-512 instructions". Intel. Retrieved 3 August 2014.
^ Anton Shilov. "Intel 'Skylake' processors for PCs will not support AVX-512 instructions". Kitguru.net. Retrieved 2015-03-17.
^ "Intel will add deep-learning instructions to its processors".
^ Jump up to: ^a ^b ^c ^d ^e ^f ^g ^h "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.
^ Jump up to: ^a ^b ^c ^d ^e ^f "Intel Architecture Instruction Set Extensions and Future Features Programming Reference". Intel. Retrieved 2017-10-16.
^ "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel. Retrieved 25 February 2014.
^ Jump up to: ^a ^b "Intel® AVX512-FP16 Architecture Specification, June 2021, Revision 1.0, Ref. 347407-001US" (PDF). Intel. 2021-06-30. Retrieved 2021-07-04.
^ "Intel Xeon Phi Processor product brief". Intel. Retrieved 12 October 2016.
^ "Intel unveils X-series platform: Up to 18 cores and 36 threads, from $242 to $2,000". Ars Technica. Retrieved 2017-05-30.
^ "Intel Advanced Vector Extensions 2015/2016 : Support in GNU Compiler Collection" (PDF). Gcc.gnu.org. Retrieved 2016-10-20.
^ Patrizio, Andy (21 September 2015). "Intel's Xeon roadmap for 2016 leaks". Itworld.org. Retrieved 2016-10-20.
^ "Intel Core i9-11900K Review - World's Fastest Gaming Processor?". www.techpowerup.com.
^ "«Add rocketlake to gcc» commit". gcc.gnu.org.
^ "Intel Celeron 6305 Processor (4M Cache, 1.80 GHz, with IPU) Product Specifications". ark.intel.com. Retrieved 2020-11-10.
^ Laptop Murah Kinerja Boleh Diadu | HP 14S DQ2518TU, retrieved 2021-08-08
^ "Using the GNU Compiler Collection (GCC): x86 Options". GNU. Retrieved 2019-10-14.
^ https://centtech.com/ai-technology/
^ "x86, x64 Instruction Latency, Memory Latency and CPUID dumps (instlatx64)". users.atw.hu.
^ "AMD Zen 4 Based Ryzen CPUs May Feature Up to 24 Cores, Support for AVX512 Vectors". Hardware Times. 2021-05-23. Retrieved 2021-09-02.
^ Hagedoorn, Hilbert. "AMD working on a prodigious 96-core EPYC processor". Guru3D.com. Retrieved 2021-05-25.
^ Cutress, Ian; Frumusanu, Andrei (2021-08-19). "Intel Architecture Day 2021: Alder Lake, Golden Cove, and Gracemont Detailed". AnandTech. Retrieved 2021-08-25.
^ Alcorn, Paul (2021-08-19). "Intel Architecture Day 2021: Alder Lake Chips, Golden Cove and Gracemont Cores". Tom's Hardware. Retrieved 2021-08-21.
^ "Intel Advisor XE 2016 Update 3 What's new - Intel Software". Software.intel.com. Retrieved 2016-10-20.
^ "Intel Advisor - Intel Software". Software.intel.com. Retrieved 2016-10-20.
^ Cordes, Peter. "SIMD instructions lowering CPU frequency". Stack Overflow.

[reinders512-1] Jump up to: ^a ^b ^c ^d ^e ^f James Reinders (23 July 2013). "AVX-512 Instructions". Intel. Retrieved 20 August 2013.

[2] "Advanced Intelligence for High-Density Edge Solutions". Intel. Intel. Retrieved 8 February 2018.

[reinders512b-3] Jump up to: ^a ^b ^c James Reinders (17 July 2014). "Additional AVX-512 instructions". Intel. Retrieved 3 August 2014.

[4] Anton Shilov. "Intel 'Skylake' processors for PCs will not support AVX-512 instructions". Kitguru.net. Retrieved 2015-03-17.

[5] "Intel will add deep-learning instructions to its processors".

[newisa-6] Jump up to: ^a ^b ^c ^d ^e ^f ^g ^h "Intel Architecture Instruction Set Extensions Programming Reference" (PDF). Intel. Retrieved 2014-01-29.

[iaiseaffpr-7] Jump up to: ^a ^b ^c ^d ^e ^f "Intel Architecture Instruction Set Extensions and Future Features Programming Reference". Intel. Retrieved 2017-10-16.

[8] "AVX-512 Architecture/Demikhovsky Poster" (PDF). Intel. Retrieved 25 February 2014.

[intel-avx512-fp16-arch-9] Jump up to: ^a ^b "Intel® AVX512-FP16 Architecture Specification, June 2021, Revision 1.0, Ref. 347407-001US" (PDF). Intel. 2021-06-30. Retrieved 2021-07-04.

[10] "Intel Xeon Phi Processor product brief". Intel. Retrieved 12 October 2016.

[11] "Intel unveils X-series platform: Up to 18 cores and 36 threads, from $242 to $2,000". Ars Technica. Retrieved 2017-05-30.

[12] "Intel Advanced Vector Extensions 2015/2016 : Support in GNU Compiler Collection" (PDF). Gcc.gnu.org. Retrieved 2016-10-20.

[13] Patrizio, Andy (21 September 2015). "Intel's Xeon roadmap for 2016 leaks". Itworld.org. Retrieved 2016-10-20.

[14] "Intel Core i9-11900K Review - World's Fastest Gaming Processor?". www.techpowerup.com.

[15] "«Add rocketlake to gcc» commit". gcc.gnu.org.

[16] "Intel Celeron 6305 Processor (4M Cache, 1.80 GHz, with IPU) Product Specifications". ark.intel.com. Retrieved 2020-11-10.

[17] Laptop Murah Kinerja Boleh Diadu | HP 14S DQ2518TU, retrieved 2021-08-08

[gcc-18] "Using the GNU Compiler Collection (GCC): x86 Options". GNU. Retrieved 2019-10-14.

[19] ttps://centtech.com/ai-technology/

[instlatx64-20] "x86, x64 Instruction Latency, Memory Latency and CPUID dumps (instlatx64)". users.atw.hu.

[21] "AMD Zen 4 Based Ryzen CPUs May Feature Up to 24 Cores, Support for AVX512 Vectors". Hardware Times. 2021-05-23. Retrieved 2021-09-02.

[22] Hagedoorn, Hilbert. "AMD working on a prodigious 96-core EPYC processor". Guru3D.com. Retrieved 2021-05-25.

[23] Cutress, Ian; Frumusanu, Andrei (2021-08-19). "Intel Architecture Day 2021: Alder Lake, Golden Cove, and Gracemont Detailed". AnandTech. Retrieved 2021-08-25.

[24] Alcorn, Paul (2021-08-19). "Intel Architecture Day 2021: Alder Lake Chips, Golden Cove and Gracemont Cores". Tom's Hardware. Retrieved 2021-08-21.

[25] "Intel Advisor XE 2016 Update 3 What's new - Intel Software". Software.intel.com. Retrieved 2016-10-20.

[26] "Intel Advisor - Intel Software". Software.intel.com. Retrieved 2016-10-20.

[27] Cordes, Peter. "SIMD instructions lowering CPU frequency". Stack Overflow.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

hide v t Instruction set extensions
SIMD (RISC)	Alpha MVI ARM NEON SVE MIPS MDMX MIPS-3D MXU MIPS SIMD PA-RISC MAX Power ISA VMX SPARC VIS
SIMD (x86)	MMX (1996) 3DNow! (1998) SSE (1999) SSE2 (2001) SSE3 (2004) SSSE3 (2006) SSE4 (2006) SSE5 ~~(2007)~~ AVX (2008) F16C (2009) XOP (2009) FMA (FMA4: 2011, FMA3: 2012) AVX2 (2013) AVX-512 (2015)
Bit manipulation	BMI (ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012) ADX (2014)
Compressed instructions	SuperH^{[citation needed]} Thumb MIPS16e ASE RVC
Security and cryptography	PadLock (2003) AES-NI (2008); ARMv8 also has AES instructions CLMUL (2010) RDRAND (2012) SHA (2013) MPX (2015) SGX (2015)
Transactional memory	TSX (2013) ASF
Virtualization	VT-x (2005) AMD-V (2006) VT-d (AMD-Vi)
Suspended extensions' dates are ~~struck through~~.