Streaming SIMD Extensions
In computing, Streaming SIMD Extensions is a single instruction, multiple data instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units shortly after the appearance of Advanced Micro Devices 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.
Intel's first IA-32 SIMD effort was the MMX instruction set. MMX had two main problems: it re-used existing x87 floating point registers making the CPUs unable to work on both floating point and SIMD data at the same time, and it only worked on integers. SSE floating point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.
SSE was subsequently expanded by Intel to SSE2, SSE3, SSSE3, and SSE4. Because it supports floating point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX largely redundant, though further performance increases can be attained in some situations by using MMX in parallel with SSE operations.
SSE was originally called Katmai New Instructions, Katmai being the code name for the first Pentium III core revision. During the Katmai project Intel sought to distinguish it from their earlier product line, particularly their flagship Pentium II. It was later renamed Internet Streaming SIMD Extensions, then SSE. AMD eventually added support for SSE instructions, starting with its Athlon XP and Duron processors.
Registers
SSE originally added eight new 128-bit registers known asXMM0
through XMM7
. The AMD64 extensions from AMD added a further eight registers XMM8
through XMM15
, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR
. The registers XMM8
through XMM15
are accessible only in 64-bit operating mode.SSE used only a single data type for XMM registers:
- four 32-bit single-precision floating point numbers
- two 64-bit double-precision floating point numbers or
- two 64-bit integers or
- four 32-bit integers or
- eight 16-bit short integers or
- sixteen 8-bit bytes or characters.
FXSAVE
and FXRSTOR
instructions, which is the extended pair of instructions that can save all x86 and SSE register states at once. This support was quickly added to all major IA-32 operating systems.The first CPU to support SSE, the Pentium III, shared execution resources between SSE and the floating point unit. While a compiled application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the same clock cycle. This limitation reduces the effectiveness of pipelining, but the separate XMM registers do allow SIMD and scalar floating point operations to be mixed without the performance hit from explicit MMX/floating point mode switching.
SSE instructions
SSE introduced both scalar and packed floating point instructions.Floating point instructions
- Memory-to-register/register-to-memory/register-to-register data movement
- * Scalar–
MOVSS
- * Packed –
MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS, MOVMSKPS
- Arithmetic
- * Scalar –
ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
- * Packed –
ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS
- Compare
- * Scalar –
CMPSS, COMISS, UCOMISS
- * Packed –
CMPPS
- Data shuffle and unpacking
- * Packed –
SHUFPS, UNPCKHPS, UNPCKLPS
- Data-type conversion
- * Scalar –
CVTSI2SS, CVTSS2SI, CVTTSS2SI
- * Packed –
CVTPI2PS, CVTPS2PI, CVTTPS2PI
- Bitwise logical operations
- * Packed –
ANDPS, ORPS, XORPS, ANDNPS
Integer instructions
- Arithmetic
- *
PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW
- Data movement
- *
PEXTRW, PINSRW
- Other
- *
PMOVMSKB, PSHUFW
Other instructions
-
MXCSR
management - *
LDMXCSR, STMXCSR
- Cache and Memory management
- *
MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE
Example
vec_res.x = v1.x + v2.x;
vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z;
vec_res.w = v1.w + v2.w;
This corresponds to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.
movaps xmm0, ;xmm0 = v1.w | v1.z | v1.y | v1.x
addps xmm0, ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x
movaps ;xmm0
Later versions
- SSE2, Willamette New Instructions, introduced with the Pentium 4, is a major enhancement to SSE. SSE2 adds two major features: double-precision floating point for all SSE operations, and MMX integer operations on 128-bit XMM registers. In the original SSE instruction set, conversion to and from integers placed the integer data in the 64-bit MMX registers. SSE2 enables the programmer to perform SIMD math on any data type entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers. It offers an orthogonal set of instructions for dealing with common data types.
- SSE3, also called Prescott New Instructions, is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process management instructions. It also allowed to add or multiply two numbers that are stored in the same register, which wasn't possible in SSE2 and earlier. This capability, known as horizontal in Intel terminology, was the major addition to the SSE3 instruction set. AMD's 3dnow! extension could do the latter too.
- SSSE3, Merom New Instructions, is an upgrade to SSE3, adding 16 new instructions which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions. SSSE3 is often mistaken for SSE4 as this term was used during the development of the Core microarchitecture.
- SSE4, Penryn New Instructions, is another major enhancement, adding a dot product instruction, additional integer instructions, a popcnt instruction, and more.
- XOP, FMA4 and CVT16 are new iterations announced by AMD in August 2007 and revised in May 2009.
- Advanced Vector Extensions, Gesher New Instructions, is an advanced version of SSE announced by Intel featuring a widened data path from 128 bits to 256 bits and 3-operand instructions. Intel released processors in early 2011 with AVX support. AVX requires support from the operating system.
- AVX2 is an expansion of the AVX instruction set. All CPUs since AMD Carrizo or Intel Haswell support AVX2.
- AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture.
Software and hardware issues
- Intel and AMD offer applications to detect what extensions a CPU supports.
- The CPUID opcode is a processor supplementary instruction for the x86 architecture. It was introduced by Intel in 1993 when it introduced the Pentium and SL-Enhanced 486 processors.
The use of multiple revisions of an application to cope with the many different sets of extensions available is the simplest way around the x86 extension optimization problem. Software libraries and some applications have begun to support multiple extension types hinting that full use of available x86 instructions may finally become common some 5 to 15 years after the instructions were initially introduced.