namespace
CpuCompile-time and runtime CPU instruction set detection and dispatch.
This namespace provides tags for x86, ARM and WebAssembly instruction sets, which can be used for either system introspection or for choosing a particular implementation based on the available instruction set. These tags build on top of the DEATH_
Usage
The Cpu namespace contains tags such as Cpu::
The most advanced base CPU instruction set enabled at compile time is then exposed through the Cpu::constexpr
variable, it's usable in a compile-time context.
Dispatching on available CPU instruction set at compile time
The main purpose of these tags, however, is to provide means for a compile-time overload resolution. In other words, picking the best candidate among a set of functions implemented with various instruction sets. As an example, let's say you have three different implementations of a certain algorithm transforming numeric data. One is using AVX2 instructions, another is a slower variant using just SSE 4.2 and as a fallback there's one with just regular scalar code. To distinguish them, the functions have the same name, but use a different tag type.
Then you can either call a particular implementation directly — for example to test it — or you can pass Cpu::
- If the user code was compiled with AVX2 or higher enabled, the Cpu::
Avx2 overload will be picked. - Otherwise, if just AVX, SSE 4.2 or anything else that includes SSE 4.2 was enabled, the Cpu::
Sse42 overload will be picked. - Otherwise (for example when compiling for generic x86-64 that has just the SSE2 feature set), the Cpu::
Scalar overload will be picked. If you wouldn't provide this overload, the compilation would fail for such a target — which is useful for example to enforce a certain CPU feature set to be enabled in order to use a certain API.
Runtime detection and manual dispatch
So far that was all compile-time detection, which has use mainly when a binary can be optimized directly for the machine it will run on. But such approach is not practical when shipping to a heterogenous set of devices. Instead, the usual workflow is that the majority of code uses the lowest common denominator (such as SSE2 on x86), with the most demanding functions having alternative implementations — picked at runtime — that make use of more advanced instructions for better performance.
Runtime detection is exposed through Cpu::
While such approach gives you the most control, manually managing the dispatch branches is error prone and the argument passthrough may also add nontrivial overhead. See below for an efficient automatic runtime dispatch.
Usage with extra instruction sets
Besides the base instruction set, which on x86 is Sse2 through Avx512f, with each tag being a superset of the previous one, there are extra instruction sets such as Popcnt or AvxFma.
The process of defining and dispatching to function variants that include extra instruction sets gets moderately more complex, however. As shown on the diagram below, those are instruction sets that neither fit into the hierarchy nor are unambiguously included in a later instruction set. For example, some CPUs are known to have Avx and just AvxFma, some Avx and just AvxF16c and there are even CPUs with Avx2 but no AvxFma.
While there's no possibility of having a total ordering between all possible combinations for dispatching, the following approach is chosen:
- The base instruction set has the main priority. For example, if both an Avx2 and a Sse2 variant are viable candidates, the Avx2 variant gets picked, even if the Sse2 variant uses extra instruction sets that the Avx2 doesn't.
- After that, the variant with the most extra instruction sets is chosen. For example, an Avx + AvxFma variant is chosen over plain Avx.
On the declaration side, the desired base instruction set gets ORed with as many extra instruction sets as needed, and then wrapped in a DEATH_
And a concrete overload gets picked at compile-time by passing a desired combination of CPU tags as well — or Default for the set of features enabled at compile time — this time wrapped in a DEATH_
Enabling instruction sets for particular functions
On GCC and Clang, a machine target has to be enabled in order to use a particular CPU instruction set or its intrinsics. While it's possible to do that for the whole compilation unit by passing for example -mavx2
to the compiler, it would force you to create dedicated files for every architecture variant you want to support. Instead, it's possible to equip particular functions with target attributes defined by DEATH_
In contrast, MSVC doesn't restrict intrinsics usage in any way, so you can freely call e.g. AVX2 intrinsics even if the whole file is compiled with just SSE2 enabled. The DEATH_
For developer convenience, the DEATH_#ifdef
your variants to be compiled only where it makes sense, or even guard intrinsics includes with them to avoid including potentially heavy headers you won't use anyway. In comparison, using the DEATH_-m
or /arch:
option passed to the compiler.
Finally, the DEATH_
Definitions of the lookup()
function variants from above would then look like below with the target attributes added. The extra instruction sets get explicitly enabled as well, in contrast a scalar variant would have no target-specific annotations at all.
Automatic runtime dispatch
Similarly to how the best-matching function variant can be picked at compile time, there's a possibility to do the same at runtime without maintaining a custom dispatch code for each case as was shown above. To avoid having to dispatch on every call and to remove the argument passthrough overhead, all variants need to have the same function signature, separate from the CPU tags. That's achievable by putting them into lambdas with a common signature, and returning that lambda from a wrapper function that contains the CPU tag. After that, a runtime dispatcher function that is created with the DEATH_
The macro creates an overload of the same name, but taking Features instead, and internally dispatches to one of the overloads using the same rules as in the compile-time dispatch. Which means you can now call it with e.g. runtimeFeatures(), get a function pointer back and then call it with the actual arguments.
Automatic runtime dispach with extra instruction sets
If the variants are tagged with extra instruction sets instead of just the base instruction set like in the lookup()
case shown above, you'll use the DEATH_
If some extra instruction sets are always used together (like it is above with Popcnt and Lzcnt), you can reduce the amount of tested combinations by specifying them as a single ORed argument instead. On the call side, there's no difference compared to using just the base instruction sets. The created dispatcher function takes Features as well.
Automatic cached dispatch
Ultimately, the dispatch can be performed implicitly, exposing only the final function or a function pointer, with no additional steps needed from the user side. There's three possible scenarios with varying performance tradeoffs. Continuing from the lookupImplementation()
example above:
- On Linux and Android with API 30+ it's possible to use the GNU IFUNC mechanism, where the dynamic linker performs a dispatch during the early startup. This is the fastest variant of runtime dispatch, as it results in an equivalent of a regular dynamic library function call. Assuming a dispatcher was created using either DEATH_
CPU_ DISPATCHER() or DEATH_ CPU_ DISPATCHER_ BASE(), it's implemented using the DEATH_ CPU_ DISPATCHED_ IFUNC() macro. - On platforms where IFUNC isn't available, a function pointer can be used for runtime dispatch instead. It's one additional indirection, which may have a visible effect if the dispatched-to code is relatively tiny and is called from within a tight loop. Assuming a dispatcher was created using either DEATH_
CPU_ DISPATCHER() or DEATH_ CPU_ DISPATCHER_ BASE(), it's implemented using the DEATH_ CPU_ DISPATCHED_ POINTER() macro. For the least amount of overhead, the compile-time dispatch can be used, with arguments passed through by hand. Similarly to IFUNC, this will also result in a regular function, but without the indirect overhead. Furthermore, since it's a direct call to the lambda inside, compiler optimizations will fully inline its contents, removing any remaining overhead and allowing LTO and other inter-procedural optimizations that wouldn't be possible with the indirect calls. This option is best suited for scenarios where it's possible to build and optimize code for a single target platform. In this case it calls directly to the original variants, so no macro is needed and DEATH_
CPU_ DISPATCHER() / DEATH_ CPU_ DISPATCHER_ BASE() is not needed either.
With all three cases, you end up with either a function or a function pointer. The macro signatures are deliberately similar to each other and to the direct function declaration to make it possible to unify them under a single wrapper macro in case a practical use case needs to handle more than one variant.
Classes
- struct Avx2T
- AVX2 tag type.
- struct Avx512fT
- AVX-512 Foundation tag type.
- struct AvxF16cT
- AVX F16C tag type.
- struct AvxFmaT
- AVX FMA tag type.
- struct AvxT
- AVX tag type.
- struct Bmi1T
- BMI1 tag type.
- struct Bmi2T
- BMI2 tag type Available only on x86. See the Cpu namespace and the Bmi2 tag for more information.
- class Features
- Feature set.
- struct LzcntT
- LZCNT tag type.
- struct NeonFmaT
- NEON FMA tag type.
- struct NeonFp16T
- NEON FP16 tag type.
- struct NeonT
- NEON tag type.
- struct PopcntT
- POPCNT tag type.
- struct ScalarT
- Scalar tag type.
- struct Simd128T
- SIMD128 tag type.
- struct Sse2T
- SSE2 tag type.
- struct Sse3T
- SSE3 tag type.
- struct Sse41T
- SSE4.1 tag type.
- struct Sse42T
- SSE4.2 tag type.
- struct Ssse3T
- SSSE3 tag type.
-
template<class T>struct TypeTraits
- Traits class for CPU detection tag types.
Typedefs
- using DefaultBaseT = ScalarT
- Default base tag type.
- using DefaultExtraT = Implementation::Tags<0>
- Default extra tag type.
- using DefaultT = Implementation::Tags<static_cast<unsigned int>(TypeTraits<DefaultBaseT>::Index)|DefaultExtraT::Value>
- Default tag type.
Functions
-
template<class T>auto tag() -> T constexpr
- Tag for a tag type.
-
template<class T>auto features() -> Features constexpr
- Feature set for a tag type.
-
template<class T, class U>auto operator|(T, U) -> Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index)|TypeTraits<U>::Index> constexpr
-
template<class T, unsigned int value>auto operator|(T, Implementation::Tags<value>) -> Implementation::Tags<TypeTraits<T>::Index|value> constexpr
-
template<class T, class U>auto operator&(T, U) -> Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index)&TypeTraits<U>::Index> constexpr
-
template<class T, unsigned int value>auto operator&(T, Implementation::Tags<value>) -> Implementation::Tags<TypeTraits<T>::Index&value> constexpr
-
template<class T, class U>auto operator^(T, U) -> Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index) ^ TypeTraits<U>::Index> constexpr
-
template<class T, unsigned int value>auto operator^(T, Implementation::Tags<value>) -> Implementation::Tags<TypeTraits<T>::Index^ value> constexpr
- auto compiledFeatures() -> Features constexpr
- CPU instruction sets enabled at compile time.
- auto runtimeFeatures() -> Features constexpr
- Detect available CPU instruction sets at runtime.
-
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator==(T a, Features b) -> bool constexpr
- Equality comparison of a tag and a feature set.
-
template<class T, class U, class = decltype(TypeTraits<T>::Index), class = decltype(TypeTraits<U>::Index)>auto operator==(T, U) -> bool constexpr
- Equality comparison of two tags Same as Features::
operator==(). Needs to be present to avoid ambiguity in C++20. -
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator!=(T a, Features b) -> bool constexpr
- Non-equality comparison of a tag and a feature set.
-
template<class T, class U, class = decltype(TypeTraits<T>::Index), class = decltype(TypeTraits<U>::Index)>auto operator!=(T, U) -> bool constexpr
- Non-equality comparison of two tags Same as Features::
operator!=(). Needs to be present to avoid ambiguity in C++20. -
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator>=(T a, Features b) -> bool constexpr
- Whether
a
is a superset ofb
( ) -
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator<=(T a, Features b) -> bool constexpr
- Whether
a
is a subset ofb
( ) -
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator|(T a, Features b) -> Features constexpr
- Union of two feature sets.
-
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator&(T a, Features b) -> Features constexpr
- Intersection of two feature sets.
-
template<class T, class = decltype(TypeTraits<T>::Index)>auto operator^(T a, Features b) -> Features constexpr
- XOR of two feature sets.
-
template<class T>auto operator~(T) -> Implementation::Tags<~TypeTraits<T>::Index> constexpr
- Feature set complement.
Variables
- ScalarT Scalar constexpr
- Scalar tag.
- Sse2T Sse2 constexpr
- SSE2 tag.
- Sse3T Sse3 constexpr
- SSE3 tag.
- Ssse3T Ssse3 constexpr
- SSSE3 tag.
- Sse41T Sse41 constexpr
- SSE4.1 tag.
- Sse42T Sse42 constexpr
- SSE4.2 tag.
- PopcntT Popcnt constexpr
- POPCNT tag.
- LzcntT Lzcnt constexpr
- LZCNT tag.
- Bmi1T Bmi1 constexpr
- BMI1 tag.
- Bmi2T Bmi2 constexpr
- BMI2 tag BMI2 instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.
- AvxT Avx constexpr
- AVX tag.
- AvxF16cT AvxF16c constexpr
- AVX F16C tag.
- AvxFmaT AvxFma constexpr
- AVX FMA tag.
- Avx2T Avx2 constexpr
- AVX2 tag.
- Avx512fT Avx512f constexpr
- AVX-512 Foundation tag.
- NeonT Neon constexpr
- NEON tag type.
- NeonFmaT NeonFma constexpr
- NEON FMA tag type.
- NeonFp16T NeonFp16 constexpr
- NEON FP16 tag type.
- Simd128T Simd128 constexpr
- SIMD128 tag type.
- DefaultBaseT DefaultBase constexpr
- Default base tag.
- DefaultExtraT DefaultExtra constexpr
- Default extra tags.
- DefaultT Default constexpr
- Default tags.
Typedef documentation
typedef ScalarT Death:: Cpu:: DefaultBaseT
#include <Cpu.h>
Default base tag type.
See the DefaultBase tag for more information.
typedef Implementation::Tags<0> Death:: Cpu:: DefaultExtraT
#include <Cpu.h>
Default extra tag type.
See the DefaultExtra tag for more information.
typedef Implementation::Tags<static_cast<unsigned int>(TypeTraits<DefaultBaseT>::Index)|DefaultExtraT::Value> Death:: Cpu:: DefaultT
#include <Cpu.h>
Default tag type.
See the Default tag for more information.
Function documentation
#include <Cpu.h>
template<class T>
Features Death:: Cpu:: features() constexpr
Feature set for a tag type.
Returns Features with a tag corresponding to tag type T
, avoiding a need to form the tag value in order to pass it to Features::
#include <Cpu.h>
template<class T, class U>
Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index)|TypeTraits<U>::Index> Death:: Cpu:: operator|(T,
U) constexpr
#include <Cpu.h>
template<class T, unsigned int value>
Implementation::Tags<TypeTraits<T>::Index|value> Death:: Cpu:: operator|(T,
Implementation::Tags<value>) constexpr
#include <Cpu.h>
template<class T, class U>
Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index)&TypeTraits<U>::Index> Death:: Cpu:: operator&(T,
U) constexpr
#include <Cpu.h>
template<class T, unsigned int value>
Implementation::Tags<TypeTraits<T>::Index&value> Death:: Cpu:: operator&(T,
Implementation::Tags<value>) constexpr
#include <Cpu.h>
template<class T, class U>
Implementation::Tags<static_cast<unsigned int>(TypeTraits<T>::Index) ^ TypeTraits<U>::Index> Death:: Cpu:: operator^(T,
U) constexpr
#include <Cpu.h>
template<class T, unsigned int value>
Implementation::Tags<TypeTraits<T>::Index^ value> Death:: Cpu:: operator^(T,
Implementation::Tags<value>) constexpr
Features Death:: Cpu:: compiledFeatures() constexpr
#include <Cpu.h>
CPU instruction sets enabled at compile time.
On x86 returns a combination of Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Lzcnt, Bmi1, Bmi2, Avx, AvxF16c, AvxFma, Avx2 and Avx512f based on what all DEATH_
On ARM, returns a combination of Neon, NeonFma and NeonFp16 based on what all DEATH_
On WebAssembly, returns Simd128 based on whether the DEATH_
On other platforms or if no known CPU instruction set is enabled, the returned value is equal to Scalar, which in turn is equivalent to empty (or default-constructed) Features.
Features Death:: Cpu:: runtimeFeatures() constexpr
#include <Cpu.h>
Detect available CPU instruction sets at runtime.
On x86 and GCC, Clang or MSVC uses the CPUID builtin to check for the Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Lzcnt, Bmi1, Bmi2, Avx, AvxF16c, AvxFma, Avx2 and Avx512f runtime features. Avx needs OS support as well, if it's not present, no following flags including Bmi1 and Bmi2 are checked either. On compilers other than GCC, Clang and MSVC the function is constexpr
and delegates into compiledFeatures().
On ARM and Linux or Android API level 18+ uses getauxval(), or on ARM macOS and iOS uses sysctlbyname() to check for the Neon, NeonFma and NeonFp16. Neon and NeonFma are implicitly supported on ARM64. On other platforms the function is constexpr
and delegates into compiledFeatures().
On WebAssembly an attempt to use SIMD instructions without runtime support results in a WebAssembly compilation error and thus runtime detection is largely meaningless. While this may change once the feature detection proposal is implemented, at the moment the function is constexpr
and delegates into compiledFeatures().
On other platforms or if no known CPU instruction set is detected, the returned value is equal to Scalar, which in turn is equivalent to empty (or default-constructed) Features.
Variable documentation
ScalarT Death:: Cpu:: Scalar constexpr
#include <Cpu.h>
Scalar tag.
Code that isn't explicitly optimized with any advanced CPU instruction set. Fallback if no other CPU instruction set is chosen or available. The next most widely supported instruction sets are Sse2 on x86, Neon on ARM and Simd128 on WebAssembly.
Ssse3T Death:: Cpu:: Ssse3 constexpr
#include <Cpu.h>
SSSE3 tag.
Supplemental Streaming SIMD Extensions 3. Available only on x86. Superset of Sse3, implied by Sse41.
Note that certain older AMD processors have SSE4a but neither SSSE3 nor SSE4.1. Both can be however treated as a subset of SSE4.1 to a large extent, and it's recommended to use Sse41 to handle those.
Sse41T Death:: Cpu:: Sse41 constexpr
#include <Cpu.h>
SSE4.1 tag.
Streaming SIMD Extensions 4.1. Available only on x86. Superset of Ssse3, implied by Sse42.
Note that certain older AMD processors have SSE4a but neither SSSE3 nor SSE4.1. Both can be however treated as a subset of SSE4.1 to a large extent, and it's recommended to use Sse41 to handle those.
LzcntT Death:: Cpu:: Lzcnt constexpr
#include <Cpu.h>
LZCNT tag.
LZCNT instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.
Note that this instruction has encoding compatible with an earlier BSR
instruction which has a slightly different behavior. To avoid wrong results if it isn't available, prefer to always detect its presence with runtimeFeatures() instead of a compile-time check.
Bmi1T Death:: Cpu:: Bmi1 constexpr
#include <Cpu.h>
BMI1 tag.
BMI1 instructions, including TZCNT
. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.
Note that the TZCNT
instruction has encoding compatible with an earlier BSF
instruction which has a slightly different behavior. To avoid wrong results if it isn't available, prefer to always detect its presence with runtimeFeatures() instead of a compile-time check.
AvxFmaT Death:: Cpu:: AvxFma constexpr
#include <Cpu.h>
AVX FMA tag.
FMA3 instruction set. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.
Simd128T Death:: Cpu:: Simd128 constexpr
#include <Cpu.h>
SIMD128 tag type.
128-bit WebAssembly SIMD. Available only on WebAssembly. Superset of Scalar.
DefaultBaseT Death:: Cpu:: DefaultBase constexpr
#include <Cpu.h>
Default base tag.
Highest base instruction set available on given architecture with current compiler flags. Ordered by priority, on DEATH_
- Avx512f if DEATH_
TARGET_ AVX512F is defined - Avx2 if DEATH_
TARGET_ AVX2 is defined - Avx if DEATH_
TARGET_ AVX is defined - Sse42 if DEATH_
TARGET_ SSE42 is defined - Sse41 if DEATH_
TARGET_ SSE41 is defined - Ssse3 if DEATH_
TARGET_ SSSE3 is defined - Sse3 if DEATH_
TARGET_ SSE3 is defined - Sse2 if DEATH_
TARGET_ SSE2 is defined - Scalar otherwise
On DEATH_
- NeonFp16 if DEATH_
TARGET_ NEON_ FP16 is defined - NeonFma if DEATH_
TARGET_ NEON_ FMA is defined - Neon if DEATH_
TARGET_ NEON is defined - Scalar otherwise
On DEATH_
- Simd128 if DEATH_
TARGET_ SIMD128 is defined - Scalar otherwise
In addition to the above, DefaultExtra contains a combination of extra instruction sets available together with the base instruction set, and Default is a combination of both. See also compiledFeatures() which returns a combination of base tags instead of just the highest available, together with the extra instruction sets, and runtimeFeatures() which is capable of detecting the available CPU feature set at runtime.
DefaultExtraT Death:: Cpu:: DefaultExtra constexpr
#include <Cpu.h>
Default extra tags.
Instruction sets available in addition to DefaultBase on given architecture with current compiler flags. On DEATH_
- Popcnt if DEATH_
TARGET_ POPCNT is defined - Lzcnt if DEATH_
TARGET_ LZCNT is defined - Bmi1 if DEATH_
TARGET_ BMI1 is defined - Bmi2 if DEATH_
TARGET_ BMI2 is defined - AvxFma if DEATH_
TARGET_ AVX_ FMA is defined - AvxF16c if DEATH_
TARGET_ AVX_ F16C is defined
No extra instruction sets are currently defined for DEATH_
In addition to the above, Default is a combination of both DefaultBase and the extra instruction sets. See also compiledFeatures() which returns these together with a combination of all base instruction sets available, and runtimeFeatures() which is capable of detecting the available CPU feature set at runtime.
DefaultT Death:: Cpu:: Default constexpr
#include <Cpu.h>
Default tags.
A combination of DefaultBase and DefaultExtra, see their documentation for more information.