Death::Cpu namespace

Compile-time and runtime CPU instruction set detection and dispatch.

This namespace provides tags for x86, ARM and WebAssembly instruction sets, which can be used for either system introspection or for choosing a particular implementation based on the available instruction set. These tags build on top of the DEATH_TARGET_SSE2, DEATH_TARGET_SSE3 etc. preprocessor macros and provide a runtime feature detection as well.

Usage

The Cpu namespace contains tags such as Cpu::Avx2, Cpu::Sse2, Cpu::Neon or Cpu::Simd128. These tags behave similarly to enum values and their combination result in Cpu::Features.

The most advanced base CPU instruction set enabled at compile time is then exposed through the Cpu::DefaultBase variable, which is an alias to one of those tags, and it matches the architecture-specific DEATH_TARGET_SSE2 etc. macros. Since it's a constexpr variable, it's usable in a compile-time context:

if constexpr (Cpu::DefaultBase >= Cpu::Avx2) {
    // AVX2 code
} else {
    // Scalar code
}

Dispatching on available CPU instruction set at compile time

The main purpose of these tags, however, is to provide means for a compile-time overload resolution. In other words, picking the best candidate among a set of functions implemented with various instruction sets. As an example, let's say you have three different implementations of a certain algorithm transforming numeric data. One is using AVX2 instructions, another is a slower variant using just SSE 4.2 and as a fallback there's one with just regular scalar code. To distinguish them, the functions have the same name, but use a different tag type:

void transform(Cpu::ScalarT, Containers::ArrayView<float> data);
void transform(Cpu::Sse42T, Containers::ArrayView<float> data);
void transform(Cpu::Avx2T, Containers::ArrayView<float> data);

Then you can either call a particular implementation directly — for example to test it — or you can pass Cpu::DefaultBase, and it'll pick the best overload candidate for the set of CPU instruction features enabled at compile time:

transform(Cpu::DefaultBase, data);

If the user code was compiled with AVX2 or higher enabled, the Cpu::Avx2 overload will be picked.
Otherwise, if just AVX, SSE 4.2 or anything else that includes SSE 4.2 was enabled, the Cpu::Sse42 overload will be picked.
Otherwise (for example when compiling for generic x86-64 that has just the SSE2 feature set), the Cpu::Scalar overload will be picked. If you wouldn't provide this overload, the compilation would fail for such a target — which is useful for example to enforce a certain CPU feature set to be enabled in order to use a certain API.

Runtime detection and manual dispatch

So far that was all compile-time detection, which has use mainly when a binary can be optimized directly for the machine it will run on. But such approach is not practical when shipping to a heterogenous set of devices. Instead, the usual workflow is that the majority of code uses the lowest common denominator (such as SSE2 on x86), with the most demanding functions having alternative implementations — picked at runtime — that make use of more advanced instructions for better performance.

Runtime detection is exposed through Cpu::runtimeFeatures(). It will detect CPU features on platforms that support it, and fall back to Cpu::compiledFeatures() on platforms that don't. You can then match the returned Cpu::Features against particular tags to decide which variant to use:

Cpu::Features features = Cpu::runtimeFeatures();
if(features & Cpu::Avx2)
    transform(Cpu::Avx2, data);
else if(features & Cpu::Sse41)
    transform(Cpu::Sse41, data);
else
    transform(Cpu::Scalar, data);

While such approach gives you the most control, manually managing the dispatch branches is error prone and the argument passthrough may also add nontrivial overhead. See below for an efficient automatic runtime dispatch.

Usage with extra instruction sets

Besides the base instruction set, which on x86 is Sse2 through Avx512f, with each tag being a superset of the previous one, there are extra instruction sets such as Popcnt or AvxFma. Basic compile-time detection for these is still straightforward, only now using Default instead of DefaultBase:

if constexpr (Cpu::Default >= (Cpu::Avx2 | Cpu::AvxFma)) {
    // AVX2+FMA code
} else {
    // Scalar code
}

The process of defining and dispatching to function variants that include extra instruction sets gets moderately more complex, however. As shown on the diagram below, those are instruction sets that neither fit into the hierarchy nor are unambiguously included in a later instruction set. For example, some CPUs are known to have Avx and just AvxFma, some Avx and just AvxF16c and there are even CPUs with Avx2 but no AvxFma.

While there's no possibility of having a total ordering between all possible combinations for dispatching, the following approach is chosen:

The base instruction set has the main priority. For example, if both an Avx2 and a Sse2 variant are viable candidates, the Avx2 variant gets picked, even if the Sse2 variant uses extra instruction sets that the Avx2 doesn't.
After that, the variant with the most extra instruction sets is chosen. For example, an Avx + AvxFma variant is chosen over plain Avx.

On the declaration side, the desired base instruction set gets ORed with as many extra instruction sets as needed, and then wrapped in a DEATH_CPU_DECLARE() macro. For example, a lookup algorithm may have a Sse41 implementation which however also relies on Popcnt and Lzcnt, and a fallback Sse2 implementation that uses neither:

int lookup(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Popcnt), …);
int lookup(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Lzcnt), …);

And a concrete overload gets picked at compile-time by passing a desired combination of CPU tags as well — or Default for the set of features enabled at compile time — this time wrapped in a DEATH_CPU_SELECT():

int found = lookup(DEATH_CPU_SELECT(Cpu::Default), …);

Resolving overload ambiguity

Because the best overload is picked based on the count of extra instruction sets used, it may happen that two different variants get assigned the same priority, causing an ambiguity. For example, the two variants below would be abiguous for a CPU with Sse41 and both Popcnt and Lzcnt present:

int lookup(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Popcnt), …);
int lookup(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Lzcnt), …);

It's not desirable for this library to arbitrarily decide which instruction set should be preferred — only the implementation itself can know that. Thus, to resolve such potential conflict, provide an overload with both extra tags and delegate from there:

int lookup(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Popcnt | Cpu::Lzcnt), …) {
    // Or the other variant, or a custom third implementation ...
    return lookup(DEATH_CPU_SELECT(Cpu::Sse41 | Cpu::Lzcnt), …);
}

Enabling instruction sets for particular functions

On GCC and Clang, a machine target has to be enabled in order to use a particular CPU instruction set or its intrinsics. While it's possible to do that for the whole compilation unit by passing for example -mavx2 to the compiler, it would force you to create dedicated files for every architecture variant you want to support. Instead, it's possible to equip particular functions with target attributes defined by DEATH_ENABLE_SSE2 and related macros, which then makes a particular instruction set enabled for given function.

In contrast, MSVC doesn't restrict intrinsics usage in any way, so you can freely call e.g. AVX2 intrinsics even if the whole file is compiled with just SSE2 enabled. The DEATH_ENABLE_SSE2 and related macros are thus defined to be empty on this compiler.

For developer convenience, the DEATH_ENABLE_SSE2 etc. macros are defined only on matching architectures, and generally only if the compiler itself has given feature set implemented and usable. Which means you can easily use them to #ifdef your variants to be compiled only where it makes sense, or even guard intrinsics includes with them to avoid including potentially heavy headers you won't use anyway. In comparison, using the DEATH_TARGET_SSE2 etc. macros would only make the variant available if the whole compilation unit has a corresponding -m or /arch: option passed to the compiler.

Finally, the DEATH_ENABLE() function allows multiple instruction sets to be enabled at the same time in a more concise way and consistently on both GCC and Clang.

Definitions of the lookup() function variants from above would then look like below with the target attributes added. The extra instruction sets get explicitly enabled as well, in contrast a scalar variant would have no target-specific annotations at all:

int lookup(DEATH_CPU_DECLARE(Cpu::Scalar), …) {
    …
}
#if defined(DEATH_ENABLE_SSE2)
DEATH_ENABLE_SSE2 int lookup(DEATH_CPU_DECLARE(Cpu::Sse2), …) {
    …
}
#endif
#if defined(DEATH_ENABLE_SSE41) && \
    defined(DEATH_ENABLE_POPCNT) && \
    defined(DEATH_ENABLE_LZCNT)
DEATH_ENABLE(SSE41,POPCNT,LZCNT) int lookup(
    DEATH_CPU_DECLARE(Cpu::Sse41|Cpu::Popcnt|Cpu::Lzcnt), …) {
    …
}
#endif

Automatic runtime dispatch

Similarly to how the best-matching function variant can be picked at compile time, there's a possibility to do the same at runtime without maintaining a custom dispatch code for each case as was shown above. To avoid having to dispatch on every call and to remove the argument passthrough overhead, all variants need to have the same function signature, separate from the CPU tags. That's achievable by putting them into lambdas with a common signature, and returning that lambda from a wrapper function that contains the CPU tag. After that, a runtime dispatcher function that is created with the DEATH_CPU_DISPATCHER_BASE() macro. The transform() variants from above would then look like this instead:

using TransformT = void(*)(Containers::ArrayView<float>);

TransformT transformImplementation(Cpu::ScalarT) {
    return [](Containers::ArrayView<float> data) { … };
}
TransformT transformImplementation(Cpu::Sse42T) {
    return [](Containers::ArrayView<float> data) { … };
}
TransformT transformImplementation(Cpu::Avx2T) {
    return [](Containers::ArrayView<float> data) { … };
}

DEATH_CPU_DISPATCHER_BASE(transformImplementation)

The macro creates an overload of the same name, but taking Features instead, and internally dispatches to one of the overloads using the same rules as in the compile-time dispatch. Which means you can now call it with e.g. runtimeFeatures(), get a function pointer back and then call it with the actual arguments:

// Dispatch once and cache the function pointer
TransformT transform = transformImplementation(Cpu::runtimeFeatures());

// Call many times
transform(data);

Instruction enabling macros and lambdas

An important difference with the DEATH_ENABLE_* macros is that they now have to go also directly next to the lambda as GCC currently doesn't propagate the attributes from the wrapper function to the nested lambda. To make matters worse, older versions of Clang suffer from the inverse problem and ignore lambda attributes, so you have to specify them on both the lambda and the wrapper function. GCC 9.1 to 9.3 also has a bug where it can't parse attributes on lambdas with a trailing return type. The preferrable solution is to not use a trailing return type at the cost of potentially more verbose return statements. Alternatively you can require version 8, 9.4 or 10 instead, but note that 9.3 is the default compiler on Ubuntu 20.04.

All things considered, the above AVX variant would look like this with relevant macros added:

#ifdef DEATH_ENABLE_AVX2
DEATH_ENABLE_AVX2 TransformT transformImplementation(Cpu::Avx2T) {
    return [](Containers::ArrayView<float> data) DEATH_ENABLE_AVX2 { … };
}
#endif

Automatic runtime dispach with extra instruction sets

If the variants are tagged with extra instruction sets instead of just the base instruction set like in the lookup() case shown above, you'll use the DEATH_CPU_DISPATCHER() macro instead. There, to avoid a combinatorial explosion of cases to check, you're expected to list the actual extra tags the overloads use. Which is usually just one or two out of the whole set:

using LookupT = int(*)(…);

LookupT lookupImplementation(DEATH_CPU_DECLARE(Cpu::Scalar)) {
    …
}
LookupT lookupImplementation(DEATH_CPU_DECLARE(Cpu::Sse2)) {
    …
}
LookupT lookupImplementation(DEATH_CPU_DECLARE(Cpu::Sse41 | Cpu::Popcnt | Cpu::Lzcnt)) {
    …
}

DEATH_CPU_DISPATCHER(lookupImplementation, Cpu::Popcnt, Cpu::Lzcnt)

If some extra instruction sets are always used together (like it is above with Popcnt and Lzcnt), you can reduce the amount of tested combinations by specifying them as a single ORed argument instead:

DEATH_CPU_DISPATCHER(lookupImplementation, Cpu::Popcnt | Cpu::Lzcnt)

On the call side, there's no difference compared to using just the base instruction sets. The created dispatcher function takes Features as well.

Automatic cached dispatch

Ultimately, the dispatch can be performed implicitly, exposing only the final function or a function pointer, with no additional steps needed from the user side. There's three possible scenarios with varying performance tradeoffs. Continuing from the lookupImplementation() example above:

On Linux and Android with API 30+ it's possible to use the GNU IFUNC mechanism, where the dynamic linker performs a dispatch during the early startup. This is the fastest variant of runtime dispatch, as it results in an equivalent of a regular dynamic library function call. Assuming a dispatcher was created using either DEATH_CPU_DISPATCHER() or DEATH_CPU_DISPATCHER_BASE(), it's implemented using the DEATH_CPU_DISPATCHED_IFUNC() macro:
```
DEATH_CPU_DISPATCHED_IFUNC(lookupImplementation, int lookup(…))
```
On platforms where IFUNC isn't available, a function pointer can be used for runtime dispatch instead. It's one additional indirection, which may have a visible effect if the dispatched-to code is relatively tiny and is called from within a tight loop. Assuming a dispatcher was created using either DEATH_CPU_DISPATCHER() or DEATH_CPU_DISPATCHER_BASE(), it's implemented using the DEATH_CPU_DISPATCHED_POINTER() macro:
```
DEATH_CPU_DISPATCHED_POINTER(lookupImplementation, int(*lookup)(…))
```
For the least amount of overhead, the compile-time dispatch can be used, with arguments passed through by hand. Similarly to IFUNC, this will also result in a regular function, but without the indirect overhead. Furthermore, since it's a direct call to the lambda inside, compiler optimizations will fully inline its contents, removing any remaining overhead and allowing LTO and other inter-procedural optimizations that wouldn't be possible with the indirect calls. This option is best suited for scenarios where it's possible to build and optimize code for a single target platform. In this case it calls directly to the original variants, so no macro is needed and DEATH_CPU_DISPATCHER() / DEATH_CPU_DISPATCHER_BASE() is not needed either:
```
int lookup(…) {
    return lookupImplementation(DEATH_CPU_SELECT(Cpu::Default))(…);
}
```

With all three cases, you end up with either a function or a function pointer. The macro signatures are deliberately similar to each other and to the direct function declaration to make it possible to unify them under a single wrapper macro in case a practical use case needs to handle more than one variant. The mechanism can be selected automatically according to the build configuration using the DEATH_CPU_DISPATCHED macro:

DEATH_CPU_DISPATCHED(lookupImplementation, bool DEATH_CPU_DISPATCHED_DECLARATION(lookup)(…)({
    return lookupImplementation(Cpu::DefaultBase)(…);
})

Classes

struct Avx2T: AVX2 tag type.
struct Avx512fT: AVX-512 Foundation tag type.
struct AvxF16cT: AVX F16C tag type.
struct AvxFmaT: AVX FMA tag type.
struct AvxT: AVX tag type.
struct Bmi1T: BMI1 tag type.
struct Bmi2T: BMI2 tag type.
class Features: Feature set.
struct LzcntT: LZCNT tag type.
struct NeonFmaT: NEON FMA tag type.
struct NeonFp16T: NEON FP16 tag type.
struct NeonT: NEON tag type.
struct PopcntT: POPCNT tag type.
struct ScalarT: Scalar tag type.
struct Simd128T: SIMD128 tag type.
struct Sse2T: SSE2 tag type.
struct Sse3T: SSE3 tag type.
struct Sse41T: SSE4.1 tag type.
struct Sse42T: SSE4.2 tag type.
struct Ssse3T: SSSE3 tag type.
template<class T> struct TypeTraits: Traits class for CPU detection tag types.

Typedefs

using DefaultBaseT = ScalarT: Default base tag type.
using DefaultExtraT = Implementation::Tags<0>: Default extra tag type.
using DefaultT = Implementation::Tags<static_cast<unsigned int>(TypeTraits<DefaultBaseT>::Index)|DefaultExtraT::Value>: Default tag type.

Functions

template<class T> auto tag() -> T constexpr: Tag for a tag type.
template<class T> auto features() -> Features constexpr: Feature set for a tag type.
auto compiledFeatures() -> Features constexpr: CPU instruction sets enabled at compile time.
auto runtimeFeatures() -> Features constexpr: Detect available CPU instruction sets at runtime.
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator==(T a, Features b) -> bool constexpr: Equality comparison of a tag and a feature set.
template<class T, class U, class = decltype(TypeTraits<T>::Index), class = decltype(TypeTraits<U>::Index)> auto operator==(T, U) -> bool constexpr: Equality comparison of two tags.
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator!=(T a, Features b) -> bool constexpr: Non-equality comparison of a tag and a feature set.
template<class T, class U, class = decltype(TypeTraits<T>::Index), class = decltype(TypeTraits<U>::Index)> auto operator!=(T, U) -> bool constexpr: Non-equality comparison of two tags.
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator>=(T a, Features b) -> bool constexpr: Whether a is a superset of b ( )
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator<=(T a, Features b) -> bool constexpr: Whether a is a subset of b ( )
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator|(T a, Features b) -> Features constexpr: Union of two feature sets.
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator&(T a, Features b) -> Features constexpr: Intersection of two feature sets.
template<class T, class = decltype(TypeTraits<T>::Index)> auto operator^(T a, Features b) -> Features constexpr: XOR of two feature sets.
template<class T> auto operator~(T a) -> Features constexpr: Feature set complement.

Variables

ScalarT Scalar constexpr: Scalar tag.
Sse2T Sse2 constexpr: SSE2 tag.
Sse3T Sse3 constexpr: SSE3 tag.
Ssse3T Ssse3 constexpr: SSSE3 tag.
Sse41T Sse41 constexpr: SSE4.1 tag.
Sse42T Sse42 constexpr: SSE4.2 tag.
PopcntT Popcnt constexpr: POPCNT tag.
LzcntT Lzcnt constexpr: LZCNT tag.
Bmi1T Bmi1 constexpr: BMI1 tag.
Bmi2T Bmi2 constexpr: BMI2 tag.
AvxT Avx constexpr: AVX tag.
AvxF16cT AvxF16c constexpr: AVX F16C tag.
AvxFmaT AvxFma constexpr: AVX FMA tag.
Avx2T Avx2 constexpr: AVX2 tag.
Avx512fT Avx512f constexpr: AVX-512 Foundation tag.
NeonT Neon constexpr: NEON tag type.
NeonFmaT NeonFma constexpr: NEON FMA tag type.
NeonFp16T NeonFp16 constexpr: NEON FP16 tag type.
Simd128T Simd128 constexpr: SIMD128 tag type.
DefaultBaseT DefaultBase constexpr: Default base tag.
DefaultExtraT DefaultExtra constexpr: Default extra tags.
DefaultT Default constexpr: Default tags.

Typedef documentation

typedef ScalarT Death::Cpu::DefaultBaseT
#include <Cpu.h>

Default base tag type.

See the DefaultBase tag for more information.

typedef Implementation::Tags<0> Death::Cpu::DefaultExtraT
#include <Cpu.h>

Default extra tag type.

See the DefaultExtra tag for more information.

typedef Implementation::Tags<static_cast<unsigned int>(TypeTraits<DefaultBaseT>::Index)|DefaultExtraT::Value> Death::Cpu::DefaultT
#include <Cpu.h>

Default tag type.

See the Default tag for more information.

Function documentation

#include <Cpu.h>

template<class T>
T Death::Cpu::tag() constexpr

Tag for a tag type.

Returns a tag corresponding to tag type T.

#include <Cpu.h>

template<class T>
Features Death::Cpu::features() constexpr

Feature set for a tag type.

Returns Features with a tag corresponding to tag type T, avoiding a need to form the tag value in order to pass it to Features::Features(T).

Features Death::Cpu::compiledFeatures() constexpr
#include <Cpu.h>

CPU instruction sets enabled at compile time.

On x86 returns a combination of Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Lzcnt, Bmi1, Bmi2, Avx, AvxF16c, AvxFma, Avx2 and Avx512f based on what all DEATH_TARGET_SSE2 etc. preprocessor variables are defined.

On ARM, returns a combination of Neon, NeonFma and NeonFp16 based on what all DEATH_TARGET_NEON etc. preprocessor variables are defined.

On WebAssembly, returns Simd128 based on whether the DEATH_TARGET_SIMD128 preprocessor variable is defined.

On other platforms or if no known CPU instruction set is enabled, the returned value is equal to Scalar, which in turn is equivalent to empty (or default-constructed) Features.

Features Death::Cpu::runtimeFeatures() constexpr
#include <Cpu.h>

Detect available CPU instruction sets at runtime.

On x86 and GCC, Clang or MSVC uses the CPUID builtin to check for the Sse2, Sse3, Ssse3, Sse41, Sse42, Popcnt, Lzcnt, Bmi1, Bmi2, Avx, AvxF16c, AvxFma, Avx2 and Avx512f runtime features. Avx needs OS support as well, if it's not present, no following flags including Bmi1 and Bmi2 are checked either. On compilers other than GCC, Clang and MSVC the function is constexpr and delegates into compiledFeatures().

On ARM and Linux or Android API level 18+ uses getauxval(), or on ARM macOS and iOS uses sysctlbyname() to check for the Neon, NeonFma and NeonFp16. Neon and NeonFma are implicitly supported on ARM64. On other platforms the function is constexpr and delegates into compiledFeatures().

On WebAssembly an attempt to use SIMD instructions without runtime support results in a WebAssembly compilation error and thus runtime detection is largely meaningless. While this may change once the feature detection proposal is implemented, at the moment the function is constexpr and delegates into compiledFeatures().

On other platforms or if no known CPU instruction set is detected, the returned value is equal to Scalar, which in turn is equivalent to empty (or default-constructed) Features.

Variable documentation

ScalarT Death::Cpu::Scalar constexpr
#include <Cpu.h>

Scalar tag.

Code that isn't explicitly optimized with any advanced CPU instruction set. Fallback if no other CPU instruction set is chosen or available. The next most widely supported instruction sets are Sse2 on x86, Neon on ARM and Simd128 on WebAssembly.

Sse2T Death::Cpu::Sse2 constexpr
#include <Cpu.h>

SSE2 tag.

Streaming SIMD Extensions 2. Available only on x86, supported by all 64-bit x86 processors and is present on majority of contemporary 32-bit x86 processors as well. Superset of Scalar, implied by Sse3.

Sse3T Death::Cpu::Sse3 constexpr
#include <Cpu.h>

SSE3 tag.

Streaming SIMD Extensions 3. Available only on x86. Superset of Sse2, implied by Ssse3.

Ssse3T Death::Cpu::Ssse3 constexpr
#include <Cpu.h>

SSSE3 tag.

Supplemental Streaming SIMD Extensions 3. Available only on x86. Superset of Sse3, implied by Sse41.

Note that certain older AMD processors have SSE4a but neither SSSE3 nor SSE4.1. Both can be however treated as a subset of SSE4.1 to a large extent, and it's recommended to use Sse41 to handle those.

Sse41T Death::Cpu::Sse41 constexpr
#include <Cpu.h>

SSE4.1 tag.

Streaming SIMD Extensions 4.1. Available only on x86. Superset of Ssse3, implied by Sse42.

Note that certain older AMD processors have SSE4a but neither SSSE3 nor SSE4.1. Both can be however treated as a subset of SSE4.1 to a large extent, and it's recommended to use Sse41 to handle those.

Sse42T Death::Cpu::Sse42 constexpr
#include <Cpu.h>

SSE4.2 tag.

Streaming SIMD Extensions 4.2. Available only on x86. Superset of Sse41, implied by Avx.

PopcntT Death::Cpu::Popcnt constexpr
#include <Cpu.h>

POPCNT tag.

POPCNT instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

LzcntT Death::Cpu::Lzcnt constexpr
#include <Cpu.h>

LZCNT tag.

LZCNT instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

Note that this instruction has encoding compatible with an earlier BSR instruction which has a slightly different behavior. To avoid wrong results if it isn't available, prefer to always detect its presence with runtimeFeatures() instead of a compile-time check.

Bmi1T Death::Cpu::Bmi1 constexpr
#include <Cpu.h>

BMI1 tag.

BMI1 instructions, including TZCNT. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

Note that the TZCNT instruction has encoding compatible with an earlier BSF instruction which has a slightly different behavior. To avoid wrong results if it isn't available, prefer to always detect its presence with runtimeFeatures() instead of a compile-time check.

Bmi2T Death::Cpu::Bmi2 constexpr
#include <Cpu.h>

BMI2 tag.

BMI2 instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

AvxT Death::Cpu::Avx constexpr
#include <Cpu.h>

AVX tag.

Advanced Vector Extensions. Available only on x86. Superset of Sse42, implied by Avx2.

AvxF16cT Death::Cpu::AvxF16c constexpr
#include <Cpu.h>

AVX F16C tag.

F16C instructions. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

AvxFmaT Death::Cpu::AvxFma constexpr
#include <Cpu.h>

AVX FMA tag.

FMA3 instruction set. Available only on x86. This instruction set is treated as an extra, i.e. is neither a superset of nor implied by any other instruction set. See Usage with extra instruction sets for more information.

Avx2T Death::Cpu::Avx2 constexpr
#include <Cpu.h>

AVX2 tag.

Advanced Vector Extensions 2. Available only on x86. Superset of Avx, implied by Avx512f.

Avx512fT Death::Cpu::Avx512f constexpr
#include <Cpu.h>

AVX-512 Foundation tag.

AVX-512 Foundation. Available only on x86. Superset of Avx2.

NeonT Death::Cpu::Neon constexpr
#include <Cpu.h>

NEON tag type.

ARM NEON. Available only on ARM. Superset of Scalar, implied by NeonFp16.

NeonFmaT Death::Cpu::NeonFma constexpr
#include <Cpu.h>

NEON FMA tag type.

ARM NEON with FMA instructions. Available only on ARM. Superset of Neon, implied by NeonFp16.

NeonFp16T Death::Cpu::NeonFp16 constexpr
#include <Cpu.h>

NEON FP16 tag type.

ARM NEON with ARMv8.2-a FP16 vector arithmetic. Available only on ARM. Superset of NeonFma.

Simd128T Death::Cpu::Simd128 constexpr
#include <Cpu.h>

SIMD128 tag type.

128-bit WebAssembly SIMD. Available only on WebAssembly. Superset of Scalar.

DefaultBaseT Death::Cpu::DefaultBase constexpr
#include <Cpu.h>

Default base tag.

Highest base instruction set available on given architecture with current compiler flags. Ordered by priority, on DEATH_TARGET_X86 it's one of these:

Avx512f if DEATH_TARGET_AVX512F is defined
Avx2 if DEATH_TARGET_AVX2 is defined
Avx if DEATH_TARGET_AVX is defined
Sse42 if DEATH_TARGET_SSE42 is defined
Sse41 if DEATH_TARGET_SSE41 is defined
Ssse3 if DEATH_TARGET_SSSE3 is defined
Sse3 if DEATH_TARGET_SSE3 is defined
Sse2 if DEATH_TARGET_SSE2 is defined
Scalar otherwise

On DEATH_TARGET_ARM it's one of these:

NeonFp16 if DEATH_TARGET_NEON_FP16 is defined
NeonFma if DEATH_TARGET_NEON_FMA is defined
Neon if DEATH_TARGET_NEON is defined
Scalar otherwise

On DEATH_TARGET_WASM it's one of these:

Simd128 if DEATH_TARGET_SIMD128 is defined
Scalar otherwise

In addition to the above, DefaultExtra contains a combination of extra instruction sets available together with the base instruction set, and Default is a combination of both. See also compiledFeatures() which returns a combination of base tags instead of just the highest available, together with the extra instruction sets, and runtimeFeatures() which is capable of detecting the available CPU feature set at runtime.

DefaultExtraT Death::Cpu::DefaultExtra constexpr
#include <Cpu.h>

Default extra tags.

Instruction sets available in addition to DefaultBase on given architecture with current compiler flags. On DEATH_TARGET_X86 it's a combination of these:

Popcnt if DEATH_TARGET_POPCNT is defined
Lzcnt if DEATH_TARGET_LZCNT is defined
Bmi1 if DEATH_TARGET_BMI1 is defined
Bmi2 if DEATH_TARGET_BMI2 is defined
AvxFma if DEATH_TARGET_AVX_FMA is defined
AvxF16c if DEATH_TARGET_AVX_F16C is defined

No extra instruction sets are currently defined for DEATH_TARGET_ARM or DEATH_TARGET_WASM.

In addition to the above, Default is a combination of both DefaultBase and the extra instruction sets. See also compiledFeatures() which returns these together with a combination of all base instruction sets available, and runtimeFeatures() which is capable of detecting the available CPU feature set at runtime.

DefaultT Death::Cpu::Default constexpr
#include <Cpu.h>

Default tags.

A combination of DefaultBase and DefaultExtra, see their documentation for more information.

Death::Cpu namespace

Typedef documentation

typedef ScalarT Death::Cpu::DefaultBaseT #include <Cpu.h>

typedef Implementation::Tags<0> Death::Cpu::DefaultExtraT #include <Cpu.h>

typedef Implementation::Tags<static_cast<unsigned int>(TypeTraits<DefaultBaseT>::Index)|DefaultExtraT::Value> Death::Cpu::DefaultT #include <Cpu.h>

Function documentation

#include <Cpu.h> template<class T> T Death::Cpu::tag() constexpr

#include <Cpu.h> template<class T> Features Death::Cpu::features() constexpr

Features Death::Cpu::compiledFeatures() constexpr #include <Cpu.h>

Features Death::Cpu::runtimeFeatures() constexpr #include <Cpu.h>

Variable documentation

ScalarT Death::Cpu::Scalar constexpr #include <Cpu.h>

Sse2T Death::Cpu::Sse2 constexpr #include <Cpu.h>

Sse3T Death::Cpu::Sse3 constexpr #include <Cpu.h>

Ssse3T Death::Cpu::Ssse3 constexpr #include <Cpu.h>

Sse41T Death::Cpu::Sse41 constexpr #include <Cpu.h>

Sse42T Death::Cpu::Sse42 constexpr #include <Cpu.h>

PopcntT Death::Cpu::Popcnt constexpr #include <Cpu.h>

LzcntT Death::Cpu::Lzcnt constexpr #include <Cpu.h>

Bmi1T Death::Cpu::Bmi1 constexpr #include <Cpu.h>

Bmi2T Death::Cpu::Bmi2 constexpr #include <Cpu.h>

AvxT Death::Cpu::Avx constexpr #include <Cpu.h>

AvxF16cT Death::Cpu::AvxF16c constexpr #include <Cpu.h>

AvxFmaT Death::Cpu::AvxFma constexpr #include <Cpu.h>

Avx2T Death::Cpu::Avx2 constexpr #include <Cpu.h>

Avx512fT Death::Cpu::Avx512f constexpr #include <Cpu.h>

NeonT Death::Cpu::Neon constexpr #include <Cpu.h>

NeonFmaT Death::Cpu::NeonFma constexpr #include <Cpu.h>

NeonFp16T Death::Cpu::NeonFp16 constexpr #include <Cpu.h>

Simd128T Death::Cpu::Simd128 constexpr #include <Cpu.h>

DefaultBaseT Death::Cpu::DefaultBase constexpr #include <Cpu.h>

DefaultExtraT Death::Cpu::DefaultExtra constexpr #include <Cpu.h>

DefaultT Death::Cpu::Default constexpr #include <Cpu.h>