ARMv8 floating point output inline assembly

gnu inline assembly clobber list
armcc inline assembly
using c variables in inline assembly
gcc inline asm immediate value
stm32 inline assembly example
gcc inline assembly memory
what is inline assembly
arm clang inline assembly

For adding two integers, I write:

int sum;
asm volatile("add %0, x3, x4" : "=r"(sum) : :);

How can I do this with two floats? I tried:

float sum;
asm volatile("fadd %0, s3, s4" : "=r"(sum) : :);

But it gives me an error:

Error: operand 1 should be a SIMD vector register -- `fadd x0,s3,s4'

Any ideas?

ARMv8 floating point output inline assembly - gcc - html, ARMv8 floating point output inline assembly - gcc. is necessary to add an output modifier to the print string: On Godbolt float foo() { float sum; asm volatile("​fadd  output_operand_list is an optional list of output operands, separated by commas. Each operand consists of a symbolic name in square brackets, a constraint string, and a C expression in parentheses. In this example, there is a single output operand: [result] "=r" (res). input_operand_list is an optional list of input operands, separated by

Because registers can have multiple names in AArch64 (v0, b0, h0, s0, d0 all refer to the same register) it is necessary to add an output modifier to the print string:

On Godbolt

float foo()
{
    float sum;
    asm volatile("fadd %s0, s3, s4" : "=w"(sum) : :);
    return sum;
}

double dsum()
{
    double sum;
    asm volatile("fadd %d0, d3, d4" : "=w"(sum) : :);
    return sum;
}

Will produce:

foo:
        fadd s0, s3, s4 // sum
        ret     
dsum:
        fadd d0, d3, d4 // sum
        ret  

C/C++ inline assembly, The asm keyword can incorporate inline GCC syntax assembly code into a function. output_operand_list is an optional list of output operands, separated by  This guide introduces the A64 instruction set, used in the 64-bit Armv8-A architecture, also known as AArch64. We will not cover every single instruction in this guide.

"=r" is the constraint for GP integer registers.

The GCC manual claims that "=w" is the constraint for an FP / SIMD register on AArch64. But if you try that, you get v0 not s0, which won't assemble. I don't know a workaround here, you should probably report on the gcc bugzilla that the constraint documented in the manual doesn't work for scalar FP.

On Godbolt I tried this source:

float foo()
{
    float sum;
#ifdef __aarch64__
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("fadds %0, s3, s4" : "=t"(sum) : :);  // ARM32
#endif
    return sum;
}

double dsum()
{
    double sum;
#ifdef __aarch64__
    asm volatile("fadd %0, d3, d4" : "=w"(sum) : :);   // AArch64
#else
    asm volatile("faddd %0, d3, d4" : "=w"(sum) : :);  // ARM32
#endif
    return sum;
}

clang7.0 (with its built-in assembler) requires the asm to be actually valid. But for gcc we're only compiling to asm, and Godbolt doesn't have a "binary mode" for non-x86.

# AArch64 gcc 8.2  -xc -O3 -fverbose-asm -Wall
# INVALID ASM, errors if you try to actually assemble it.
foo:
    fadd v0, s3, s4 // sum
    ret     
dsum:
    fadd v0, d3, d4 // sum
    ret

clang produces the same asm, and its built-in assembler errors with:

<source>:5:18: error: invalid operand for instruction
    asm volatile("fadd %0, s3, s4" : "=w"(sum) : :);
                 ^
<inline asm>:1:11: note: instantiated into assembly here
        fadd v0, s3, s4
             ^

On 32-bit ARM, =t" for single works, but "=w" for (which the manual says you should use for double-precision) also gives you s0 with gcc. It works with clang, though. You have to use -mfloat-abi=hard and a -mcpu= something with an FPU, e.g. -mcpu=cortex-a15

# clang7.0 -xc -O3 -Wall--target=arm -mcpu=cortex-a15 -mfloat-abi=hard
# valid asm for ARM 32
foo:
        vadd.f32        s0, s3, s4
        bx      lr
dsum:
        vadd.f64        d0, d3, d4
        bx      lr

But gcc fails:

# ARM gcc 8.2  -xc -O3 -fverbose-asm -Wall -mfloat-abi=hard -mcpu=cortex-a15
foo:
        fadds s0, s3, s4        @ sum
        bx      lr  @
dsum:
        faddd s0, d3, d4        @ sum    @@@ INVALID
        bx      lr  @

So you can use =t for single just fine with gcc, but for double presumably you need a %something0 modifier to print the register name as d0 instead of s0, with a "=w" output.


Obviously these asm statements would only be useful for anything beyond learning the syntax if you add constraints to specify the input operands as well, instead of reading whatever happened to be sitting in s3 and s4.

See also https://stackoverflow.com/tags/inline-assembly/info

ARM Compiler User Guide Version 6.6, Overview · Base ISAs · Custom Instructions · DSP extensions · Floating Point · SIMD ISAs The compiler provides an inline assembler that enables you to write assembly code in Input operands use the same syntax as output operands. armclang --target=arm-arm-none-eabi -march=armv8-m.main -c -S file.c -o file.s  Thus, when passing constants, pointers or variables to inline assembly statements, the inline assembler must know, how they should be represented in the assembly code. For ARM processors, GCC 4 provides the following constraints.

ARM GCC Inline Assembler Cookbook, With inline assembly you can use the same assembler instruction mnemonics as you'd use for writing pure ARM asm(code : output operand list : input operand list : clobber list); w, Vector floating point registers s0 .. s31, Not available. Some compilers might have an option to force every floating point operation to single precision only, or generate notification messages when double precision calculation is used. 4. Using floating point calculations in an Interrupt Service Routine. Floating point calculations are performed on a separate register bank inside the floating point unit.

ARM Options (Using the GNU Compiler Collection (GCC)), Specifying ' soft ' causes GCC to generate output containing library calls for floating-point operations. hard ' allows generation of floating-point instructions and uses FPU-specific calling conventions. The ARMv8-A Advanced SIMD and floating-point instructions. Assume inline assembler is using unified asm syntax​. A1.16 Access to the inline barrel shifter in AArch32 state .. A1-42 Part B Advanced SIMD and Floating-point Programming

How to Use Inline Assembly Language in C Code, Extended asm statements have to be inside a C function, so to write inline For example, on many targets there is a system register that controls the rounding mode of floating-point operations. Outputs a number that is unique to each instance of the asm statement in the AArch64 family-config/aarch64/constraints​.md. Floating point support is similar to AArch32 VFP but with some extensions. Our function add_to takes three parameters: 2 pointers and a long word (64-bits). So, we expect arguments to be:

Comments
  • Nice, I figured there was probably a modifier while writing my answer, but I didn't see it in the GCC manual. Is this documented anywhere? I only know of the modifiers for x86 registers being in the gcc manual, at the bottom of the Extended asm section: gcc.gnu.org/onlinedocs/gcc/…
  • @PeterCordes The problem is that the gcc people are reluctant to document these things. If they're documented, then they're not allowed to change them (which they probably don't do much anyway). One could argue that there are already enough people using them that changing them would cause wide-spread consternation, but that has not yet proven to be a good enough argument to overcome the inertia (but feel free to try!). x86 got doc'ed cuz I was changing the asm docs and added them, and no one felt strongly enough about it to argue with me. I only did x86 cuz that's what I know. Sorry.
  • As an aside to James and OP: Be aware that since these modifiers AREN'T doc'ed, using them is unsupported. As with any undocumented feature, gcc can change them at any time. They probably won't, but they can.
  • If you wanted to document the subset of modifiers that we shouldn’t change, I’ll approve the patch on list. Some (these, %w0 for printing the w rather than x name) should just be documented and fixed. Especially where behavior is consistent between GCC and Clang. To my great shame, the best documentation we have is at github.com/gcc-mirror/gcc/blob/master/gcc/config/aarch64/…
  • @PeterCordes - Is there any way you can take James up on his offer here? I'd love to help, but I know almost zero about arm, so selecting the subset is beyond what I can offer. I could help with the texinfo, but that's probably the least challenging part of this.
  • @DavidWohlferd: I meant that without input constraints, this toy inline asm is only useful for learning the syntax, not doing anything useful. To go beyond that, you need input constraints like "w" (input1).
  • @DavidWohlferd: I think a valid parsing of my sentence is that change is required for the statements to be useful for anything other than learning/testing the syntax. In my last edit I changed the wording of the rest of the sentence to try to make that clearer, but feel free to edit if you're still convinced it's confusing. Probably you aren't the only one that parsed it differently from how I intended.
  • Opened a bug for the 32-bit problem at: gcc.gnu.org/bugzilla/show_bug.cgi?id=89482