:lower16, :upper16 for aarch64; absolute address into register;

aarch64 registers
aarch64 assembly tutorial
arm64 assembly cheat sheet
aarch64 instruction set
aarch64 assembly example
aarch64 instruction encoding
arm64 assembly reference
armv8 assembly example

I need to put a 32-bit absolute address into a register on AArch64. (e.g. an MMIO address, not PC-relative).

On ARM32 it was possible to use lower16 & upper16 to load an address into a register

movw    r0, #:lower16:my_addr
movt    r0, #:upper16:my_addr

Is there a way to do similar thing on AArch64 by using movk?

If the code is relocated, I still want the same absolute address, so adr is not suitable.

ldr from a nearby literal pool would work, but I'd rather avoid that.

If your address is an assemble-time constant, not link-time, this is super easy. It's just an integer, and you can split it up manually.

I asked gcc and clang to compile unsigned abs_addr() { return 0x12345678; } (Godbolt)

// gcc8.2 -O3
abs_addr():
    mov     w0, 0x5678               // low half
    movk    w0, 0x1234, lsl 16       // high half
    ret

(Writing w0 implicitly zero-extends into 64-bit x0, same as x86-64).


Or if your constant is only a link-time constant and you need to generate relocations in the .o for the linker to fill in, the GAS manual documents what you can do, in the AArch64 machine-specific section:

Relocations for ‘MOVZ’ and ‘MOVK’ instructions can be generated by prefixing the label with #:abs_g2: etc. For example to load the 48-bit absolute address of foo into x0:

    movz x0, #:abs_g2:foo     // bits 32-47, overflow check
    movk x0, #:abs_g1_nc:foo  // bits 16-31, no overflow check
    movk x0, #:abs_g0_nc:foo  // bits  0-15, no overflow check

The GAS manual's example is sub-optimal; going low to high is more efficient on at least some AArch64 CPUs (see below). For a 32-bit constant, follow the same pattern that gcc used for a numeric literal.

 movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
 movk x0, #:abs_g1:foo              // bits 16-31, overflow check

#:abs_g1:foo will is known to have its possibly-set bits in the 16-31 range, so the assembler knows to use a lsl 16 when encoding movk. You should not use an explicit lsl 16 here.

I chose x0 instead of w0 because that's what gcc does for unsigned long long. Probably performance is identical on all CPUs, and code size is identical.

.text
func:
   // efficient
     movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
     movk x0, #:abs_g1:foo              // bits 16-31, overflow check

   // inefficient but does assemble + link
   //  movz x1, #:abs_g1:foo              // bits 16-31, overflow check
   //  movk x1, #:abs_g0_nc:foo           // bits  0-15, no overflow check

.data
foo: .word 123       // .data will be in a different page than .text

With GCC: aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s to build and link (just to prove we can, this will just crash if you actually ran it), and then aarch64-linux-gnu-objdump -drwC a.out:

a.out:     file format elf64-littleaarch64


Disassembly of section .text:

000000000040010c <func>:
  40010c:       d2802280        mov     x0, #0x114                      // #276
  400110:       f2a00820        movk    x0, #0x41, lsl #16

Clang appears to have a bug here, making it unusable: it only assembles #:abs_g1_nc:foo (no check for the high half) and #:abs_g0:foo (overflow check for the low half). This is backwards, and results in a linker error (g0 overflow) when foo has a 32-bit address. I'm using clang version 7.0.1 on x86-64 Arch Linux.

$ clang -target aarch64 -c aarch-reloc.s
aarch-reloc.s:5:15: error: immediate must be an integer in range [0, 65535].
     movz x0, #:abs_g0_nc:foo
              ^

As a workaround g1_nc instead of g1 is fine, you can live without overflow checks. But you need g0_nc, unless you have a linker where checking can be disabled. (Or maybe some clang installs come with a linker that's bug-compatible with the relocations clang emits?) I was testing with GNU ld (GNU Binutils) 2.31.1 and GNU gold (GNU Binutils 2.31.1) 1.16

$ aarch64-linux-gnu-ld.bfd aarch-reloc.o 
aarch64-linux-gnu-ld.bfd: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
aarch64-linux-gnu-ld.bfd: aarch-reloc.o: in function `func':
(.text+0x0): relocation truncated to fit: R_AARCH64_MOVW_UABS_G0 against `.data'

$ aarch64-linux-gnu-ld.gold aarch-reloc.o 
aarch-reloc.o(.text+0x0): error: relocation overflow in R_AARCH64_MOVW_UABS_G0

MOVZ vs. MOVK vs. MOVN

movz = move-zero puts a 16-bit immediate into a register with a left-shift of 0, 16, 32 or 48 (and clears the rest of the bits). You always want to start a sequence like this with a movz, and then movk the rest of the bits. (movk = move-keep. Move 16-bit immediate into register, keeping other bits unchanged.)

mov is sort of a pseudo-instruction that can pick movz, but I just tested with GNU binutils and clang, and you need an explicit movz (not mov) with an immediate like #:abs_g0:foo. Apparently the assembler won't infer that it needs movz there, unlike with a numeric literal.

For a narrow immediate, e.g. 0xFF000 which has non-zero bits in two aligned 16-bit chunks of the value, mov w0, #0x18000 would pick the bitmask-immediate form of mov, which is actually an alias for ORR-immediate with the zero register. AArch64 bitmask-immediates use a powerful encoding scheme for repeated patterns of bit-ranges. (So e.g. and x0, x1, 0x5555555555555555 (keep only the even bits) can be encoded in a single 32-bit-wide instruction, great for bit-hacks.)

There's also movn (move not) which flips the bits. This is useful for negative values, allowing you to have all the upper bits set to 1. There's even a relocation for it, according to AArch64 relocation prefixes.


Performance: movz low16; movk high16 in that order

The Cortex A57 optimization manual

4.14 Fast literal generation

Cortex-A57 r1p0 and later revisions support optimized literal generation for 32- and 64-bit code

    MOV wX, #bottom_16_bits
    MOVK wX, #top_16_bits, lsl #16

[and other examples]

... If any of these sequences appear sequentially and in the described order in program code, the two instructions can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.

The sequences include movz low16 + movk high16 into x or w registers, in that order. (And also back-to-back movk to set the high 32, again in low, high order.) According to the manual, both instructions have to use w, or both have to use x registers.

Without special support, the movk would have to wait for the movz result to be ready as an input for an ALU operation to replace that 16-bit chunk. Presumably at some point in the pipeline, the 2 instructions merge into a single 32-bit immediate movz or movk, removing the dependency chain.

A Guide to ARM64 / AArch64 Assembly on Linux with Shellcodes , ARM already has a monopoly on handheld devices, and are now a PC-relative address, and writes the result to the destination register. .struct expression, Switch to the absolute section, and set the section MOVK loads 0x1234 into the upper 16-bits using a shift, while preserving the lower 16-bits. A test AArch64 assembly file that, presumably, contains all the relocation operators available to GAS is here here. Conventions. Following the ARM document convention we have: S is the runtime address of the symbol being relocated. A is the addend for the relocation. P is the address of the relocation site (derived from r_offset).

Why not just

ldr w0, =my_addr

This should expand to optimal code for whatever microarchitecture you are programming for.

instruction equivalent to movw,movt with position independent and , Using movw and movt to load a label address into a register in Arm 32 movw r1 , #:lower16:ASM_NAME(forkx) movt r1, #:upper16:ASM_NAME(forkx) it seems label can not be used in the aarch32, it is fine in aarch64 and works as intendent. Instead of storing the absolute address of forkx inside a literal, it now stores� We need to do this because attribute target can change the result of vector_mode_supported_p and have_regs_of_mode on a per-function basis. Thus the TYPE_MODE of a VECTOR_TYPE can change on a per-function basis. */ I'm testing a central fix in expand_expr. Will need to check on the other targets but it fixes the case it for arm and aarch64.

Assuming that Peter Cordes' edits to your post reflect your actual intent, you can use the MOVL psuedo-instruction to load an absolute address into a register without using the LDR instruction. For example:

    MOVL x0, my_addr

The MOVL instruction has the advantage of working both with externally defined symbols and locally defined constants. The pseudo-instruction will expand to two or four instructions, depending on whether the destination is a 32-bit or 64-bit register, usually a MOV instruction followed by one or three MOVK instructions

However it's not obvious why the LDR instruction, specifically the LDR pseudo-instruction, wouldn't also work. This normally results in a PC relative load from a literal pool that the assembler will place in the same section (area) as your code.

For example:

    LDR x0, =my_addr

would be assembled to something like:

    LDR x0, literal_pool   ; LDR (PC-relative literal)
    ; ...
literal_pool:
    .quad my_addr

Since literal_pool is part of the same code section as the PC-relative LDR instruction that references it, the offset between the instruction and the symbol never changes, making the code relocatable. You can place your trampoline code in its own section and/or use the LTORG directive to ensure that the literal pool gets placed in a close and easily predictable location.

How to load constants in assembly for Arm architecture, The constant has to be moved into a register before use and there are many When executing an Arm instruction, PC reads as the address of the :upper16: and :lower16: allow you to extract the corresponding half from a� GCC for ARMv8 Aarch64 1. GCC for ARMv8 Aarch64 2014 issue.hsu@gmail.com 2. New features • Load-acquire and store-release atomics • AdvSIMD usable for general purpose float math • Larger PC-relative addressing and branching • Literal pool access and most conditional branches are extended to ± 1MB, unconditional branches and calls to ±128MB • Non-temporal (cache skipping) load/store

Machine Constraints (Using the GNU Compiler Collection (GCC)), AArch64 family— config/aarch64/constraints.md Registers usable as base- regs of memory addresses in ARCompact 16-bit memory An absolute address. B A signed 32-bit constant in which the lower 16 bits are zero. is valid as an immediate operand in an instruction taking only the upper 16-bits of a 32-bit number. Invalid configuration `aarch64-oe- linux': machine `aarch64-oe' not recognized – configure: error: /bin/sh config.sub aarch64-oe-linux failed • Please run autoreconf against autotools-dev 20120210.1 or later, and make a release of your software. 33 34.

assembly, :lower16, :upper16 for aarch64; absolute address into register;. 发表于 在ARM32上,可以使用 lower16 和 upper16 将地址加载到寄存器中 Non-Confidential PDF version101754_0614_00_en Arm® Compiler Reference GuideVersion 6.14Home > armclang Reference > armclang Integrated Assembler > Assembly expressionsB7.2 Assembly expressions Expressions consist of one or more integer literals or symbol references, combined using operators.

Machine Constraints, AArch64 family— config/aarch64/constraints.md e: Registers usable as base- regs of memory addresses in ARCompact 16-bit memory A: An absolute address L: A signed 32-bit constant in which the lower 16 bits are zero. as an immediate operand in an instruction taking only the upper 16-bits of a 32-bit number. Non-Confidential PDF version100748_0614_00_enArm® Compiler User GuideVersion 6.14Home > Writing Optimized Code > Optimizing loops3.2 Optimizing loops Loops can take a significant amount of time to complete depending on the number of iterations in the loop.

Comments
  • Relative memory reading via LDR and ADR is relocatable code. On the other hand your ARM32 example code isn't relocatable. . Also note that :lower16: and :upper16: wouldn't be sufficient for 64-bit ARM code because addresses are 64-bit.
  • Noup, ldr & adr are not relocateable in my case since memory region they are referencing could not be copied into a new location.
  • LDR and ADR are PC relative so work even if the program is relocated.
  • alright, mate. I need to load an absolute address without usage of LDR & ADR instructions.
  • @ElliotAlderson: pretty sure we're talking about whatever instructions the assembler chooses to use for a ldr w0, =0x12345678 pseudo-instruction. Which could be mov/movk.
  • abs_g* is what I was looking for. Thanks!
  • Just to note, instead of movz x0, #:abs_g2:foo, I used mov x0, #0 + movk x0, #:abs_g2_nc:foo. For some reason linker could not make first version and return "unrecognized reloc 267".
  • @user3124812: I don't think you need to zero the register first and then merge with movk. mov x0, #:abs_g2_nc:foo should be able to put a 16-bit immediate left-shifted to any position in a register (and zero the rest of the bits). BTW, isn't g2 for bits 32-47? That would make your address 4GB aligned, and 48 bits, not 32.
  • mov x0, #:abs_g2_nc:foo is not compiled with a Clang, GCC, on the other hand, handles that properly. I checked that on godbolt.org as well. In fact address fits 32b but it's not a big deal to have one extra 'explicit' instruction after all and do not worry about that in future
  • @user3124812: I was just experimenting with that, it seems you need an explicit movz, not just mov. Working on an update.
  • As I mentioned in comments above, code might be executed from different locations and respected literal pool also should be moved appropriately with a code. That makes handling much more complicated
  • @user3124812 If a literal pool is generated, it is accessed through a PC-relative addressing mode. Thus, if the code is moved around, the access still goes to the right place.
  • That's right, code is pc-relative, that means literal pool should be moved around as well so this relative memory reading could access that pool. I was looking for a way to avoid this extras with filling an absolute address straight into a register.