What does multicore assembly language look like?

multithreading in assembly
x86 synchronization instructions
x86 sipi
inter core interrupt
bare metal x86

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc.

With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ? If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented? Is there a "CPU context" or "thread" concept in x86 assembler now?

How does communication/synchronization between the cores work?

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores? Is it some special priviledged instruction(s)?

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

What changes have been made to x86 machine code to support multi-core functionality?

This isn't a direct answer to the question, but it's an answer to a question that appears in the comments. Essentially, the question is what support the hardware gives to multi-threaded operation.

Nicholas Flynt had it right, at least regarding x86. In a multi threaded environment (Hyper-threading, multi-core or multi-processor), the Bootstrap thread (usually thread 0 in core 0 in processor 0) starts up fetching code from address 0xfffffff0. All the other threads start up in a special sleep state called Wait-for-SIPI. As part of its initialization, the primary thread sends a special inter-processor-interrupt (IPI) over the APIC called a SIPI (Startup IPI) to each thread that is in WFS. The SIPI contains the address from which that thread should start fetching code.

This mechanism allows each thread to execute code from a different address. All that's needed is software support for each thread to set up its own tables and messaging queues. The OS uses those to do the actual multi-threaded scheduling.

As far as the actual assembly is concerned, as Nicholas wrote, there's no difference between the assemblies for a single threaded or multi threaded application. Each logical thread has its own register set, so writing:

mov edx, 0

will only update EDX for the currently running thread. There's no way to modify EDX on another processor using a single assembly instruction. You need some sort of system call to ask the OS to tell another thread to run code that will update its own EDX.

What does multicore assembly language look like?, Once upon a time, to write x86 assembler, for example, you would have code level does it just look like there are 4 separate CPUs (i.e. are  Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc. With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ?

How does Multi-threaded Code Run in Assembly Language?, Then a compiler (e.g. gcc) converts that code into assembly and machine (byte) What does multicore assembly language look like. Dans un environnement multi-threadé (hyperfiletage, multi-core ou multi-processeur), le Bootstrap thread (généralement filetage 0 dans le cœur 0 dans le processeur 0) démarre la récupération du code à partir de l'adresse 0xfffffff0. Tous les autres fils commencent dans un État de sommeil spécial appelé Wait-for-SIPI . Dans le cadre

As I understand it, each "core" is a complete processor, with its own register set. Basically, the BIOS starts you off with one core running, and then the operating system can "start" other cores by initializing them and pointing them at the code to run, etc.

Synchronization is done by the OS. Generally, each processor is running a different process for the OS, so the multi-threading functionality of the operating system is in charge of deciding which process gets to touch which memory, and what to do in the case of a memory collision.

Which modern multicore assembly should I study?, You can learn “enough” of the four within one year if you are willing to commit to I'm currently learning Assembly language in college. it for but its just something you would like to know how to do and actually spend a lot of time doing, . the architecture spec itself also has a section about the sharing of certain resources that must be valid across all implementations, although it does not mention caches: What does multicore assembly language look like?

The Unofficial SMP FAQ


Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc. With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ?

Exactly. There are 4 sets of registers, including 4 separate instruction pointers.

If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented?

The CPU that executed that instruction, naturally. Think of it as 4 entirely different microprocessors that are simply sharing the same memory.

Is there a "CPU context" or "thread" concept in x86 assembler now?

No. The assembler just translates instructions like it always did. No changes there.

How does communication/synchronization between the cores work?

Since they share the same memory, it's mostly a matter of program logic. Although there now is an inter-processor interrupt mechanism, it's not necessary and was not originally present in the first dual-CPU x86 systems.

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores?

The scheduler actually doesn't change, except that it is slightly more carefully about critical sections and the types of locks used. Before SMP, kernel code would eventually call the scheduler, which would look at the run queue and pick a process to run as the next thread. (Processes to the kernel look a lot like threads.) The SMP kernel runs the exact same code, one thread at a time, it's just that now critical section locking needs to be SMP-safe to be sure two cores can't accidentally pick the same PID.

Is it some special privileged instruction(s)?

No. The cores are just all running in the same memory with the same old instructions.

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

You run the same code as before. It's the Unix or Windows kernel that needed to change.

You could summarize my question as "What changes have been made to x86 machine code to support multi-core functionality?"

Nothing was necessary. The first SMP systems used the exact same instruction set as uniprocessors. Now, there has been a great deal of x86 architecture evolution and zillions of new instructions to make things go faster, but none were necessary for SMP.

For more information, see the Intel Multiprocessor Specification.


Update: all the follow-up questions can be answered by just completely accepting that an n-way multicore CPU is almost1 exactly the same thing as n separate processors that just share the same memory.2 There was an important question not asked: how is a program written to run on more than one core for more performance? And the answer is: it is written using a thread library like Pthreads. Some thread libraries use "green threads" that are not visible to the OS, and those won't get separate cores, but as long as the thread library uses kernel thread features then your threaded program will automatically be multicore.
1. For backwards compatibility, only the first core starts up at reset, and a few driver-type things need to be done to fire up the remaining ones.2. They also share all the peripherals, naturally.

Deep Inside CPU: Raw Multicore Programming, If this signature is not found in the memory, then we don't have ACPI and therefore, there are no multiple CPU cores. As stated in RSDP, this is  An assembly language is a programming language that can be used to directly tell the computer what to do. An assembly language is almost exactly like the machine code that a computer can understand, except that it uses words in place of numbers. A computer cannot really understand an assembly program directly.

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

As someone who writes optimizing compiler/bytecode VMs I may be able to help you here.

You do not need to know anything specifically about x86 to make it generate code that runs efficiently across all the cores.

However, you may need to know about cmpxchg and friends in order to write code that runs correctly across all the cores. Multicore programming requires the use of synchronisation and communication between threads of execution.

You may need to know something about x86 to make it generate code that runs efficiently on x86 in general.

There are other things it would be useful for you to learn:

You should learn about the facilities the OS (Linux or Windows or OSX) provides to allow you to run multiple threads. You should learn about parallelization APIs such as OpenMP and Threading Building Blocks, or OSX 10.6 "Snow Leopard"'s forthcoming "Grand Central".

You should consider if your compiler should be auto-parallelising, or if the author of the applications compiled by your compiler needs to add special syntax or API calls into his program to take advantage of the multiple cores.

[PDF] Introduction to Multicore Programming, 1 Multi-core Architecture. Multi-core Cores on a multi-core device can be coupled tightly or loosely: single-core systems such as instruction pipeline parallelism (ILP), out-of-order execution: attempt to execute independent instructions. test al,al to me looks like it checks the same lower bits and will always get the same results. What does multicore assembly language look like? 678.

Real World Multicore Embedded Systems: Chapter 9. Programming , This device can be limited in terms of memory, battery power, data transfer The application may be limited by the development environment as well. be written to deal with all the lowlevel details of the device, usually in assembly language. Use the high-level language to write a skeletal version of the routine that you plan to code in assembly language. Compile the program using the - S option, which creates an assembly language ( .s ) version of the compiled source file (the -O option, though not required, reduces the amount of code generated, making the listing easier to read).

Bare metal multi core programming, There is a sticky labeled "Bare Metal resources", which would appear to be a You could say it is like doing Arduino stuff on Pi's, you have an IDE with It's barely 20 lines of assembler code just write out what each line does. This new method, called writing a program in assembly language, saved programmers thousands of hours, since they no longer had to look up hard-to-remember numbers in the backs of programming books, but could use simple words instead. The program above, written in assembly language, looks like this:

Real World Multicore Embedded Systems, This device can be limited in terms of memory, battery power, data transfer The application may be limited by the development environment as well. written to deal with all the low-level details of the device, usually in assembly language. The second LDR does the same for var2 and loads it to R1. Then we load the value stored at the memory address found in R0 to R2, and store the value found in R2 to the memory address found in R1. When we load something into a register, the brackets ([ ]) mean: the value found in the register between these brackets is a memory address we want to

Comments
  • There's a similar (though not identical) question here: stackoverflow.com/questions/714905/…
  • Thanks for filling the gap in Nicholas' answer. Have marked yours as the accepted answer now.... gives the specific details I was interested in... although it would be better if there was a single answer that had your information and Nicholas' all combined.
  • This doesn't answer the question of where the threads come from. Cores and processors is a hardware thing, but somehow threads must be created in software. How does the primary thread know where to send the SIPI? Or does the SIPI itself create a new thread?
  • @richremer: It seems like you're confusing HW threads and SW threads. The HW thread always exists. Sometimes it's asleep. The SIPI itself wakes the HW thread and allows it to run SW. It is up to the OS and BIOS to decide which HW threads run, and which processes and SW threads run on each HW thread.
  • Lots of good and concise info here, but this is a big topic - so questions can linger. There are a few examples of complete "bare bones" kernels in the wild that boot from USB drives or "floppy" disks - here's an x86_32 version written in assembler using the old TSS descriptors that can actually run multi-threaded C code (github.com/duanev/oz-x86-32-asm-003) but there is no standard library support. Quite a bit more than you asked for but it can maybe answer some of those lingering questions.
  • What assembler do you use to compile your example? GAS doesn't seem to like your #include (takes it as a comment), NASM, FASM, YASM don't know AT&T syntax so it can't be them... so what is it?
  • @Ruslan gcc, #include comes from the C preprocessor. Use the Makefile provided as explained in the getting started section: github.com/cirosantilli/x86-bare-metal-examples/blob/… If that does not work, do open a GitHub issue.
  • on x86, what happen if a core realize there is no more processes ready to run in the queue ? (which might happen from time to time on a idle system). Does the core spinlock on shared memory structure until there is a new task ? (probably not good is it will use lot of power) does it call something like HLT to sleep until there is an interrupt ? (in that case who is responsible to wake up that core ?)
  • @tigrou not sure, but I find it extremely likely that the Linux implementation will put it in a power state until next (likely timer) interrupt, especially on ARM where power is key. I would try quickly to see if that can be observed concretely easily with an instruction trace of a simulator running Linux, it might be: github.com/cirosantilli/linux-kernel-module-cheat/tree/…
  • Some information (specific to x86 / Windows) can be found here (see "Idle Thread"). TL;DR : when no runnable thread exists on a CPU, CPU is dispatched to an idle thread. Along with some other tasks, it will ultimately call the registered power management processor idle routine (via a driver provided by CPU vendor, eg : Intel). This might transition CPU to some deeper C-state (eg : C0 -> C3) in order reduce power consumption.