Fig 0 Apple Overall scheduling design
Background
Traditionally, the scheduler (reservation station) has been designed as a queue to preserve age information, allowing the oldest instruction to be selected when multiple instructions are to be issued. However, a queue structure consumes significant power and involves complex wiring. Each time a new instruction enters the queue, it has to collaPSe to fill the gap left by filled entries (see Fig. 1 below).
Apple's approach utilizes an age matrix to track relative age state information, as described in the 2015 patent (https://patents.google.com/patent/US20170024205A1) for a non-shifting reservation station. The idea is that each entry in the scheduler includes additional state information to indicate its relative age compared to other entries (1 for older, 0 for younger). For brevity, I will use C-syntax, where M[X][Y] denotes the Yth bit of the X entry in the matrix. For example, if instruction A enters scheduler entry 0 and there are no other instructions yet, its state would be M[0]=11110 (assuming a pool of 5 entries), since it is older than all other unallocated entries (see Fig. 2 below).
Fig 1 Many writes have to be generated when "collapsing".
This structure is easy to update. Consider what happens to the relative age between entries when an instruction enters and exits the scheduler:
Fig 2 Each entry has N bits filed to keep track of the relative age, where N is the size of the scheduler
When an instruction first enters the scheduler, you allocate an entry in the array and set its age vector accordingly. Since the dispatch stage is in order, all instructions that have been inside the scheduler are older than the current instruction, so those bits are set to 0. For unallocated entries, we set their bits to 1 because when they are allocated, the current instruction must be older than them.
When an instruction is issued from the scheduler, the corresponding entry is freed, and all bits indicating relative age to that entry (and allocated) should be set to 1. This is because the next instruction allocated in that entry must be younger than all current instructions in the scheduler.
Now that we have this extra state information, how can we use it? We can refer to a 2016 patent (https://patents.google.com/patent/US10120690B1) on reservation station early age indicator generation. The scheme Apple provided not only consumes less power and is easier to wire but also results in a shorter critical path and higher frequency. The goal is to find the oldest ready instruction to issue into the execution unit using just the matrix states. This can be envisioned as a "knockout round" based on the age of the instructions (see Fig. 3).
Fig 3 perform knockout round between instructions
Using the matrix, performing age comparisons between any two scheduler entries is strAIghtforward:
First, ensure that the current entry is valid (i.e., allocated).
Next, check that the instruction is ready for execution.
Finally, compare their age fields in the matrix: if M[X][Y] > 0, then X is older than Y.
Apple has implemented this process with a few tweaks to shorten the critical path. Nowadays, a modern CPU has around a 20-24 gate delay for one cycle operation. The tree height is log2(Entries); for a scheduler with 32 entries, only 5 passes of comparison are needed.
Fig 4 Overview of apple's approach
Let's make a few assumptions:
One instruction is dispatched to the scheduler per cycle (which holds true for Apple due to their dispatch buffer design).
One instruction is dequeued from the scheduler to the execution unit per cycle.
Apple separates the comparison into two pipeline stages. The first pass selects the oldest instruction from adjacent pairs of entries, storing the selected entries in registers. In the following cycle, the remaining passes are executed on the instructions selected from the previous cycle. This creates a potential issue: the selected instructions are based on information from the previous cycle rather than the current one.
To illustrate the problem:
In cycle 0, one instruction is selected to execute (it will dequeue in the next cycle), but the first pass does not use this information, as the scheduler only updates on the transition to the next cycle.
In cycle 1, the selected instruction continues down the pipeline, selecting the same instruction that was previously chosen. This occurs because it’s based on outdated information; the reason that instruction was selected is that it was the oldest in the earlier cycle. Meanwhile, the first pass stage selects instructions based on the after-dequeue information.
(Note: This is my personal interpretation, not explicitly stated in the 2016 patent.)
To avoid this repetition issue, we calibrate the previously selected entries with the current cycle information. In the second stage of the pipeline, we recognize that an instruction has dequeued from the scheduler, so we can simply remove that instruction from the selected list, keeping the information up-to-date. When a new instruction enters the scheduler, it must be the youngest compared to all instructions already present. Thus, the only situation where it will be selected is if no instructions have been chosen from the previous pass. It’s an easy yet elegant solution, isn’t it?
Taking a step back, this is essentially incremental adjustment! Since a maximum of only two instructions will be inserted or deleted from the array in any given cycle, it’s feasible to select from the previous sample with minimal adjustments, avoiding delays. This is far superior to redoing the entire selection process, which could lead to longer dependency paths.
Thanks to Maynard Handley for pointing out the patents in his M1 exploration PDF. If you haven’t read them, you should definitely check them out. |