标题: On GPGPU core sharing and preemption [打印本页] 作者: phsdljpl67 时间: 2024-8-13 10:54 标题: On GPGPU core sharing and preemption First, comes terminology:
warp: similar SIMD thread on CPU
threadblock: A collection of warp that is guaranteed to be run on the same SM
SM: minimum unit of execution core.
GPU core: a collection of SM(assumed to be 4 here).
The current GPU core design all converges to using 4 SM inside a GPU core. Which have independent schedulers and execution units, but they all share the same local data cache. At a high level, the top-level scheduler breaks down a kernel into multiple thread blocks, and these thread blocks are granted to be scheduled into one GPU core. And once a threadblock is emitted into a GPU core, it’s up to the core scheduler to decide how the warps to SM. Nowadays, different warps of different thread blocks of different kernels can all be scheduled into the same GPU core to achieve maximum utilization. So what the core scheduler does is to dispatch as many warps as possible to the SM (to hide latency). That being said, the number of warps can be scheduled into an SM depending on the resources that can be shared by different threads e.g. register file size, branch convergence stack, etc.
Here is an example we have 2 thread blocks each with a warp size of 32 dispatched into the core scheduler and at each cycle, the core scheduler checks how many threads are executing in the SM once there is space(register space, convergence stack, and other states required to run the thread, etc..) and enough space in local shared memory, the warps will be dispatched. Inside the SM, there is one more scheduler that actively chooses warp to execute from the active set of warps(and this active set of warps is contained inside from the larger set of running warps), each cycle one warp is dispatched and decoded. Then they reach another queue(a small one). Each cycle in that queue, another scheduler actively detects dependence on the warp instruction and tries to find one to issue into the execution pipeline. In the end, once a warp has finished execution, its register and branch stack are released; This allowed another warp to enter execution of the current warp.
Preemption
Now we allow concurrent execution of different kernel(or even process)’s warp in a GPU core, but we still lack time sharing feature. For example, imagine you have a long-running background computation task but you still would like to refresh your display every 1/60 sec. You can do this at the thread block level where you move the thread blocks of the render kernel to the front of the core scheduler waitlist and the next time you have finished executing the preceding thread blocks you can dispatch the warp from the render thread blocks rather than the computation ones. In case the threadblocks take too long to execute, you can preempt at the instruction level, this is costly it requires you to save all the current register context and other states, then swap in the threadblocks of the render thread. once you have finished executing the rendering threads, you can recover the register state and start execution of the previous thread as usual.
Apple gives 64kb of scratchpad memory to each GPU core and a thread block can use up to 32kb of it. I think this design is for the purpose that at any cycle, the core can keep two at least threadblocks active(if other resources are not constrained). So when execution meets the boundary of one thread block, the core scheduler can start dispatching threads from the next thread block if there are free slots in those SMs.
So basically things happen like this, there is a waitlist inside each GPU core, and the top-level scheduler dispatches threadblocks to each GPU core. On every cycle the core scheduler checks if it is possible to dispatch thread blocks from the waitlist(i.e. check for scratchpad usage and free slots in SMs). If all resources are preserved, dispatch the warp into the SM, else continue to wait for the free slots.作者: phsdljpl67 时间: 2024-8-13 10:55
References:
1) https://upcommons.upc.edu/bitstream/handle/2117/26093/isca2014.pdf Enabling Preemptive Multiprogramming on GPUs
2) https://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia%27s_fermi-the_first_complete_gpu_architecture.pdf NVIDIA’s Fermi: The First Complete GPU Computing Architecture
3) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf NVIDIA Kepler GK110 Architecture Whitepaper.pdf
4) https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10作者: rzywipru79 时间: 2024-8-13 10:55
。◕‿◕。作者: AoshuaFab 时间: 2024-8-13 10:55
这种贴子13就不会来 (, 下载次数: 0)