  • QQ空间
  • 回复
  • 收藏

On GPGPU core sharing and preemption

First, comes terminology:
warp: similar SIMD thread on CPU
threadblock: A collection of warp that is guaranteed to be run on the same SM
SM: minimum unit of execution core.
GPU core: a collection of SM(assumed to be 4 here).
The current GPU core design all converges to using 4 SM inside a GPU core. Which have independent schedulers and execution units, but they all share the same local data cache. At a high level, the top-level scheduler breaks down a kernel into multiple thread blocks, and these thread blocks are granted to be scheduled into one GPU core. And once a threadblock is emitted into a GPU core, it’s up to the core scheduler to decide how the warPS to SM. Nowadays, different warps of different thread blocks of different kernels can all be scheduled into the same GPU core to achieve maximum utilization. So what the core scheduler does is to dispatch as many warps as possible to the SM (to hide latency). That being sAId, the number of warps can be scheduled into an SM depending on the resources that can be shared by different threads e.g. register file size, branch convergence stack, etc.
Here is an example we have 2 thread blocks each with a warp size of 32 dispatched into the core scheduler and at each cycle, the core scheduler checks how many threads are executing in the SM once there is space(register space, convergence stack, and other states required to run the thread, etc..) and enough space in local shared memory, the warps will be dispatched. Inside the SM, there is one more scheduler that actively chooses warp to execute from the active set of warps(and this active set of warps is contained inside from the larger set of running warps), each cycle one warp is dispatched and decoded. Then they reach another queue(a small one). Each cycle in that queue, another scheduler actively detects dependence on the warp instruction and tries to find one to issue into the execution pipeline. In the end, once a warp has finished execution, its register and branch stack are released; This allowed another warp to enter execution of the current warp.
Now we allow concurrent execution of different kernel(or even process)’s warp in a GPU core, but we still lack time sharing feature. For example, imagine you have a long-running background computation task but you still would like to refresh your display every 1/60 sec. You can do this at the thread block level where you move the thread blocks of the render kernel to the front of the core scheduler waitlist and the next time you have finished executing the preceding thread blocks you can dispatch the warp from the render thread blocks rather than the computation ones. In case the threadblocks take too long to execute, you can preempt at the instruction level, this is costly it requires you to save all the current register context and other states, then swap in the threadblocks of the render thread. once you have finished executing the rendering threads, you can recover the register state and start execution of the previous thread as usual.
Apple gives 64kb of scratchpad memory to each GPU core and a thread block can use up to 32kb of it. I think this design is for the purpose that at any cycle, the core can keep two at least threadblocks active(if other resources are not constrained). So when execution meets the boundary of one thread block, the core scheduler can start dispatching threads from the next thread block if there are free slots in those SMs.
So basically things happen like this, there is a waitlist inside each GPU core, and the top-level scheduler dispatches threadblocks to each GPU core. On every cycle the core scheduler checks if it is possible to dispatch thread blocks from the waitlist(i.e. check for scratchpad usage and free slots in SMs). If all resources are preserved, dispatch the warp into the SM, else continue to wait for the free slots.

使用道具 举报


phsdljpl67 楼主 2024-8-13 10:55:14 显示全部楼层
1) https://upcommons.upc.edu/bitstream/handle/2117/26093/isca2014.pdf Enabling Preemptive Multiprogramming on GPUs
2) https://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia%27s_fermi-the_first_complete_gpu_architecture.pdf NVIDIA’s Fermi: The First Complete GPU Computing Architecture
3) https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf NVIDIA Kepler GK110 Architecture Whitepaper.pdf
4) https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10
回复 支持 反对

使用道具 举报

rzywipru79 2024-8-13 10:55:48 显示全部楼层
回复 支持 反对

使用道具 举报

AoshuaFab 2024-8-13 10:55:57 显示全部楼层
On GPGPU core sharing and preemption-1.png
回复 支持 反对

使用道具 举报

Thomasdus 2024-8-13 10:56:43 显示全部楼层
SM(Streaming Multiprocessor):执行核心的最小单位。
当前GPU核心设计趋向于在每个GPU核心内部包含4个SM。这些SM拥有独立的调度器和执行单元,但它们共享同一局部数据缓存。在高层面上,顶级调度器将一个内核(kernel)分解成多个线程块(thread block),这些线程块被分配到一个GPU核心进行调度。一旦线程块被发送到GPU核心,就由该核心的调度器决定如何将Warp分配给各个SM。如今,来自不同内核、不同线程块的不同Warp都可以被调度到同一个GPU核心中,以实现最大利用率。因此,核心调度器的工作是尽可能多地向SM分发Warp(以隐藏延迟)。然而,能够被调度到一个SM中的Warp数量取决于可由不同线程共享的资源,例如寄存器文件大小、分支收敛堆栈等。
回复 支持 反对

使用道具 举报


您需要登录后才可以回帖 登录 | 立即注册
HOT • 推荐