GPU Computing and Beyond
(This Blog is based on a talk by Prof. Adwait Jog at IIT Bombay, who is an associate professor at School of Engineering and Applied Sciences (SEAS) at University of Virginia)
Most of us have heard about the term GPU in some way or other. Some ight have been using this technology for editing, high graphics, Crypto, gaming, Deep Learning, AI/ML, cloud providing units etc. Can we describe the fields in which they are used in a more general way? As in is there something which is common to all these applications? Let’s explore it
GPU or Graphical Processsing Units are used in domains that have some sort of “parallelism” involved. Basically there are 2 levels of parallelism -
- Data level parallelism : Single instruction being processed across different data simultaneously
- Task level parallelism : Running different instructions together (like in GPU)
GPUs are used in applications that require abundant data along with task level paralleslism. You might have wondered a fascinating amount of efficiency in such applications if you ever tried using GPU on them, a common example could be playing a highly graphical oriented game or maybe training a Neural network.
Ever wondered what is so special about GPUs as compared to CPU?
Because of the energy efficiency and high throughput of GPUs, the demand of desktop GPUs in market is increasing at a much higher rate than CPU. As the trend goes, more companies like Apple are also using more of the GPU by using more size of GPU in chips and increasing number of GPU cores.
If GPU > CPU, why don’t replace all CPUs with GPU?
There are always some trade-offs, and such is the case of GPU as well. Primarily, GPU face these challenges —
Scalability
As we know that GPU are quite efficient in computations, but the memory part has some limiatations of its own. They can speed up the throughput to a higher extent but latency is a critical point, the memory bandwidth is not able to scale upto the desired extent so as to support the compute leading to a “compute-memory gap”. This is the reason why we are learning about caches and prefetching techniques to support the memory management. Is there any solution to this problem?
- Approximate/Low precision computing : By reducing the number of operations, we can perform more work for some AI application that may not require exact precision, although this will work only upto a certain extent of error.
- Specialized GPU : designed for specific domains
Another problem in the aspect of Scalability is Die size limitation for GPU manufacturing.
Security & Reliability at Low Overhead
GPUs are prone to attack and leak some information while processing. We will go through one such type of attack ahead. We will focus on this challenge in detail, starting with a “Hello World” to GPU architecture -
GPU Architecture
Lets begin with a comparison to CPU. Each board in CPU is optimized for latency and therefore they are great for sequential coding. On the other hand, there are several cores in GPU which are optimized for throughput.
GPU focuses more on total amount of work done rather than a single thread
CPUs use memory hierarchy structure of caches so as to improve latency. Caches are used so that the data which is accessed frequently can be stored in them and accessed faster.
Do you think GPUs need caches as much as CPU?
As we have seen that GPU do not focus on a single thread, speeding up a single thread does not help much here. Also, due to parallelism, there are several threads processing together which can hide the latency of a slower thread. Hence, caches are not that necessary in GPU as compared to CPU, but still they are used in GPU because caches support the memory bandwidth problem (not latency) by saving memory band so that the data can be used by multiple threads.
Ok, and? what about prefetching?
Prefetching refers to fetching another address along with the current address into caches so that if its instruction occurs in future, it would be useful as it is present in the cache. But, in GPU since the bandwidth is already saturated, it do not have much benefits here. Moreover, issues like accuracy and cache pollution in prefetching may also affect unnecessary bandwidth and it is quite expensive also in terms of silicon area in chip.
I wonder then what are the significant features or optimizations used in GPU?
GPU have a lot of Streaming Multiprocessors (SMs) interconnected with each other, and the memory used is also different from CPU. The memory in GPU needs to be have high bandwidth so as to allow every thread running through to use it. As there are multiple threads, they are warped together by a Warp scheduler to process them. The caches are present at different levels. Like CPU, here also there is a special register called MSHR (Miss-Status Holding Register) whose responsibility is to reduce irrelevant requests that are coming from multiple threads into memory, for example if many threads are asking for same data/cache location.
Suppose there are 4 blocks of warped threads and two of these are referring to the exact same memory address. Now, MSHR would see that there are common blocks and it will try to coalesce these 2 blocks together, and thus the number of requests going to memory will be reduced from 4 to 3. That is why MSHR or coalesce unit are present there, and this technique is referred to as Memory Access Coalescing in GPU.
Although Coalescing saves memory bandwidth, it possess an interesting security vulnerability… 😯
Correlation Timing Attack Model
Consider a setting where you are an attacker and you are sending a plain text to a remote server (which is using GPU). The server is encrypting data using a secret key and returns it to you. You also know that it is using Coalescing mechanism to save time. Is it possible for you to figure out the secret key?
Well, the answer is yes! The mechanism uses last round execution information for recovering the AES key.
The multi-thread equation is doing encryption paramount and it is running in parallel on several threads. The attacker gathers several execution times acress threads and guesses a lot of key bytes. Then, it calculates the # of accesses generated by guessed key, and also the correct execution time (that has happened through real key) is also known. Attacker finds the correlation between them and it is seen that the Correlation is highest for Real Key! So in this way, you can guess the correct key through timing analysis considering GPU performing Bandwidth saving optimization.
Seems quite a threat right? Can we mitigate these Timing Attacks on GPU?
- Naive Solution : Just remove this whole concept of coalescing from GPU. Its simple, you see. As soon as this optimization is removed, the correlation drops to ~0 and then it is not possible to distinguish between incorrect and correct key. But, there is a trade-off established now. Disabling coalescing leads to upto 178% performance degradation :(
- State-of-Art Solutions: So, naive solution is not so feasible, a reduction of 178% in architecture is horrible. Other solutions include Randomized Coalescing (which leads to high performance overhead, but it is vulnerable with Caches and MSHR), Software solution (address leakage on AES application only). We need a performance efficient general solution, these aren’t enough yet.
Key insight : Redundant Data Management => Bucketing for Coalesced Accesses!
BCoal : Bucketing-based Memory Coalescing for
Efficient and Secure GPUs
This solution was proposed by Prof. Adwait Jog and his team at HPCA 2020. You can read about this in detail here. In this solution, we try to reduce the correlation between coalesce access and execution time by reducing the variance. If the variance is more, then the information can be leaked by calculating number of request sent and time taken, and if attacker gets constant timing then he can use correlation info. However, if we always send a constant number of requests to memory, leading to constant time always and thus reducing the variance as well as correlation. That is, we have a bucket of quantized number if instructions and use it for accessing memory by padding accesses.
Using less number of buckets is best for security, but leads to performance overhead due to more padding. On the other hand, more number of buckets have low performance overhead but less security because it will lead to some difference in timings leading to some variance. So, have we arrived again at yet another trade-off?
Well, selecting bucketing features is an important task so as to decide balance between performance and security. Performance degrades with a single bucket of size 16 (known as BCoal(16)) but provides Best security. On the other hand, results have shown that using 2 buckets of size 1 and 16 (known as BCoal(1,16)) provides good security along with appreciable performance at the same time. It generates less DRAM traffic which leads to optimal number of coalesced accesses as well as secures MSHR and caches. Thus, BCoal is performance efficient.
Reliability Research over GPUs
There are various errors associated with GPUs as well. These errors include soft errors like high-energy radioactive particles causing bit flips or some permanent faults too. Their impacts are critical which can lead to crashing of system, Silent Data Corruption (SDC) or incorrect output. The outcome of such faults in critical applications can be dangerous and they can be severely impacted.
Some of the protection mechanism that are used include Error checking and Correction (ECE), Duplication/Triplication, Selective re-computation, checkpointing etc. ECE can be used to solve single/double bit error correction and it also protects register files and memory. However, protection for multi-bit errors is prohibitively expensive in GPUs and current protection techniques are not enough (lead to high error rates) or are expensive :(
Key insight : Redundant Data Management : Not all Memory faults are same!
A solution to this problem was published by Prof. Adwait and his team by using controlled data replication, called “Data centric reliability management in GPUs”. Instead of protecting each and every bunch of memory, we protect only a fraction of memory. Now, an important question which arises is : Which fraction of memory and How to protect it?
Highly Accessed => Hot Memory 🥵 <= Highly Shared
When we analyze application pattern, we get that there is a small fraction of memory which is highly accessed and similarly another fraction which is highly shared. Together we term these regions of memory as “Hot memory”. If there is any sort of fault in the application, the impact of them is very high in these as compared to rest of the memory. In other words, Output is Highly Sensitive to faults in Hot Memory! Therefore, identification of such regions is important which is done through Source code profiling, i.e. analyzing the data objects and hot data objects from the source code. This can be automated or also performed offline once. On idetifying them, we notice that hot memory has a very small memory footprint.
Protection of Hot memory and its Evaluation
We try to implement a simple but rather effective replication mechanism in which hot memory is replicated for detection or correction purposes. This is adapted because GPU’s latency tolerance feature helps in keeping the overheads low in this way. You can read more details about protection of hot memory here. Results based on improvising hot memory protection depict a drop of 98.97% in SDC outcomes, and that too at low performance overhead (due to data replication).
Key Take away points
- Redundant data management helps in improving security and reliabiliy at low overhead. We saw 2 such examples which demonstrated application of it — correlation timing attack (on security side) and Memory faults (on reliability side)
- GPUs although better in performance than CPU, they are facing scalability challenges
- Along with scalability, security & reliability issues are also critical and need proper attention
- Prof. Adwait’s group focuses on a “Holistic” approach to improve efficiency as well as reliability & security at the same time
A lot of innovative research is involved in this domain of GPUs and future work includes going beyond GPU. Is there even something that is beyond this technology? Yes, consider other accelarators like quantum and superconducting architectures, which have the potential to bring about significant changes to Computer Science!