What chip does ChatGPT need?

Recently, the generative model led by ChatGPT has become a new hotspot of AI. Microsoft and Google in Silicon Valley have invested heavily in such technology (Microsoft has invested $10 billion in OpenAI behind ChatGPT , and Google has recently released its own BARD model). Internet technology companies represented by Baidu in China have also said that they are developing such technology and will go online in the near future.

The generation class model represented by ChatGPT has a common feature, that is, it uses massive data for pre-training, and often matches a relatively powerful language model. The main function of the language model is to learn from the massive existing corpus. After learning, you can understand the user’s language instructions, or further generate relevant text output according to the user’s instructions.

Generation class models can be roughly divided into two categories, one is language class generation model, and the other is image class generation model. The language generation model is represented by ChatGPT. As mentioned above, its language model can not only learn to understand the meaning of user instructions (for example, “write a poem in Li Bai style”), but also generate relevant text according to user instructions (in the above example, write a poem in Li Bai style) after massive data training. This means that ChatGPT needs a large enough language model (LLM) to understand the user’s language, and can have high-quality language output – for example, the model must be able to understand how to generate poetry, how to generate poetry in the style of Li Bai, and so on. This also means that the large language model in language generative AI needs a lot of parameters to complete such complex learning and remember so much information. Taking ChatGPT as an example, its number of parameters is up to 175 billion (using standard floating-point numbers will occupy 700GB of storage space), which can be seen from the “big” of its language model.

The other generation model is the image class generation model represented by Diffusion. Typical models include Dalle from OpenAI, ImaGen from Google, and the most popular Stable Diffusion from Runway AI. This kind of image generation model will also use a language model to understand the user’s language instructions, and then generate high-quality images according to this instruction. Different from the language generation model, the language model used here is mainly used to understand user input, without generating language output, so the number of parameters can be much smaller (in the order of hundreds of millions), while the number of parameters of the image diffusion model is relatively small. In general, the number of parameters is about several billion orders of magnitude, but its calculation is not small, because the resolution of the generated image or video can be very high.

Generation models can produce unprecedented high-quality output through massive data training. At present, there are many clear application markets, including search, dialogue robots, image generation and editing, etc. It is expected that more applications will be obtained in the future, which also puts forward requirements for related chips.

Generating class model requirements for chips
As mentioned earlier, the generation class model represented by ChatGPT needs to learn from massive training data to achieve high-quality generation output. In order to support efficient training and reasoning, the generation class model also has its own requirements for related chips.

The first is the demand for distributed computing. The number of parameters of language generation models such as ChatGPT is as high as 100 billion. It is almost impossible to use stand-alone training and reasoning, but it is necessary to use a large number of distributed computing. When conducting distributed computing, there is a great demand for data interconnection bandwidth between machines and computing chips for such distributed computing (such as RDMA), because the bottleneck of tasks may not be computing, but data interconnection. Especially in such large-scale distributed computing, the efficient support of chips for distributed computing has become the key.

The second is memory capacity and bandwidth. Although distributed training and reasoning for language class generation model is inevitable, the local memory and bandwidth of each chip will also largely determine the execution efficiency of a single chip (because the memory of each chip is used to the limit). For the image class generation model, you can put the model (about 20GB) in the memory of the chip, but with the further evolution of the image generation class model in the future, its demand for memory may also be further improved. From this point of view, the ultra-high bandwidth memory technology represented by HBM will become the inevitable choice of relevant acceleration chips, and the generation of class models will also accelerate the further increase of capacity and bandwidth of HBM memory. In addition to HBM, new storage technologies such as CXL and software optimization will also increase the capacity and performance of local storage in such applications. It is estimated that more industrial applications will be adopted from the rise of generative class models.

Finally, computing. Both language and image generation models require a lot of computing power. As the generation resolution of the image generation model becomes higher and higher and moves towards video applications, the demand for computing power may be greatly improved. The current mainstream image generation model has a computing capacity of about 20 TFlops, while as it moves towards high resolution and image, the computing power demand of 100-1000 TFlops is likely to be the standard.

To sum up, we believe that the requirements of the generation class model for chips include distributed computing, storage and computing, which can be described as involving all aspects of chip design. What is more important is how to combine these requirements in a reasonable way to ensure that a single aspect will not become a bottleneck, which will also become a problem of chip design system engineering.

ChatGPT and new AI chips, who has more opportunities

The generative model has new demand for chips. For GPU (represented by Nvidia and AMD) and new AI chips (represented by Habana and GraphCore), who has more opportunities to seize this new demand and market?

First of all, from the perspective of the language class generation model, due to the huge number of parameters and the need for good distributed computing support, GPU manufacturers that have a complete layout in this kind of ecology have more advantages. This is a system engineering problem that requires complete software and hardware solutions. In this regard, Nvidia has launched the Triton solution in combination with its GPU. Triton supports distributed training and distributed reasoning. It can divide a model into multiple parts and process them on different GPUs, so as to solve the problem that the main memory of a GPU cannot accommodate the large number of parameters. In the future, whether using Triton directly or further developing on the basis of Triton, it will be more convenient to have a complete ecological GPU. From the perspective of calculation, since the main calculation of language class generation model is matrix calculation, and matrix calculation itself is the strength of GPU, from this point of view, the advantages of the new AI chip over GPU are not obvious.

From the perspective of image class generation model, the parameter quantity of this kind of model is also large, but it is one or two orders of magnitude smaller than that of language class generation model. In addition, convolution calculation will be widely used in its calculation. Therefore, AI chip may have a certain opportunity if it can be optimized very well in reasoning application. The optimization here includes a large number of on-chip storage to accommodate parameters and intermediate calculation results, and efficient support for convolution and matrix operations.

In general, ChatGPT  the current generation of AI chips is mainly designed for smaller models (with parameters of 100 million and computation of 1 TOPS), while the demand for generating models is relatively larger than the original design goal. GPU gains higher flexibility at the cost of efficiency in design, while AI chip design goes the other way and pursues the efficiency of the target application. Therefore, we believe that GPU will still take the lead in the acceleration of such generative model in the next year or two. However, as the design of generative model becomes more stable, AI chip design has time to catch up with the iteration of the generative model, AI chips have the opportunity to surpass GPU in the field of generative model from the perspective of efficiency.