How does the chip processor play a role in deep learning?

Last Update Time: 2019-07-16 10:47:30

With the wide application of AI, deep learning has become the mainstream way of AI research and application. Faced with the parallel computing of massive data, AI's requirements for computing power continue to increase, and higher requirements are placed on the computing speed and power consumption of hardware.

At present, in addition to the general-purpose CPU, some chip processors such as GPU, NPU, and FPGA, which are hardware-accelerated, play their respective advantages in different applications of deep learning, but are they better or worse?

Taking face recognition as an example, the processing power required to process the basic flow and the corresponding functional modules is as follows:

 

 

Therefore, a special chip NPU with small size, low power consumption, high calculation performance and high calculation efficiency was born.

 

NPU

NPU (Neural Networks Process Units) neural network processing unit. The NPU works by simulating human neurons and synapses at the circuit level, and directly processing large-scale neurons and synapses with a deep learning instruction set, an instruction that completes the processing of a set of neurons. Compared to CPU and GPU, NPU achieves integration of storage and computing through synaptic weights, thereby improving operational efficiency.

 

The NPU is constructed by imitating the biological neural network. The CPU and the GPU processor need to process the neurons with thousands of instructions. The NPU can be completed by one or several pieces, so the advantage of the processing efficiency of the deep learning is obvious.

 

The experimental results show that the performance of the NPU under the same power consumption is 118 times that of the GPU.

 

Like the GPU, the NPU also requires CPU co-processing to accomplish specific tasks. Below, we can look at how the GPU and NPU work with the CPU.

 

GPU acceleration

The GPU is currently only a multiply and add operation of a simple parallel matrix. The construction of the neural network model and the transfer of the data stream are still performed on the CPU.

 

The CPU loads the weight data, constructs a neural network model according to the code, and transmits the matrix operation of each layer to the GPU through a class library interface such as CUDA or OpenCL to implement parallel computing, and outputs the result; the CPU then schedules the calculation of the matrix data of the lower layer neuron until the nerve network output layer is calculated and the final result is obtained.