Up next


Deep Net Performance - Ep. 24 (Deep Learning SIMPLIFIED)

2,975,454 Views
AI Lover
3
Published on 12/17/22 / In How-to & Learning

Training a large-scale deep net is a computationally expensive process, and common CPUs are generally insufficient for the task. GPUs are a great tool for speeding up training, but there are several other options available.

Deep Learning TV on
Facebook: https://www.facebook.com/DeepLearningTV/
Twitter: https://twitter.com/deeplearningtv

A CPU is a versatile tool than can be used across many domains of computation. However, the cost of this versatility is the dependence on sophisticated control mechanisms needed to manage the flow of tasks. CPUs also perform tasks serially, requiring the use of a limited number of cores in order to build in parallelism. Even though CPU speeds and memory limits have increased over the years, a CPU is still an impractical choice for training large deep nets.

Vector implementations can be used to speed up the deep net training process. Generally, parallelism comes in the form of both parallel processing and parallel programming. Parallel processing can either involve shared resources on a single computer, or distributed computing across a cluster of nodes.

The GPU is a common tool for parallel processing. As opposed to a CPU, GPUs tend to hold large numbers of cores – anywhere from 100s to even 1000s. Each of these cores is capable of general purpose computing, and the core structure allows for large amounts of parallelism. As a result, GPUs are a popular choice for training large deep nets. The Deep Learning community provides GPU support through various libraries, implementations, and a vibrant ecosystem fostered by nVidia. The main downside of a GPU is the amount of power required to run one relative to the alternatives.

The “Field Programmable Gate Array”, or FPGA, is another choice for training a deep net. FPGAs were originally used by electrical engineers to design mock-ups for different computer chips without having to custom build a chip for each solution. With an FPGA, chip function can be programmed at the lowest level – the logic gate. With this flexibility, an FPGA can be tailored for deep nets so as to require less power than a GPU. Aside from speeding up the training process, FPGAs can also be used to run the resultant models. For example, FPGAs would be useful for running a complex convolutional net over thousands of images every second. The downside of an FPGA is the specialized knowledge required during design, setup, and configuration.

Another option is the “Application Specific Integrated Circuit”, or ASIC. ASICs are highly specialized, with designs built in at the hardware and integrated circuit level. Once built, they will perform very well at the task they were designed for, but are generally unusable in any other task. Compared to GPUs and FPGAs, ASICs tend to have the lowest power consumption requirements. There are several Deep Learning ASICs such as the Google Tensor Processing Unit (TPU), and the chip being built by Nervana Systems.

There are a few parallelism options available with distributed computing such as data parallelism, model parallelism, and pipeline parallelism. With data parallelism, different subsets of the data are trained on different nodes in parallel for each training pass, followed by parameter averaging and replacement across the cluster. Libraries like TensorFlow support model parallelism, where different portions of the model are trained on different devices in parallel. With pipeline parallelism, workers are dedicated to tasks, like in an assembly line. The main idea is to ensure that each worker is relatively well-utilized. A worker starts the next job as soon as the current one is complete, a strategy that minimizes the total amount of wasted time.

Parallel programming research has been active for decades, and many advanced techniques have been developed. Generally, algorithms should be designed with parallelism in mind in order to take full advantage of the hardware. One such way to do this is to decompose the data model into independent chunks that each perform one instance of a task. Another option is to group all the tasks by their dependencies, so that each group is completely independent of the others. As an addition, you can implement threads or processes that handle different task groups. These threads can be used as a standalone solution, but will provide significant speed improvements when combined with the grouping method. To learn more about this topic, follow this link to the Open HPI Massive Open Online course (MOOC) on parallel programming - https://open.hpi.de/courses/parprog2014.

Credits
Nickey Pickorita (YouTube art) -
https://www.upwork.com/freelan....cers/~0147b8991909b2
Isabel Descutner (Voice) -
https://www.youtube.com/user/IsabelDescutner
Dan Partynski (Copy Editing) -
https://www.linkedin.com/in/danielpartynski
Marek Scibior (Prezi creator, Illustrator) -
http://brawuroweprezentacje.pl/
Jagannath Rajagopal (Creator, Producer and Director) -
https://ca.linkedin.com/in/jagannathrajagopal

Show more
0 Comments sort Sort By

Up next