-
Notifications
You must be signed in to change notification settings - Fork 116
Design doc of high performance PS implementation #1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
47626e0
5900f65
a8aad15
b666ecf
a09b9a4
b8e9eb9
60a5774
04b8d07
ef0aefa
eaef642
3e470e5
5efb0fc
3e94579
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# High Performance Parameter Server Design | ||
|
||
## Motivation | ||
|
||
This design doc focuses on implementing a high performance parameter server (PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md) | ||
|
||
The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and parameters updating cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy. | ||
|
||
The current PS is in Python. Due to the existence of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. We want to remove this bottleneck and make full utilization of multiple CPU cores. | ||
workingloong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++ in all or in part. | ||
|
||
## Communication | ||
|
||
The PS provides services to workers with gRPC library. Both C++ and Go are well supported in gRPC. Go has better development efficiency than C++. | ||
|
||
## Computation | ||
|
||
The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is actually a math operation of tensors. | ||
|
||
### Tensor | ||
|
||
We have to support both dense tensor and sparse tensor. Besides, different element data types are also needed, such as int8/int32/float16/float32/float64. Int8 and float16 are used in training based quantization. | ||
|
||
Each tensor operator has to support different data types. C++ supports generics with template programming, while Go does not support generics directly. | ||
|
||
### Math library | ||
|
||
There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently. | ||
|
||
It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some worry with the performance of math libraries in Go. | ||
|
||
## Scheduling | ||
|
||
In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimization will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core. | ||
|
||
In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that occur in Go programs that allow => that occur in Go programs and allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not mean it will always happen in one of these events. => It doesn't mean that the scheduling will always happen when one of the events occurs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the original expression is more concise. |
||
|
||
- The use of the keyword `go` | ||
- Garbage collection | ||
- System calls | ||
- Synchronization and Orchestration | ||
|
||
Go supports concurrent programming well with first-class concepts, goroutine and channel. | ||
|
||
## Conclusion | ||
|
||
Considering the tradeoff between development efficiency and program performance, we plan to put communication and scheduling parts in Go, and computation part in C++. | ||
|
||
[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrapped with C interface, and exposed to Go. | ||
|
||
The receiving gradients and sending parameters service are implemented in Go. Once receiving gradients from a worker, a goroutine will be launched to do optimization. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The receiving gradients and sending parameters service => The gradients receiving and parameters sending services? or The services of receiving gradients and sending parameters There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
|
||
## Reference | ||
|
||
- https://gitlab.com/libeigen/eigen | ||
- https://github.com/cpmech/gosl | ||
- https://github.com/gonum/gonum | ||
- https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part2.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameters updating cost => updating parameters costs CPU resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done