Use sync.Cond with and without the sync.Mutex
This post is the third topic in the series about the common tips I learned from building
rueidisa high-performance Golang Redis client library.
I think those tips are worth sharing because they can also be useful for daily Golang programming:
In previous part 2, we have better throughput for pipelining request/response communication with our custom ring queue compared to the double channel approach.
However, there are two places of the custom ring queue using busy waiting:
EnqueueRequestuses a busy loop to wait for the slot to be available.
- The writing goroutine calls
NextRequestToSendin a busy loop because it doesn’t have blocking behavior while the Golang channel has.
In the following sections, I will cover:
- What problems do these busy loops have?
- Remove the bad busy loop by the
- Minimize the bad busy loop by the
Golang is known for making concurrent programming easily, and its runtime scheduler does a great job of scheduling goroutines on the processes of the operating system. But the real concurrency of a Go program is still limited by the CPU cores you have.
Doing a busy loop basically occupies one CPU core. Furthermore, if the loop takes uncertain time to complete, it means the core is hard to do other useful work for uncertain time and leads to bad overall performance.
That is the case with our custom ring queue:
EnqueueRequest will loop until the ring slot is available, but the ring slot will be available only if we have already processed its previous response. That is, we have already sent out the previous request and received its response from the server, and most importantly, how long will it take is unknown.
Similarly, our writing goroutine just keeps looping and calls the
NextRequestsToSendbut when the user will make requests is also unknown:
The writing goroutine will just keep occupying one of your CPU cores. and the
EnqueueRequestin the worst case, will occupy all of them.
We can confirm the performance degradation by benchmarking the custom ring queue with higher parallelism settings.
As you can tell from the result, the custom ring queue performs worst than the channel approach when the parallelism goes up. That is because the racing of accquiring the OS processes between goroutines also becomes harder.
EnqueueRequestwe need the ability to put a goroutine into sleep when the slot is not available and wake it up once the slot is available.
This ability is like what a semaphore provides in other programming languages.
There are two recommended ways to use “semaphore” like synchronization technique in Golang:
The former provides a complex weighted semaphore mechanism, implemented with a
sync.Mutex and a linked list of channels, allowing users to
Release multiple signals at once. I put more about this in the appendix.
The latter provides much a simpler interface with
Signal methods but requires users to prepare a
sync.Locker to avoid racing on the condition.
Since our waiting condition is based on the slot state which is external to the semaphore, it is a better fit to use the
sync.Cond in the
We can add the
sync.Cond into our slot and initialize it with a
And rewrite our custom ring queue with it:
We now put the
EnqueueRequest into sleep if the slot is not available and wake just one goroutine up with the
cond.Signal() when the previous slot response is delivered.
Now the benchmark result is better than the channel approach:
Then, we are going to deal with the busy loop in our writing goroutine.
In our new
EnqueueRequestthere will be lock contentions on slot only if the ring is always recycling.
But if we use the
sync.Cond in the same way on our writing goroutine, that means every
EnqueueRequest call needs to access our writing goroutine’s
sync.Cond to check whether it is necessary to wake it up. That will undoubtedly have lots of lock contentions.
Fortunately, we don’t need a real
sync.Locker in this case. We can slightly relax the sleeping condition of our writing goroutine and still make it not keep occupying one CPU core.
We let our writing goroutine to go sleep only if there are no more flying
EnqueueRequest calls and only wake it up in the next
To do that, we use two atomic counters:
We use the
sleep counter to mark when the writing goroutine is going to sleep and when it is awakened.
We increase the
waits counter before entering the
EnqueueRequest and decrease the counter after leaving it. If the
waits counter is 1 after we increase it, we then try to wake the writing goroutine up with the
It is important that we access the
waits counter first and then access the
sleep counter later in our
makeRequest function, while on the other hand, we access the
sleep counter first and then access the
waits counter later in the
These reversed access sequences can guarantee we will not miss the chance to wake the goroutine up.
We still use a busy loop to wake up the writing goroutine in our
makeRequest function, but this busy loop is better than the previous one because we know it will complete shortly.
Putting writing goroutine into sleep does add some overhead. However, now it will not occupy and not waste one of your CPUs while there is no request to send.
The final piece of a thread-safe client library probably is the problem of how to close it. In the final post, I will share how
rueidis handles it gracefully.