Edge of the Stack: Improve Performance of Python Programs by Restricting Them to a Single CPU
January 13, 2014
Articles and tutorials in the “Edge of the Stack” series cover fundamental programming issues and concerns that might not come up when dealing with OpenStack directly, but are certainly relevant to the OpenStack ecosystem. For example, drivers are often written in C rather than Python, leading to concerns about memory leaks and other C-specific issues. The “Edge of the Stack” series is meant to provide information on these peripheral, but still important areas.
Sometimes the combination of CPython, global interpreter lock (GIL), and operating system (OS) scheduler can decrease performance of your multithreaded program due to extra context switches and thread migration across all available CPU cores as a direct result of such a combination. In this article, we will explain this problem and show how easy it can be to improve performance in some cases.
We will begin by showing an example of a simple python TCP server and client. The server creates a thread pool and then waits for the client to connect. When the connection is established, the server passes it to the pool for processing. The processing function reads one line from socket and simulates a CPU load. After receiving ‘bye\n’, the server closes the connection. The client then creates N connections and generates a fixed-size load.
The following are the code timings for the four-thread case:
Here we have the same timings, but now with the command line option ‘taskset 0×00000001′, the OS is allowed to run all of the python threads (from the eight available) on a single core:
The results don’t exactly jump out at you–a four-thread program is faster when executed on a single core instead of eight. There are two main reasons for this result. First, we have the GIL. It doesn’t matter how many Python threads are ready to run–only one of them is allowed to execute the Python code at any particular moment.
We couldn’t have expected to achieve performance improvement just by running this code on a multi-core computer. All pure-python programs are concurrent, but not parallel, and performance mostly does not depend on how many cores you have. This explains why the performance doesn’t degrade after processor affinity is turned on. Why does it improve? There are two main reasons:
Let’s take a closer look at what happened in the two-thread example. While the first thread is processing certain data, the second is waiting for data from the socket. The second thread socket then gets some data and becomes ready to continue the execution. The OS scans the available CPU cores and discovers that the first core is busy processing the first thread. So, it schedules the second thread to the second core. The second thread starts and first tries to acquire GIL. It fails because GIL is held up by the first thread. So it goes to sleep again, waiting for GIL to be released.
As a result, the OS, which is clueless about GIL semantics, is doing a lot of extra work. Part of this work is being done by the second CPU core and should not really slow down the first thread, but it does create extra load for the memory bus and the CPU cache. If the processor has HyperThreading (HT), the situation may be even worse.
In the meantime, the real problem is that the second thread is now scheduled for execution on the second core. When GIL is released, the OS does not migrate this thread to the first core because it knows about the caches and tries not to move a thread to its core without a real reason. As a result, all Python threads, which in sum can produce a 100 percent load on a single core, are producing a 12.5-percent load on each of the eight available cores.
In this situation, Python threads are continuously jumping around all the cores. Data is moved in and out of L1/L2 caches and LLC/RAM, and each cache miss can costs thousands of CPU cycles for memory access.
By restricting the OS to scheduling all server Python threads on a single core we eliminate most of the context switches. Also, in this case other threads would (mostly) not be scheduled to this core, which also would decrease frequency of cache misses.
To test, all measurements were taken on a Core i7-2630QM@2.90 GHz, Python 2.7.5, x64, Ubuntu 13.10., averaged from seven tests. (NOTE: Python 3.3 shows this same behavior.) To eliminate the turbo boost influence, the CPU clock was decreased to 800 MHz. Here are the raw results:
Running the program in VTune shows that when processor affinity is turned on, the amount of cache misses decrease by 500% and the number of context switches declines by a factor of 40. During the experiments I found another interesting thing: if we restrict the program to just one core, we also can enable the turbo boost feature–it also speeds up the program, which uses a single core, by increasing this particular core clock.
When can we expect such speed increases? Our example is a CPU bound program because the data flow is faster than it can be processed. In case of I/O bound programs, the speed increase would be smaller. So, a higher CPU load provides a bigger boost in this case.
We’d expect that processor affinity would slow down the program in the following cases:
For further consideration
If you run a multithreaded Python program using CPython, it’s worth trying to restrict available cores to one or two and look at the results.
For further information, please check out the following resources: