Mirantis OpenStack

  • Download

    Mirantis OpenStack is the zero lock-in distro that makes deploying your cloud easier, and more flexible, and more reliable.

  • On-Demand

    Mirantis OpenStack Express is on demand Private-Cloud-as-a-Service. Fire up your own cloud and deploy your workloads immediately.

Solutions Engineering

Services offerings for all phases of the OpenStack lifecycle, from green-field to migration to scale-out optimization, including Migration, Self-service IT as a Service (ITaaS), CI/CD. Learn More

Deployment and Operations

The deep bench of OpenStack infrrastructure experts has the proven experience across scores of deployments and uses cases, to ensure you get OpenStack running fast and delivering continuous ROI.

Driver Testing and Certification

Mirantis provides coding, testing and maintenance for OpenStack drivers to help infrastructure companies integrate with OpenStack and deliver innovation to cloud customers and operators. Learn More

Certification Exam

Know OpenStack? Prove it. An IT professional who has earned the Mirantis® Certificate of Expertise in OpenStack has demonstrated the skills, knowledge, and abilities needed to create, configure, and manage OpenStack environments.

OpenStack Bootcamp

New to OpenStack and need the skills to run an OpenStack cluster yourself? Our bestselling 3 day course gives you the hands-on knowledge you need.

OpenStack: Now

Your one stop for the latest news and technical updates from across OpenStack ecosystem and marketplace, for all the information you need stay on top of rapid the pace innovation.

Read the Latest

The #1 Pure Play OpenStack Company

Some vendors choose to “improve” OpenStack by salting it with their own exclusive technology. At Mirantis, we’re totally committed to keeping production open source clouds free of proprietary hooks or opaque packaging. When you choose to work with us, you stay in full control of your infrastructure roadmap.

Learn about Our Philosophy

Edge of the Stack: Improve Performance of Python Programs by Restricting Them to a Single CPU

on January 13, 2014

Articles and tutorials in the “Edge of the Stack” series cover fundamental programming issues and concerns that might not come up when dealing with OpenStack directly, but are certainly relevant to the OpenStack ecosystem. For example, drivers are often written in C rather than Python, leading to concerns about memory leaks and other C-specific issues. The “Edge of the Stack” series is meant to provide information on these peripheral, but still important areas.

Sometimes the combination of CPython, global interpreter lock (GIL), and operating system (OS) scheduler can decrease performance of your multithreaded program due to extra context switches and thread migration across all available CPU cores as a direct result of such a combination. In this article, we will explain this problem and show how easy it can be to improve performance in some cases.

We will begin by showing an example of a simple python TCP server and client. The server creates a thread pool and then waits for the client to connect. When the connection is established, the server passes it to the pool for processing. The processing function reads one line from socket and simulates a CPU load. After receiving ‘bye\n’, the server closes the connection. The client then creates N connections and generates a fixed-size load.

<----------------------------------------------------------------------------->

python:
	data = ' ' * 100 + '\x0A'
	def client(th_count):
    	sockets = []
    	for i in range(th_count):
        	sock = socket.socket()

        	for cnt in range(3):
            	try:
                	sock.connect(host_port)
                	break
            	except socket.error:
                	if cnt == 2:
                    	raise
                	time.sleep(0.1)

        	sockets.append(sock)

    	for i in range(NUM_PACKETS):
        	sock = random.choice(sockets)
        	sock.send(data)

    	for sock in sockets:
        	sock.send('bye\x0A')

python:
	def server(th_count):
    	def process_client(sock):
        	num = 0
        	while True:
            	msg = ""
            	while not msg.endswith('\n'):
              	msg += sock.recv(1)

            	if msg == 'bye\n':
                	break

            	for i in range(serv_tout):
                	pass

            	num += 1

    	s = socket.socket()
    	s.bind(host_port)
    	s.listen(5)
    	with ThreadPoolExecutor(max_workers=th_count) as pool:
        	fts = []

        	for i in range(th_count):
            	sock,_ = s.accept()
            	fts.append(pool.submit(process_client, sock))

        	for ft in fts:
            	ft.result()

The following are the code timings for the four-thread case:

shell:
	$ python mt_test.py client 4 & time python mt_test.py server 4

	real	0m8.342s
	user	0m7.789s
	sys 	0m6.587s

Here we have the same timings, but now with the command line option ‘taskset 0×00000001′, the OS is allowed to run all of the python threads (from the eight available) on a single core:

shell:
	$ python mt_test.py client 4 & time taskset 0x00000001 python mt_test.py server 4

	real	0m4.663s
	user	0m3.186s
	sys 	0m0.762s

The results don’t exactly jump out at you–a four-thread program is faster when executed on a single core instead of eight. There are two main reasons for this result. First, we have the GIL. It doesn’t matter how many Python threads are ready to run–only one of them is allowed to execute the Python code at any particular moment.

We couldn’t have expected to achieve performance improvement just by running this code on a multi-core computer. All pure-python programs are concurrent, but not parallel, and performance mostly does not depend on how many cores you have. This explains why the performance doesn’t degrade after processor affinity is turned on. Why does it improve? There are two main reasons:

  • When you have more than one processor, you have extra thread context switching.

  • Threads can’t use the CPU cache simultaneously, so they continuously battle for the CPU cache with other threads.

Let’s take a closer look at what happened in the two-thread example. While the first thread is processing certain data, the second is waiting for data from the socket. The second thread socket then gets some data and becomes ready to continue the execution. The OS scans the available CPU cores and discovers that the first core is busy processing the first thread. So, it schedules the second thread to the second core. The second thread starts and first tries to acquire GIL. It fails because GIL is held up by the first thread. So it goes to sleep again, waiting for GIL to be released.

As a result, the OS, which is clueless about GIL semantics, is doing a lot of extra work. Part of this work is being done by the second CPU core and should not really slow down the first thread, but it does create extra load for the memory bus and the CPU cache. If the processor has HyperThreading (HT), the situation may be even worse.

In the meantime, the real problem is that the second thread is now scheduled for execution on the second core. When GIL is released, the OS does not migrate this thread to the first core because it knows about the caches and tries not to move a thread to its core without a real reason. As a result, all Python threads, which in sum can produce a 100 percent load on a single core, are producing a 12.5-percent load on each of the eight available cores.

In this situation, Python threads are continuously jumping around all the cores. Data is moved in and out of L1/L2 caches and LLC/RAM, and each cache miss can costs thousands of CPU cycles for memory access.

By restricting the OS to scheduling all server Python threads on a single core we eliminate most of the context switches. Also, in this case other threads would (mostly) not be scheduled to this core, which also would decrease frequency of cache misses.

To test, all measurements were taken on a Core i7-2630QM@2.90 GHz, Python 2.7.5, x64, Ubuntu 13.10., averaged from seven tests. (NOTE: Python 3.3 shows this same behavior.) To eliminate the turbo boost influence, the CPU clock was decreased to 800 MHz. Here are the raw results:

Threads

SUM

SUM_AF

%DIFF

SYS

SYS_AF

%DIFF

USR

USR_AF

%DIFF

1

3.35

3.55

-5

0.54

0.52

4

2.78

3.03

-8

2

7.26

4.63

36

4.91

0.67

86

5.10

2.95

42

4

8.28

4.90

41

6.58

0.76

88

7.37

3.14

57

8

7.96

5.00

37

6.49

0.84

87

7.32

3.15

57

16

9.77

5.88

40

6.53

0.73

89

7.01

3.15

55

32

9.73

6.84

30

6.54

0.81

88

7.06

3.04

57

Where,

  • SUM is the execution time, as shown by the ‘real’ field of the ‘time’ utility’s output.

  • SYS is the OS time, as shown by the ‘system’ field.

  • USR is the user time, as shown by ‘user’ field.

  • XXX_AF is XXX when CPU affinity is turned on. (In other words, SUM_AF, SYS_AF, and USR_AF numbers are SUM, SYS, and USR with CPU affinity turned on.)

  • %DIFF is the difference between XXX and XXX_AF (a positive shows that affinity increases the speed).

Running the program in VTune shows that when processor affinity is turned on, the amount of cache misses decrease by 500% and the number of context switches declines by a factor of 40. During the experiments I found another interesting thing: if we restrict the program to just one core, we also can enable the turbo boost feature–it also speeds up the program, which uses a single core, by increasing this particular core clock.

When can we expect such speed increases? Our example is a CPU bound program because the data flow is faster than it can be processed. In case of I/O bound programs, the speed increase would be smaller. So, a higher CPU load provides a bigger boost in this case.

We’d expect that processor affinity would slow down the program in the following cases:

  • If some calculations that release GIL during execution are made in C/C++ library.

  • If you use Jython or IronPython.

  • If you use multiprocessing/ProcessPoolExecutor, which uses processes instead of threads and allows python to side-step GIL. Since processor affinity is inherited by child processes, you should either set a dedicated CPU core or forego affinity for child processes.

  • In some single-thread programs, such as the ones that use gevent.

For further consideration

If you run a multithreaded Python program using CPython, it’s worth trying to restrict available cores to one or two and look at the results.

For further information, please check out the following resources:

Some HTML is OK


or, reply to this post via trackback.