问题描述
我注意到 Python 的内置 sum 函数在对 1 000 000 个整数列表求和时比 for 循环快大约 3 倍:
I noticed that Python's built-in sum function is roughly 3x faster than a for loop when summing a list of 1 000 000 integers:
import timeit def sum1(): s = 0 for i in range(1000000): s += i return s def sum2(): return sum(range(1000000)) print 'For Loop Sum:', timeit.timeit(sum1, number=10) print 'Built-in Sum:', timeit.timeit(sum2, number=10) # Prints: # For Loop Sum: 0.751425027847 # Built-in Sum: 0.266746997833
这是为什么呢?sum是如何实现的?
Why is that? How is sum implemented?
推荐答案
速度差异实际上大于 3 倍,但是您通过首先创建一个包含 100 万个整数的巨大内存列表来减慢任一版本的速度.将其与计时赛分开:
The speed difference is actually greater than 3 times, but you slow down either version by first creating a huge in-memory list of 1 million integers. Separate that out of the time trials:
>>> import timeit >>> def sum1(lst): ... s = 0 ... for i in lst: ... s += i ... return s ... >>> def sum2(lst): ... return sum(lst) ... >>> values = range(1000000) >>> timeit.timeit('f(lst)', 'from __main__ import sum1 as f, values as lst', number=100) 3.457869052886963 >>> timeit.timeit('f(lst)', 'from __main__ import sum2 as f, values as lst', number=100) 0.6696369647979736
现在速度差已经上升到5倍以上了.
The speed difference has risen to over 5 times now.
for 循环作为解释的 Python 字节码执行.sum() 完全在 C 代码中循环.解释字节码和 C 代码之间的速度差异很大.
A for loop is executed as interpreted Python bytecode. sum() loops entirely in C code. The speed difference between interpreted bytecode and C code is large.
此外,如果 C 代码可以将总和保留在 C 类型中,则确保不会创建新的 Python 对象;这适用于 int 和 float 结果.
In addition, the C code makes sure not to create new Python objects if it can keep the sum in C types instead; this works for int and float results.
反汇编的 Python 版本是这样做的:
The Python version, disassembled, does this:
>>> import dis >>> def sum1(): ... s = 0 ... for i in range(1000000): ... s += i ... return s ... >>> dis.dis(sum1) 2 0 LOAD_CONST 1 (0) 3 STORE_FAST 0 (s) 3 6 SETUP_LOOP 30 (to 39) 9 LOAD_GLOBAL 0 (range) 12 LOAD_CONST 2 (1000000) 15 CALL_FUNCTION 1 18 GET_ITER >> 19 FOR_ITER 16 (to 38) 22 STORE_FAST 1 (i) 4 25 LOAD_FAST 0 (s) 28 LOAD_FAST 1 (i) 31 INPLACE_ADD 32 STORE_FAST 0 (s) 35 JUMP_ABSOLUTE 19 >> 38 POP_BLOCK 5 >> 39 LOAD_FAST 0 (s) 42 RETURN_VALUE
除了解释器循环比 C 慢之外,INPLACE_ADD 将创建一个新的整数对象(超过 255,CPython 将小的 int 对象缓存为单例).
Apart from the interpreter loop being slower than C, the INPLACE_ADD will create a new integer object (past 255, CPython caches small int objects as singletons).
您可以在Python mercurial 代码存储库,但它在评论中明确指出:
You can see the C implementation in the Python mercurial code repository, but it explicitly states in the comments:
/* Fast addition by keeping temporary sums in C instead of new Python objects. Assumes all inputs are the same type. If the assumption fails, default to the more general routine. */