如何准确测量 PyTorch GPU 代码的运行时间？

2024-02-23 PyTorch

与其他 CUDA 程序一样，PyTorch 中的 GPU 计算也是异步执行的，因此其运行时间的测量不能简单地使用 time 模块中的 time.time() 函数。

正确的做法是使用 torch.cuda 模块中提供的 Event 对象和 synchronize() 函数。

例如，我们想测量 PyTorch 中 Multi-head Attention 的计算耗时，如果使用 time.time，会得到一个不准确的结果：

import torch
import torch.nn as nn
import time

attn = nn.MultiheadAttention(1024, 8).to('cuda:0')
x = torch.rand(size=(32, 1024, 1024), device='cuda:0')

# 使用 time.time 测量耗时
start = time.time()

out, _ = attn(x, x, x)

end = time.time()
print(f'Elapsed: {(end - start) * 1000:.1f}ms')  # Elapsed: 1.9ms

但是，如果使用 torch.cuda.Event 和 torch.cuda.synchronize()，我们会得到一个更准确的结果：

# 使用 torch.cuda.Event 测量耗时
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

out, _ = attn(x, x, x)

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)
print(f'Elapsed: {elapsed_time_ms:.1f}ms')  # Elapsed: 109.9ms

显然后者的结果更接近真实的耗时，而这两种计时方式相差了近 50 倍！

事实上，有些 PyTorch 函数隐式地包含同步操作，如 Tensor.to()。因此，如果你的代码中包含了这些函数，那么使用 time.time() 得到的结果会更接近真实的耗时：

# 使用 time.time 测量耗时
start = time.time()

out, _ = attn(x, x, x)
out = out.to('cpu')  # GPU 拷贝至 CPU，隐式同步

end = time.time()
print(f'Elapsed: {(end - start) * 1000:.1f}ms')  # Elapsed: 223.6ms

# 使用 torch.cuda.Event 测量耗时
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()

out, _ = attn(x, x, x)
out = out.to('cpu')  # GPU 拷贝至 CPU，隐式同步

end_event.record()
torch.cuda.synchronize()  # Wait for the events to be recorded!
elapsed_time_ms = start_event.elapsed_time(end_event)
print(f'Elapsed: {elapsed_time_ms:.1f}ms')  # Elapsed: 212.1ms

总结：测量 PyTorch 代码运行时间的时候尤其要注意 GPU 计算的异步特性，直接对代码片段进行计时，将会得到错误的结果。正确的做法是使用 torch.cuda.Event 和 torch.cuda.synchronize() 进行计时。