How to Check Whether a Tesla V100 Has ECC Errors

Use nvidia-smi to quickly inspect the ECC status of a Tesla V100 and determine which error counters should be 0 or N/A.

If you have a Tesla V100 on hand and want to do a basic health check first, ECC status is one of the most useful things to look at.

The most direct method is to inspect the card’s detailed information with nvidia-smi.

1
2
3
nvidia-smi -q
# 查询第 0 块 GPU
nvidia-smi -q -i 0

Focus on the ECC Errors section.

On a card in normal condition, the four common groups of counters under ECC Errors should all be 0 or N/A. If any of them already show a non-zero value, it means the card has seen that type of ECC anomaly before, and you should further evaluate whether it is still suitable for continued use.

Reference output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
nvidia-smi -q
    ECC Mode
        Current                          : Enabled
        Pending                          : Enabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : N/A
                Total                    : 0
            Double Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : 0
                Total                    : 0
        Aggregate
            Single Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : N/A
                Total                    : 0
            Double Bit
                Device Memory            : 0
                Register File            : 0
                L1 Cache                 : 0
                L2 Cache                 : 0
                Texture Memory           : N/A
                Texture Shared           : N/A
                CBU                      : 0
                Total                    : 0
    Retired Pages

You can think of it like this:

  • Volatile is the error count for the current power cycle
  • Aggregate is the lifetime accumulated error count
  • Single Bit means correctable errors
  • Double Bit means uncorrectable errors, which are more serious

If you only want a quick screening rule, remember this:

  • Most items should be 0
  • N/A is normal for some not-applicable entries
  • If Double Bit or the total count is not 0, do not rely only on a seller’s verbal description; it is better to continue with fuller stress testing and stability checks

This does not replace a complete inspection, but it is enough for a first round of checks after a V100 arrives.

记录并分享
Built with Hugo
Theme Stack designed by Jimmy