If you have a Tesla V100 on hand and want to do a basic health check first, ECC status is one of the most useful things to look at.
The most direct method is to inspect the card’s detailed information with nvidia-smi.
|
|
Focus on the ECC Errors section.
On a card in normal condition, the four common groups of counters under ECC Errors should all be 0 or N/A. If any of them already show a non-zero value, it means the card has seen that type of ECC anomaly before, and you should further evaluate whether it is still suitable for continued use.
Reference output:
|
|
You can think of it like this:
Volatileis the error count for the current power cycleAggregateis the lifetime accumulated error countSingle Bitmeans correctable errorsDouble Bitmeans uncorrectable errors, which are more serious
If you only want a quick screening rule, remember this:
- Most items should be
0 N/Ais normal for some not-applicable entries- If
Double Bitor the total count is not0, do not rely only on a seller’s verbal description; it is better to continue with fuller stress testing and stability checks
This does not replace a complete inspection, but it is enough for a first round of checks after a V100 arrives.