For more information, take a look at The NVIDIA® Data Center GPU Manager (DCGM) . According to the Information,
The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:
- GPU behavior monitoring
- GPU configuration management
- GPU policy oversight
- GPU health and diagnostics
- GPU accounting and process statistics
- NVSwitch configuration and monitoring
This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.
Installation
Assuming you are using RHEL Derivative like Rocky Linux 8, installation is a breeze
# dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-rhel8.repo
# dnf install -y datacenter-gpu-manager
Enable the DCGM systemd service (on reboot) and start it now
# systemctl --now enable nvidia-dcgm
# systemctl start nvidia-dcgm
Basic Usage – Discovery
# dcgmi discovery -l

Basic Usage – Diagnostic
To run a diagnostic test, you can use the command. You can decide on the level of diagostic. For example,
# dcgmi diag -r 2

If you want a more comprehensive diagnostic, you can use the command, you can use -r 3
# dcgmi diag -r 3

Basic Usage – NVLink Status
# dcgmi nvlink -s
