How to Use the Command 'nvidia-smi' (with examples)
The nvidia-smi
command is a powerful utility provided by NVIDIA that assists in the management and monitoring of NVIDIA GPU devices. It offers insights into GPU status, memory usage, GPU utilization, thermals, and running processes, among other details. This tool is pivotal for developers, system administrators, and enthusiasts who need to keep track of their GPU’s performance and efficiency. Whether you are maintaining a large-scale data center or optimizing GPU performance on a personal workstation, nvidia-smi
is essential.
Use case 1: Display information on all available GPUs and processes using them
Code:
nvidia-smi
Motivation:
Using this command without any additional arguments provides a comprehensive view of all available NVIDIA GPUs in your system. It lists essential information such as GPU index, utilization, temperature, memory usage, and active processes utilizing each GPU. This basic command is a great starting point for anyone looking to quickly assess the status and health of their GPUs. It is especially useful for system administrators managing servers with multiple GPUs or when diagnosing GPU-related issues.
Explanation:
nvidia-smi
: This is the base command, which, when executed without additional arguments, provides a summary table displaying key information about all available NVIDIA GPUs and any processes that are currently using these GPUs.
Example Output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1A.0 Off | 0 |
| N/A 56C P0 69W / 149W | 10462MiB / 11441MiB | 47% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 53768 C /usr/bin/python3 10408MiB |
+-----------------------------------------------------------------------------+
Use case 2: Display more detailed GPU information
Code:
nvidia-smi --query
Motivation:
Sometimes, a general overview may not provide enough detail for specific troubleshooting or analysis tasks. The --query
flag allows users to delve deeper into the specifics of each GPU, unveiling additional parameters such as individual GPU temperatures, total power utilization, memory usage, and many other metrics. This level of detail is indispensable when optimizing the performance of data-intensive applications or ensuring the reliable operation of GPUs in demanding environments, such as production servers.
Explanation:
nvidia-smi --query
: The--query
argument extends the functionality of the base command by retrieving a more detailed dataset regarding the configuration and real-time metrics of the available GPUs. This is particularly useful for generating comprehensive performance reports or scripting automated monitoring solutions.
Example Output:
===============================+======================+======================|
| Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| GPU_USAGE BUS_USAGE PWR MEM_COMMON | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| Tesla K80 Yes | 00000000:00:1A.0 On | 0 |
| N/A 62C P0 123W / 150W | 1036MiB / 11441MiB | 51% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Use case 3: Monitor overall GPU usage with 1-second update interval
Code:
nvidia-smi dmon
Motivation:
In scenarios where continuous, real-time monitoring is essential, such as when assessing the impact of newly deployed software or monitoring performance during high-computation tasks, the dmon
feature of nvidia-smi
offers unparalleled live feedback. By providing constant updates at regular intervals, users can track variations in GPU metrics and understand performance trends over time. This use is particularly advantageous for engineers conducting performance testing or administrators managing GPUs in high-availability environments.
Explanation:
nvidia-smi dmon
: This argument initiates a daemon mode that regularly outputs a simplified, yet highly informative, view of the GPU statistics. Thedmon
command starts an interval-based monitoring session, refreshing the output at the default period of one second continuously, making it an excellent choice for real-time monitoring.
Example Output:
# gpu pwr gtemp mtemp SM MEM ENC DEC mclk pclk
# Idx W C C % % % % MHz MHz
0 60 56 - 48 10 0 0 2506 758
0 61 56 - 50 12 0 0 2506 860
0 58 56 - 46 15 0 0 2506 833
Conclusion:
The nvidia-smi
command is a indispensable tool for anyone working with NVIDIA GPUs, providing critical insights into GPU utilization, performance metrics, and processes that are crucial for effective management and optimal performance tuning. From simple status overviews to detailed queries and real-time monitoring, the examples provided showcase the versatility and utility of the nvidia-smi
command in various scenarios. Whether you’re dealing with a single workstation or scaling GPU management across data centers, this tool is a key resource for ensuring efficient and reliable GPU operations.