How to Use the Command 'nvidia-smi' (with examples)

How to Use the Command 'nvidia-smi' (with examples)

The nvidia-smi command is a powerful utility provided by NVIDIA that assists in the management and monitoring of NVIDIA GPU devices. It offers insights into GPU status, memory usage, GPU utilization, thermals, and running processes, among other details. This tool is pivotal for developers, system administrators, and enthusiasts who need to keep track of their GPU’s performance and efficiency. Whether you are maintaining a large-scale data center or optimizing GPU performance on a personal workstation, nvidia-smi is essential.

Use case 1: Display information on all available GPUs and processes using them

Code:

nvidia-smi

Motivation:

Using this command without any additional arguments provides a comprehensive view of all available NVIDIA GPUs in your system. It lists essential information such as GPU index, utilization, temperature, memory usage, and active processes utilizing each GPU. This basic command is a great starting point for anyone looking to quickly assess the status and health of their GPUs. It is especially useful for system administrators managing servers with multiple GPUs or when diagnosing GPU-related issues.

Explanation:

  • nvidia-smi: This is the base command, which, when executed without additional arguments, provides a summary table displaying key information about all available NVIDIA GPUs and any processes that are currently using these GPUs.

Example Output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   56C    P0    69W / 149W |  10462MiB / 11441MiB |     47%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     53768      C   /usr/bin/python3                 10408MiB |
+-----------------------------------------------------------------------------+

Use case 2: Display more detailed GPU information

Code:

nvidia-smi --query

Motivation:

Sometimes, a general overview may not provide enough detail for specific troubleshooting or analysis tasks. The --query flag allows users to delve deeper into the specifics of each GPU, unveiling additional parameters such as individual GPU temperatures, total power utilization, memory usage, and many other metrics. This level of detail is indispensable when optimizing the performance of data-intensive applications or ensuring the reliable operation of GPUs in demanding environments, such as production servers.

Explanation:

  • nvidia-smi --query: The --query argument extends the functionality of the base command by retrieving a more detailed dataset regarding the configuration and real-time metrics of the available GPUs. This is particularly useful for generating comprehensive performance reports or scripting automated monitoring solutions.

Example Output:

===============================+======================+======================|
|    Name   Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| GPU_USAGE  BUS_USAGE   PWR  MEM_COMMON |         Memory-Usage       | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
| Tesla K80          Yes   | 00000000:00:1A.0 On  |                    0    |
| N/A  62C     P0    123W / 150W |  1036MiB / 11441MiB |     51%      Default   |
|                               |                      |                  N/A  |
+-------------------------------+----------------------+----------------------+

Use case 3: Monitor overall GPU usage with 1-second update interval

Code:

nvidia-smi dmon

Motivation:

In scenarios where continuous, real-time monitoring is essential, such as when assessing the impact of newly deployed software or monitoring performance during high-computation tasks, the dmon feature of nvidia-smi offers unparalleled live feedback. By providing constant updates at regular intervals, users can track variations in GPU metrics and understand performance trends over time. This use is particularly advantageous for engineers conducting performance testing or administrators managing GPUs in high-availability environments.

Explanation:

  • nvidia-smi dmon: This argument initiates a daemon mode that regularly outputs a simplified, yet highly informative, view of the GPU statistics. The dmon command starts an interval-based monitoring session, refreshing the output at the default period of one second continuously, making it an excellent choice for real-time monitoring.

Example Output:

# gpu   pwr gtemp mtemp    SM   MEM    ENC    DEC  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    60    56     -    48    10     0     0   2506  758
    0    61    56     -    50    12     0     0   2506  860
    0    58    56     -    46    15     0     0   2506  833

Conclusion:

The nvidia-smi command is a indispensable tool for anyone working with NVIDIA GPUs, providing critical insights into GPU utilization, performance metrics, and processes that are crucial for effective management and optimal performance tuning. From simple status overviews to detailed queries and real-time monitoring, the examples provided showcase the versatility and utility of the nvidia-smi command in various scenarios. Whether you’re dealing with a single workstation or scaling GPU management across data centers, this tool is a key resource for ensuring efficient and reliable GPU operations.

Related Posts

How to Use the Command 'Stop-Service' (with examples)

How to Use the Command 'Stop-Service' (with examples)

The Stop-Service command is a powerful utility available in PowerShell that allows users to stop running services on a Windows operating system.

Read More
How to Use the Command 'pueue parallel' (with examples)

How to Use the Command 'pueue parallel' (with examples)

“Pueue” is a command-line task management tool designed to simplify the handling of long-running processes.

Read More
Utilizing Git Difftool for Enhanced Code Comparison (with examples)

Utilizing Git Difftool for Enhanced Code Comparison (with examples)

Git difftool is a powerful command-line utility used in conjunction with Git, one of the most popular version control systems.

Read More