 |
|
| |
nvidia-smi(1) |
NVSMI |
nvidia-smi(1) |
nvidia-smi - NVIDIA System Management Interface program
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...
nvidia-smi (also NVSMI) provides monitoring and management
capabilities for each of NVIDIA's Tesla, Quadro, GRID and GeForce devices
from Fermi and higher architecture families. GeForce Titan series devices
are supported for most functions with very limited information provided for
the remainder of the Geforce brand. NVSMI is a cross platform tool that
supports all standard NVIDIA driver-supported Linux distros, as well as
64bit versions of Windows starting with Windows Server 2008 R2. Metrics can
be consumed directly by users via stdout, or provided by file via CSV and
XML formats for scripting purposes.
Note that much of the functionality of NVSMI is provided by the
underlying NVML C-based library. See the NVIDIA developer website link below
for more information about NVML. NVML-based python bindings are also
available.
The output of NVSMI is not guaranteed to be backwards compatible.
However, both NVML and the Python bindings are backwards compatible, and
should be the first choice when writing any tools that must be maintained
across NVIDIA driver releases.
NVML SDK:
https://docs.nvidia.com/deploy/nvml-api/index.html
Python bindings:
http://pypi.python.org/pypi/nvidia-ml-py/
Print usage information and exit.
Print version information and exit.
List each of the NVIDIA GPUs in the system, along with their
UUIDs.
List each of the excluded NVIDIA GPUs in the system, along with
their UUIDs.
-i, --id=ID
Target a specific GPU.
-f FILE, --filename=FILE
Log to the specified file, rather than to stdout.
-l SEC, --loop=SEC
Probe until Ctrl+C at specified second interval.
Display GPU or Unit info. Displayed info includes all data listed
in the (GPU ATTRIBUTES) or (UNIT ATTRIBUTES) sections of this
document. Some devices and/or environments don't support all possible
information. Any unsupported data is indicated by a "N/A" in the
output. By default information for all available GPUs or Units is displayed.
Use the -i option to restrict the output to a single GPU or Unit.
-u, --unit
Display Unit data instead of GPU data. Unit data is only available
for NVIDIA S-class Tesla enclosures.
-i, --id=ID
Display data for a single specified GPU or Unit. The specified id
may be the GPU/Unit's 0-based index in the natural enumeration returned by
the driver, the GPU's board serial number, the GPU's UUID, or the GPU's PCI
bus ID (as domain:bus:device.function in hex). It is recommended that users
desiring consistency use either UUID or PCI bus ID, since device enumeration
ordering is not guaranteed to be consistent between reboots and board serial
number might be shared between multiple GPUs on the same board.
-f FILE, --filename=FILE
Redirect query output to the specified file in place of the
default stdout. The specified file will be overwritten.
Produce XML output in place of the default human-readable format.
Both GPU and Unit query outputs conform to corresponding DTDs. These are
available via the --dtd flag.
Use with -x. Embed the DTD in the XML output.
Produces an encrypted debug log for use in submission of bugs back
to NVIDIA.
Display only selected information: MEMORY, UTILIZATION, ECC,
TEMPERATURE, POWER, CLOCK, COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, SUPPORTED_GPU_TARGET_TEMP,
VOLTAGE, FBC_STATS, ROW_REMAPPER, RESET_STATUS, GSP_FIRMWARE_VERSION,
POWER_SMOOTHING, POWER_PROFILES. Flags can be combined with comma e.g.
"MEMORY,ECC". Sampling data with max, min and avg is also returned
for POWER, UTILIZATION and CLOCK display types. Doesn't work with -u/--unit
or -x/--xml-format flags.
-l SEC, --loop=SEC
Continuously report query data at the specified interval, rather
than the default of just once. The application will sleep in-between
queries. Note that on Linux ECC error or Xid error events will print out
during the sleep period if the -x flag was not specified. Pressing Ctrl+C at
any time will abort the loop, which will otherwise run indefinitely. If no
argument is specified for the -l form a default interval of 5 seconds
is used.
-lms ms, --loop-ms=ms
Same as -l,--loop but in milliseconds.
Allows the caller to pass an explicit list of properties to
query.
Information about GPU. Pass comma separated list of properties you
want to query. e.g. --query-gpu=pci.bus_id,persistence_mode. Call
--help-query-gpu for more info.
List of supported clocks. Call --help-query-supported-clocks for
more info.
List of currently active compute processes. Call
--help-query-compute-apps for more info.
List of accounted compute processes. Call
--help-query-accounted-apps for more info. This query is not supported on
vGPU host.
List of GPU device memory pages that have been retired. Call
--help-query-retired-pages for more info.
Information about remapped rows. Call --help-query-remapped-rows
for more info.
Comma separated list of format options:
- csv - comma separated values (MANDATORY)
- noheader - skip first line with column headers
- nounits - don't print units for numerical values
-i, --id=ID
Display data for a single specified GPU. The specified id may be
the GPU's 0-based index in the natural enumeration returned by the driver,
the GPU's board serial number, the GPU's UUID, or the GPU's PCI bus ID (as
domain:bus:device.function in hex). It is recommended that users desiring
consistency use either UUID or PCI bus ID, since device enumeration ordering
is not guaranteed to be consistent between reboots and board serial number
might be shared between multiple GPUs on the same board.
-f FILE, --filename=FILE
Redirect query output to the specified file in place of the
default stdout. The specified file will be overwritten.
-l SEC, --loop=SEC
Continuously report query data at the specified interval, rather
than the default of just once. The application will sleep in-between
queries. Note that on Linux ECC error or Xid error events will print out
during the sleep period if the -x flag was not specified. Pressing Ctrl+C at
any time will abort the loop, which will otherwise run indefinitely. If no
argument is specified for the -l form a default interval of 5 seconds
is used.
-lms ms, --loop-ms=ms
Same as -l,--loop but in milliseconds.
Set the persistence mode for the target GPUs. See the (GPU
ATTRIBUTES) section for a description of persistence mode. Requires
root. Will impact all GPUs unless a single GPU is specified using the -i
argument. The effect of this operation is immediate. However, it does not
persist across reboots. After each reboot persistence mode will default to
"Disabled". Available on Linux only.
Set the ECC mode for the target GPUs. See the (GPU
ATTRIBUTES) section for a description of ECC mode. Requires root. Will
impact all GPUs unless a single GPU is specified using the -i argument. This
setting takes effect after the next reboot and is persistent.
Reset the ECC error counters for the target GPUs. See the (GPU
ATTRIBUTES) section for a description of ECC error counter types.
Available arguments are 0\|VOLATILE or 1\|AGGREGATE. Requires root. Will
impact all GPUs unless a single GPU is specified using the -i argument. The
effect of this operation is immediate. Clearing aggregate counts is not
supported on Ampere+
Set the compute mode for the target GPUs. See the (GPU
ATTRIBUTES) section for a description of compute mode. Requires root.
Will impact all GPUs unless a single GPU is specified using the -i argument.
The effect of this operation is immediate. However, it does not persist
across reboots. After each reboot compute mode will reset to
"DEFAULT".
Enable or disable TCC driver model. For Windows only. Requires
administrator privileges. -dm will fail if a display is attached, but -fdm
will force the driver model to change. Will impact all GPUs unless a single
GPU is specified using the -i argument. A reboot is required for the change
to take place. See Driver Model for more information on Windows
driver models. An error message indicates that retrieving the field
failed.
Set GPU Operation Mode: 0/ALL_ON, 1/COMPUTE, 2/LOW_DP Supported on
GK110 M-class and X-class Tesla products from the Kepler family. Not
supported on Quadro and Tesla C-class products. LOW_DP and ALL_ON are the
only modes supported on GeForce Titan devices. Requires administrator
privileges. See GPU Operation Mode for more information about GOM.
GOM changes take effect after reboot. The reboot requirement might be
removed in the future. Compute only GOMs don't support WDDM (Windows Display
Driver Model)
Trigger a reset of one or more GPUs. Can be used to clear GPU HW
and SW state in situations that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has occurred. Optional -i switch
can be used to target one or more specific devices. Without this option, all
GPUs are reset. Requires root. There can't be any applications using these
devices (e.g. CUDA application, graphics application like X server,
monitoring application like other instance of nvidia-smi). There also can't
be any compute applications running on any other GPU in the system if
individual GPU reset is not feasible.
Starting with the NVIDIA Ampere architecture, GPUs with NVLink
connections can be individually reset. On Ampere NVSwitch systems, Fabric
Manager is required to facilitate reset. On Hopper and later NVSwitch
systems, the dependency on Fabric Manager to facilitate reset is
removed.
If Fabric Manager is not running, or if any of the GPUs being
reset are based on an architecture preceding the NVIDIA Ampere architecture,
any GPUs with NVLink connections to a GPU being reset must also be reset in
the same command. This can be done either by omitting the -i switch, or
using the -i switch to specify the GPUs to be reset. If the -i option does
not specify a complete set of NVLink GPUs to reset, this command will issue
an error identifying the additional GPUs that must be included in the reset
command.
GPU reset is not guaranteed to work in all cases. It is not
recommended for production environments at this time. In some situations
there may be HW components on the board that fail to revert back to an
initial state following the reset request. This is more likely to be seen on
Fermi-generation products vs. Kepler, and more likely to be seen if the
reset is being performed on a hung GPU.
Following a reset, it is recommended that the health of each reset
GPU be verified before further use. If any GPU is not healthy a complete
reset should be instigated by power cycling the node.
GPU reset operation will not be supported on MIG enabled vGPU
guests.
Visit http://developer.nvidia.com/gpu-deployment-kit to
download the GDK.
Switch GPU Virtualization Mode. Sets GPU virtualization mode to
3/VGPU or 4/VSGA. Virtualization mode of a GPU can only be set when it is
running on a hypervisor.
Specifies <minGpuClock,maxGpuClock> clocks as a pair (e.g.
1500,1500) that defines closest desired locked GPU clock speed in MHz. Input
can also use be a singular desired clock value (e.g. <GpuClockValue>).
Optionally, --mode can be supplied to specify the clock locking modes.
Supported on Volta+. Requires root
- --mode=0
(Default)
- This mode is the default clock locking mode and provides the highest
possible frequency accuracies supported by the hardware.
- --mode=1
- The clock locking algorithm leverages close loop controllers to achieve
frequency accuracies with improved perf per watt for certain class of
applications. Due to convergence latency of close loop controllers, the
frequency accuracies may be slightly lower than default mode 0.
Specifies <minMemClock,maxMemClock> clocks as a pair (e.g.
5100,5100) that defines the range of desired locked Memory clock speed in
MHz. Input can also be a singular desired clock value (e.g.
<MemClockValue>).
Resets the GPU clocks to the default value. Supported on Volta+.
Requires root.
Resets the memory clocks to the default value. Supported on
Volta+. Requires root.
Specifies maximum <memory,graphics> clocks as a pair (e.g.
2000,800) that defines GPU's speed while running applications on a GPU.
Supported on Maxwell-based GeForce and from the Kepler+ family in
Tesla/Quadro/Titan devices. Requires root.
Resets the applications clocks to the default value. Supported on
Maxwell-based GeForce and from the Kepler+ family in Tesla/Quadro/Titan
devices. Requires root.
Specifies the memory clock that defines the closest desired Memory
Clock in MHz. The memory clock takes effect the next time the GPU is
initialized. This can be guaranteed by unloading and reloading the kernel
module. Requires root.
Resets the memory clock to default value. Driver unload and reload
is required for this to take effect. This can be done by unloading and
reloading the kernel module. Requires root.
Specifies maximum power limit in watts. Accepts integer and
floating point numbers. it takes an optional argument --scope. Only on
supported devices from Kepler family. Requires administrator privileges.
Value needs to be between Min and Max Power Limit as reported by
nvidia-smi.
Specifies the scope of the power limit. Following are the options:
0/GPU: This only changes power limits for the GPU 1/Module: This changes the
power for the module containing multiple components. E.g. GPU and CPU.
Overrides or restores default CUDA clocks Available arguments are
0\|RESTORE_DEFAULT or 1\|OVERRIDE.
Enables or disables GPU Accounting. With GPU Accounting one can
keep track of usage of resources throughout lifespan of a single process.
Only on supported devices from Kepler family. Requires administrator
privileges. Available arguments are 0\|DISABLED or 1\|ENABLED.
Clears all processes accounted so far. Only on supported devices
from Kepler family. Requires administrator privileges.
Set the default auto boost policy to 0/DISABLED or 1/ENABLED,
enforcing the change only after the last boost client has exited. Only on
certain Tesla devices from the Kepler+ family and Maxwell-based GeForce
devices. Requires root.
Allow non-admin/root control over auto boost mode. Available
arguments are 0\|UNRESTRICTED, 1\|RESTRICTED. Only on certain Tesla devices
from the Kepler+ family and Maxwell-based GeForce devices. Requires
root.
Enables or disables Multi Instance GPU mode. Only supported on
devices based on the NVIDIA Ampere architecture. Requires root. Available
arguments are 0\|DISABLED or 1\|ENABLED.
Set GPU Target Temperature for a GPU in degree celsius. Requires
administrator privileges. Target temperature should be within limits
supported by GPU. These limits can be retrieved by using query option with
SUPPORTED_GPU_TARGET_TEMP.
-i, --id=ID
Modify a single specified GPU. The specified id may be the
GPU/Unit's 0-based index in the natural enumeration returned by the driver,
the GPU's board serial number, the GPU's UUID, or the GPU's PCI bus ID (as
domain:bus:device.function in hex). It is recommended that users desiring
consistency use either UUID or PCI bus ID, since device enumeration ordering
is not guaranteed to be consistent between reboots and board serial number
might be shared between multiple GPUs on the same board.
Return a non-zero error for warnings.
Set the LED indicator state on the front and back of the unit to
the specified color. See the (UNIT ATTRIBUTES) section for a
description of the LED states. Allowed colors are 0\|GREEN and 1\|AMBER.
Requires root.
-i, --id=ID
Modify a single specified Unit. The specified id is the Unit's
0-based index in the natural enumeration returned by the driver.
Display Device or Unit DTD.
-f FILE, --filename=FILE
Redirect query output to the specified file in place of the
default stdout. The specified file will be overwritten.
-u, --unit
Display Unit DTD instead of device DTD.
Display topology information about the system. Use
"nvidia-smi topo -h" for more information. Linux only. Shows all
GPUs NVML is able to detect but CPU and NUMA node affinity information will
only be shown for GPUs with Kepler or newer architectures. Note: GPU
enumeration is the same as NVML.
Display and modify the GPU drain states. A drain state is one in
which the GPU is no longer accepting new clients, and is used while
preparing to power down the GPU. Use "nvidia-smi drain -h" for
more information. Linux only.
Display nvlink information. Use "nvidia-smi nvlink -h"
for more information.
Query and control clocking behavior. Use "nvidia-smi clocks
--help" for more information.
Display information on GRID virtual GPUs. Use "nvidia-smi
vgpu -h" for more information.
Provides controls for MIG management. "nvidia-smi mig
-h" for more information.
Provides controls for boost sliders management. "nvidia-smi
boost-slider -h" for more information.
Provides queries for power hint. "nvidia-smi power-hint
-h" for more information.
Provides control and queries for confidential compute.
"nvidia-smi conf-compute -h" for more information.
Provides controls and information for power smoothing.
"nvidia-smi power-smoothing -h" for more information.
Profiles controls and information for workload power profiles.
"nvidia-smi power-profiles -h" for more information.
Display Encoder Sessions information. "nvidia-smi
encodersessions -h" for more information.
Return code reflects whether the operation succeeded or failed and
what was the reason of failure.
- Return code 0 - Success
- Return code 2 - A supplied argument or flag is invalid
- Return code 3 - The requested operation is not available on target
device
- Return code 4 - The current user does not have permission to access this
device or perform this operation
- Return code 6 - A query to find an object was unsuccessful
- Return code 8 - A device's external power cables are not properly
attached
- Return code 9 - NVIDIA driver is not loaded
- Return code 10 - NVIDIA Kernel detected an interrupt issue with a GPU
- Return code 12 - NVML Shared Library couldn't be found or loaded
- Return code 13 - Local version of NVML doesn't implement this
function
- Return code 14 - infoROM is corrupted
- Return code 15 - The GPU has fallen off the bus or has otherwise become
inaccessible
- Return code 255 - Other error or internal driver error occurred
The following list describes all possible data returned by the
-q device query option. Unless otherwise noted all numerical results
are base 10 and unitless.
The current system timestamp at the time nvidia-smi was invoked.
Format is "Day-of-week Month Day HH:MM:SS Year".
Driver Version
The version of the installed NVIDIA display driver. This is an
alphanumeric string.
The version of the CUDA toolkit installed on the system. This is
an alphanumeric string.
Attached GPUs
The number of NVIDIA GPUs in the system.
Product Name
The official product name of the GPU. This is an alphanumeric
string. For all products.
The official brand of the GPU. This is an alphanumeric string. For
all products.
The official architecture name of the GPU. This is an alphanumeric
string. For all products.
This field is deprecated, and will be removed in a future
release.
A flag that indicates whether a physical display (e.g. monitor) is
currently connected to any of the GPU's connectors. "Yes"
indicates an attached display. "No" indicates otherwise.
A flag that indicates whether a display is initialized on the
GPU's (e.g. memory is allocated on the device for display). Display can be
active even when no monitor is physically attached. "Enabled"
indicates an active display. "Disabled" indicates otherwise.
A flag that indicates whether persistence mode is enabled for the
GPU. Value is either "Enabled" or "Disabled". When
persistence mode is enabled the NVIDIA driver remains loaded even when no
active clients, such as X11 or nvidia-smi, exist. This minimizes the driver
load latency associated with running dependent apps, such as CUDA programs.
For all CUDA-capable products. Linux only.
A field that indicates which addressing mode is currently active.
The value is "ATS" or "HMM" or "None". When
the mode is "ATS", system allocated memory like malloc is
addressable from the GPU via Address Translation Services. This means there
is effectively a single set of page tables used by both the CPU and the GPU.
When the mode is "HMM", system allocated memory like malloc is
addressable from the GPU via software-based mirroring of the CPU's page
tables, on the GPU. When the mode is "None", neither ATS nor HMM
is active. Linux only.
MIG Mode configuration status
- Current
- MIG mode currently in use - NA/Enabled/Disabled
- Pending
- Pending configuration of MIG Mode - Enabled/Disabled
A flag that indicates whether accounting mode is enabled for the
GPU. Value is either "Enabled" or "Disabled". When
accounting is enabled statistics are calculated for each compute process
running on the GPU. Statistics can be queried during the lifetime or after
termination of the process. The execution time of process is reported as 0
while the process is in running state and updated to actual execution time
after the process has terminated. See --help-query-accounted-apps for more
info.
Returns the size of the circular buffer that holds list of
processes that can be queried for accounting stats. This is the maximum
number of processes that accounting information will be stored for before
information about oldest processes will get overwritten by information about
new processes.
On Windows, the TCC and WDDM driver models are supported. The
driver model can be changed with the (-dm) or (-fdm) flags. The TCC driver
model is optimized for compute applications. I.E. kernel launch times will
be quicker with TCC. The WDDM driver model is designed for graphics
applications and is not recommended for compute applications. Linux does not
support multiple driver models, and will always have the value of
"N/A".
- Current
- The driver model currently in use. Always "N/A" on Linux.
- Pending
- The driver model that will be used on the next reboot. Always
"N/A" on Linux.
This number matches the serial number physically printed on each
board. It is a globally unique immutable alphanumeric value.
This value is the globally unique immutable alphanumeric
identifier of the GPU. It does not correspond to any physical label on the
board.
The minor number for the device is such that the Nvidia device
node file for each GPU will have the form /dev/nvidia[minor number].
Available only on Linux platform.
The BIOS of the GPU board.
Whether or not this GPU is part of a multiGPU board.
The unique board ID assigned by the driver. If two or more GPUs
have the same board ID and the above "MultiGPU" field is true then
the GPUs are on the same board.
The unique part number of the GPU's board
The unique part number of the GPU
Unique FRU part number of the GPU
Platform Information are compute tray platform specific
information. They are GPU's positional index and platform identifying
information.
Chassis Serial Number
Serial Number of the chassis containing this GPU.
Slot Number
The slot number in the chassis containing this GPU (includes
switches).
Tray Index
The tray index within the compute slots in the chassis containing
this GPU (does not include switches).
Host ID
Index of the node within the slot containing this GPU.
Peer Type
Platform indicated NVLink-peer type (e.g. switch present or
not).
Module Id
ID of this GPU within the node.
GPU Fabric GUID
Fabric ID for this GPU.
Version numbers for each object in the GPU board's inforom
storage. The inforom is a small, persistent store of configuration and state
data for the GPU. All inforom version fields are numerical. It can be useful
to know these version numbers because some GPU features are only available
with inforoms of a certain version or higher.
If any of the fields below return Unknown Error additional Inforom
verification check is performed and appropriate warning message is
displayed.
- Image
Version
- Global version of the infoROM image. Image version just like VBIOS version
uniquely describes the exact version of the infoROM flashed on the board
in contrast to infoROM object version which is only an indicator of
supported features.
- OEM Object
- Version for the OEM configuration data.
- ECC Object
- Version for the ECC recording data.
- Power Management
Object
- Version for the power management data.
- Inforom checksum
validation
- Inforom checksum validation ("valid", "invalid",
"N/A") Only available via --query-gpu
inforom.checksum_validation
Information about flushing of the blackbox data to the inforom
storage.
- Latest
Timestamp
- The timestamp of the latest flush of the BBX Object during the current
run.
- Latest
Duration
- The duration of the latest flush of the BBX Object during the current
run.
GOM allows one to reduce power usage and optimize GPU throughput
by disabling GPU features.
Each GOM is designed to meet specific user needs.
In "All On" mode everything is enabled and running at
full speed.
The "Compute" mode is designed for running only compute
tasks. Graphics operations are not allowed.
The "Low Double Precision" mode is designed for running
graphics applications that don't require high bandwidth double
precision.
GOM can be changed with the (--gom) flag.
Supported on GK110 M-class and X-class Tesla products from the
Kepler family. Not supported on Quadro and Tesla C-class products. Low
Double Precision and All On modes are the only modes available for supported
GeForce Titan products.
- Current
- The GOM currently in use.
- Pending
- The GOM that will be used on the next reboot.
Action to take to clear fault that previously happened. It is not
intended for determining which fault triggered recovery action.
Possible values: None, Reset, Reboot, Drain P2P, Drain and Reset
None
No recovery action needed
Reset
Example scenario - Uncontained HBM/SRAM UCE
The GPU has encountered a fault that requires a reset to recover.
Terminate all GPU processes, reset the GPU using 'nvidia-smi -r', and the GPU
can be used again by starting new GPU processes.
Reboot
Example scenario - UVM fatal error
The GPU has encountered a fault may have left the OS in an inconsistent state.
Reboot the operating system to restore the OS back to a consistent state.
Node reboot required.
Application cannot restart without node reboot
OS warm reboot is sufficient (no need for AC/DC cycle)
Drain P2P
Example scenario - N/A
The GPU has encountered a fault that requires all peer-to-peer traffic to be
quiesced.
Terminate all GPU processes that conduct peer-to-peer traffic and disable UVM
persistence mode.
Disable job scheduling (no new jobs), stop all applications when convenient,
if persistence mode is enabled, disable it
Once all peer-to-peer traffic are drained, query
NVML_FI_DEV_GET_GPU_RECOVERY_ACTION again, which will return one of the
other actions.
If still DRAIN_P2P, then GPU reset.
Drain and Reset
Example scenario - Contained HBM UCE
Reset Recommended.
The GPU has encountered a fault that results the GPU to temporarily operate at
a reduced capacity, such as part of its frame buffer memory being offlined,
or some of its MIG partitions down.
No new work should be scheduled on the GPU, but existing work that
didn’t get affected are safe to continue until they finish or reach a
good checkpoint.
Safe to restart application (memory capacity will be reduced due to dynamic
page offlining), but need to eventually reset (to get row remap).
Asserted only for UCE row remaps.
After all existing work have drained, reset the GPU to regain its full
capacity.
Firmware version of GSP. This is an alphanumeric string.
Basic PCI info for the device. Some of this information may change
whenever cards are added/removed/moved in a system. For all products.
- Bus
- PCI bus number, in hex
- Device
- PCI device number, in hex
- Domain
- PCI domain number, in hex
- Base Classcode
- PCI Base classcode, in hex
- Sub Classcode
- PCI Sub classcode, in hex
- Device
Id
- PCI vendor device id, in hex
- Sub System
Id
- PCI Sub System id, in hex
- Bus Id
- PCI bus id as "domain:bus:device.function", in hex
The PCIe link generation and bus width
- Current
- The current link generation and width. These may be reduced when the GPU
is not in use.
- Max
- The maximum link generation and width possible with this GPU and system
configuration. For example, if the GPU supports a higher PCIe generation
than the system supports then this reports the system PCIe
generation.
Information related to Bridge Chip on the device. The bridge chip
firmware is only present on certain boards and may display "N/A"
for some newer multiGPUs boards.
- Type
- The type of bridge chip. Reported as N/A if doesn't exist.
- Firmware
Version
- The firmware version of the bridge chip. Reported as N/A if doesn't
exist.
The number of PCIe replays since reset.
The number of PCIe replay number rollovers since reset. A replay
number rollover occurs after 4 consecutive replays and results in retraining
the link.
The GPU-centric transmission throughput across the PCIe bus in
MB/s over the past 20ms. Only supported on Maxwell architectures and
newer.
The GPU-centric receive throughput across the PCIe bus in MB/s
over the past 20ms. Only supported on Maxwell architectures and newer.
The PCIe atomic capabilities of outbound/inbound operations of the
GPU.
The fan speed value is the percent of the product's maximum noise
tolerance fan speed that the device's fan is currently intended to run at.
This value may exceed 100% in certain cases. Note: The reported speed is the
intended fan speed. If the fan is physically blocked and unable to spin,
this output will not match the actual fan speed. Many parts do not report
fan speeds because they rely on cooling via fans in the surrounding
enclosure. For all discrete products with dedicated fans.
The current performance state for the GPU. States range from P0
(maximum performance) to P12 (minimum performance).
Retrieves information about factors that are reducing the
frequency of clocks.
If all event reasons are returned as "Not Active" it
means that clocks are running as high as possible.
- Idle
- Nothing is running on the GPU and the clocks are dropping to Idle state.
This limiter may be removed in a later release.
- Application
Clocks Setting
- GPU clocks are limited by applications clocks setting. E.g. can be changed
using nvidia-smi --applications-clocks=
- SW Power Cap
- SW Power Scaling algorithm is reducing the clocks below requested clocks
because the GPU is consuming too much power. E.g. SW power cap limit can
be changed with nvidia-smi --power-limit=
- HW Slowdown
- HW Slowdown (reducing the core clocks by a factor of 2 or more) is
engaged. HW Thermal Slowdown and HW Power Brake will be displayed on
Pascal+.
This is an indicator of:\
- Temperature being too high (HW Thermal Slowdown)\
- External Power Brake Assertion is triggered (e.g. by the system power
supply) (HW Power Brake Slowdown)\
- Power draw is too high and Fast Trigger protection is reducing the
clocks
- SW Thermal
Slowdown
- SW Thermal capping algorithm is reducing clocks below requested clocks
because GPU temperature is higher than Max Operating Temp
Counters, in microseconds, for the amount of time factors have
been reducing the frequency of clocks
- SW Power
Capping
- Amount of time SW Power Scaling algorithm has reduced the clocks below
requested clocks because the GPU was consuming too much power.
- Sync Boost
Group
- Amount of time the clock frequency of this GPU was reduced to match the
minimum possible clock across the sync boost group.
- SW Thermal
Slowdown
- Amount of time SW Thermal capping algorithm has reduced clocks below
requested clocks because GPU temperature was higher than Max Operating
Temp.
- HW Thermal
Slowdown
- Amount of time HW Thermal Slowdown was engaged, reducing the core clocks
by a factor of 2 or more, due to temperature being too high.
- HW Power
Braking
- Amount of time External Power Brake Assertion was triggered (e.g. by the
system power supply).
A flag that indicates whether sparse operation mode is enabled for
the GPU. Value is either "Enabled" or "Disabled".
Reported as "N/A" if not supported.
On-board frame buffer memory information. Reported total memory
can be affected by ECC state. If ECC does affect the total available memory,
memory is decreased by several percent, due to the requisite parity bits.
The driver may also reserve a small amount of memory for internal use, even
without active work on the GPU. On systems where GPUs are NUMA nodes, the
accuracy of FB memory utilization provided by nvidia-smi depends on the
memory accounting of the operating system. This is because FB memory is
managed by the operating system instead of the NVIDIA GPU driver. Typically,
pages allocated from FB memory are not released even after the process
terminates to enhance performance. In scenarios where the operating system
is under memory pressure, it may resort to utilizing FB memory. Such actions
can result in discrepancies in the accuracy of memory reporting. For all
products.
- Total
- Total size of FB memory.
- Reserved
- Reserved size of FB memory.
- Used
- Used size of FB memory.
- Free
- Available size of FB memory.
BAR1 is used to map the FB (device memory) so that it can be
directly accessed by the CPU or by 3rd party devices (peer-to-peer on the
PCIe bus).
- Total
- Total size of BAR1 memory.
- Used
- Used size of BAR1 memory.
- Free
- Available size of BAR1 memory.
The compute mode flag indicates whether individual or multiple
compute applications may run on the GPU.
"Default" means multiple contexts are allowed per
device.
"Exclusive Process" means only one context is allowed
per device, usable from multiple threads at a time.
"Prohibited" means no contexts are allowed per device
(no compute apps).
"EXCLUSIVE_PROCESS" was added in CUDA 4.0. Prior CUDA
releases supported only one exclusive mode, which is equivalent to
"EXCLUSIVE_THREAD" in CUDA 4.0 and beyond.
For all CUDA-capable products.
Utilization rates report how busy each GPU is over time, and can
be used to determine how much an application is using the GPUs in the
system. Note: On MIG-enabled GPUs, querying the utilization of encoder,
decoder, jpeg, ofa, gpu, and memory is not currently supported.
Note: During driver initialization when ECC is enabled one can see
high GPU and Memory Utilization readings. This is caused by ECC Memory
Scrubbing mechanism that is performed during driver initialization.
- GPU
- Percent of time over the past sample period during which one or more
kernels was executing on the GPU. The sample period may be between 1
second and 1/6 second depending on the product.
- Memory
- Percent of time over the past sample period during which global (device)
memory was being read or written. The sample period may be between 1
second and 1/6 second depending on the product.
- Encoder
- Percent of time over the past sample period during which the GPU's video
encoder was being used. The sampling rate is variable and can be obtained
directly via the nvmlDeviceGetEncoderUtilization() API
- Decoder
- Percent of time over the past sample period during which the GPU's video
decoder was being used. The sampling rate is variable and can be obtained
directly via the nvmlDeviceGetDecoderUtilization() API
- JPEG
- Percent of time over the past sample period during which the GPU's JPEG
decoder was being used. The sampling rate is variable and can be obtained
directly via the nvmlDeviceGetJpgUtilization() API
- OFA
- Percent of time over the past sample period during which the GPU's OFA
(Optical Flow Accelerator) was being used. The sampling rate is variable
and can be obtained directly via the nvmlDeviceGetOfaUtilization()
API
Encoder Stats report the count of active encoder sessions, along
with the average Frames Per Second (FPS) and average latency (in
microseconds) for all these active sessions on this device.
- Active
Sessions
- The total number of active encoder sessions on this device.
- Average
FPS
- The average Frame Per Sencond (FSP) of all active encoder sessions on this
device.
- Average
Latency
- The average latency in microseconds of all active encoder sessions on this
device.
A flag that indicates whether DRAM Encryption support is enabled.
May be either "Enabled" or "Disabled". Changes to DRAM
Encryption mode require a reboot. Requires Inforom ECC object.
- Current
- The DRAM Encryption mode that the GPU is currently operating under.
- Pending
- The DRAM Encryption mode that the GPU will operate under after the next
reboot.
A flag that indicates whether ECC support is enabled. May be
either "Enabled" or "Disabled". Changes to ECC mode
require a reboot. Requires Inforom ECC object version 1.0 or higher.
- Current
- The ECC mode that the GPU is currently operating under.
- Pending
- The ECC mode that the GPU will operate under after the next reboot.
NVIDIA GPUs can provide error counts for various types of ECC
errors. Some ECC errors are either single or double bit, where single bit
errors are corrected and double bit errors are uncorrectable. Texture memory
errors may be correctable via resend or uncorrectable if the resend fails.
These errors are available across two timescales (volatile and aggregate).
Single bit ECC errors are automatically corrected by the HW and do not
result in data corruption. Double bit errors are detected but not corrected.
Please see the ECC documents on the web for information on compute
application behavior when double bit errors occur. Volatile error counters
track the number of errors detected since the last driver load. Aggregate
error counts persist indefinitely and thus act as a lifetime counter.
A note about volatile counts: On Windows this is once per boot. On
Linux this can be more frequent. On Linux the driver unloads when no active
clients exist. Hence, if persistence mode is enabled or there is always a
driver client active (e.g. X11), then Linux also sees per-boot behavior. If
not, volatile counts are reset each time a compute app is run.
Tesla and Quadro products pre-volta can display total ECC error
counts, as well as a breakdown of errors based on location on the chip. The
locations are described below. Location-based data for aggregate error
counts requires Inforom ECC object version 2.0. All other ECC counts require
ECC object version 1.0.
- Device
Memory
- Errors detected in global device memory.
- Register
File
- Errors detected in register file memory.
- L1 Cache
- Errors detected in the L1 cache.
- L2 Cache
- Errors detected in the L2 cache.
- Texture
Memory
- Parity errors detected in texture memory.
- Total
- Total errors detected across entire chip. Sum of Device Memory,
Register File, L1 Cache, L2 Cache and Texture
Memory.
On Turing the output is such:
- SRAM
Correctable
- Number of correctable errors detected in any of the SRAMs
- SRAM
Uncorrectable
- Number of uncorrectable errors detected in any of the SRAMs
- DRAM
Correctable
- Number of correctable errors detected in the DRAM
- DRAM
Uncorrectable
- Number of uncorrectable errors detected in the DRAM
On Ampere+ The categorization of SRAM errors has been expanded
upon. SRAM errors are now categorized as either parity or SEC-DED (single
error correctable/double error detectable) depending on which unit hit the
error. A histogram has been added that categorizes what unit hit the SRAM
error. Additionally a flag has been added that indicates if the threshold
for the specific SRAM has been exceeded.
- SRAM Uncorrectable
Parity
- Number of uncorrectable errors detected in SRAMs that are parity
protected
- SRAM Uncorrectable
SEC-DED
- Number of uncorrectable errors detected in SRAMs that are SEC-DED
protected
- Aggregate
Uncorrectable SRAM Sources
- SRAM L2
- Errors that occurred in the L2 cache
- SRAM SM
- Errors that occurred in the SM
- SRAM
Microcontroller
- Errors that occurred in a microcontroller (PMU/GSP etc...)
- SRAM PCIE
- Errors that occrred in any PCIE related unit
- SRAM Other
- Errors occuring in anything else not covered above
NVIDIA GPUs can retire pages of GPU device memory when they become
unreliable. This can happen when multiple single bit ECC errors occur for
the same page, or on a double bit ECC error. When a page is retired, the
NVIDIA driver will hide it such that no driver, or application memory
allocations can access it.
Double Bit ECC The number of GPU device memory pages that
have been retired due to a double bit ECC error.
Single Bit ECC The number of GPU device memory pages that
have been retired due to multiple single bit ECC errors.
Pending Checks if any GPU device memory pages are pending
blacklist on the next reboot. Pages that are retired but not yet blacklisted
can still be allocated, and may cause further reliability issues.
NVIDIA GPUs can remap rows of GPU device memory when they become
unreliable. This can happen when a single uncorrectable ECC error or
multiple correctable ECC errors occur on the same row. When a row is
remapped, the NVIDIA driver will remap the faulty row to a reserved row. All
future accesses to the row will access the reserved row instead of the
faulty row. This feature is available on Ampere+
Correctable Error The number of rows that have been
remapped due to correctable ECC errors.
Uncorrectable Error The number of rows that have been
remapped due to uncorrectable ECC errors.
Pending Indicates whether or not a row is pending
remapping. The GPU must be reset for the remapping to go into effect.
Remapping Failure Occurred Indicates whether or not a row
remapping has failed in the past.
Bank Remap Availability Histogram Each memory bank has a
fixed number of reserved rows that can be used for row remapping. The
histogram will classify the remap availability of each bank into Maximum,
High, Partial, Low and None. Maximum availability means that all reserved
rows are available for remapping while None means that no reserved rows are
available. Correctable row remappings don't count towards the availability
histogram since row remappings due to correctable row remappings can be
evicted by an uncorrectable row remapping.
Readings from temperature sensors on the board. All readings are
in degrees C. Not all products support all reading types. In particular,
products in module form factors that rely on case fans or passive cooling do
not usually provide temperature readings. See below for restrictions.
T.Limit: The T.Limit sensor measures the current margin in degree
Celsius to the maximum operating temperature. As such it is not an absolute
temperature reading rather a relative measurement.
Not all products support T.Limit sensor readings.
When supported, nvidia-smi reports the current T.Limit temperature
as a signed value that counts down. A T.Limit temperature of 0 C or lower
indicates that the GPU may optimize its clock based on thermal conditions.
Further, when the T.Limit sensor is supported, available temperature
thresholds are also reported relative to T.Limit (see below) instead of
absolute measurements.
- GPU
- Core GPU temperature. For all discrete and S-class products.
- T.Limit Temp
- Current margin in degrees Celsius from the maximum GPU operating
temperature.
- Shutdown
Temp
- The temperature at which a GPU will shutdown.
- Shutdown
T.Limit Temp
- The T.Limit temperature below which a GPU may shutdown. Since shutdown can
only triggered by the maximum GPU temperature it is possible for the
current T.Limit to be more negative than this threshold.
- Slowdown
Temp
- The temperature at which a GPU HW will begin optimizing clocks due to
thermal conditions, in order to cool.
- Slowdown
T.Limit Temp
- The T.Limit temperature below which a GPU HW may optimize its clocks for
thermal conditions. Since this clock adjustment can only triggered by the
maximum GPU temperature it is possible for the current T.Limit to be more
negative than this threshold.
- Max Operating
Temp
- The temperature at which GPU SW will optimize its clock for thermal
conditions.
- Max Operating T.Limit
Temp
- The T.Limit temperature below which GPU SW will optimize its clock for
thermal conditions.
Power readings help to shed light on the current power usage of
the GPU, and the factors that affect that usage. When power management is
enabled the GPU limits power draw under load to fit within a predefined
power envelope by manipulating the current performance state. See below for
limits of availability. Please note that power readings are not applicable
for Pascal and higher GPUs with BA sensor boards.
- Power
State
- Power State is deprecated and has been renamed to Performance State in
2.285. To maintain XML compatibility, in XML format Performance State is
listed in both places.
- Power
Management
- A flag that indicates whether power management is enabled. Either
"Supported" or "N/A". Requires Inforom PWR object
version 3.0 or higher or Kepler device.
- Instantaneous
Power Draw
- The last measured power draw for the entire board, in watts. Only
available if power management is supported.
- Average Power
Draw
- The average power draw for the entire board for the last second, in watts.
Only supported on Ampere (except GA100) or newer devices. Only available
if power management is supported.
- Power
Limit
- The software power limit, in watts. Set by software such as nvidia-smi.
Only available if power management is supported. Requires Inforom PWR
object version 3.0 or higher or Kepler device. On Kepler devices Power
Limit can be adjusted using -pl,--power-limit= switches.
- Enforced Power
Limit
- The power management algorithm's power ceiling, in watts. Total board
power draw is manipulated by the power management algorithm such that it
stays under this value. This limit is the minimum of various limits such
as the software limit listed above. Only available if power management is
supported. Requires a Kepler device. Please note that for boards without
INA sensors, it is the GPU power draw that is being manipulated.
- Default Power
Limit
- The default power management algorithm's power ceiling, in watts. Power
Limit will be set back to Default Power Limit after driver unload. Only on
supported devices from Kepler family.
- Min Power Limit
- The minimum value in watts that power limit can be set to. Only on
supported devices from Kepler family.
- Max Power
Limit
- The maximum value in watts that power limit can be set to. Only on
supported devices from Kepler family.
Power Smoothing
Power Smoothing related definitions and currently set values. This
feature allows users to tune power parameters to minimize power fluctuations
in large datacenter environments.
- Enabled
- Value is "Yes" if the feature is enabled and "No" if
the feature is not enabled.
- Privilege
Level
- The current privilege for the user. Value is 0, 1 or 2. Note that the
higher the privilege level, the more information the user will have access
to.
- Immediate Ramp
Down
- Values are "Enabled" or "Disabled". Indicates if ramp
down hysteresis value will be honored (when enabled) or ignored (when
disabled).
- Current
TMP
- The last read value of the Total Module Power, in watts.
- Current TMP
FLoor
- The last read value of the Total Module Power floor, in watts. This value
is calculated by doing TMP Ceiling * (% TMP FLoor value)
- Max % TMP
Floor
- The highest percentage value for which the Percent TMP Floor can be
set.
- Min % TMP
Floor
- The lowest percentage value for which the Percent TMP Floor can be
set.
- HW Lifetime %
Remaining
- As this feature is used, the circuitry which drives the feature wears
down. This value gives the percentage of the remaining lifetime of this
hardware.
- Number of Preset
Profiles
- This value is the total number of Preset Profiles supported.
Values for the currently acvive power smoothing preset
profile.
- **% TMP Floor**
- The percentage of the TMP Ceiling, which is used to set the TMP floor, for
the currently active preset profile. For example, if max TMP is 1000 W,
and the % TMP floor is 50%, then the min TMP value will be 500 W. This
value is in the range [Min % TMP Floor, Max % TMP Floor].
- Ramp Up Rate
- The ramp up rate, measured in mW/s, for the currently active preset
profile.
- Ramp Down
Rate
- The ramp down rate, measured in mW/s, for the currently active preset
profile.
- Ramp Down
Hysteresis
- The ramp down hysteresis value, in ms, for the currently active preset
profile.
- Active Preset
Profile Number
- The number of the active preset profile.
Admin overrides allow users with sufficient permissions to preempt
the values of the currently active preset profile. If an admin override is
set for one of the fields, then this value will be used instead of any other
configured value.
- **% TMP Floor**
- The admin override value for % TMP Floor. This value is in the range [Min
% TMP Floor, Max % TMP Floor].
- Ramp Up
Rate
- The admin override value for ramp up rate, measured in mW/s.
- Ramp Down
Rate
- The admin override value for ramp down rate, measured in mW/s.
- Ramp Down
Hysteresis
- The admin override value for ramp down hysteresis value, in ms.
Pre-tuned GPU profiles help to provide immediate, optimized
configurations for Datacenter use cases. This sections includes information
about the currently requested on enfornced power profiles.
- Requested
Profiles
- The list of user requested profiles.
- Enforced
Profiles
- Since many of the profiles have conflicting goals, some configurations of
requested profiles are incompatible. This is the list of the requested
profiles which are currently enforced.
Current frequency at which parts of the GPU are running. All
readings are in MHz. Note that it is possible for clocks to report a lower
freqency than the lowest frequency that can be set by SW due to HW
optimizations in certain scenarios.
- Graphics
- Current frequency of graphics (shader) clock.
- SM
- Current frequency of SM (Streaming Multiprocessor) clock.
- Memory
- Current frequency of memory clock.
- Video
- Current frequency of video (encoder + decoder) clocks.
User specified frequency at which applications will be running at.
Can be changed with [-ac \| --applications-clocks] switches.
- Graphics
- User specified frequency of graphics (shader) clock.
- Memory
- User specified frequency of memory clock.
Default frequency at which applications will be running at.
Application clocks can be changed with [-ac \| --applications-clocks]
switches. Application clocks can be set to default using [-rac \|
--reset-applications-clocks] switches.
- Graphics
- Default frequency of applications graphics (shader) clock.
- Memory
- Default frequency of applications memory clock.
Maximum frequency at which parts of the GPU are design to run. All
readings are in MHz.
On GPUs from Fermi family current P0 clocks (reported in Clocks
section) can differ from max clocks by few MHz.
- Graphics
- Maximum frequency of graphics (shader) clock.
- SM
- Maximum frequency of SM (Streaming Multiprocessor) clock.
- Memory
- Maximum frequency of memory clock.
- Video
- Maximum frequency of video (encoder + decoder) clock.
User-specified settings for automated clocking changes such as
auto boost.
- Auto Boost
- Indicates whether auto boost mode is currently enabled for this GPU (On)
or disabled for this GPU (Off). Shows (N/A) if boost is not supported.
Auto boost allows dynamic GPU clocking based on power, thermal and
utilization. When auto boost is disabled the GPU will attempt to maintain
clocks at precisely the Current Application Clocks settings (whenever a
CUDA context is active). With auto boost enabled the GPU will still
attempt to maintain this floor, but will opportunistically boost to higher
clocks when power, thermal and utilization headroom allow. This setting
persists for the life of the CUDA context for which it was requested. Apps
can request a particular mode either via an NVML call (see NVML SDK) or by
setting the CUDA environment variable CUDA_AUTO_BOOST.
- Auto Boost
Default
- Indicates the default setting for auto boost mode, either enabled (On) or
disabled (Off). Shows (N/A) if boost is not supported. Apps will run in
the default mode if they have not explicitly requested a particular mode.
Note: Auto Boost settings can only be modified if "Persistence
Mode" is enabled, which is NOT by default.
List of possible memory and graphics clocks combinations that the
GPU can operate on (not taking into account HW brake reduced clocks). These
are the only clock combinations that can be passed to --applications-clocks
flag. Supported Clocks are listed only when -q -d SUPPORTED_CLOCKS switches
are provided or in XML format.
Current voltage reported by the GPU. All units are in mV.
- Graphics
- Current voltage of the graphics unit. This field is deprecated and always
displays "N/A". Voltage will be removed in a future
release.
GPU Fabric information
State
Indicates the state of the GPU's handshake with the
nvidia-fabricmanager (a.k.a. GPU fabric probe)
Possible values: Completed, In Progress, Not Started, Not supported
Status
Status of the GPU fabric probe response from the
nvidia-fabricmanager.
Possible values: NVML_SUCCESS or one of the failure codes.
Clique ID
A clique is a set of GPUs that can communicate to each other over
NVLink.
The GPUs belonging to the same clique share the same clique ID.
Clique ID will only be valid for NVLink multi-node systems.
Cluster UUID
UUID of an NVLink multi-node cluster to which this GPU belongs.
Cluster UUID will be zero for NVLink single-node systems.
Health
Bandwidth - is the GPU NVLink bandwidth degraded or not
<True/False>
Route Recovery in progress - is NVLink route recovery in progress
<True/False>
Route Unhealthy - is NVLink route recovery failed or aborted
<True/False>
Access Timeout Recovery - is NVLink access timeout recovery in progress
<True/False>
List of processes having Compute or Graphics Context on the
device. Compute processes are reported on all the fully supported products.
Reporting for Graphics processes is limited to the supported products
starting with Kepler architecture.
- Each Entry is of format
"<GPU Index> <PID> <Type> <Process Name>
<GPU Memory Usage>"
- GPU Index
- Represents NVML Index of the device.
- GPU Instance
Index
- Represents GPU Instance Index of the MIG device (if enabled).
- Compute Instance
Index
- Represents Compute Instance Index of the MIG device (if enabled).
- PID
- Represents Process ID corresponding to the active Compute or Graphics
context.
- Type
- Displayed as "C" for Compute Process, "G" for Graphics
Process, "M" for MPS ("Multi-Process Service") Compute
Process, and "C+G" or "M+C" for the process having
both Compute and Graphics or MPS Compute and Compute contexts.
- Process
Name
- Represents process name for the Compute or Graphics process.
- GPU Memory
Usage
- Amount of memory used on the device by the context. Not available on
Windows when running in WDDM mode because Windows KMD manages all the
memory not NVIDIA driver.
The "nvidia-smi dmon" command-line is used to monitor
one or more GPUs (up to 16 devices) plugged into the system. This tool
allows the user to see one line of monitoring data per monitoring cycle. The
output is in concise format and easy to interpret in interactive mode. The
output data per line is limited by the terminal size. It is supported on
Tesla, GRID, Quadro and limited GeForce products for Kepler or newer GPUs
under bare metal 64 bits Linux. By default, the monitoring data includes
Power Usage, Temperature, SM clocks, Memory clocks and Utilization values
for SM, Memory, Encoder, Decoder, JPEG and OFA. It can also be configured to
report other metrics such as frame buffer memory usage, bar1 memory usage,
power/thermal violations and aggregate single/double bit ecc errors. If any
of the metric is not supported on the device or any other error in fetching
the metric is reported as "-" in the output data. The user can
also configure monitoring frequency and the number of monitoring iterations
for each run. There is also an option to include date and time at each line.
All the supported options are exclusive and can be used together in any
order. Note: On MIG-enabled GPUs, querying the utilization of encoder,
decoder, jpeg, ofa, gpu, and memory is not currently supported.
Usage:
- 1) Default with no arguments
- nvidia-smi
dmon
- Monitors default
metrics for up to 16 supported devices under natural enumeration (starting
with GPU index 0) at a frequency of 1 sec. Runs until terminated with
^C.
- 2) Select one or more devices
- nvidia-smi
dmon -i <device1,device2, .. , deviceN>
- Reports default metrics
for the devices selected by comma separated device list. The tool picks up
to 16 supported devices from the list under natural enumeration (starting
with GPU index 0).
- 3) Select metrics to be displayed
- nvidia-smi
dmon -s <metric_group>
- <metric_group> can be one or more from the following:
- p - Power Usage (in Watts) and
GPU/Memory Temperature (in C) if supported
- u - Utilization (SM, Memory,
Encoder, Decoder, JPEG and OFA Utilization in %)
- c - Proc and Mem Clocks (in
MHz)
- v - Power Violations (in %) and
Thermal Violations (as a boolean flag)
- m - Frame Buffer, Bar1 and
Confidential Compute protected memory usage (in MB)
- e - ECC (Number of aggregated single
bit, double bit ecc errors) and PCIe Replay errors
- t - PCIe Rx and Tx Throughput in
MB/s (Maxwell and above)
- 4) Configure monitoring iterations
- nvidia-smi
dmon -c <number of samples>
- Displays data for
specified number of samples and exit.
- 5) Configure monitoring frequency
- nvidia-smi
dmon -d <time in secs>
- Collects and displays
data at every specified monitoring interval until terminated with
^C.
- 6) Display date
- nvidia-smi
dmon -o D
- Prepends monitoring
data with date in YYYYMMDD format.
- 7) Display time
- nvidia-smi
dmon -o T
- Prepends
monitoring data with time in HH:MM:SS format.
- 8) Select GPM metrics to be displayed
- nvidia-smi
dmon --gpm-metrics <gpmMetric1,gpmMetric2,...,gpmMetricN>
- <gpmMetricX> Refer to the documentation for nvmlGpmMetricId_t in the
NVML header file
- 9) Select which level of GPM metrics to be displayed
- nvidia-smi
dmon --gpm-options <gpmMode>
- <gpmMode> can be one of the following:
- d - Display Device Level GPM
metrics
- m - Display MIG Level GPM
metrics
- dm - Display Device and MIG Level
GPM metrics
- md - Display Device and MIG Level
GPM metrics, same as 'dm'
- 10) Modify output format
- nvidia-smi
dmon --format <formatSpecifier>
- <formatSpecifier> can be any comma separated combination of the
following:
- csv - Format dmon output as
CSV
- nounit - Remove unit line
from dmon output
- 11) Help Information
- nvidia-smi
dmon -h
- Displays help
information for using the command line.
The "nvidia-smi daemon" starts a background process to
monitor one or more GPUs plugged in to the system. It monitors the requested
GPUs every monitoring cycle and logs the file in compressed format at the
user provided path or the default location at /var/log/nvstats/. The log
file is created with system's date appended to it and of the format
nvstats-YYYYMMDD. The flush operation to the log file is done every
alternate monitoring cycle. Daemon also logs it's own PID at
/var/run/nvsmi.pid. By default, the monitoring data to persist includes
Power Usage, Temperature, SM clocks, Memory clocks and Utilization values
for SM, Memory, Encoder, Decoder, JPEG and OFA. The daemon tools can also be
configured to record other metrics such as frame buffer memory usage, bar1
memory usage, power/thermal violations and aggregate single/double bit ecc
errors.The default monitoring cycle is set to 10 secs and can be configured
via command-line. It is supported on Tesla, GRID, Quadro and GeForce
products for Kepler or newer GPUs under bare metal 64 bits Linux. The daemon
requires root privileges to run, and only supports running a single instance
on the system. All of the supported options are exclusive and can be used
together in any order. Note: On MIG-enabled GPUs, querying the utilization
of encoder, decoder, jpeg, ofa, gpu, and memory is not currently supported.
Usage:
- 1) Default with no arguments
- nvidia-smi
daemon
- Runs in the background to
monitor default metrics for up to 16 supported devices under natural
enumeration (starting with GPU index 0) at a frequency of 10 sec. The date
stamped log file is created at /var/log/nvstats/.
- 2) Select one or more devices
- nvidia-smi
daemon -i <device1,device2, .. , deviceN>
- Runs in the background to
monitor default metrics for the devices selected by comma separated device
list. The tool picks up to 16 supported devices from the list under natural
enumeration (starting with GPU index 0).
- 3) Select metrics to be monitored
- nvidia-smi
daemon -s <metric_group>
- <metric_group> can be one or more from the following:
- p - Power Usage (in Watts) and
GPU/Memory Temperature (in C) if supported
- u - Utilization (SM, Memory,
Encoder, Decoder, JPEG and OFA Utilization in %)
- c - Proc and Mem Clocks (in
MHz)
- v - Power Violations (in %) and
Thermal Violations (as a boolean flag)
- m - Frame Buffer, Bar1 and
Confidential Compute protected memory usage (in MB)
- e - ECC (Number of aggregated
single bit, double bit ecc errors) and PCIe Replay errors
- t - PCIe Rx and Tx Throughput in
MB/s (Maxwell and above)
- 4) Configure monitoring frequency
- nvidia-smi
daemon -d <time in secs>
- Collects data at
every specified monitoring interval until terminated.
- 5) Configure log directory
- nvidia-smi
daemon -p <path of directory>
- The log files are created at the
specified directory.
- 6) Configure log file name
- nvidia-smi
daemon -j <string to append log file name>
- The command-line is used to
append the log file name with the user provided string.
- 7) Terminate the daemon
- nvidia-smi
daemon -t
- This command-line uses the
stored PID (at /var/run/nvsmi.pid) to terminate the daemon. It makes the
best effort to stop the daemon and offers no guarantees for it's
termination. In case the daemon is not terminated, then the user can
manually terminate by sending kill signal to the daemon. Performing a GPU
reset operation (via nvidia-smi) requires all GPU processes to be exited,
including the daemon. Users who have the daemon open will see an error to
the effect that the GPU is busy.
- 8) Help Information
- nvidia-smi
daemon -h
- Displays help
information for using the command line.
The "nvidia-smi replay" command-line is used to
extract/replay all or parts of log file generated by the daemon. By default,
the tool tries to pull the metrics such as Power Usage, Temperature, SM
clocks, Memory clocks and Utilization values for SM, Memory, Encoder,
Decoder, JPEG and OFA. The replay tool can also fetch other metrics such as
frame buffer memory usage, bar1 memory usage, power/thermal violations and
aggregate single/double bit ecc errors. There is an option to select a set
of metrics to replay, If any of the requested metric is not maintained or
logged as not-supported then it's shown as "-" in the output. The
format of data produced by this mode is such that the user is running the
device monitoring utility interactively. The command line requires mandatory
option "-f" to specify complete path of the log filename, all the
other supported options are exclusive and can be used together in any order.
Note: On MIG-enabled GPUs, querying the utilization of encoder, decoder,
jpeg, ofa, gpu, and memory is not currently supported. Usage:
- 1) Specify log file to be replayed
- nvidia-smi
replay -f <log file name>
- Fetches monitoring data
from the compressed log file and allows the user to see one line of
monitoring data (default metrics with time-stamp) for each monitoring
iteration stored in the log file. A new line of monitoring data is replayed
every other second irrespective of the actual monitoring frequency
maintained at the time of collection. It is displayed till the end of file
or until terminated by ^C.
- 2) Filter metrics to be replayed
- nvidia-smi
replay -f <path to log file> -s <metric_group>
- <metric_group> can be one or more from the following:
- p - Power Usage (in Watts) and
GPU/Memory Temperature (in C) if supported
- u - Utilization (SM, Memory,
Encoder, Decoder, JPEG and OFA Utilization in %)
- c - Proc and Mem Clocks (in
MHz)
- v - Power Violations (in %) and
Thermal Violations (as a boolean flag)
- m - Frame Buffer, Bar1 and
Confidential Compute protected memory usage (in MB)
- e - ECC (Number of aggregated
single bit, double bit ecc errors) and PCIe Replay errors
- t - PCIe Rx and Tx Throughput in
MB/s (Maxwell and above)
- 3) Limit replay to one or more devices
- nvidia-smi
replay -f <log file> -i <device1,device2, .. ,
deviceN>
- Limits reporting of the
metrics to the set of devices selected by comma separated device list. The
tool skips any of the devices not maintained in the log file.
- 4) Restrict the time frame between which data is reported
- nvidia-smi
replay -f <log file> -b <start time in HH:MM:SS format> -e
<end time in HH:MM:SS format>
- This option allows the
data to be limited between the specified time range. Specifying time as 0
with -b or -e option implies start or end file respectively.
- 5) Redirect replay information to a log file
- nvidia-smi
replay -f <log file> -r <output file name>
- This option takes log file
as an input and extracts the information related to default metrics in the
specified output file.
- 6) Help Information
- nvidia-smi
replay -h
- Displays help
information for using the command line.
The "nvidia-smi pmon" command-line is used to monitor
compute and graphics processes running on one or more GPUs (up to 16
devices) plugged into the system. This tool allows the user to see the
statistics for all the running processes on each device at every monitoring
cycle. The output is in concise format and easy to interpret in interactive
mode. The output data per line is limited by the terminal size. It is
supported on Tesla, GRID, Quadro and limited GeForce products for Kepler or
newer GPUs under bare metal 64 bits Linux. By default, the monitoring data
for each process includes the pid, command name and average utilization
values for SM, Memory, Encoder and Decoder since the last monitoring cycle.
It can also be configured to report frame buffer memory usage for each
process. If there is no process running for the device, then all the metrics
are reported as "-" for the device. If any of the metric is not
supported on the device or any other error in fetching the metric is also
reported as "-" in the output data. The user can also configure
monitoring frequency and the number of monitoring iterations for each run.
There is also an option to include date and time at each line. All the
supported options are exclusive and can be used together in any order. Note:
On MIG-enabled GPUs, querying the utilization of encoder, decoder, jpeg,
ofa, gpu, and memory is not currently supported.
Usage:
- 1) Default with no arguments
- nvidia-smi
pmon
- Monitors all the
processes running on each device for up to 16 supported devices under
natural enumeration (starting with GPU index 0) at a frequency of 1 sec.
Runs until terminated with ^C.
- 2) Select one or more devices
- nvidia-smi
pmon -i <device1,device2, .. , deviceN>
- Reports statistics
for all the processes running on the devices selected by comma separated
device list. The tool picks up to 16 supported devices from the list under
natural enumeration (starting with GPU index 0).
- 3) Select metrics to be displayed
- nvidia-smi
pmon -s <metric_group>
- <metric_group> can be one or more from the following:
- u - Utilization (SM, Memory,
Encoder, Decoder, JPEG, and OFA Utilization for the process in %). Reports
average utilization since last monitoring cycle.
- m - Frame Buffer and
Confidential Compute protected memory usage (in MB). Reports instantaneous
value for memory usage.
- 4) Configure monitoring iterations
- nvidia-smi
pmon -c <number of samples>
- Displays data for
specified number of samples and exit.
- 5) Configure monitoring frequency
- nvidia-smi
pmon -d <time in secs>
- Collects and
displays data at every specified monitoring interval until terminated with
^C. The monitoring frequency must be between 1 to 10 secs.
- 6) Display date
- nvidia-smi
pmon -o D
- Prepends
monitoring data with date in YYYYMMDD format.
- 7) Display time
- nvidia-smi
pmon -o T
- Prepends
monitoring data with time in HH:MM:SS format.
- 8) Help Information
- nvidia-smi
pmon -h
- Displays help
information for using the command line.
The "nvidia-smi nvlink" command-line is used to manage
the GPU's Nvlinks. It provides options to set and query Nvlink
information.
Usage:
- 1) Display help menu
- nvidia-smi
nvlink -h
- Displays help menu
for using the command-line.
- 2) List one or more GPUs
- nvidia-smi
nvlink -i <GPU IDs>
- nvidia-smi
nvlink --id <GPU IDs>
- Selects one or more GPUs
using the given comma-separated GPU indexes, PCI bus IDs or UUIDs. If not
used, the given command-line option applies to all of the supported
GPUs.
- 3) Select a specific NvLink
- nvidia-smi
nvlink -l <GPU Nvlink Id>
- nvidia-smi
nvlink --list <GPU Nvlink Id>
- Selects a specific
Nvlink of the GPU for the given command, if valid. If not used, the given
command-line option allies to all of the GPU's Nvlinks.
- 4) Query Nvlink Status
- nvidia-smi
nvlink -s
- nvidia-smi
nvlink --status
- Get the status of the GPU's
Nvlinks.
- If Active, the Bandwidth of the
links will be displayed.
- If the link is present but Not
Active, it will show the link as Inactive.
- If the link is in Sleep state,
it will show as Sleep.
- 5) Query Nvlink capabilities
- nvidia-smi
nvlink -c
- nvidia-smi
nvlink --capabilities
- Get the GPU's Nvlink
capabilities.
- 6) Query the Nvlink's remote node PCI bus
- nvidia-smi
nvlink -p
- nvidia-smi
nvlink -pcibusid
- Get the Nvlink's remote node
PCI bus ID.
- 7) Query the Nvlink's remote link info
- nvidia-smi
nvlink -R
- nvidia-smi
nvlink -remotelinkinfo
- Get the remote device PCI
bus ID and NvLink ID for a link.
- 8) Set Nvlink Counter Control is DEPRECATED
- 9) Get Nvlink Counter Control is DEPRECATED
- 10) Get Nvlink Counters is DEPRECATED, -gt/--getthroughput should be
used instead
- 11) Reset Nvlink counters is DEPRECATED
- 12) Query Nvlink Error Counters
- nvidia-smi
nvlink -e
- nvidia-smi
nvlink --errorcounters
- Get the Nvlink error
counters.
- For NVLink 4
- Replay Errors - count the
number of replay 'events' that occurred
- Recovery Errors -
count the number of link recovery events
- CRC Errors - count the number of
CRC errors in received packets
- For NVLink 5
- Tx packets - Total Tx packets on
the link
- Tx bytes - Total Tx bytes on
the link
- Rx packets - Total Rx packets on
the link
- Rx bytes - Total Rx bytes on
the link
- Malformed packet
Errors - Number of packets Rx on a link where packets are malformed
- Buffer overrun Errors -
Number of packets that were discarded on Rx due to buffer overrun
- Rx Errors - Total number of
packets with errors Rx on a link
- Rx remote Errors - Total
number of packets Rx - stomp/EBP marker
- Rx General Errors - Total
number of packets Rx with header mismatch
- Local link integrity Errors
- Total number of times that the count of local errors exceeded a
threshold
- Tx discards - Total number of
tx error packets that were discarded
- Link recovery successful
events - Number of times link went from Up to recovery, succeeded and link
came back up
- Link recovery failed
events - Number of times link went from Up to recovery, failed and link was
declared down
- Total link recovery
events - Number of times link went from Up to recovery, irrespective of the
result
- Effective Errors -
Sum of the number of errors in each Nvlink packet
- Effective BER -
BER for symbol errors
- Symbol Errors - Number of
errors in rx symbols
- Symbol BER - BER for
symbol errors
- FEC Errors - [0-15] - count of
symbol errors that are corrected
- 13) Query Nvlink CRC error counters
- nvidia-smi
nvlink -ec
- nvidia-smi
nvlink --crcerrorcounters
- Get the Nvlink per-lane
CRC/ECC error counters.
- CRC - NVLink 4 and before -
Total Rx CRC errors on an NVLink Lane
- ECC - NVLink 4 - Total Rx
ECC errors on an NVLink Lane
- Deprecated NVLink
5 onwards
- 14) Reset Nvlink Error Counters
- nvidia-smi
nvlink -re
- nvidia-smi
nvlink --reseterrorcounters
- Reset all Nvlink error
counters to zero.
- NvLink 5 NOT
SUPPORTED
- 15) Query Nvlink throughput counters
- nvidia-smi
nvlink -gt <Data Type>
- nvidia-smi
nvlink --getthroughput <Data Type>
- <Data Type> can be one of the following:
- d - Tx and Rx data payload in
KiB.
- r - Tx and Rx raw payload and
protocol overhead in KiB.
- 16) Set Nvlink Low Power thresholds
- nvidia-smi
nvlink -sLowPwrThres <Threshold>
- nvidia-smi
nvlink --setLowPowerThreshold <Threshold>
- Set the Nvlink Low Power
Threshold, before the links go into Low Power Mode.
- Threshold ranges and
units can be found using -gLowPwrInfo.
- 17) Get Nvlink Low Power Info
- nvidia-smi
nvlink -gLowPwrInfo
- nvidia-smi
nvlink --getLowPowerInfo
- Query the Nvlink's Low Power
Info.
- 18) Set Nvlink Bandwidth mode
- nvidia-smi
nvlink -sBwMode <Bandwidth Mode>
- nvidia-smi
nvlink --setBandwidthMode <Bandwidth Mode>
- Set the Nvlink Bandwidth
mode for all GPUs. This is DEPRECATED for Blackwell+.
- The options are:
- FULL - All links are at max
Bandwidth.
- OFF - Bandwidth is not used. P2P
is via PCIe bus.
- MIN - Bandwidth is at minimum
speed.
- HALF - Bandwidth is at around
half of FULL speed.
- 3QUARTER - Bandwidth is at around 75% of FULL speed.
- 19) Get Nvlink Bandwidth mode
- nvidia-smi
nvlink -gBwMode
- nvidia-smi
nvlink --getBandwidthMode
- Get the Nvlink Bandwidth
mode for all GPUs. THis is DEPRECATED for Blackwell+.
- 20) Query for Nvlink Bridge
- nvidia-smi
nvlink -cBridge
- nvidia-smi
nvlink --checkBridge
- Query for Nvlink Bridge
presence.
- 21) Set the GPU's Nvlink Width
- nvidia-smi
nvlink -sLWidth <Link Width>
- nvidia-smi
nvlink --setLinkWidth <Link Width>
- Set the GPU's Nvlink width,
which will be keep those number of links Active, and the rest to
sleep.
- <Link Width> can be one of the following:
- values - List possible
Link Widths to be set.
- The numerical value from the
above option.
- 22) Get the GPU's Nvlink Width
- nvidia-smi
nvlink -gLWidth
- nvidia-smi
nvlink --getLinkWidth
- Query the GPU's Nvlink
Width.
The "nvidia-smi vgpu" command reports on GRID vGPUs
executing on supported GPUs and hypervisors (refer to driver release notes
for supported platforms). Summary reporting provides basic information about
vGPUs currently executing on the system. Additional options provide detailed
reporting of vGPU properties, per-vGPU reporting of SM, Memory, Encoder,
Decoder, Jpeg, and OFA utilization, and per-GPU reporting of supported and
creatable vGPUs. Periodic reports can be automatically generated by
specifying a configurable loop frequency to any command. Note: On
MIG-enabled GPUs, querying the utilization of encoder, decoder, jpeg, ofa,
gpu, and memory is not currently supported.
Usage:
- 1) Help Information
- nvidia-smi
vgpu -h
- Displays help
information for using the command line.
- 2) Default with no arguments
- nvidia-smi
vgpu
- Reports summary of
all the vGPUs currently active on each device.
- 3) Display detailed info on currently active vGPUs
- nvidia-smi
vgpu -q
- Collects and
displays information on currently active vGPUs on each device, including
driver version, utilization, and other information.
- 4) Select one or more devices
- nvidia-smi
vgpu -i <device1,device2, .. , deviceN>
- Reports summary for
all the vGPUs currently active on the devices selected by comma-separated
device list.
- 5) Display supported vGPUs
- nvidia-smi
vgpu -s
- Displays vGPU
types supported on each device. Use the -v / --verbose option to show
detailed info on each vGPU type.
- 6) Display creatable vGPUs
- nvidia-smi
vgpu -c
- Displays vGPU
types creatable on each device. This varies dynamically, depending on the
vGPUs already active on the device. Use the -v / --verbose option to show
detailed info on each vGPU type.
- 7) Report utilization for currently active vGPUs.
- nvidia-smi
vgpu -u
- Reports average
utilization (SM, Memory, Encoder, Decoder, Jpeg, and OFA) for each active
vGPU since last monitoring cycle. The default cycle time is 1 second, and
the command runs until terminated with ^C. If a device has no active vGPUs,
its metrics are reported as "-".
- 8) Configure loop frequency
- nvidia-smi
vgpu [-s -c -q -u] -l <time in secs>
- Collects and
displays data at a specified loop interval until terminated with ^C. The
loop frequency must be between 1 and 10 secs. When no time is specified, the
loop frequency defaults to 5 secs.
- 9) Display GPU engine usage
- nvidia-smi
vgpu -p
- Display GPU engine
usage of currently active processes running in the vGPU VMs.
- 10) Display migration capabitlities.
- nvidia-smi
vgpu -m
- Display pGPU's
migration/suspend/resume capability.
- 11) Display the vGPU Software scheduler state.
- nvidia-smi
vgpu -ss
- Display the
information about vGPU Software scheduler state.
- 12) Display the vGPU Software scheduler capabilities.
- nvidia-smi
vgpu -sc
- Display the list of
supported vGPU scheduler policies returned along with the other capabilities
values, if the engine is Graphics type. For other engine types, it is BEST
EFFORT policy and other capabilities will be zero. If ARR is supported and
enabled, scheduling frequency and averaging factor are applicable else
timeSlice is applicable.
- 13) Display the vGPU Software scheduler logs.
- nvidia-smi
vgpu -sl
- Display the vGPU
Software scheduler runlist logs.
- nvidia-smi
--query-vgpu-scheduler-logs=[input parameters]
- Display the vGPU
Software scheduler runlist logs in CSV format.
- 14) Set the vGPU Software scheduler state.
- nvidia-smi
vgpu --set-vgpu-scheduler-state [options]
- Set the vGPU Software
scheduler policy and states.
- 15) Display Nvidia Encoder session info.
- nvidia-smi
vgpu -es
- Display the
information about encoder sessions for currently running vGPUs.
- 16) Display accounting statistics.
- nvidia-smi
vgpu --query-accounted-apps=[input parameters]
- Display accounting
stats for compute/graphics processes.
- To find the list of properties
which can be queried, run - 'nvidia-smi
--help-query-accounted-apps'.
- 17) Display Nvidia Frame Buffer Capture session info.
- nvidia-smi
vgpu -fs
- Display the
information about FBC sessions for currently running vGPUs.
- Note : Horizontal
resolution, vertical resolution, average FPS and average latency data for a
FBC session may be zero if there are no new frames captured since the
session started.
- 18) Set vGPU heterogeneous mode.
- nvidia-smi
vgpu -shm
- Set vGPU heterogeneous mode
of the device for timesliced vGPUs with different framebuffer
sizes.
- >
- 19) Set vGPU MIG timeslice mode.
- >
- nvidia-smi
vgpu -smts
- Set vGPU MIG timeslice mode
of the device.
- >
- 20) Display the currently creatable vGPU types on the user provided GPU
Instance
- nvidia-smi
vgpu -c -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -c --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma separated
values for more than one GPU instance. The target GPU index (MANDATORY) for
the given GPU instance.
- 21) Display detailed information of the currently active vGPU instances
on the user provided GPU Instance
- nvidia-smi
vgpu -q -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -q --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
- 22) Display the vGPU scheduler state on the user provided GPU
Instance
- >
- nvidia-smi
vgpu -ss -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -ss --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
- 23) Get the vGPU heterogeneous mode on the user provided GPU
Instance
- >
- nvidia-smi
vgpu -ghm -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -ghm --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance. If not used, the given command-line
option applies to all of the GPU instances.
- 24) Set the vGPU heterogeneous mode on the user provided GPU
Instance
- >
- nvidia-smi
vgpu -shm -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -shm --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
- 25) Set the vGPU Software scheduler state on the user provided GPU
Instance.
- nvidia-smi
vgpu set-vgpu-scheduler-state [options] -gi <GPU instance IDs> -i
<GPU IDs>
- nvidia-smi
vgpu set-vgpu-scheduler-state [options] --gpu-instance-id <GPU instance
IDs> --id <GPU IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
- 26) Display the vGPU scheduler logs on the user provided GPU
Instance
- >
- nvidia-smi
vgpu -sl -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -sl --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
- >
- nvidia-smi
vgpu --query-gpu-instance-vgpu-scheduler-logs=[input parameters] -gi <GPU
instance IDs> -i <GPU IDs>
- >
- Display the vGPU
Software scheduler logs in CSV format on the user provided GPU
Instance.
- 27) Display detailed information of the currently creatable vGPU types
on the user provided GPU Instance
- nvidia-smi
vgpu -c -v -gi <GPU instance IDs> -i <GPU IDs>
- >
- nvidia-smi
vgpu -c -v --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- >
- Provide comma
separated values for more than one GPU instance. The target GPU index
(MANDATORY) for the given GPU instance.
The privileged "nvidia-smi mig" command-line is used to
manage MIG-enabled GPUs. It provides options to create, list and destroy GPU
instances and compute instances.
Usage:
- 1) Display help menu
- nvidia-smi
mig -h
- Displays help
menu for using the command-line.
- 2) Select one or more GPUs
- nvidia-smi
mig -i <GPU IDs>
- nvidia-smi
mig --id <GPU IDs>
- Selects one or more
GPUs using the given comma-separated GPU indexes, PCI bus IDs or UUIDs. If
not used, the given command-line option applies to all of the supported
GPUs.
- 3) Select one or more GPU instances
- nvidia-smi
mig -gi <GPU instance IDs>
- nvidia-smi
mig --gpu-instance-id <GPU instance IDs>
- Selects one or more
GPU instances using the given comma-separated GPU instance IDs. If not used,
the given command-line option applies to all of the GPU instances.
- 4) Select one or more compute instances
- nvidia-smi
mig -ci <compute instance IDs>
- nvidia-smi
mig --compute-instance-id <compute instance IDs>
- Selects one or more
compute instances using the given comma-separated compute instance IDs. If
not used, the given command-line option applies to all of the compute
instances.
- 5) List GPU instance profiles
- nvidia-smi
mig -lgip -i <GPU IDs>
- nvidia-smi
mig --list-gpu-instance-profiles --id <GPU IDs>
- Lists GPU instance profiles,
their availability and IDs. Profiles describe the supported types of GPU
instances, including all of the GPU resources they exclusively
control.
- 6) List GPU instance possible placements
- nvidia-smi
mig -lgipp -i <GPU IDs>
- nvidia-smi
mig --list-gpu-instance-possible-placements --id <GPU
IDs>
- Lists GPU instance
possible placements. Possible placements describe the locations of the
supported types of GPU instances within the GPU.
- 7) Create GPU instance
- nvidia-smi
mig -cgi <GPU instance specifiers> -i <GPU IDs>
- nvidia-smi
mig --create-gpu-instance <GPU instance specifiers> --id <GPU
IDs>
- Creates GPU instances
for the given GPU instance specifiers. A GPU instance specifier comprises a
GPU instance profile name or ID and an optional placement specifier
consisting of a colon and a placement start index. The command fails if the
GPU resources required to allocate the requested GPU instances are not
available, or if the placement index is not valid for the given
profile.
- 8) Create a GPU instance along with the default compute
instance
- nvidia-smi
mig -cgi <GPU instance profile IDs or names> -i <GPU IDs>
-C
- nvidia-smi
mig --create-gpu-instance <GPU instance profile IDs or names> --id
<GPU IDs> --default-compute-instance
- 9) List GPU instances
- nvidia-smi
mig -lgi -i <GPU IDs>
- nvidia-smi
mig --list-gpu-instances --id <GPU IDs>
- Lists GPU instances and
their IDs.
- 10) Destroy GPU instance
- nvidia-smi
mig -dgi -gi <GPU instance IDs> -i <GPU IDs>
- nvidia-smi
mig --destroy-gpu-instances --gpu-instance-id <GPU instance IDs> --id
<GPU IDs>
- Destroys GPU
instances. The command fails if the requested GPU instance is in use by an
application.
- 11) List compute instance profiles
- nvidia-smi
mig -lcip -gi <GPU instance IDs> -i <GPU IDs>
- nvidia-smi
mig --list-compute-instance-profiles --gpu-instance-id <GPU instance
IDs> --id <GPU IDs>
- Lists compute instance
profiles, their availability and IDs. Profiles describe the supported types
of compute instances, including all of the GPU resources they share or
exclusively control.
- 12) List compute instance possible placements
- nvidia-smi
mig -lcipp -gi <GPU instance IDs> -i <GPU IDs>
- nvidia-smi
mig --list-compute-instance-possible-placements --gpu-instance-id <GPU
instance IDs> --id <GPU IDs>
- Lists compute instance
possible placements. Possible placements describe the locations of the
supported types of compute instances within the GPU instance.
- 13) Create compute instance
- nvidia-smi
mig -cci <compute instance profile IDs or names> -gi <GPU instance
IDs> -i <GPU IDs>
- nvidia-smi
mig --create-compute-instance <compute instance profile IDs or names>
--gpu-instance-id <GPU instance IDs> --id <GPU IDs>
- Creates compute
instances for the given compute instance spcifiers. A compute instance
specifier comprises a compute instance profile name or ID and an optional
placement specifier consisting of a colon and a placement start index. The
command fails if the GPU resources required to allocate the requested
compute instances are not available, or if the placement index is not valid
for the given profile.
- 14) List compute instances
- nvidia-smi
mig -lci -gi <GPU instance IDs> -i <GPU IDs>
- nvidia-smi
mig --list-compute-instances --gpu-instance-id <GPU instance IDs> --id
<GPU IDs>
- Lists compute instances
and their IDs.
- 15) Destroy compute instance
- nvidia-smi
mig -dci -ci <compute instance IDs> -gi <GPU instance IDs> -i
<GPU IDs>
- nvidia-smi
mig --destroy-compute-instance --compute-instance-id <compute instance
IDs> --gpu-instance-id <GPU instance IDs> --id <GPU
IDs>
- Destroys compute
instances. The command fails if the requested compute instance is in use by
an application.
The following list describes all possible data returned by the
-q -u unit query option. Unless otherwise noted all numerical results
are base 10 and unitless.
The current system timestamp at the time nvidia-smi was invoked.
Format is "Day-of-week Month Day HH:MM:SS Year".
Driver Version
The version of the installed NVIDIA display driver. Format is
"Major-Number.Minor-Number".
Information about any Host Interface Cards (HIC) that are
installed in the system.
- Firmware
Version
- The version of the firmware running on the HIC.
The number of attached Units in the system.
Product Name
The official product name of the unit. This is an alphanumeric
value. For all S-class products.
The product identifier for the unit. This is an alphanumeric value
of the form "part1-part2-part3". For all S-class products.
The immutable globally unique identifier for the unit. This is an
alphanumeric value. For all S-class products.
The version of the firmware running on the unit. Format is
"Major-Number.Minor-Number". For all S-class products.
The LED indicator is used to flag systems with potential problems.
An LED color of AMBER indicates an issue. For all S-class products.
- Color
- The color of the LED indicator. Either "GREEN" or
"AMBER".
- Cause
- The reason for the current LED color. The cause may be listed as any
combination of "Unknown", "Set to AMBER by host
system", "Thermal sensor failure", "Fan failure"
and "Temperature exceeds critical limit".
Temperature readings for important components of the Unit. All
readings are in degrees C. Not all readings may be available. For all
S-class products.
- Intake
- Air temperature at the unit intake.
- Exhaust
- Air temperature at the unit exhaust point.
- Board
- Air temperature across the unit board.
Readings for the unit power supply. For all S-class products.
- State
- Operating state of the PSU. The power supply state can be any of the
following: "Normal", "Abnormal", "High
voltage", "Fan failure", "Heatsink temperature",
"Current limit", "Voltage below UV alarm threshold",
"Low-voltage", "I2C remote off command",
"MOD_DISABLE input" or "Short pin transition".
- Voltage
- PSU voltage setting, in volts.
- Current
- PSU current draw, in amps.
Fan readings for the unit. A reading is provided for each fan, of
which there can be many. For all S-class products.
- State
- The state of the fan, either "NORMAL" or
"FAILED".
- Speed
- For a healthy fan, the fan's speed in RPM.
Attached GPUs
A list of PCI bus ids that correspond to each of the GPUs attached
to the unit. The bus ids have the form
"domain:bus:device.function", in hex. For all S-class
products.
On Linux, NVIDIA device files may be modified by nvidia-smi if run
as root. Please see the relevant section of the driver README file.
The -a and -g arguments are now deprecated in favor
of -q and -i, respectively. However, the old arguments still
work for this release.
Query attributes for all GPUs once, and display in plain text to
stdout.
Query UUID and persistence mode of all GPUs in the system.
Query ECC errors and power consumption for GPU 0 at a frequency of
10 seconds, indefinitely, and record to the file out.log.
Set the compute mode to "PROHIBITED" for GPU with UUID
"GPU-b2f5f1b745e3d23d-65a3a26d-097db358-7303e0b6-149642ff3d219f8587cde3a8".
Query attributes for all Units once, and display in XML format
with embedded DTD to stdout.
Write the Unit DTD to nvsmi_unit.dtd.
Display supported clocks of all GPUs.
Set applications clocks to 2500 MHz memory, and 745 MHz
graphics.
Create a MIG GPU instance on profile ID 19.
Create a MIG GPU instance on profile ID 19 at placement start
index 2.
List all boost sliders for all GPUs.
Set vboost to value 1 for all GPUs.
List clock range, temperature range and supported profiles of
power hint.
Query power hint with graphics clock at 1350MHz, temperature at
60C and profile ID at 0.
Query power hint with graphics clock at 1350MHz, memory clock at
1216MHz, temperature at -5C and profile ID at 1.
- On systems where GPUs are NUMA nodes, the accuracy of FB memory
utilization provided by nvidia-smi depends on the memory accounting of the
operating system. This is because FB memory is managed by the operating
system instead of the NVIDIA GPU driver. Typically, pages allocated from
FB memory are not released even after the process terminates to enhance
performance. In scenarios where the operating system is under memory
pressure, it may resort to utilizing FB memory. Such actions can result in
discrepancies in the accuracy of memory reporting.
- On Linux GPU Reset can't be triggered when there is pending GOM
change.
- On Linux GPU Reset may not successfully change pending ECC mode. A full
reboot may be required to enable the mode change.
- On Linux platforms that configure NVIDIA GPUs as NUMA nodes, enabling
persistence mode or resetting GPUs may print 'Warning: persistence mode is
disabled on device' if nvidia-persistenced is not running, or if
nvidia-persistenced cannot access files in the NVIDIA driver's procfs
directory for the device (/proc/driver/nvidia/gpus/<PCI config=''
address>=''>/). During GPU reset and driver reload, this directory
will be deleted and recreated, and outstanding references to the deleted
directory, such as mounts or shells, can prevent processes from accessing
files in the new directory.
- There might be a slight discrepency between volatile/aggregate ECC
counters if recovery action was not taken
- Added new --query-gpu option inforom.checksum_validation to check the
inforom checksum validation (nvidia-smi --query-gpu
inforom.checksum_validation)
- Updated 'nvidia-smi -q' to print both 'Instantaneous Power Draw' and
'Average Power Draw' in all cases where 'Power Draw' used to be
printed.
- Added support to nvidia-smi c2c -e to display C2C Link Errors
- Added support to nvidia-smi c2c -gLowPwrInfo to display C2C Link Power
state
- Added new fields for Clock Event Reason Counters which can be queries with
'nvidia-smi -q' or with the 'nvidia-smi -q -d PERFORMANCE' display
flag.
- Added new query GPU options for Clock Event Reason Counters: 'nvidia-smi
--query-gpu=clocks_event_reasons_counters.{sw_power_cap,sw_thermal_slowdown,sync_boost,hw_thermal_slowdown,hw_power_brake_slowdown}'
- Added new fields for MIG timeslicing which can be queried with 'nvidia-smi
-q'
- Added a new cmdline option '-smts' to 'nvidia-smi vgpu' to set vGPU MIG
timeslice mode
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -c' to query the
currently creatable vGPU types on the user provided GPU Instance
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -q' to query detailed
information of the currently active vGPU instances on the user provided
GPU Instance
- Added a new cmdline option '-ghm' to 'nvidia-smi vgpu' to get vGPU
heterogeneous mode on the user provided GPU Instance
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -shm' to set the vGPU
heterogeneous mode on the user provided GPU Instance
- Added new field for max instances per GPU Instance which can be queried
with 'nvidia-smi vgpu -s -v'
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -ss' to query the vGPU
software scheduler state on the user provided GPU Instance
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -sl' to query the vGPU
software scheduler logs on the user provided GPU Instance
- Added a new sub-option '-gi' to 'nvidia-smi vgpu set-scheduler-state' to
set the vGPU software scheduler state on the user provided GPU
Instance.
- Added a new sub-option '-gi' to 'nvidia-smi vgpu -c -v' to query detailed
information of the creatable vGPU types on the user provided GPU
Instance
- Added a new cmdlin option '--query-gpu-instance-vgpu-scheduler-logs' to
'nvidia-smi vgpu' to get the vGPU software scheduler logs on the user
provided GPU Instance in CSV format. See nvidia-smi vgpu
--help-gpu-instance-vgpu-query-scheduler-logs for details.
- Added new cmdline option '-\sLWidth' and '-\gLWidth' to 'nvidia-smi
nvlink'
- Added new ability to display Nvlink sleep state with 'nvidia-smi nvlink
-\s for Blackwell and onward generations'
- Added new query GPU options for average/instant module power draw:
'nvidia-smi --query-gpu=module.power.draw.{average,instant}'
- Added new query GPU options for default/max/min module power limits:
'nvidia-smi
--query-gpu=module.power.{default_limit,max_limit,min_limit}'
- Added new query GPU options for module power limits: 'nvidia-smi
--query-gpu=module.power.limit'
- Added new query GPU options for enforced module power limits: 'nvidia-smi
--query-gpu=module.enforced.power.limit'
- Added new query GPU aliases for GPU Power options
- Added a new command to get confidential compute info: 'nvidia-smi
conf-compute -q'
- Added new Power Profiles section in nvidia-smi -q and corresponding -d
display flag POWER_PROFILES
- Added new Power Profiles option 'nvidia-smi power-profiles' to get/set
power profiles related information.
- Added the platform information query to 'nvidia-smi -q'
- Added the platform information query to 'nvidia-smi --query-gpu
platform'
- Added new Power Smoothing option 'nvidia-smi power-smoothing' to set power
smoothing related values.
- Added new Power Smoothing section in nvidia-smi -q and corresponding -d
display flag POWER_SMOOTHING
- Deprecated graphics voltage value from Voltage section of nvidia-smi -q.
Voltage now always displays as 'N/A' and will be removed in a future
release.
- Added new topo option nvidia-smi topo -nvme to display GPUs vs NVMes
connecting path.
- Changed help string for the command 'nvidia-smi topo -p2p -p' from 'prop'
to 'pcie' to better describe the p2p capability.
- Added new command 'nvidia-smi pci -gCnt' to query PCIe RX/TX Bytes.
- Added EGM capability display under new Capabilities section in nvidia-smi
-q command.
- Add multiGpuMode dipsplay via nvidia-smi via 'nvidia-smi conf-compute
--get-multigpu-mode' or 'nvidia-smi conf-compute -mgm'
- GPU Reset Status in nvidia-smi -q has been deprecated. GPU Recovery action
provides all the necessary actions
- nvidia-smi -q will now display Dram encryption state
- nvidia-smi -den/--dram-encryption 0/1 to disable/enable dram
encryption
- Added new status to nvidia fabric health. nvidia-smi -q will display 3 new
fields in Fabric Health - Route Recovery in progress, Route Unhealthy and
Access Timeout Recovery
- In nvidia-smi -q Platform Info - RACK GUID is changed to Platform Info -
RACK Serial Number
- In nvidia-smi --query-gpu new option for gpu_recovery_action is added
- Added new counters for Nvlink5 in nvidia-smi nvlink -e:
- •
- Effective Errors to get sum of the number of errors in each Nvlink
packet
- •
- Effective BER to get Effective BER for effective errors
- •
- FEC Errors - 0 to 15 to get count of symbol errors that are corrected
- Added a new output field called 'GPU Fabric GUID' to the 'nvidia-smi -q'
output
- Added a new property called 'platform.gpu_fabric_guid' to 'nvidia-smi
--query-gpu'
- Updated 'nvidia-smi nvlink -gLowPwrInfo' command to display the Power
Threshold Range and Units
- Added the reporting of vGPU homogeneous mode to 'nvidia-smi -q'.
- Added the reporting of homogeneous vGPU placements to 'nvidia-smi vgpu -s
-v', complementing the existing reporting of heterogeneous vGPU
placements.
- Added 'Atomic Caps Inbound' in the PCI section of 'nvidia-smi -q'.
- Updated ECC and row remapper output for options '--query-gpu' and
'--query-remapped-rows'.
- Added support for events including ECC single-bit error storm, DRAM
retirement, DRAM retirement failure, contained/nonfatal poison and
uncontained/fatal poison.
- Added support in 'nvidia-smi nvlink -e' to display NVLink5 error
counters
- Added a new cmdline option to print out version information:
--version
- Added ability to print out only the GSP firmware version with'nvidia-smi
-q -d'. Example commandline: nvidia-smi -q -d GSP_FIRMWARE_VERSION
- Added support to query pci.baseClass and pci.subClass. See nvidia-smi
--help-query-gpu for details.
- Added PCI base and sub classcodes to 'nvidia-smi -q' output.
- Added new cmdline option '--format' to 'nvidia-smi dmon' to support 'csv',
'nounit' and 'noheader' format specifiers
- Added a new cmdline option '--gpm-options' to 'nvidia-smi dmon' to support
GPM metrics report in MIG mode
- Added the NVJPG and NVOFA utilization report to 'nvidia-smi pmon'
- Added the NVJPG and NVOFA utilization report to 'nvidia-smi -q -d
utilization'
- Added the NVJPG and NVOFA utilization report to 'nvidia-smi vgpu -q' to
report NVJPG/NVOFA utilization on active vgpus
- Added the NVJPG and NVOFA utilization report to 'nvidia-smi vgpu -u' to
periodically report NVJPG/NVOFA utilization on active vgpus
- Added the NVJPG and NVOFA utilization report to 'nvidia-smi vgpu -p' to
periodically report NVJPG/NVOFA utilization on running processs of active
vgpus
- Added a new cmdline option '-shm' to 'nvidia-smi vgpu' to set vGPU
heterogeneous mode
- Added the reporting of vGPU heterogeneous mode in 'nvidia-smi -q'
- Added ability to call 'nvidia-smi mig -lgip' and 'nvidia-smi mig -lgipp'
to work without requiring MIG being enabled
- Added support to query confidential compute key rotation threshold
info.
- Added support to set confidential compute key rotation max attacker
advantage.
- Added a new cmdline option '--sparse-operation-mode' to 'nvidia-smi
clocks' to set the sparse operation mode
- Added the reporting of sparse operation mode to 'nvidia-smi -q -d
PERFORMANCE'
- Added support to query the timestamp and duration of the latest flush of
the BBX object to the inforom storage.
- Added support for reporting out GPU Memory power usage.
- Updated the SRAM error status reported in the ECC query 'nvidia-smi -q -d
ECC'
- Added support to query and report the GPU JPEG and OFA (Optical Flow
Accelerator) utilizations.
- Removed deprecated 'stats' command.
- Added support to set the vGPU software scheduler state.
- Renamed counter collection unit to gpu performance monitoring.
- Added new C2C Mode reporting to device query.
- Added back clock_throttle_reasons to --query-gpu to not break backwards
compatibility
- Added support to get confidential compute CPU capability and GPUs
capability.
- Added support to set confidential compute unprotected memory and GPU ready
state.
- Added support to get confidential compute memory info and GPU ready
state.
- Added support to display confidential compute devtools mode, environment
and feature status.
- Added support to query power.draw.average and power.draw.instant. See
nvidia-smi --help-query-gpu for details.
- Added support to get the vGPU software scheduler state.
- Added support to get the vGPU software scheduler logs.
- Added support to get the vGPU software scheduler capabilities.
- Renamed Clock Throttle Reasons to Clock Event Reasons.
- •
- Added support to query and set counter collection unit stream state.
- •
- Add new 'Reserved' memory reporting to the FB memory output
- •
- Added support to query power hint
- •
- Removed support for -acp,--application-clock-permissions option
- Add option to specify placement when creating a MIG GPU instance.
- Added support to query and control boost slider
- Added --lock-memory-clock and --reset-memory-clock command to lock to
closest min/max Memory clock provided and ability to reset Memory
clock
- Allow fan speeds greater than 100% to be reported
- Added topo support to display NUMA node affinity for GPU devices
- Added support to create MIG instances using profile names
- Added support to create the default compute instance while creating a GPU
instance
- Added support to query and disable MIG mode on Windows
- Removed support of GPU reset(-r) command on MIG enabled vGPU guests
- Added support for Multi Instance GPU (MIG)
- Added support to individually reset NVLink-capable GPUs based on the
NVIDIA Ampere architecture
- •
- Support for Volta and Turing architectures, bug fixes, performance
improvements, and new features
- Added nvlink support to expose the publicly available NVLINK NVML
APIs
- Added clocks sub-command with synchronized boost support
- Updated nvidia-smi stats to report GPU temperature metric
- Updated nvidia-smi dmon to support PCIe throughput
- Updated nvidia-smi daemon/replay to support PCIe throughput
- Updated nvidia-smi dmon, daemon and replay to support PCIe Replay
Errors
- Added GPU part numbers in nvidia-smi -q
- Removed support for exclusive thread compute mode
- Added Video (encoder/decode) clocks to the Clocks and Max Clocks display
of nvidia-smi -q
- Added memory temperature output to nvidia-smi dmon
- Added --lock-gpu-clock and --reset-gpu-clock command to lock to closest
min/max GPU clock provided and reset clock
- Added --cuda-clocks to override or restore default CUDA clocks
- Added topo support to display affinities per GPU
- Added topo support to display neighboring GPUs for a given level
- Added topo support to show pathway between two given GPUs
- Added 'nvidia-smi pmon' command-line for process monitoring in scrolling
format
- Added '--debug' option to produce an encrypted debug log for use in
submission of bugs back to NVIDIA
- Fixed reporting of Used/Free memory under Windows WDDM mode
- The accounting stats is updated to include both running and terminated
processes. The execution time of running process is reported as 0 and
updated to actual value when the process is terminated.
- Added reporting of PCIe replay counters
- Added support for reporting Graphics processes via nvidia-smi
- Added reporting of PCIe utilization
- Added dmon command-line for device monitoring in scrolling format
- Added daemon command-line to run in background and monitor devices as a
daemon process. Generates dated log files at /var/log/nvstats/
- Added replay command-line to replay/extract the stat files generated by
the daemon tool
- Added reporting of temperature threshold information.
- Added reporting of brand information (e.g. Tesla, Quadro, etc.)
- Added support for K40d and K80.
- Added reporting of max, min and avg for samples (power, utilization, clock
changes). Example commandline: nvidia-smi -q -d power,utilization,
clock
- Added nvidia-smi stats interface to collect statistics such as power,
utilization, clock changes, xid events and perf capping counters with a
notion of time attached to each sample. Example commandline: nvidia-smi
stats
- Added support for collectively reporting metrics on more than one GPU.
Used with comma separated with '-i' option. Example: nvidia-smi -i
0,1,2
- Added support for displaying the GPU encoder and decoder utilizations
- Added nvidia-smi topo interface to display the GPUDirect communication
matrix (EXPERIMENTAL)
- Added support for displayed the GPU board ID and whether or not it is a
multiGPU board
- Removed user-defined throttle reason from XML output
- Added reporting of minor number.
- Added reporting BAR1 memory size.
- Added reporting of bridge chip firmware.
Changes between nvidia-smi v4.319 Production and v4.319
Update
- •
- Added new --applications-clocks-permission switch to change permission
requirements for setting and resetting applications clocks.
Changes between nvidia-smi v4.304 and v4.319 Production
- Added reporting of Display Active state and updated documentation to
clarify how it differs from Display Mode and Display Active state
- For consistency on multi-GPU boards nvidia-smi -L always displays UUID
instead of serial number
- Added machine readable selective reporting. See SELECTIVE QUERY OPTIONS
section of nvidia-smi -h
- Added queries for page retirement information. See
--help-query-retired-pages and -d PAGE_RETIREMENT
- Renamed Clock Throttle Reason User Defined Clocks to Applications Clocks
Setting
- On error, return codes have distinct non zero values for each error class.
See RETURN VALUE section
- nvidia-smi -i can now query information from healthy GPU when there is a
problem with other GPU in the system
- All messages that point to a problem with a GPU print pci bus id of a GPU
at fault
- New flag --loop-ms for querying information at higher rates than once a
second (can have negative impact on system performance)
- Added queries for accounting procsses. See --help-query-accounted-apps and
-d ACCOUNTING
- Added the enforced power limit to the query output
Changes between nvidia-smi v4.304 RC and v4.304 Production
- Added reporting of GPU Operation Mode (GOM)
- Added new --gom switch to set GPU Operation Mode
- Reformatted non-verbose output due to user feedback. Removed pending
information from table.
- Print out helpful message if initialization fails due to kernel module not
receiving interrupts
- Better error handling when NVML shared library is not present in the
system
- Added new --applications-clocks switch
- Added new filter to --display switch. Run with -d SUPPORTED_CLOCKS to list
possible clocks on a GPU
- When reporting free memory, calculate it from the rounded total and used
memory so that values add up
- Added reporting of power management limit constraints and default
limit
- Added new --power-limit switch
- Added reporting of texture memory ECC errors
- Added reporting of Clock Throttle Reasons
Changes between nvidia-smi v2.285 and v3.295
- Clearer error reporting for running commands (like changing compute
mode)
- When running commands on multiple GPUs at once N/A errors are treated as
warnings.
- nvidia-smi -i now also supports UUID
- UUID format changed to match UUID standard and will report a different
value.
Changes between nvidia-smi v2.0 and v2.285
- Report VBIOS version.
- Added -d/--display flag to filter parts of data
- Added reporting of PCI Sub System ID
- Updated docs to indicate we support M2075 and C2075
- Report HIC HWBC firmware version with -u switch
- Report max(P0) clocks next to current clocks
- Added --dtd flag to print the device or unit DTD
- Added message when NVIDIA driver is not running
- Added reporting of PCIe link generation (max and current), and link width
(max and current).
- Getting pending driver model works on non-admin
- Added support for running nvidia-smi on Windows Guest accounts
- Running nvidia-smi without -q command will output non verbose version of
-q instead of help
- Fixed parsing of -l/--loop= argument (default value, 0, to big value)
- Changed format of pciBusId (to XXXX:XX:XX.X - this change was visible in
280)
- Parsing of busId for -i command is less restrictive. You can pass 0:2:0.0
or 0000:02:00 and other variations
- Changed versioning scheme to also include 'driver version'
- XML format always conforms to DTD, even when error conditions occur
- Added support for single and double bit ECC events and XID errors (enabled
by default with -l flag disabled for -x flag)
- Added device reset -r --gpu-reset flags
- Added listing of compute running processes
- Renamed power state to performance state. Deprecated support exists in XML
output only.
- Updated DTD version number to 2.0 to match the updated XML output
On Linux, the driver README is installed as
/usr/share/doc/NVIDIA_GLX-1.0/README.txt
Copyright 2011-2025 NVIDIA Corporation
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|