nvprof --version (return code: 0)
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2016 NVIDIA Corporation
Release version 8.0.44 (21)
nvprof --help (return code: 0)
Usage: nvprof [options] [application] [application-arguments]
Options:
--aggregate-mode <on|off>
This option turns on/off aggregate mode for events and metrics
specified by subsequent "--events" and "--metrics" options.
Those event/metric values will be collected for each domain
instance, instead of the whole device. Allowed values:
on - turn on aggregate mode (default)
off - turn off aggregate mode
--analysis-metrics
Collect profiling data that can be imported to Visual Profiler's
"analysis" mode. Note: Use "--export-profile" to specify
an export file.
--concurrent-kernels <on|off>
Turn on/off concurrent kernel execution. If concurrent kernel
execution is off, all kernels running on one device will
be serialized. Allowed values:
on - turn on concurrent kernel execution (default)
off - turn off concurrent kernel execution
--continuous-sampling-interval <interval>
Set the continuous mode sampling interval in milliseconds.
Minimum is 1 ms. Default is 2 ms.
--cpu-thread-tracing <on|off>
Collect information about CPU thread API activity.
Allowed values:
on - turn on CPU thread API tracing
off - turn off CPU thread API tracing (default)
--dependency-analysis
Generate event dependency graph for host and device activities
and run dependency analysis.
--device-buffer-size <size in MBs>
The device memory size (in MBs) reserved for storing profiling
data for non-CDP operations for each buffer on a context.
The default value is 8MB. The size should be a positive
integer.
--device-cdp-buffer-size <size in MBs>
The device memory size (in MBs) reserved for storing profiling
data for CDP operations for each buffer on a context. The
default value is 8MB. The size should be a positive integer.
--devices <device ids>
This option changes the scope of subsequent "--events", "--metrics",
"--query-events" and "--query-metrics" options.
Allowed values:
all - change scope to all valid devices
comma-separated device IDs - change scope to specified
devices
--event-collection-mode <mode>
Choose event collection mode for all events/metrics Allowed
values:
kernel - events/metrics are collected only for durations
of kernel executions (default)
continuous - events/metrics are collected for duration
of application. This is not applicable for non-tesla devices.
This mode is compatible only with NVLink events/metrics.
This is incompatible with "--profile-all-processes" or "--profile-child-processes"
or "--replay-mode kernel" or "--replay-mode application"
-e, --events <event names>
Specify the events to be profiled on certain device(s). Multiple
event names separated by comma can be specified. Which device(s)
are profiled is controlled by the "--devices" option. Otherwise
events will be collected on all devices.
For a list of available events, use "--query-events".
Use "--events all" to profile all events available for each
device.
Use "--devices" and "--kernels" to select a specific kernel
invocation.
--kernels <kernel path syntax>
This option changes the scope of subsequent "--events", "--metrics"
options. The syntax is as following:
<kernel name>
or
<context id/name>:<stream id/name>:<kernel name>:<invocation>
The context/stream IDs, names, kernel name and invocation
can be regular expressions. Empty string matches any number
or characters. If <context id/name> or <stream id/name>
is a positive number, it's strictly matched against the
CUDA context/stream ID. Otherwise it's treated as a regular
expression and matched against the context/stream name specified
by the NVTX library. If the invocation count is a positive
number, it's strictly matched against the invocation of
the kernel. Otherwise it's treated as a regular expression.
Example: --kernels "1:foo:bar:2" - profile any kernel whose
name contains "bar" and was the 2nd instance on context
1 and on stream named "foo".
-m, --metrics <metric names>
Specify the metrics to be profiled on certain device(s).
Multiple metric names separated by comma can be specified.
Which device(s) are profiled is controlled by the "--devices"
option. Otherwise metrics will be collected on all devices.
For a list of available metrics, use "--query-metrics".
Use "--metrics all" to profile all metrics available for
each device.
Use "--devices" and "--kernels" to select a specific kernel
invocation.
Note: "--metrics all" does not include some metrics which
are needed for Visual Profiler's source level analysis.
Use "--analysis-metrics".
--profile-all-processes
Profile all processes launched by the same user who launched
this nvprof instance. Note: Only one instance of nvprof
can run with this option at the same time. Under this mode,
there's no need to specify an application to run.
--profile-api-trace <none|runtime|driver|all>
Turn on/off CUDA runtime/driver API tracing. Allowed values:
none - turn off API tracing
runtime - only turn on CUDA runtime API tracing
driver - only turn on CUDA driver API tracing
all - turn on all API tracing (default)
--profile-child-processes
Profile the application and all child processes launched
by it.
--profile-from-start <on|off>
Enable/disable profiling from the start of the application.
If it's disabled, the application can use {cu,cuda}Profiler{Start,Stop}
to turn on/off profiling. Allowed values:
on - enable profiling from start (default)
off - disable profiling from start
--query-events
List all the events available on the device(s). Device(s)
queried can be controlled by the "--devices" option.
--query-metrics
List all the metrics available on the device(s). Device(s)
queried can be controlled by the "--devices" option.
--replay-mode <mode>
Choose replay mode used when not all events/metrics can be
collected in a single run. Allowed values:
disabled - replay is disabled, events/metrics couldn't
be profiled will be dropped
kernel - each kernel invocation is replayed (default)
application - the entire application is replayed.
This is incompatible with "--profile-all-processes" or "profile-child-processes"
--system-profiling <on|off>
Turn on/off power, clock, and thermal profiling. Allowed
values:
on - turn on system profiling
off - turn off system profiling (default)
-t, --timeout <seconds>
Set an execution timeout (in seconds) for the CUDA application.
Note: Timeout starts counting from the moment the CUDA driver
is initialized. If the application doesn't call any CUDA
APIs, timeout won't be triggered.
--unified-memory-profiling <per-process-device|off>
Options for unified memory profiling. Allowed values:
per-process-device - collect counts for each process
and each device (default)
off - turn off unified memory profiling
--cpu-profiling <on|off>
Turn on CPU profiling. Note: CPU profiling is not supported
in multi-process mode.
--cpu-profiling-frequency <frequency>
Set the CPU profiling frequency in samples per second. Default
is 100Hz. Maximum is 500Hz.
--cpu-profiling-max-depth <depth>
Set the maximum depth of each call stack. Zero means no limit.
Default is zero.
--cpu-profiling-mode <mode>
Set the output mode of CPU profiling. Allowed values:
"flat" - Show flat profile
"top-down" - Show parent functions at the top
"bottom-up" - Show parent functions at the bottom
(default)
--cpu-profiling-percentage-threshold <threshold>
Filter out the entries that are below the set percentage
threshold. The limit should be an integer between 0 and
100, inclusive. Zero means no limit. Default is zero.
--cpu-profiling-scope <scope>
Choose the profiling scope. Allowed values:
"function" - Each level in the stack trace represents
a distinct function (default)
"instruction" - Each level in the stack trace represents
a distinct instruction address
--cpu-profiling-show-ccff <on|off>
Whether to print Common Compiler Feedback Format (CCFF) messages
embedded in the binary. Note: this option implies "--cpu-profiling-scope
instruction".
--cpu-profiling-show-library <on|off>
Whether to print the library name for each sample.
--cpu-profiling-thread-mode <mode>
Set the thread mode of CPU profiling. Allowed values:
"separated" - Show separate profile for each thread
"aggregated" - Aggregate data from all threads
--openacc-profiling <on|off>
Turn on recording information from OpenACC profiling interface.
Note: if the OpenACC profiling interface is available depends
on the OpenACC runtime. Default is on.
--context-name <name>
Name of the CUDA context.
"%i" in the context name string is replaced with
the ID of the context.
"%p" in the context name string is replaced with
the process ID of the application being profiled.
"%q{<ENV>}" in the context name string is replaced
with the value of the environment variable "<ENV>". If the
environment variable is not set it's an error.
"%h" in the context name string is replaced with
the hostname of the system.
"%%" in the context name string is replaced with
"%". Any other character following "%" is illegal.
--csv
Use comma-separated values in the output.
--demangling <on|off>
Turn on/off C++ name demangling of function names. Allowed
values:
on - turn on demangling (default)
off - turn off demangling
-u, --normalized-time-unit <s|ms|us|ns|col|auto>
Specify the unit of time that will be used in the output.
Allowed values:
s - second, ms - millisecond, us - microsecond,
ns - nanosecond
col - a fixed unit for each column
auto (default) - the scale is chosen for each value
based on its length.
--openacc-summary-mode <mode>
Set how durations are computed in the OpenACC summary. Allowed
values:
exclusive: show exclusive times (default)
inclusive: show inclusive times
--print-api-summary
Print a summary of CUDA runtime/driver API calls.
--print-api-trace
Print CUDA runtime/driver API trace.
--print-dependency-analysis-trace
Print dependency analysis trace.
--print-gpu-summary
Print a summary of the activities on the GPU (including CUDA
kernels and memcpy's/memset's).
--print-gpu-trace
Print individual kernel invocations (including CUDA memcpy's/memset's)
and sort them in chronological order. In event/metric profiling
mode, show events/metrics for each kernel invocation.
--print-openacc-constructs
Include parent construct names in OpenACC profile.
--print-openacc-summary
Print a summary of the OpenACC profile.
--print-openacc-trace
Print a trace of the OpenACC profile.
-s, --print-summary
Print a summary of the profiling result on screen. Note:
This is the default unless "--export-profile" or other print
options are used.
--print-summary-per-gpu
Print a summary of the profiling result for each GPU.
--process-name <name>
Name of the process.
"%p" in the process name string is replaced with
the process ID of the application being profiled.
"%q{<ENV>}" in the process name string is replaced
with the value of the environment variable "<ENV>". If the
environment variable is not set it's an error.
"%h" in the process name string is replaced with
the hostname of the system.
"%%" in the process name string is replaced with
"%". Any other character following "%" is illegal.
--quiet
Suppress all nvprof output.
--stream-name <name>
Name of the CUDA stream.
"%i" in the stream name string is replaced with the
ID of the stream.
"%p" in the stream name string is replaced with
the process ID of the application being profiled.
"%q{<ENV>}" in the stream name string is replaced
with the value of the environment variable "<ENV>". If the
environment variable is not set it's an error.
"%h" in the stream name string is replaced with
the hostname of the system.
"%%" in the stream name string is replaced with
"%". Any other character following "%" is illegal.
-o, --export-profile <filename>
Export the result file which can be imported later or opened
by the NVIDIA Visual Profiler.
"%p" in the file name string is replaced with the
process ID of the application being profiled.
"%q{<ENV>}" in the file name string is replaced
with the value of the environment variable "<ENV>". If the
environment variable is not set it's an error.
"%h" in the file name string is replaced with the
hostname of the system.
"%%" in the file name string is replaced with "%".
Any other character following "%" is illegal.
By default, this option disables the summary output. Note:
If the application being profiled creates child processes,
or if '--profile-all-processes' is used, the "%p" format
is needed to get correct export files for each process.
-f, --force-overwrite
Force overwriting all output files (any existing files will
be overwritten).
-i, --import-profile <filename>
Import a result profile from a previous run.
--log-file <filename>
Make nvprof send all its output to the specified file, or
one of the standard channels. The file will be overwritten.
If the file doesn't exist, a new one will be created.
"%1" as the whole file name indicates standard output
channel (stdout).
"%2" as the whole file name indicates standard error
channel (stderr). Note: This is the default.
"%p" in the file name string is replaced with the
process ID of the application being profiled.
"%q{<ENV>}" in the file name string is replaced
with the value of the environment variable "<ENV>". If the
environment variable is not set it's an error.
"%h" in the file name string is replaced with the
hostname of the system.
"%%" in the file name is replaced with "%".
Any other character following "%" is illegal.
--print-nvlink-topology
Print nvlink topology
-h, --help
Print this help information.
-V, --version
Print version information of this tool.