ROCM Support(1)
===============

NAME
----
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore in
userspace for AMD GPUs.


CURRENT SUPPORT
---------------
Single and Multi GPU systems (Gfx9)
Checkpoint / Restore on different system
Checkpoint / Restore inside a docker container
Pytorch
Tensorflow
Using CRIU Image Streamer

DESCRIPTION
-----------
Though *criu* is a great tool for checkpointing and restoring running
applications, it has certain limitations such as it cannot handle
applications that have device files open. In order to support *ROCm* based
workloads with *criu* we need to augment criu's core functionality with a
plugin based extension mechanism. *criu-amdgpu-plugin* provides the necessary support
to criu to allow Checkpoint / Restore with ROCm.


Dependencies
------------
*amdkfd support*::
    In order to snapshot the *VRAM* and other *GPU* device states, we require
    an updated version of amdkfd(amdgpu) driver.

OPTIONS
-------
Optional parameters can be passed in as environment variables before
executing criu command.

*KFD_FW_VER_CHECK*::
    Enable or disable firmware version check.
    If enabled, firmware version on restored gpu needs to be greater than or
    equal firmware version on checkpointed GPU. Default:Enabled

    E.g:
    KFD_FW_VER_CHECK=0

*KFD_SDMA_FW_VER_CHECK*::
    Enable or disable SDMA firmware version check.
    If enabled, SDMA firmware version on restored gpu needs to be greater than or
    equal firmware version on checkpointed GPU. Default:Enabled

    E.g:
    KFD_SDMA_FW_VER_CHECK=0

*KFD_CACHES_COUNT_CHECK*::
    Enable or disable caches count check. If enabled, the caches count on
    restored GPU needs to be greater than or equal caches count on checkpointed
    GPU. Default:Enabled

    E.g:
    KFD_CACHES_COUNT_CHECK=0

*KFD_NUM_GWS_CHECK*::
    Enable or disable num_gws check. If enabled, the num_gws on
    restored GPU needs to be greater than or equal num_gws on checkpointed
    GPU. Default:Enabled

    E.g:
    KFD_NUM_GWS_CHECK=0

*KFD_VRAM_SIZE_CHECK*::
    Enable or disable VRAM size check. If enabled, the VRAM size on
    restored GPU needs to be greater than or equal VRAM size on checkpointed
    GPU. Default:Enabled

    E.g:
    KFD_VRAM_SIZE_CHECK=0

*KFD_NUMA_CHECK*::
    Enable or disable NUMA CPU region check. If enabled, the plugin will restore
    GPUs that belong to one CPU NUMA region to the same CPU NUMA region.
    Default:Enabled

    E.g:
    KFD_NUMA_CHECK=1

*KFD_CAPABILITY_CHECK*::
    Enable or disable capability check. If enabled, the capability on
    restored GPU needs to be equal to the capability on the checkpointed GPU.
    Default:Enabled

    E.g:
    KFD_CAPABILITY_CHECK=1

*KFD_MAX_BUFFER_SIZE*::
    On some systems, VRAM sizes may exceed RAM sizes, and so buffers for dumping
    and restoring VRAM may be unable to fit. Set to a nonzero value (in bytes)
    to set a limit on the plugin's memory usage.
    Default:0 (Disabled)

    E.g:
    KFD_MAX_BUFFER_SIZE="2G"


AUTHOR
------
The AMDKFD team.


COPYRIGHT
---------
Copyright \(C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
