前言

在写Mindspore算子过程中，由于自己粗心、又或是理解出现了一些偏差，导致项目在编译、运行时会出现错误。对于一些简单的语法错误，语法检查器可以帮助我们纠正错误，但有一部分错误只有在运行时才会暴露出来，这一类错误很难发现，程序崩溃之后只会输出Core dump，使用print大法虽然能知道程序在哪一行Core dump了，而具体是什么原因导致的Core dump还是需要使用调试器进一步分析。这时候，GDB就派上用场了，虽然久仰GDB大名，但真正将GDB调试用到实战中还是第一次，因此必须将这一过程记录下来。

算子介绍

在真正开始调试之前，首先要介绍一下算子的功能。首先，算子的名字为Sample Distorted Bounding Box V2，算子大致的功能为：为一张图片生成一个随机大小的边框，该边框的长宽比、覆盖面积大小符合给定的要求，用户可以使用该边框对图片进行裁剪、部分随机扭曲，该算子通常用在训练数据的增强中。

在友商的产品中，该算子只实现了CPU版本，而Mindspore要求实现一个GPU版本的算子，为了发挥CUDA并行计算的优势，我想到的一个实现思路如下：使用CUDA随机生成N个边框，将这些边框的信息从Device端复制到Host端中进行逻辑判断，选取一个符合要求的边框作为结果，将结果从Host端复制到Device端并作为算子的最终结果返回。虽然数据在Host端和Device端反复传输，可能会耗费大量的时间，但是实现了再谈如何去优化吧。

由于算子前端已经写好，因此只需要写一个GPU Kernel的实现即可，主要分为以下四个源文件，源码比较长，我将其放到本篇最后一个章节：

sample_distorted_bounding_box_v2_gpu_kernel.h: GPU算子的头文件，通过继承GPU Kernel父类，定义算子的具体类。
sample_distorted_bounding_box_v2_gpu_kernel.cc: GPU算子的具体实现，通常包括参数检查、内存分配、算子初始化、算子注册等实现。
sample_distorted_bounding_box_v2_impl.cuh: 算子CUDA实现的头文件，定义CUDA核函数的入口。
sample_distorted_bounding_box_v2_impl.cu: 算子CUDA核函数的具体实现。

按照这个思路编码完成后，将项目编译成Python包并安装测试，测试过程中遇到的第一个错误是undefined symbol，这个错误通常是因为函数的参数类型没有对齐造成的，使用c++filt可以查看调用的函数，将其参数与源码中函数的定义对比，查看是哪些类型没有对齐，将其对齐即可。

解决上一个错误后，接下来的便是Core dump错误，需要请GDB出场了。

GDB调试

首先将当前的目录切换到项目根目录下，因为GDB调试器是基于当前目录启动的，切换到项目根目录下方便等会对源文件打断点。

接着启动GDB调试器，参数为要调试的命令。比如，我这里要调试的程序是用Python运行的，那么--args后面跟的参数就是Python运行程序的命令。

1	gdb --args python ~/pyProject/mindspore-dev/SampleDistortedBoundingBox/test_mindspore.py

调试器启动后，找到需要打断点的源文件和对应代码行，将断点打上去。比如，我需要在算子的LaunchKernel函数中，调用CUDA核函数前，查看传递到CUDA核函数中的各个参数是否合法，就需要在函数SampleDistortedBoundingBoxV2前打一个断点。

template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
                                                            const std::vector<AddressPtr> &workspace,
                                                            const std::vector<AddressPtr> &outputs) {
  if (is_null_input_) {
    return true;
  }
  T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
  int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);
  void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
  curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
  SampleDistortedBoundingBoxV2(image_size_addr, seed_, seed2_,
                               aspect_ratio_range_[kIndex0], aspect_ratio_range_[kIndex1],
                               area_range_[kIndex0], area_range_[kIndex1],
                               max_attempts_, use_image_if_no_bounding_boxes_,
                               workspace_addr, devStates,
                               reinterpret_cast<cudaStream_t>(cuda_stream_));
  SelectBox<T>(inputs, workspace, outputs);
  return true;
}

函数SampleDistortedBoundingBoxV2对应源文件中是365行，因此在365行处打一个断点。

1	b mindspore/ccsrc/plugin/device/gpu/kernel/other/sample_distorted_bounding_box_v2_gpu_kernel.cc:365

此时GDB会提示执行程序的源码中没有这个文件，在后面加载动态库时，如果GDB发现该源文件，就会打上对应的断点，输入y。这是因为程序还没开始运行，我们写的算子被编译为动态链接库了，还没有开始加载，GDB不确定你打的这个断点是否能命中，如果该文件对应的动态链接库在未来加载了，GDB就会打上断点，在运行到断点位置时停下来。

打上断点后，可以输入info br查看断点信息，由于动态链接库未加载，因此地址Address还是<PENDING>状态。

接下来输入r运行程序，程序运行到断点处会停下来，此时可以查看断点处的各种变量，输入info args可以查看当前栈帧内的所有变量值，print xxx可以查看某个变量的值，backtrace查看函数调用栈。

在查看变量image_size_addr的值时，我发现这个地址的值永远是0，无法正确获取到输入图片的尺寸。这是因为image_size_addr获取的是Device端地址，也就是说该变量的值实际上储存在GPU中，在Host端直接读取是会产生错误的。

但是下一个问题就来了，函数SampleDistortedBoundingBoxV2不是写在.cu文件中了吗？CUDA中的函数使用Device端地址获取变量值不是正确的获取方法吗？那就得看看这个函数得源码了。

template<typename T>
void SampleDistortedBoundingBoxV2(T* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream)
{
  int ms_original_height = static_cast<int>(image_size[0]);
  int ms_original_width = static_cast<int>(image_size[1]);

  int RNG_seed = 0;
  std::random_device rd;
  if (seed2 != 0) {
    RNG_seed = seed2;
  } else if (seed != 0) {
    RNG_seed = seed;
  } else {
    RNG_seed = static_cast<int>(rd());
  }
  GenerateRandomCropKernel<<<GET_BLOCKS(max_attempts), GET_THREADS, 0, cuda_stream>>>
               (RNG_seed, globalState,  ms_original_height, ms_original_width,min_area_range, max_area_range,ms_min_sample_aspect_ratio,ms_max_sample_aspect_ratio, output, max_attempts);
  return;
}

咋一看好像没什么问题，但实际上，只有在函数或变量的定义前加上__device__或__global__字段时，该函数或变量才真正在Device端上执行。因此，函数SampleDistortedBoundingBoxV2虽然写在了.cu文件中，但实际上还是在Host端上运行的，这也就导致了函数在获取image_size时出现了非法访存的操作，最终导致Core dump。

弄明白Core dump的原因后，如何解决它就变得很简单了。将image_size数据从Device端同步到Host端后再使用即可，修改后的LaunchKernel函数如下：

template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
                                                            const std::vector<AddressPtr> &workspace,
                                                            const std::vector<AddressPtr> &outputs) {
  if (is_null_input_) {
    return true;
  }
  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);
  T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
  T image_size[kShapeSize3];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");

  int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);
  void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
  curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
  SampleDistortedBoundingBoxV2(image_size, seed_, seed2_,
            aspect_ratio_range_[kIndex0], aspect_ratio_range_[kIndex1],
            area_range_[kIndex0], area_range_[kIndex1],
            max_attempts_, use_image_if_no_bounding_boxes_,
            workspace_addr, devStates,
            reinterpret_cast<cudaStream_t>(cuda_stream_));
  SelectBox<T>(inputs, workspace, outputs);
  return true;

重新编译、安装一次mindspore-gpu包，再次启动GDB，在同样的位置打下断点，程序运行到断点停下来后，查看image_size变量，可以看到GDN正确地输出变量值，并且变量地址也是正确的Host端地址了。

总结

在这次实战之前，我天真地以为GDB只能调试C/C++程序，并且还以为只能调试可执行的文件，直到这一次实战我才了解到，GDB实际上是一个通用的调试器，功能非常强大。虽然之前用过GDB，但是都是调试Hello world之类的C/C++小程序，只能说是小打小闹，遇到大型项目的调试就无从下手了，要不是这一次print大法实在是不好使了，我还一直不情愿去学习、使用、将GDB真正用到实战中。真正上手使用GDB并解决问题过后，现在的我就觉得GDB可太香辣！

关于GDB的工作原理，我是一窍不通的，接下来需要花点时间去学习一下GDB到底是如何工作的，进一步提高我对GDB的理解。

最后贴一个GDB命令速查表，链接在此。

源代码

sample_distorted_bounding_box_v2_gpu_kernel.h

/**
 * Copyright 2022 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_
#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_

#include <curand_kernel.h>
#include <cuda_runtime_api.h>
#include <cstdint>
#include <vector>
#include "kernel/common_utils.h"
#include "mindapi/base/type_id.h"
#include "plugin/device/gpu/kernel/gpu_kernel.h"
#include "plugin/device/gpu/kernel/gpu_kernel_factory.h"

namespace mindspore {
namespace kernel {

class Region {
 public:
  Region() { SetPoint(0, 0, 0, 0); }
  Region(int xmin, int ymin, int xmax, int ymax) { SetPoint(xmin, ymin, xmax, ymax); }

  void SetPoint(int xmin, int ymin, int xmax, int ymax) {
    min_x_ = xmin;
    min_y_ = ymin;
    max_x_ = xmax;
    max_y_ = ymax;
  }

  float Area() const { return static_cast<float>((max_x_ - min_x_) * (max_y_ - min_y_)); }

  Region Intersect(const Region &r) const {
    const int pmin_x = std::max(min_x_, r.min_x_);
    const int pmin_y = std::max(min_y_, r.min_y_);
    const int pmax_x = std::min(max_x_, r.max_x_);
    const int pmax_y = std::min(max_y_, r.max_y_);
    if (pmin_x > pmax_x || pmin_y > pmax_y) {
      return Region();
    } else {
      return Region(pmin_x, pmin_y, pmax_x, pmax_y);
    }
  }
  int min_x_;
  int min_y_;
  int max_x_;
  int max_y_;
};

class SampleDistortedBoundingBoxV2GpuKernelMod : public NativeGpuKernelMod, public MatchKernelHelper<SampleDistortedBoundingBoxV2GpuKernelMod> {
 public:
  SampleDistortedBoundingBoxV2GpuKernelMod() = default;
  ~SampleDistortedBoundingBoxV2GpuKernelMod() override = default;

  bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
              const std::vector<AddressPtr> &outputs, void *cuda_stream) override {
    if (is_null_input_) {
      return true;
    }
    cuda_stream_ = cuda_stream;
    return kernel_func_(this, inputs, workspace, outputs);
  }

  bool Init(const BaseOperatorPtr &base_operator, const std::vector<KernelTensorPtr> &inputs,
            const std::vector<KernelTensorPtr> &outputs) override;

  int Resize(const BaseOperatorPtr &base_operator, const std::vector<KernelTensorPtr> &inputs,
             const std::vector<KernelTensorPtr> &outputs, const std::map<uint32_t, tensor::TensorPtr> &) override;

  const std::vector<std::pair<KernelAttr, KernelRunFunc>> &GetFuncList() const override;

 protected:
  void ResetResource() noexcept {
    bounding_boxes_elements_ = 0;
    is_null_input_ = false;
    input_size_list_.clear();
    output_size_list_.clear();
    workspace_size_list_.clear();
  }

  std::vector<KernelAttr> GetOpSupport() override { return OpSupport(); }

 private:
  bool SatisfiesOverlapConstraints(const Region &crop, float minimum_object_covered,
                                   const std::vector<Region> &bounding_boxes);

  template <typename T>
  void SelectBox(const std::vector<kernel::AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
                 const std::vector<kernel::AddressPtr> &outputs);

  template <typename T>
  bool LaunchKernel(const std::vector<kernel::AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
                    const std::vector<kernel::AddressPtr> &outputs);

  TypeId dtype_{kTypeUnknown};
  int64_t unit_dtype_size_{0};
  int64_t bounding_boxes_elements_{0};
  bool is_null_input_{false};
  void *cuda_stream_{nullptr};

  int64_t seed_{0};
  int64_t seed2_{0};
  std::vector<float> aspect_ratio_range_;
  std::vector<float> area_range_;
  int64_t max_attempts_{100};
  bool use_image_if_no_bounding_boxes_{false};
};

}   // namespace kernel
}   // namespace mindspore
#endif  // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_

sample_distorted_bounding_box_v2_gpu_kernel.cc

/**
 * Copyright 2022 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#include "kernel/kernel.h"
#include "ops/sample_distorted_bounding_box_v2.h"
#include "plugin/device/gpu/hal/device/gpu_common.h"
#include "plugin/device/gpu/kernel/other/sample_distorted_bounding_box_v2_gpu_kernel.h"
#include "targets/x86_64-linux/include/cuda_runtime_api.h"
#include "targets/x86_64-linux/include/curand_kernel.h"
#include "targets/x86_64-linux/include/driver_types.h"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/sample_distorted_bounding_box_v2_impl.cuh"

namespace mindspore {
namespace kernel {
namespace {
using KernelRunFunc = SampleDistortedBoundingBoxV2GpuKernelMod::KernelRunFunc;
#define ADD_KERNEL(image_size_dtype, kernel_type)                                 \
  {                                                                               \
    KernelAttr()                                                                  \
      .AddInputAttr(image_size_dtype)                                             \
      .AddInputAttr(kNumberTypeFloat32)                                           \
      .AddInputAttr(kNumberTypeFloat32)                                           \
      .AddOutputAttr(image_size_dtype)                                            \
      .AddOutputAttr(image_size_dtype)                                            \
      .AddOutputAttr(kNumberTypeFloat32),                                         \
      &SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel<kernel_type>        \
  }

constexpr size_t kOutputSize = 3;
constexpr size_t kInputSize = 3;
constexpr size_t kIndex0 = 0;
constexpr size_t kIndex1 = 1;
constexpr size_t kIndex2 = 2;
constexpr size_t kIndex3 = 3;
constexpr size_t kBBoxesDimension = 3;
constexpr size_t kShapeSize1 = 1;
constexpr size_t kShapeSize2 = 2;
constexpr size_t kShapeSize3 = 3;
constexpr size_t kShapeSize4 = 4;
constexpr size_t kNumber0 = 0;
constexpr float kFloatNum0 = 0.0;
constexpr float kFloatNum1 = 1.0;
}

bool SampleDistortedBoundingBoxV2GpuKernelMod::Init(const BaseOperatorPtr &base_operator,
                                                    const std::vector<KernelTensorPtr> &inputs,
                                                    const std::vector<KernelTensorPtr> &outputs) {
  MS_EXCEPTION_IF_NULL(base_operator);
  kernel_name_ = base_operator->name();

  if (!MatchKernelFunc(base_operator, inputs, outputs)) {
    return false;
  }

  auto kernel_attr = GetKernelAttrFromTensors(inputs, outputs);
  dtype_ = kernel_attr.GetInputAttr(kIndex0).first;
  unit_dtype_size_ = abstract::TypeIdSize(dtype_);

  auto kernel_ptr = std::make_shared<ops::SampleDistortedBoundingBoxV2>(base_operator->GetPrim());
  seed_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("seed")));
  seed2_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("seed2")));
  aspect_ratio_range_ = GetValue<std::vector<float>>(kernel_ptr->GetAttr("aspect_ratio_range"));
  area_range_ = GetValue<std::vector<float>>(kernel_ptr->GetAttr("area_range"));
  max_attempts_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("max_attempts")));
  use_image_if_no_bounding_boxes_ = static_cast<bool>(GetValue<bool>(kernel_ptr->GetAttr("use_image_if_no_bounding_boxes")));

  std::vector<int64_t> shape_image_size = std::vector<int64_t>(inputs.at(kIndex0)->GetDeviceShapeAdaptively().begin(),
                                                               inputs.at(kIndex0)->GetDeviceShapeAdaptively().end());
  std::vector<int64_t> shape_bounding_boxes = std::vector<int64_t>(inputs.at(kIndex1)->GetDeviceShapeAdaptively().begin(),
                                                                   inputs.at(kIndex1)->GetDeviceShapeAdaptively().end());
  size_t shape_dim_image_size = shape_image_size.size();
  size_t shape_dim_bounding_boxes = shape_bounding_boxes.size();
  if (shape_dim_image_size != kShapeSize1) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image_size must be 1-dimensional, got: ["
                      << shape_dim_image_size << "].";
  }
  if (shape_image_size[kIndex0] != kShapeSize3) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image_size must contain 3 elements, got: ["
                      << shape_image_size[kIndex0] << "].";
  }
  if (shape_dim_bounding_boxes != kBBoxesDimension) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', bounding_boxes must be 3-dimensional"
                      << " [batch, num_boxes, coords], got: [" << shape_dim_bounding_boxes << "].";
  }
  if (shape_bounding_boxes[shape_dim_bounding_boxes - 1] != kShapeSize4) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', bounding_boxes must have shape [4], got: ["
                      << shape_bounding_boxes[shape_dim_bounding_boxes - 1] << "].";
  }

  if (max_attempts_ <= SizeToLong(kNumber0)) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', max_attempts must be positive: [" << max_attempts_ << "].";
  }
  if (aspect_ratio_range_[kIndex1] <= kFloatNum0 || aspect_ratio_range_[kIndex0] <= kFloatNum0) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', aspect_ratio_range must be positive: ["
                      << aspect_ratio_range_[kIndex0] << "], [" << aspect_ratio_range_[kIndex1] << "].";
  }
  if (area_range_[kIndex1] <= kFloatNum0 || area_range_[kIndex0] <= kFloatNum0) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range must be positive: [" << area_range_[kIndex0] << "], ["
                      << area_range_[kIndex1] << "].";
  }
  if (area_range_[kIndex1] > kFloatNum1 || area_range_[kIndex0] > kFloatNum1) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range must be less then or equal to 1.0: ["
                      << area_range_[kIndex0] << "], [" << area_range_[kIndex1] << "].";
  }
  if (aspect_ratio_range_.size() != kShapeSize2) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', aspect_ratio_range field must specify 2 dimensions.";
  }
  if (area_range_.size() != kShapeSize2) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range field must specify 2 dimensions.";
  }
  return true;
}

int SampleDistortedBoundingBoxV2GpuKernelMod::Resize(const BaseOperatorPtr &base_operator,
                                                     const std::vector<KernelTensorPtr> &inputs,
                                                     const std::vector<KernelTensorPtr> &outputs,
                                                     const std::map<uint32_t, tensor::TensorPtr> &) {
  for (auto input : inputs) {
    auto input_shape = input->GetShapeVector();
    if (!IsValidShape(input_shape)) {
      return KRET_UNKNOWN_SHAPE;
    }
  }
  ResetResource();
  std::vector<int64_t> shape_bounding_boxes = std::vector<int64_t>(inputs.at(kIndex1)->GetDeviceShapeAdaptively().begin(),
                                                                   inputs.at(kIndex1)->GetDeviceShapeAdaptively().end());
  bounding_boxes_elements_ = std::accumulate(shape_bounding_boxes.begin(), shape_bounding_boxes.end(), 1, std::multiplies<int64_t>());
  input_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_);    // input image size
  input_size_list_.emplace_back(bounding_boxes_elements_ * abstract::TypeIdSize(kNumberTypeFloat32));   // input bboxes
  input_size_list_.emplace_back(kShapeSize1 * abstract::TypeIdSize(kNumberTypeFloat32));    // input min object covered
  output_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_);   // output begin
  output_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_);   // output size
  output_size_list_.emplace_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * abstract::TypeIdSize(kNumberTypeFloat32));   // output bboxes
  workspace_size_list_.push_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_ * sizeof(int32_t));  // number of boxes
  workspace_size_list_.push_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_ * sizeof(curandState));  // number of boxes
  return KRET_OK;
}

bool SampleDistortedBoundingBoxV2GpuKernelMod::SatisfiesOverlapConstraints(const Region &crop,
                                                                           float minimum_object_covered,
                                                                           const std::vector<Region> &bounding_boxes) {
  const float kMinArea = 1.0;
  if (crop.Area() < kMinArea) {
    return false;
  }

  bool is_object_covered = false;
  for (const auto &bbox : bounding_boxes) {
    const float object_area = bbox.Area();
    if (object_area < kMinArea) {
      continue;
    }

    const float object_covered = object_area != 0 ? crop.Intersect(bbox).Area() / object_area : 0;
    if (object_covered >= minimum_object_covered) {
      is_object_covered = true;
      break;
    }
  }
  return is_object_covered;
}

template <typename T>
void SampleDistortedBoundingBoxV2GpuKernelMod::SelectBox(const std::vector<AddressPtr> &inputs,
                                                         const std::vector<AddressPtr> &workspace,
                                                         const std::vector<AddressPtr> &outputs) {
  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);

  T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
  T image_size[kShapeSize3];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  float *bounding_boxes_addr = GetDeviceAddress<float>(inputs, kIndex1);
  float bounding_boxes[bounding_boxes_elements_];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&bounding_boxes, bounding_boxes_addr, bounding_boxes_elements_ * abstract::TypeIdSize(kNumberTypeFloat32), cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  float *min_object_covered_addr = GetDeviceAddress<float>(inputs, kIndex2);
  float min_object_covered[kShapeSize1];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&min_object_covered, min_object_covered_addr, kShapeSize1 * abstract::TypeIdSize(kNumberTypeFloat32), cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  //size_t boxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_;
  
  int32_t *workspace_ptr = GetDeviceAddress<int32_t>(workspace, kIndex0);
  size_t boxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_;
  int32_t generated_boxes[boxes_elements_count];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&generated_boxes, workspace_ptr, boxes_elements_count * sizeof(int32_t), cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  
  T *begin_addr = GetDeviceAddress<T>(outputs, kIndex0);
  T begin[kShapeSize3];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&begin, begin_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  T *size_addr = GetDeviceAddress<T>(outputs, kIndex1);
  T size[kShapeSize3];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&size, size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  float *bboxes_addr = GetDeviceAddress<float>(outputs, kIndex2);
  size_t bboxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4;
  float bboxes[bboxes_elements_count];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&bboxes, bboxes_addr, bboxes_elements_count * sizeof(float), cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");

  const int32_t height = static_cast<int32_t>(image_size[kIndex0]);
  const int32_t width = static_cast<int32_t>(image_size[kIndex1]);
  if (!(height > 0 && width > 0)) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image height and width must be positive, got: [" << height
                      << "] and [" << width << "].";
  }

  float min_object_covered_val = 0.0;
  min_object_covered_val = *min_object_covered;
  if (min_object_covered_val < 0.0 || min_object_covered_val > 1.0) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', min_object_covered must be in [0.0, 1.0], got: ["
                      << min_object_covered_val << "].";
  }

  std::vector<Region> boxes;
  size_t size_bounding_boxes = inputs[kIndex1]->size / sizeof(float);
  for (size_t b = 0; b < size_bounding_boxes / kShapeSize4; ++b) {
    for (size_t i = 0; i < kShapeSize4; ++i) {
      if (bounding_boxes[b * kShapeSize4 + i] < 0.0 || bounding_boxes[b * kShapeSize4 + i] > 1.0) {
        MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', all bounding box coordinates must in [0.0, 1.0], got: ["
                          << bounding_boxes[b * kShapeSize4 + i] << "].";
      }
    }
    if (!(bounding_boxes[b * kShapeSize4 + kIndex1] < bounding_boxes[b * kShapeSize4 + kIndex3])) {
      MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', x_min of bounding box must be less than x_max, got: ["
                        << bounding_boxes[b * kShapeSize4 + kIndex1] << "] and ["
                        << bounding_boxes[b * kShapeSize4 + kIndex3] << "].";
    }
    if (!(bounding_boxes[b * kShapeSize4 + kIndex0] < bounding_boxes[b * kShapeSize4 + kIndex2])) {
      MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', y_min of bounding box must be less than y_max, got: ["
                        << bounding_boxes[b * kShapeSize4 + kIndex0] << "] and ["
                        << bounding_boxes[b * kShapeSize4 + kIndex2] << "].";
    }
    const int32_t x_min = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 1] * width );
    const int32_t y_min = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 0] * height );
    const int32_t x_max = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 3] * width );
    const int32_t y_max = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 2] * height);
    boxes.push_back(Region(x_min, y_min, x_max, y_max));
  }

  const Region ms_image_rect(0, 0, width, height);
  if (boxes.empty()) {
    if (!use_image_if_no_bounding_boxes_) {
      MS_LOG(EXCEPTION) << "For '" << kernel_name_
                        << "', no bounding boxes provided as input. One must enable use_image_if_no_bounding_boxes "
                           "if you wish to not provide any bounding boxes.";
    }

    boxes.push_back(ms_image_rect);
  }

  Region ms_crop_rect;
  bool ms_sample_generated = false;
  for (size_t i = 0; i < LongToSize(max_attempts_); ++i) {
    const int32_t x_min = generated_boxes[i * kShapeSize4 + 0] ;
    const int32_t y_min = generated_boxes[i * kShapeSize4 + 1];
    const int32_t x_max = generated_boxes[i * kShapeSize4 + 2];
    const int32_t y_max = generated_boxes[i * kShapeSize4 + 3];
    ms_crop_rect.SetPoint(x_min, y_min, x_max, y_max);
    if (SatisfiesOverlapConstraints(ms_crop_rect, min_object_covered_val, boxes)) {
      ms_sample_generated = true;
      break;
    }
  }

  if (!ms_sample_generated) {
    ms_crop_rect = ms_image_rect;
  }


  const int target_width = ms_crop_rect.max_x_ - ms_crop_rect.min_x_;
  const int target_height = ms_crop_rect.max_y_ - ms_crop_rect.min_y_;
  const int offset_width = ms_crop_rect.min_x_;
  const int offset_height = ms_crop_rect.min_y_;

  if (width < target_width + offset_width) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', width must be >= target_width + offset_width: [" << width
                      << "] vs [" << target_width << "] + [" << offset_width << "]";
  }

  if (height < target_height + offset_height) {
    MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', height must be >= target_height + offset_height: [" << height
                      << "] vs [" << target_height << "] + [" << offset_height << "]";
  }

  begin[kIndex0] = static_cast<T>(offset_height);
  size[kIndex0] = static_cast<T>(target_height);
  begin[kIndex1] = static_cast<T>(offset_width);
  size[kIndex1] = static_cast<T>(target_width);

  bboxes[kIndex0] = static_cast<float>(ms_crop_rect.min_y_) / static_cast<float>(height);
  bboxes[kIndex1] = static_cast<float>(ms_crop_rect.min_x_) / static_cast<float>(width);
  bboxes[kIndex2] = static_cast<float>(ms_crop_rect.max_y_) / static_cast<float>(height);
  bboxes[kIndex3] = static_cast<float>(ms_crop_rect.max_x_) / static_cast<float>(width);


  begin[kIndex2] = static_cast<T>(0);
  size[kIndex2] = static_cast<T>(-1);


  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(begin_addr, &begin, kShapeSize3 * unit_dtype_size_, cudaMemcpyHostToDevice, stream),
    "cudaMemcpy failed.");
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(size_addr, &size, kShapeSize3 * unit_dtype_size_, cudaMemcpyHostToDevice, stream),
    "cudaMemcpy failed.");
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(bboxes_addr, &bboxes, boxes_elements_count * sizeof(float), cudaMemcpyHostToDevice, stream),
    "cudaMemcpy failed.");
  return;
}

template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
                                                            const std::vector<AddressPtr> &workspace,
                                                            const std::vector<AddressPtr> &outputs) {
  if (is_null_input_) {
    return true;
  }
  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);
  T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
  T image_size[kShapeSize3];
  CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
    cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
    "cudaMemcpy failed.");
  int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);

  void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
  curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
  SampleDistortedBoundingBoxV2(image_size, seed_, seed2_, aspect_ratio_range_[kIndex0],
            aspect_ratio_range_[kIndex1],
            area_range_[kIndex0],area_range_[kIndex1] ,max_attempts_, use_image_if_no_bounding_boxes_,workspace_addr,
            devStates, reinterpret_cast<cudaStream_t>(cuda_stream_));
  SelectBox<T>(inputs, workspace, outputs);
  return true;
}

const std::vector<std::pair<KernelAttr, KernelRunFunc>> &SampleDistortedBoundingBoxV2GpuKernelMod::GetFuncList() const {
  static const std::vector<std::pair<KernelAttr, KernelRunFunc>> func_list = {
    ADD_KERNEL(kNumberTypeUInt8, uint8_t),
    ADD_KERNEL(kNumberTypeInt8, int8_t),
    ADD_KERNEL(kNumberTypeInt16, int16_t),
    ADD_KERNEL(kNumberTypeInt32, int32_t),
    ADD_KERNEL(kNumberTypeInt64, int64_t),
  };
  return func_list;
}
MS_KERNEL_FACTORY_REG(NativeGpuKernelMod, SampleDistortedBoundingBoxV2, SampleDistortedBoundingBoxV2GpuKernelMod);
}
}

sample_distorted_bounding_box_v2_impl.cuh

/**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#ifndef MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_
#define MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_

#include <curand_kernel.h>
#include <random>
#include <cuda_runtime.h>
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/cuda_device_info.h"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/cuda_common.h"


template <typename T>
void SampleDistortedBoundingBoxV2(T* image_size , int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts, 
            bool use_image_if_no_bounding_boxes, int *output,curandState *globalState,cudaStream_t cuda_stream);
#endif  // MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_

sample_distorted_bounding_box_v2_impl.cu

 /**
 * Copyright 2020 Huawei Technologies Co., Ltd
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */


#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/util.cuh"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/sample_distorted_bounding_box_v2_impl.cuh"

__device__ bool dev_error_res = false;
__global__ void GenerateRandomCropKernel(int seed, curandState *globalState, int ms_original_height,int ms_original_width,
       float min_area_range, float max_area_range,float ms_min_sample_aspect_ratio, float ms_max_sample_aspect_ratio, int* output, int max_attempt)
{
  const float ms_bias = 0.5;
  const float ms_min_area = min_area_range * ms_original_width * ms_original_height;
  const float ms_max_area = max_area_range * ms_original_width * ms_original_height;
  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (max_attempt); i += blockDim.x * gridDim.x) {
    curand_init(seed, i, 0, &globalState[i]);
    float sample_aspect_ratio = curand_uniform(&globalState[i])* (ms_max_sample_aspect_ratio - ms_min_sample_aspect_ratio) + ms_min_sample_aspect_ratio;
    
    int min_height =static_cast<int>(ms_original_height*min_area_range);
    int max_height =static_cast<int>(ms_original_height*max_area_range);

    int max_width = static_cast<int>(max_height*sample_aspect_ratio);
    if(max_width > ms_original_width){
      const float kEps = 0.0000001;
      max_height = static_cast<int>((ms_original_width + ms_bias - kEps) / sample_aspect_ratio);
      if ((max_height * sample_aspect_ratio) > ms_original_width) {
        max_height -= 1;
      }
    }

    max_height = min(max_height, ms_original_height);
    min_height = min(min_height, max_height);

    if (min_height < max_height) {
        min_height += static_cast<int>(curand_uniform(&globalState[i])*(max_height-min_height));
    }
    int width = static_cast<int>(min_height*sample_aspect_ratio);
    float area = static_cast<float>(width * min_height);

    if (area < ms_min_area) {
      min_height += 1;
      width = static_cast<int>(min_height * sample_aspect_ratio);
      area = width * min_height;
    }

    if (area > ms_max_area) {
      min_height -= 1;
      width = static_cast<int>(min_height * sample_aspect_ratio);
      area = width * min_height;
    }
    if (area < ms_min_area || area > ms_max_area || width > ms_original_width || min_height > ms_original_height ||
      width <= 0 || min_height <= 0) {
      return;
    }


    int y = 0;
    if (min_height < ms_original_height) {
      y = static_cast<int>(curand_uniform(&globalState[i])*(ms_original_height-min_height));
    }
    int x = 0;
    if (width < ms_original_width) {
      x = static_cast<int>(curand_uniform(&globalState[i])*(ms_original_width-width));
    }

    const int left_x = i*4;
    const int left_y = i*4+1;
    const int right_x = i*4+2;
    const int right_y = i*4+3;

    output[left_x] = x;
    output[left_y] = y;
    output[right_x] = x+width;
    output[right_y] = y+min_height;
  }
  dev_error_res = true;
  return;
}

template<typename T>
void SampleDistortedBoundingBoxV2(T* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream)
{
  int ms_original_height = static_cast<int>(image_size[0]);
  int ms_original_width = static_cast<int>(image_size[1]);
  int RNG_seed = 0;
  std::random_device rd;
  if (seed2 != 0) {
    RNG_seed = seed2;
  } else if (seed != 0) {
    RNG_seed = seed;
  } else {
    RNG_seed = static_cast<int>(rd());
  }
  GenerateRandomCropKernel<<<GET_BLOCKS(max_attempts), GET_THREADS, 0, cuda_stream>>>
               (RNG_seed, globalState,  ms_original_height, ms_original_width,min_area_range, max_area_range,ms_min_sample_aspect_ratio,ms_max_sample_aspect_ratio, output, max_attempts);
  return;
}

template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<uint8_t>(uint8_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int8_t>(int8_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int16_t>(int16_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int32_t>(int32_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int64_t>(int64_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
            float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
            bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);

test.py

import mindspore as ms
from mindspore import Tensor
import mindspore.ops as ops

ms.set_context(device_target="GPU")

image_size = Tensor([640,480,3], ms.int32)
bounding_boxes = Tensor([[[0.38, 0.17, 0.95, 0.40]]], ms.float32)
min_object_covered = Tensor([0.8], ms.float32)
sample_distorted_bounding_box_v2 = ops.operations.other_ops.SampleDistortedBoundingBoxV2(seed=10, seed2=20,
                                        aspect_ratio_range=(0.9, 1.1), area_range=(0.1,1.0), max_attempts=100,
                                        use_image_if_no_bounding_boxes=False)
output = sample_distorted_bounding_box_v2(image_size, bounding_boxes, min_object_covered)
begin, size, bboxes = output[0], output[1], output[2]
print(begin)
print(size)
print(bboxes)