自由尋覓快樂別人從沒法感受

0%

GDB调试入门

前言

在写Mindspore算子过程中,由于自己粗心、又或是理解出现了一些偏差,导致项目在编译、运行时会出现错误。对于一些简单的语法错误,语法检查器可以帮助我们纠正错误,但有一部分错误只有在运行时才会暴露出来,这一类错误很难发现,程序崩溃之后只会输出Core dump,使用print大法虽然能知道程序在哪一行Core dump了,而具体是什么原因导致的Core dump还是需要使用调试器进一步分析。这时候,GDB就派上用场了,虽然久仰GDB大名,但真正将GDB调试用到实战中还是第一次,因此必须将这一过程记录下来。

算子介绍

在真正开始调试之前,首先要介绍一下算子的功能。首先,算子的名字为Sample Distorted Bounding Box V2,算子大致的功能为:为一张图片生成一个随机大小的边框,该边框的长宽比、覆盖面积大小符合给定的要求,用户可以使用该边框对图片进行裁剪、部分随机扭曲,该算子通常用在训练数据的增强中。

在友商的产品中,该算子只实现了CPU版本,而Mindspore要求实现一个GPU版本的算子,为了发挥CUDA并行计算的优势,我想到的一个实现思路如下:使用CUDA随机生成N个边框,将这些边框的信息从Device端复制到Host端中进行逻辑判断,选取一个符合要求的边框作为结果,将结果从Host端复制到Device端并作为算子的最终结果返回。虽然数据在Host端和Device端反复传输,可能会耗费大量的时间,但是实现了再谈如何去优化吧。

由于算子前端已经写好,因此只需要写一个GPU Kernel的实现即可,主要分为以下四个源文件,源码比较长,我将其放到本篇最后一个章节:

  • sample_distorted_bounding_box_v2_gpu_kernel.h: GPU算子的头文件,通过继承GPU Kernel父类,定义算子的具体类。
  • sample_distorted_bounding_box_v2_gpu_kernel.cc: GPU算子的具体实现,通常包括参数检查、内存分配、算子初始化、算子注册等实现。
  • sample_distorted_bounding_box_v2_impl.cuh: 算子CUDA实现的头文件,定义CUDA核函数的入口。
  • sample_distorted_bounding_box_v2_impl.cu: 算子CUDA核函数的具体实现。

按照这个思路编码完成后,将项目编译成Python包并安装测试,测试过程中遇到的第一个错误是undefined symbol,这个错误通常是因为函数的参数类型没有对齐造成的,使用c++filt可以查看调用的函数,将其参数与源码中函数的定义对比,查看是哪些类型没有对齐,将其对齐即可。

Undefined symbol error

c++filt

解决上一个错误后,接下来的便是Core dump错误,需要请GDB出场了。

GDB调试

首先将当前的目录切换到项目根目录下,因为GDB调试器是基于当前目录启动的,切换到项目根目录下方便等会对源文件打断点。

接着启动GDB调试器,参数为要调试的命令。比如,我这里要调试的程序是用Python运行的,那么--args后面跟的参数就是Python运行程序的命令。

1
gdb --args python ~/pyProject/mindspore-dev/SampleDistortedBoundingBox/test_mindspore.py

Start GDB

调试器启动后,找到需要打断点的源文件和对应代码行,将断点打上去。比如,我需要在算子的LaunchKernel函数中,调用CUDA核函数前,查看传递到CUDA核函数中的各个参数是否合法,就需要在函数SampleDistortedBoundingBoxV2前打一个断点。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs) {
if (is_null_input_) {
return true;
}
T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);
void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
SampleDistortedBoundingBoxV2(image_size_addr, seed_, seed2_,
aspect_ratio_range_[kIndex0], aspect_ratio_range_[kIndex1],
area_range_[kIndex0], area_range_[kIndex1],
max_attempts_, use_image_if_no_bounding_boxes_,
workspace_addr, devStates,
reinterpret_cast<cudaStream_t>(cuda_stream_));
SelectBox<T>(inputs, workspace, outputs);
return true;
}

函数SampleDistortedBoundingBoxV2对应源文件中是365行,因此在365行处打一个断点。

1
b mindspore/ccsrc/plugin/device/gpu/kernel/other/sample_distorted_bounding_box_v2_gpu_kernel.cc:365

此时GDB会提示执行程序的源码中没有这个文件,在后面加载动态库时,如果GDB发现该源文件,就会打上对应的断点,输入y。这是因为程序还没开始运行,我们写的算子被编译为动态链接库了,还没有开始加载,GDB不确定你打的这个断点是否能命中,如果该文件对应的动态链接库在未来加载了,GDB就会打上断点,在运行到断点位置时停下来。

打上断点后,可以输入info br查看断点信息,由于动态链接库未加载,因此地址Address还是<PENDING>状态。

查看断点

接下来输入r运行程序,程序运行到断点处会停下来,此时可以查看断点处的各种变量,输入info args可以查看当前栈帧内的所有变量值,print xxx可以查看某个变量的值,backtrace查看函数调用栈。

查看变量值

在查看变量image_size_addr的值时,我发现这个地址的值永远是0,无法正确获取到输入图片的尺寸。这是因为image_size_addr获取的是Device端地址,也就是说该变量的值实际上储存在GPU中,在Host端直接读取是会产生错误的。

但是下一个问题就来了,函数SampleDistortedBoundingBoxV2不是写在.cu文件中了吗?CUDA中的函数使用Device端地址获取变量值不是正确的获取方法吗?那就得看看这个函数得源码了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
template<typename T>
void SampleDistortedBoundingBoxV2(T* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream)
{
int ms_original_height = static_cast<int>(image_size[0]);
int ms_original_width = static_cast<int>(image_size[1]);

int RNG_seed = 0;
std::random_device rd;
if (seed2 != 0) {
RNG_seed = seed2;
} else if (seed != 0) {
RNG_seed = seed;
} else {
RNG_seed = static_cast<int>(rd());
}
GenerateRandomCropKernel<<<GET_BLOCKS(max_attempts), GET_THREADS, 0, cuda_stream>>>
(RNG_seed, globalState, ms_original_height, ms_original_width,min_area_range, max_area_range,ms_min_sample_aspect_ratio,ms_max_sample_aspect_ratio, output, max_attempts);
return;
}

咋一看好像没什么问题,但实际上,只有在函数或变量的定义前加上__device____global__字段时,该函数或变量才真正在Device端上执行。因此,函数SampleDistortedBoundingBoxV2虽然写在了.cu文件中,但实际上还是在Host端上运行的,这也就导致了函数在获取image_size时出现了非法访存的操作,最终导致Core dump。

弄明白Core dump的原因后,如何解决它就变得很简单了。将image_size数据从Device端同步到Host端后再使用即可,修改后的LaunchKernel函数如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs) {
if (is_null_input_) {
return true;
}
cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);
T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
T image_size[kShapeSize3];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");

int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);
void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
SampleDistortedBoundingBoxV2(image_size, seed_, seed2_,
aspect_ratio_range_[kIndex0], aspect_ratio_range_[kIndex1],
area_range_[kIndex0], area_range_[kIndex1],
max_attempts_, use_image_if_no_bounding_boxes_,
workspace_addr, devStates,
reinterpret_cast<cudaStream_t>(cuda_stream_));
SelectBox<T>(inputs, workspace, outputs);
return true;

重新编译、安装一次mindspore-gpu包,再次启动GDB,在同样的位置打下断点,程序运行到断点停下来后,查看image_size变量,可以看到GDN正确地输出变量值,并且变量地址也是正确的Host端地址了。

Bug修复后

总结

在这次实战之前,我天真地以为GDB只能调试C/C++程序,并且还以为只能调试可执行的文件,直到这一次实战我才了解到,GDB实际上是一个通用的调试器,功能非常强大。虽然之前用过GDB,但是都是调试Hello world之类的C/C++小程序,只能说是小打小闹,遇到大型项目的调试就无从下手了,要不是这一次print大法实在是不好使了,我还一直不情愿去学习、使用、将GDB真正用到实战中。真正上手使用GDB并解决问题过后,现在的我就觉得GDB可太香辣!

关于GDB的工作原理,我是一窍不通的,接下来需要花点时间去学习一下GDB到底是如何工作的,进一步提高我对GDB的理解。

最后贴一个GDB命令速查表,链接在此。

源代码

  1. sample_distorted_bounding_box_v2_gpu_kernel.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
/**
* Copyright 2022 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#ifndef MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_
#define MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_

#include <curand_kernel.h>
#include <cuda_runtime_api.h>
#include <cstdint>
#include <vector>
#include "kernel/common_utils.h"
#include "mindapi/base/type_id.h"
#include "plugin/device/gpu/kernel/gpu_kernel.h"
#include "plugin/device/gpu/kernel/gpu_kernel_factory.h"

namespace mindspore {
namespace kernel {

class Region {
public:
Region() { SetPoint(0, 0, 0, 0); }
Region(int xmin, int ymin, int xmax, int ymax) { SetPoint(xmin, ymin, xmax, ymax); }

void SetPoint(int xmin, int ymin, int xmax, int ymax) {
min_x_ = xmin;
min_y_ = ymin;
max_x_ = xmax;
max_y_ = ymax;
}

float Area() const { return static_cast<float>((max_x_ - min_x_) * (max_y_ - min_y_)); }

Region Intersect(const Region &r) const {
const int pmin_x = std::max(min_x_, r.min_x_);
const int pmin_y = std::max(min_y_, r.min_y_);
const int pmax_x = std::min(max_x_, r.max_x_);
const int pmax_y = std::min(max_y_, r.max_y_);
if (pmin_x > pmax_x || pmin_y > pmax_y) {
return Region();
} else {
return Region(pmin_x, pmin_y, pmax_x, pmax_y);
}
}
int min_x_;
int min_y_;
int max_x_;
int max_y_;
};

class SampleDistortedBoundingBoxV2GpuKernelMod : public NativeGpuKernelMod, public MatchKernelHelper<SampleDistortedBoundingBoxV2GpuKernelMod> {
public:
SampleDistortedBoundingBoxV2GpuKernelMod() = default;
~SampleDistortedBoundingBoxV2GpuKernelMod() override = default;

bool Launch(const std::vector<AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs, void *cuda_stream) override {
if (is_null_input_) {
return true;
}
cuda_stream_ = cuda_stream;
return kernel_func_(this, inputs, workspace, outputs);
}

bool Init(const BaseOperatorPtr &base_operator, const std::vector<KernelTensorPtr> &inputs,
const std::vector<KernelTensorPtr> &outputs) override;

int Resize(const BaseOperatorPtr &base_operator, const std::vector<KernelTensorPtr> &inputs,
const std::vector<KernelTensorPtr> &outputs, const std::map<uint32_t, tensor::TensorPtr> &) override;

const std::vector<std::pair<KernelAttr, KernelRunFunc>> &GetFuncList() const override;

protected:
void ResetResource() noexcept {
bounding_boxes_elements_ = 0;
is_null_input_ = false;
input_size_list_.clear();
output_size_list_.clear();
workspace_size_list_.clear();
}

std::vector<KernelAttr> GetOpSupport() override { return OpSupport(); }

private:
bool SatisfiesOverlapConstraints(const Region &crop, float minimum_object_covered,
const std::vector<Region> &bounding_boxes);

template <typename T>
void SelectBox(const std::vector<kernel::AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
const std::vector<kernel::AddressPtr> &outputs);

template <typename T>
bool LaunchKernel(const std::vector<kernel::AddressPtr> &inputs, const std::vector<AddressPtr> &workspace,
const std::vector<kernel::AddressPtr> &outputs);

TypeId dtype_{kTypeUnknown};
int64_t unit_dtype_size_{0};
int64_t bounding_boxes_elements_{0};
bool is_null_input_{false};
void *cuda_stream_{nullptr};

int64_t seed_{0};
int64_t seed2_{0};
std::vector<float> aspect_ratio_range_;
std::vector<float> area_range_;
int64_t max_attempts_{100};
bool use_image_if_no_bounding_boxes_{false};
};

} // namespace kernel
} // namespace mindspore
#endif // MINDSPORE_CCSRC_BACKEND_KERNEL_COMPILER_GPU_SAMPLE_DISTORTED_BOUNDING_BOX_V2_H_
  1. sample_distorted_bounding_box_v2_gpu_kernel.cc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
/**
* Copyright 2022 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "kernel/kernel.h"
#include "ops/sample_distorted_bounding_box_v2.h"
#include "plugin/device/gpu/hal/device/gpu_common.h"
#include "plugin/device/gpu/kernel/other/sample_distorted_bounding_box_v2_gpu_kernel.h"
#include "targets/x86_64-linux/include/cuda_runtime_api.h"
#include "targets/x86_64-linux/include/curand_kernel.h"
#include "targets/x86_64-linux/include/driver_types.h"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/sample_distorted_bounding_box_v2_impl.cuh"

namespace mindspore {
namespace kernel {
namespace {
using KernelRunFunc = SampleDistortedBoundingBoxV2GpuKernelMod::KernelRunFunc;
#define ADD_KERNEL(image_size_dtype, kernel_type) \
{ \
KernelAttr() \
.AddInputAttr(image_size_dtype) \
.AddInputAttr(kNumberTypeFloat32) \
.AddInputAttr(kNumberTypeFloat32) \
.AddOutputAttr(image_size_dtype) \
.AddOutputAttr(image_size_dtype) \
.AddOutputAttr(kNumberTypeFloat32), \
&SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel<kernel_type> \
}

constexpr size_t kOutputSize = 3;
constexpr size_t kInputSize = 3;
constexpr size_t kIndex0 = 0;
constexpr size_t kIndex1 = 1;
constexpr size_t kIndex2 = 2;
constexpr size_t kIndex3 = 3;
constexpr size_t kBBoxesDimension = 3;
constexpr size_t kShapeSize1 = 1;
constexpr size_t kShapeSize2 = 2;
constexpr size_t kShapeSize3 = 3;
constexpr size_t kShapeSize4 = 4;
constexpr size_t kNumber0 = 0;
constexpr float kFloatNum0 = 0.0;
constexpr float kFloatNum1 = 1.0;
}

bool SampleDistortedBoundingBoxV2GpuKernelMod::Init(const BaseOperatorPtr &base_operator,
const std::vector<KernelTensorPtr> &inputs,
const std::vector<KernelTensorPtr> &outputs) {
MS_EXCEPTION_IF_NULL(base_operator);
kernel_name_ = base_operator->name();

if (!MatchKernelFunc(base_operator, inputs, outputs)) {
return false;
}

auto kernel_attr = GetKernelAttrFromTensors(inputs, outputs);
dtype_ = kernel_attr.GetInputAttr(kIndex0).first;
unit_dtype_size_ = abstract::TypeIdSize(dtype_);

auto kernel_ptr = std::make_shared<ops::SampleDistortedBoundingBoxV2>(base_operator->GetPrim());
seed_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("seed")));
seed2_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("seed2")));
aspect_ratio_range_ = GetValue<std::vector<float>>(kernel_ptr->GetAttr("aspect_ratio_range"));
area_range_ = GetValue<std::vector<float>>(kernel_ptr->GetAttr("area_range"));
max_attempts_ = static_cast<int64_t>(GetValue<int64_t>(kernel_ptr->GetAttr("max_attempts")));
use_image_if_no_bounding_boxes_ = static_cast<bool>(GetValue<bool>(kernel_ptr->GetAttr("use_image_if_no_bounding_boxes")));

std::vector<int64_t> shape_image_size = std::vector<int64_t>(inputs.at(kIndex0)->GetDeviceShapeAdaptively().begin(),
inputs.at(kIndex0)->GetDeviceShapeAdaptively().end());
std::vector<int64_t> shape_bounding_boxes = std::vector<int64_t>(inputs.at(kIndex1)->GetDeviceShapeAdaptively().begin(),
inputs.at(kIndex1)->GetDeviceShapeAdaptively().end());
size_t shape_dim_image_size = shape_image_size.size();
size_t shape_dim_bounding_boxes = shape_bounding_boxes.size();
if (shape_dim_image_size != kShapeSize1) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image_size must be 1-dimensional, got: ["
<< shape_dim_image_size << "].";
}
if (shape_image_size[kIndex0] != kShapeSize3) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image_size must contain 3 elements, got: ["
<< shape_image_size[kIndex0] << "].";
}
if (shape_dim_bounding_boxes != kBBoxesDimension) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', bounding_boxes must be 3-dimensional"
<< " [batch, num_boxes, coords], got: [" << shape_dim_bounding_boxes << "].";
}
if (shape_bounding_boxes[shape_dim_bounding_boxes - 1] != kShapeSize4) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', bounding_boxes must have shape [4], got: ["
<< shape_bounding_boxes[shape_dim_bounding_boxes - 1] << "].";
}

if (max_attempts_ <= SizeToLong(kNumber0)) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', max_attempts must be positive: [" << max_attempts_ << "].";
}
if (aspect_ratio_range_[kIndex1] <= kFloatNum0 || aspect_ratio_range_[kIndex0] <= kFloatNum0) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', aspect_ratio_range must be positive: ["
<< aspect_ratio_range_[kIndex0] << "], [" << aspect_ratio_range_[kIndex1] << "].";
}
if (area_range_[kIndex1] <= kFloatNum0 || area_range_[kIndex0] <= kFloatNum0) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range must be positive: [" << area_range_[kIndex0] << "], ["
<< area_range_[kIndex1] << "].";
}
if (area_range_[kIndex1] > kFloatNum1 || area_range_[kIndex0] > kFloatNum1) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range must be less then or equal to 1.0: ["
<< area_range_[kIndex0] << "], [" << area_range_[kIndex1] << "].";
}
if (aspect_ratio_range_.size() != kShapeSize2) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', aspect_ratio_range field must specify 2 dimensions.";
}
if (area_range_.size() != kShapeSize2) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', area_range field must specify 2 dimensions.";
}
return true;
}

int SampleDistortedBoundingBoxV2GpuKernelMod::Resize(const BaseOperatorPtr &base_operator,
const std::vector<KernelTensorPtr> &inputs,
const std::vector<KernelTensorPtr> &outputs,
const std::map<uint32_t, tensor::TensorPtr> &) {
for (auto input : inputs) {
auto input_shape = input->GetShapeVector();
if (!IsValidShape(input_shape)) {
return KRET_UNKNOWN_SHAPE;
}
}
ResetResource();
std::vector<int64_t> shape_bounding_boxes = std::vector<int64_t>(inputs.at(kIndex1)->GetDeviceShapeAdaptively().begin(),
inputs.at(kIndex1)->GetDeviceShapeAdaptively().end());
bounding_boxes_elements_ = std::accumulate(shape_bounding_boxes.begin(), shape_bounding_boxes.end(), 1, std::multiplies<int64_t>());
input_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_); // input image size
input_size_list_.emplace_back(bounding_boxes_elements_ * abstract::TypeIdSize(kNumberTypeFloat32)); // input bboxes
input_size_list_.emplace_back(kShapeSize1 * abstract::TypeIdSize(kNumberTypeFloat32)); // input min object covered
output_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_); // output begin
output_size_list_.emplace_back(kShapeSize3 * unit_dtype_size_); // output size
output_size_list_.emplace_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * abstract::TypeIdSize(kNumberTypeFloat32)); // output bboxes
workspace_size_list_.push_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_ * sizeof(int32_t)); // number of boxes
workspace_size_list_.push_back(kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_ * sizeof(curandState)); // number of boxes
return KRET_OK;
}

bool SampleDistortedBoundingBoxV2GpuKernelMod::SatisfiesOverlapConstraints(const Region &crop,
float minimum_object_covered,
const std::vector<Region> &bounding_boxes) {
const float kMinArea = 1.0;
if (crop.Area() < kMinArea) {
return false;
}

bool is_object_covered = false;
for (const auto &bbox : bounding_boxes) {
const float object_area = bbox.Area();
if (object_area < kMinArea) {
continue;
}

const float object_covered = object_area != 0 ? crop.Intersect(bbox).Area() / object_area : 0;
if (object_covered >= minimum_object_covered) {
is_object_covered = true;
break;
}
}
return is_object_covered;
}

template <typename T>
void SampleDistortedBoundingBoxV2GpuKernelMod::SelectBox(const std::vector<AddressPtr> &inputs,
const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs) {
cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);

T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
T image_size[kShapeSize3];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
float *bounding_boxes_addr = GetDeviceAddress<float>(inputs, kIndex1);
float bounding_boxes[bounding_boxes_elements_];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&bounding_boxes, bounding_boxes_addr, bounding_boxes_elements_ * abstract::TypeIdSize(kNumberTypeFloat32), cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
float *min_object_covered_addr = GetDeviceAddress<float>(inputs, kIndex2);
float min_object_covered[kShapeSize1];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&min_object_covered, min_object_covered_addr, kShapeSize1 * abstract::TypeIdSize(kNumberTypeFloat32), cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
//size_t boxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_;

int32_t *workspace_ptr = GetDeviceAddress<int32_t>(workspace, kIndex0);
size_t boxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4 * max_attempts_;
int32_t generated_boxes[boxes_elements_count];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&generated_boxes, workspace_ptr, boxes_elements_count * sizeof(int32_t), cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");

T *begin_addr = GetDeviceAddress<T>(outputs, kIndex0);
T begin[kShapeSize3];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&begin, begin_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
T *size_addr = GetDeviceAddress<T>(outputs, kIndex1);
T size[kShapeSize3];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&size, size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
float *bboxes_addr = GetDeviceAddress<float>(outputs, kIndex2);
size_t bboxes_elements_count = kShapeSize1 * kShapeSize1 * kShapeSize4;
float bboxes[bboxes_elements_count];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&bboxes, bboxes_addr, bboxes_elements_count * sizeof(float), cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");

const int32_t height = static_cast<int32_t>(image_size[kIndex0]);
const int32_t width = static_cast<int32_t>(image_size[kIndex1]);
if (!(height > 0 && width > 0)) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', image height and width must be positive, got: [" << height
<< "] and [" << width << "].";
}

float min_object_covered_val = 0.0;
min_object_covered_val = *min_object_covered;
if (min_object_covered_val < 0.0 || min_object_covered_val > 1.0) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', min_object_covered must be in [0.0, 1.0], got: ["
<< min_object_covered_val << "].";
}

std::vector<Region> boxes;
size_t size_bounding_boxes = inputs[kIndex1]->size / sizeof(float);
for (size_t b = 0; b < size_bounding_boxes / kShapeSize4; ++b) {
for (size_t i = 0; i < kShapeSize4; ++i) {
if (bounding_boxes[b * kShapeSize4 + i] < 0.0 || bounding_boxes[b * kShapeSize4 + i] > 1.0) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', all bounding box coordinates must in [0.0, 1.0], got: ["
<< bounding_boxes[b * kShapeSize4 + i] << "].";
}
}
if (!(bounding_boxes[b * kShapeSize4 + kIndex1] < bounding_boxes[b * kShapeSize4 + kIndex3])) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', x_min of bounding box must be less than x_max, got: ["
<< bounding_boxes[b * kShapeSize4 + kIndex1] << "] and ["
<< bounding_boxes[b * kShapeSize4 + kIndex3] << "].";
}
if (!(bounding_boxes[b * kShapeSize4 + kIndex0] < bounding_boxes[b * kShapeSize4 + kIndex2])) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', y_min of bounding box must be less than y_max, got: ["
<< bounding_boxes[b * kShapeSize4 + kIndex0] << "] and ["
<< bounding_boxes[b * kShapeSize4 + kIndex2] << "].";
}
const int32_t x_min = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 1] * width );
const int32_t y_min = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 0] * height );
const int32_t x_max = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 3] * width );
const int32_t y_max = static_cast<int32_t>(bounding_boxes[b * kShapeSize4 + 2] * height);
boxes.push_back(Region(x_min, y_min, x_max, y_max));
}

const Region ms_image_rect(0, 0, width, height);
if (boxes.empty()) {
if (!use_image_if_no_bounding_boxes_) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_
<< "', no bounding boxes provided as input. One must enable use_image_if_no_bounding_boxes "
"if you wish to not provide any bounding boxes.";
}

boxes.push_back(ms_image_rect);
}

Region ms_crop_rect;
bool ms_sample_generated = false;
for (size_t i = 0; i < LongToSize(max_attempts_); ++i) {
const int32_t x_min = generated_boxes[i * kShapeSize4 + 0] ;
const int32_t y_min = generated_boxes[i * kShapeSize4 + 1];
const int32_t x_max = generated_boxes[i * kShapeSize4 + 2];
const int32_t y_max = generated_boxes[i * kShapeSize4 + 3];
ms_crop_rect.SetPoint(x_min, y_min, x_max, y_max);
if (SatisfiesOverlapConstraints(ms_crop_rect, min_object_covered_val, boxes)) {
ms_sample_generated = true;
break;
}
}

if (!ms_sample_generated) {
ms_crop_rect = ms_image_rect;
}


const int target_width = ms_crop_rect.max_x_ - ms_crop_rect.min_x_;
const int target_height = ms_crop_rect.max_y_ - ms_crop_rect.min_y_;
const int offset_width = ms_crop_rect.min_x_;
const int offset_height = ms_crop_rect.min_y_;

if (width < target_width + offset_width) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', width must be >= target_width + offset_width: [" << width
<< "] vs [" << target_width << "] + [" << offset_width << "]";
}

if (height < target_height + offset_height) {
MS_LOG(EXCEPTION) << "For '" << kernel_name_ << "', height must be >= target_height + offset_height: [" << height
<< "] vs [" << target_height << "] + [" << offset_height << "]";
}

begin[kIndex0] = static_cast<T>(offset_height);
size[kIndex0] = static_cast<T>(target_height);
begin[kIndex1] = static_cast<T>(offset_width);
size[kIndex1] = static_cast<T>(target_width);

bboxes[kIndex0] = static_cast<float>(ms_crop_rect.min_y_) / static_cast<float>(height);
bboxes[kIndex1] = static_cast<float>(ms_crop_rect.min_x_) / static_cast<float>(width);
bboxes[kIndex2] = static_cast<float>(ms_crop_rect.max_y_) / static_cast<float>(height);
bboxes[kIndex3] = static_cast<float>(ms_crop_rect.max_x_) / static_cast<float>(width);


begin[kIndex2] = static_cast<T>(0);
size[kIndex2] = static_cast<T>(-1);


CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(begin_addr, &begin, kShapeSize3 * unit_dtype_size_, cudaMemcpyHostToDevice, stream),
"cudaMemcpy failed.");
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(size_addr, &size, kShapeSize3 * unit_dtype_size_, cudaMemcpyHostToDevice, stream),
"cudaMemcpy failed.");
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(bboxes_addr, &bboxes, boxes_elements_count * sizeof(float), cudaMemcpyHostToDevice, stream),
"cudaMemcpy failed.");
return;
}

template <typename T>
bool SampleDistortedBoundingBoxV2GpuKernelMod::LaunchKernel(const std::vector<AddressPtr> &inputs,
const std::vector<AddressPtr> &workspace,
const std::vector<AddressPtr> &outputs) {
if (is_null_input_) {
return true;
}
cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream_);
T *image_size_addr = GetDeviceAddress<T>(inputs, kIndex0);
T image_size[kShapeSize3];
CHECK_CUDA_RET_WITH_EXCEPT_NOTRACE(
cudaMemcpyAsync(&image_size, image_size_addr, kShapeSize3 * unit_dtype_size_, cudaMemcpyDeviceToHost, stream),
"cudaMemcpy failed.");
int32_t* workspace_addr = GetDeviceAddress<int32_t>(workspace, kIndex0);

void *curandState_addr = GetDeviceAddress<void *>(workspace, kIndex1);
curandState *devStates = reinterpret_cast<curandState *>(curandState_addr);
SampleDistortedBoundingBoxV2(image_size, seed_, seed2_, aspect_ratio_range_[kIndex0],
aspect_ratio_range_[kIndex1],
area_range_[kIndex0],area_range_[kIndex1] ,max_attempts_, use_image_if_no_bounding_boxes_,workspace_addr,
devStates, reinterpret_cast<cudaStream_t>(cuda_stream_));
SelectBox<T>(inputs, workspace, outputs);
return true;
}

const std::vector<std::pair<KernelAttr, KernelRunFunc>> &SampleDistortedBoundingBoxV2GpuKernelMod::GetFuncList() const {
static const std::vector<std::pair<KernelAttr, KernelRunFunc>> func_list = {
ADD_KERNEL(kNumberTypeUInt8, uint8_t),
ADD_KERNEL(kNumberTypeInt8, int8_t),
ADD_KERNEL(kNumberTypeInt16, int16_t),
ADD_KERNEL(kNumberTypeInt32, int32_t),
ADD_KERNEL(kNumberTypeInt64, int64_t),
};
return func_list;
}
MS_KERNEL_FACTORY_REG(NativeGpuKernelMod, SampleDistortedBoundingBoxV2, SampleDistortedBoundingBoxV2GpuKernelMod);
}
}
  1. sample_distorted_bounding_box_v2_impl.cuh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/**
* Copyright 2020 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#ifndef MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_
#define MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_

#include <curand_kernel.h>
#include <random>
#include <cuda_runtime.h>
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/cuda_device_info.h"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/cuda_common.h"


template <typename T>
void SampleDistortedBoundingBoxV2(T* image_size , int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes, int *output,curandState *globalState,cudaStream_t cuda_stream);
#endif // MINDSPORE_CCSRC_PLUGIN_DEVICE_GPU_KERNEL_CUDA_IMPL_CUDA_OPS_SAMPLE_DISTORTED_BOUNDING_BOX_V2_IMPL_CUH_
  1. sample_distorted_bounding_box_v2_impl.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
 /**
* Copyright 2020 Huawei Technologies Co., Ltd
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/


#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/util.cuh"
#include "plugin/device/gpu/kernel/cuda_impl/cuda_ops/sample_distorted_bounding_box_v2_impl.cuh"

__device__ bool dev_error_res = false;
__global__ void GenerateRandomCropKernel(int seed, curandState *globalState, int ms_original_height,int ms_original_width,
float min_area_range, float max_area_range,float ms_min_sample_aspect_ratio, float ms_max_sample_aspect_ratio, int* output, int max_attempt)
{
const float ms_bias = 0.5;
const float ms_min_area = min_area_range * ms_original_width * ms_original_height;
const float ms_max_area = max_area_range * ms_original_width * ms_original_height;
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < (max_attempt); i += blockDim.x * gridDim.x) {
curand_init(seed, i, 0, &globalState[i]);
float sample_aspect_ratio = curand_uniform(&globalState[i])* (ms_max_sample_aspect_ratio - ms_min_sample_aspect_ratio) + ms_min_sample_aspect_ratio;

int min_height =static_cast<int>(ms_original_height*min_area_range);
int max_height =static_cast<int>(ms_original_height*max_area_range);

int max_width = static_cast<int>(max_height*sample_aspect_ratio);
if(max_width > ms_original_width){
const float kEps = 0.0000001;
max_height = static_cast<int>((ms_original_width + ms_bias - kEps) / sample_aspect_ratio);
if ((max_height * sample_aspect_ratio) > ms_original_width) {
max_height -= 1;
}
}

max_height = min(max_height, ms_original_height);
min_height = min(min_height, max_height);

if (min_height < max_height) {
min_height += static_cast<int>(curand_uniform(&globalState[i])*(max_height-min_height));
}
int width = static_cast<int>(min_height*sample_aspect_ratio);
float area = static_cast<float>(width * min_height);

if (area < ms_min_area) {
min_height += 1;
width = static_cast<int>(min_height * sample_aspect_ratio);
area = width * min_height;
}

if (area > ms_max_area) {
min_height -= 1;
width = static_cast<int>(min_height * sample_aspect_ratio);
area = width * min_height;
}
if (area < ms_min_area || area > ms_max_area || width > ms_original_width || min_height > ms_original_height ||
width <= 0 || min_height <= 0) {
return;
}


int y = 0;
if (min_height < ms_original_height) {
y = static_cast<int>(curand_uniform(&globalState[i])*(ms_original_height-min_height));
}
int x = 0;
if (width < ms_original_width) {
x = static_cast<int>(curand_uniform(&globalState[i])*(ms_original_width-width));
}

const int left_x = i*4;
const int left_y = i*4+1;
const int right_x = i*4+2;
const int right_y = i*4+3;

output[left_x] = x;
output[left_y] = y;
output[right_x] = x+width;
output[right_y] = y+min_height;
}
dev_error_res = true;
return;
}

template<typename T>
void SampleDistortedBoundingBoxV2(T* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream)
{
int ms_original_height = static_cast<int>(image_size[0]);
int ms_original_width = static_cast<int>(image_size[1]);
int RNG_seed = 0;
std::random_device rd;
if (seed2 != 0) {
RNG_seed = seed2;
} else if (seed != 0) {
RNG_seed = seed;
} else {
RNG_seed = static_cast<int>(rd());
}
GenerateRandomCropKernel<<<GET_BLOCKS(max_attempts), GET_THREADS, 0, cuda_stream>>>
(RNG_seed, globalState, ms_original_height, ms_original_width,min_area_range, max_area_range,ms_min_sample_aspect_ratio,ms_max_sample_aspect_ratio, output, max_attempts);
return;
}

template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<uint8_t>(uint8_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int8_t>(int8_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int16_t>(int16_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int32_t>(int32_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
template CUDA_LIB_EXPORT void SampleDistortedBoundingBoxV2<int64_t>(int64_t* image_size, int seed, int seed2, float ms_min_sample_aspect_ratio,
float ms_max_sample_aspect_ratio,float min_area_range,float max_area_range ,int max_attempts,
bool use_image_if_no_bounding_boxes,int* output,curandState *globalState,cudaStream_t cuda_stream);
  1. test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import mindspore as ms
from mindspore import Tensor
import mindspore.ops as ops

ms.set_context(device_target="GPU")

image_size = Tensor([640,480,3], ms.int32)
bounding_boxes = Tensor([[[0.38, 0.17, 0.95, 0.40]]], ms.float32)
min_object_covered = Tensor([0.8], ms.float32)
sample_distorted_bounding_box_v2 = ops.operations.other_ops.SampleDistortedBoundingBoxV2(seed=10, seed2=20,
aspect_ratio_range=(0.9, 1.1), area_range=(0.1,1.0), max_attempts=100,
use_image_if_no_bounding_boxes=False)
output = sample_distorted_bounding_box_v2(image_size, bounding_boxes, min_object_covered)
begin, size, bboxes = output[0], output[1], output[2]
print(begin)
print(size)
print(bboxes)