Unable to get working on Broadwell i7 5600U, or missing steps? #20

lissyx · 2018-02-19T08:27:09Z

I have built and installed without any error (following exactly the steps from https://github.com/intel/compute-runtime#building) on my laptop (ubuntu 17.10, Broadwell i7-5600U), but clinfo returns 0 platform found. Checking clinfo calls with strace shows it is properly picking-up /opt/intel/opencl/libigdrcl.so.

Is Broadwell supported right now? Readme states it is https://github.com/intel/compute-runtime#supported-platforms, but the "GenX" naming is unclear to me, and https://ark.intel.com/products/85215/Intel-Core-i7-5600U-Processor-4M-Cache-up-to-3_20-GHz reports "5th gen". And the GPU is Intel HD Graphics 5500, whose naming would be consistent with 5th gen, but it's not listed on https://www.intel.com/content/www/us/en/architecture-and-technology/visual-technology/graphics-overview.html.

So, am I trying to get it working on an unsupported platform, or did I missed anything to get that working ? Thanks!

The text was updated successfully, but these errors were encountered:

lissyx · 2018-02-19T08:41:16Z

The GPU seems to be identified as Broadwell in the source as well, under the define IBDW_GT2_ULT_MOBL_DEVICE_F0_ID: https://github.com/intel/intel-graphics-compiler/blob/e42b674bce86b372c92ba08812cecccc71debafc/inc/common/igfxfmid.h#L354

pwilma · 2018-02-19T08:52:05Z

Broadwell is supported by Neo driver. Your issue looks similar to problem already reported in #9. Could you please check it? To operate correctly Neo driver requires Khronos ICD loader: https://github.com/KhronosGroup/OpenCL-ICD-Loader. It is known limitation that we are trying to solve.

lissyx · 2018-02-19T09:44:36Z

@pwilma Thanks! I did search in issues before, but missed #9. I also saw the note about the Khronos ICD loader, but I mistakenly understood that this was provided by the ocl-icd-libopencl1 package on debian/ubuntu. I'll verify all of that, thanks !

lissyx · 2018-02-19T09:46:01Z

Ok, OCL_ICD_ASSUME_ICD_EXTENSION=1 clinfo gives cool stuff, it might just be the ICD loader that I messed up with :)

lissyx · 2018-02-19T10:38:49Z

Okay, after adding a clone of https://github.com/KhronosGroup/OpenCL-ICD-Loader into the set of sources for the Neo driver and building it by hand, re-running CMake, it was properly picked up as this line shows:
-- Taking ICD library from /home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/../OpenCL-ICD-Loader/build/lib

Then, it's being properly packaged, and clinfo returns proper status, as well as does computecpp_info. Now, my TensorFlow code is failing in neo/runtime/os_interface/linux/drm_buffer_object.cpp at line 183, but that might be something expected.

lissyx · 2018-02-19T10:52:24Z

@pwilma i'm unable to get any debug, even forcing PrintDebugMessages=1 on the env, and I could not find that mentionned in any doc. How am I supposed to get those? Should I rebuild with something different than Release ?

pwilma · 2018-02-19T11:33:21Z

Yes, debug variables won't work for Release. Please try to build Debug version.

lissyx · 2018-02-19T11:40:54Z

@pwilma Thanks, I'm doing that right now. BTW, one step I had to make but I'm not completely sure about is: I did git clone the Khronos ICD Loader next to the neo driver, and I built it by hand. Was building by hand required, or should your CMake build system have handled that for me automagically?

pwilma · 2018-02-19T11:45:19Z

You did it correctly. Manual build of Khronos ICD Loader is required. Neo CMake buildsystem won't build it automatically.

lissyx · 2018-02-19T13:04:20Z

@pwilma Debug-enabled build is failing: igc/IGC/AdaptorCommon/customApi.cpp:555:41: error: 'struct SRegKeysList' has no member named 'EnableDxbcDump'; did you mean 'EnableCosDump'?

lissyx · 2018-02-19T13:42:39Z

@pwilma I've removed the unexisting member, built and installed debug build, but I'm still unable to get any debug message :(. Update: it works, but only through igdrcl.config :)

2018-02-19 14:44:04.441003: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING: Failed to request OCL Turbo Boost
hwInfo: {24, 168}: (8, 2, 6)
computeUnitsUsedForScratch: 336
2018-02-19 14:44:04.443499: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 14:44:04.443527: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(52, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(52, 1, 1)	LWS:(52, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(2048, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(2048, 1, 1)	LWS:(256, 1, 1)	TWGS:(8, 1, 1)	NWGS:(8, 1, 1)	SWGS:(0, 0, 0)
ioctl(I915_GEM_EXECBUFFER2) failed with -1. errno=14(Bad address)

lissyx · 2018-02-19T13:59:20Z

Not sure yet, but there's a return -EFAULT here in i915_gem_execbuffer2(): http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/tree/drivers/gpu/drm/i915/i915_gem_execbuffer.c#n2487

lissyx · 2018-02-19T14:12:26Z

Enabling DRM's debug in /sys/module/drm/parameters/debug, I get this for the PID of my process:

# grep '8911' dmesg.txt 
[423537.069035] [drm:drm_open [drm]] pid = 8911, minor = 128
[423537.069103] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, DRM_IOCTL_VERSION
[423537.069111] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069117] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069123] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069128] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_CONTEXT_SETPARAM
[423537.069164] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069169] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069174] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069179] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069184] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069210] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069216] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.072558] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_USERPTR
[423537.072643] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_USERPTR
[423537.072660] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_REG_READ
[423537.072699] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_GET_APERTURE
[423537.334023] [drm:drm_release [drm]] pid = 8911, device = 0xe280, open_count = 6

lissyx · 2018-02-19T14:30:59Z

@pwilma Here is most of the debug info I could gather so far. Is it possible it is failing because the driver does not yet support enough of the required calls / opencl for proper operation with TensorFlow / ComputeCpp 0.5.1 ? Or do I need a newer kernel (Ubuntu 17.10 is on 4.13) ?

lissyx · 2018-02-19T14:40:24Z

@pwilma I've also collected some debug files, if it can be of help, tell me what would be the best way to share them:

$ ls -ah CopyBufferToBuffer*
CopyBufferToBufferLeftLeftover.txt			       CopyBufferToBufferMiddle_arg_2_immediate_size_4_flags_0.bin	     CopyBufferToBufferRightLeftover_arg_1_buffer_size_32_flags_9.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_128_flags_9.bin     CopyBufferToBufferMiddle_arg_3_immediate_size_4_flags_0.bin	     CopyBufferToBufferRightLeftover_arg_2_immediate_size_4_flags_0.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_193664_flags_9.bin  CopyBufferToBufferMiddle.txt					     CopyBufferToBufferRightLeftover_arg_3_immediate_size_4_flags_0.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_32768_flags_9.bin   CopyBufferToBufferRightLeftover_arg_1_buffer_size_128_flags_9.bin     CopyBufferToBufferRightLeftover.txt
CopyBufferToBufferMiddle_arg_1_buffer_size_8192_flags_9.bin    CopyBufferToBufferRightLeftover_arg_1_buffer_size_193664_flags_9.bin

MichalMrozek · 2018-02-19T15:02:12Z

Useful stuff for debugging:

Open file vpg-compute-neo\runtime\os_interface\DebugVariables.def , there are various debug flags available there.
Good thing to start is to EnableDebugBreak to true
Here problems looks like a failure in exec call, this probably means that inputs are incorrect. To isolate faulty kernel please set MakeEachEnqueueBlocking, that will tell you which clEnqueueNDRangeKernel call is failing
Knowing which kernel fails , look at its arguments, are there any buffers created with CL_MEM_USE_HOST_PTR flag there ? If so, can you rule out that there are any duplicates ( CL_MEM_USE_HOST_PTR buffers created from the same host_ptr)
Please turn on logging and provide dumps:
DumpKernels/DumpKernelArgs/LogApiCalls/LogPatchTokens/LogMemoryObject

lissyx · 2018-02-19T15:06:52Z

@MichalMrozek Thanks! I have the following in igdrcl.config, is that okay? it seems to provide debug infos at least:

PrintDebugMessages = 1
EnableDebugBreak = true
MakeEachEnqueueBlocking = 1
DumpKernels = 1
DumpKernelArgs = 1
LogApiCalls = 1
LogPatchTokens = 1
LogMemoryObject = 1

lissyx · 2018-02-19T15:07:32Z

I cannot find anything referring to clEnqueueNDRangeKernel in igdcrl.log. @MichalMrozek you should have received that by mail

MichalMrozek · 2018-02-19T15:24:55Z

Can you try to use this flag ?
DoCpuCopyOnWriteBuffer to 1.

lissyx · 2018-02-19T15:25:37Z

@MichalMrozek Magic! it's not failing anymore, it seems to be doing actual stuff :)

MichalMrozek · 2018-02-19T15:28:16Z

Looks like there may be a bug in our clEnqueueWriteBuffer implementation.
We will try to reproduce it basing on the logs sent.
If we fail we will contact for more robust reproducer, thanks for reporting this.

lissyx · 2018-02-19T15:30:55Z

@MichalMrozek I can share you the binaries to reproduce as well, if you need. It's DeepSpeech + ComputeCpp 0.5.1.

Now, so far, it started the run, but it's not completing anything yet, the output is "blocked" on this:

Kernel Name: SYCL_struct_Eigen__TensorSycl__ExecExprFunctorKernel_const_class_Eigen__TensorAssignOp_class_Eigen__TensorMap_class_Eigen__Tensor_struct_std__complex_double___1__1__long___16__MakePointer___const_class_Eigen__TensorConversionOp_struct_std__complex_double___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_long_long__1__1__long___16__MakePointer_______struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_long__1_____struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_long__1_________struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____false_
Gen7 Kernel Binary Header:
	CheckSum = 66b54f10
	ShaderHashCode = 104d0de50e756486
	KernelNameSize = 704
	PatchListSize = 1308
	KernelHeapSize = 2688
	GeneralStateHeapSize = 0
	DynamicStateHeapSize = 32
	SurfaceStateHeapSize = 136
Program Binary Header:
	Magic = 494e5443
	Version = 1049
	Device = 11
	GPUPointerSizeInBytes = 8
	NumberOfKernels = 22
	SteppingId = 9
	PatchListSize = 0

The deepspeech process seems to be consuming as much as CPU as it can, so maybe it's just slow, given the model is quite big and it's a debug build. After sometime, it unblocked. i'll rebuild a non-debug with just the DoCpuCopyOnWriteBuffer flipped :)

lissyx · 2018-02-19T17:14:39Z

@MichalMrozek Building release and changing the value of DoCpuCopyOnWriteBuffer to true in runtime/os_interface/DebugVariables.def fails like this:

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/enqueue_write_buffer_tests.cpp:453: Failure
Value of: retVal
  Actual: 0
Expected: -5

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 135
[  FAILED  ] NegativeFailAllocationTest.givenEnqueueWriteBufferWhenHostPtrAllocationCreationFailsThenReturnOutOfResource

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/ooq_task_tests.cpp:67: Failure
Expected: (previousTaskCount) < (commandStreamReceiver.peekTaskCount()), actual: 0 vs 0
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 178
[  FAILED  ] OOQ/OOQTaskTypedTests/6.changesTaskCount

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/ooq_task_tests.cpp:53: Failure
Value of: this->pCmdQ->taskLevel
  Actual: 0
Expected: previousTaskLevel + taskLevelClosed
Which is: 1
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 168
[  FAILED  ] OOQ/OOQTaskTypedTests/6.doesntChangeTaskLevel

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/enqueue_thread_tests.cpp:67: Failure
Expected: (toFree.size()) > (0u), actual: 0 vs 0
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 79
[  FAILED  ] EnqueueThreading.enqueueWriteBuffer

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/get_size_required_buffer_tests.cpp:358: Failure
Value of: usedAfterCS - usedBeforeCS
  Actual: 0
Expected: expectedSizeCS
Which is: 192
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 338
[  FAILED  ] GetSizeRequiredBufferTest.enqueueWriteBufferNonBlocking

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/get_size_required_buffer_tests.cpp:413: Failure
Value of: usedAfterCS - usedBeforeCS
  Actual: 0
Expected: expectedSizeCS
Which is: 192
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 338
[  FAILED  ] GetSizeRequiredBufferTest.enqueueWriteBufferBlocking
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_skl_unit_tests.dir/build.make:61: recipe for target 'run_skl_unit_tests' failed
make[2]: *** [run_skl_unit_tests] Error 134
CMakeFiles/Makefile2:5095: recipe for target 'unit_tests/CMakeFiles/run_skl_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_skl_unit_tests.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_bdw_unit_tests.dir/build.make:61: recipe for target 'run_bdw_unit_tests' failed
make[2]: *** [run_bdw_unit_tests] Error 134
CMakeFiles/Makefile2:5199: recipe for target 'unit_tests/CMakeFiles/run_bdw_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_bdw_unit_tests.dir/all] Error 2
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_kbl_unit_tests.dir/build.make:61: recipe for target 'run_kbl_unit_tests' failed
make[2]: *** [run_kbl_unit_tests] Error 134
CMakeFiles/Makefile2:5447: recipe for target 'unit_tests/CMakeFiles/run_kbl_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_kbl_unit_tests.dir/all] Error 2
Aborted (core dumped)
unit_tests/CMakeFiles/run_bxt_unit_tests.dir/build.make:61: recipe for target 'run_bxt_unit_tests' failed
make[2]: *** [run_bxt_unit_tests] Error 134
CMakeFiles/Makefile2:5836: recipe for target 'unit_tests/CMakeFiles/run_bxt_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_bxt_unit_tests.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2

lissyx · 2018-02-19T18:05:14Z

@MichalMrozek Okay, performances are very bad, but it executed successfully:

$ time ./deepspeech ~/codaz/Mozilla/DeepSpeech/deepspeech-kdavis/tf14.frozen.494_e120.LSTM.ldc93s1.pb ../models/alphabet.txt ~/codaz/Mozilla/DeepSpeech/deepspeech-kdavis/data/smoke_test/LDC93S1.wav -t
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-19 18:54:33.238904: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-19 18:54:33.243317: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 18:54:33.243427: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
shehadyourdarksuitingreasywwsh wtteral year
cpu_time_overall=835.32753 cpu_time_mfcc=0.00770 cpu_time_infer=835.31983

real	7m14,412s
user	9m31,262s
sys	4m24,331s

For reference, the same run on pure CPU takes 4-6 secs instead of 835. But for experimental driver and with the flag that we had to flip, it's probably not surprising. Good news is that it works !

MichalMrozek · 2018-02-19T18:20:35Z

It is expected that some tests will fail in non-default debug variable setting.
Those variables are meant to be used for debug/experiments.
Glad to hear that stuff is working.

If you are interested in analyzing why execution is slow I kindly suggest following project
https://github.com/intel/opencl-intercept-layer

Good to start would be
ContextHintLevel to 0xff

This will analyze application for potential pitfalls.

If you want to find the GPU execution times I recommend
DevicePerformanceTiming to 1

lissyx · 2018-02-19T18:26:54Z

@MichalMrozek Thanks for those hints, I'll take a look into that. I pushed the testing to comparing the same model:

alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/prod.rw.pbmm ../models/alphabet.txt ../audio/2830-3980-0043.wav -t
2018-02-19 19:03:28.432828: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-19 19:03:28.439545: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 19:03:28.439639: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
experience proves tis
cpu_time_overall=1006.05650 cpu_time_mfcc=0.00631 cpu_time_infer=1006.05019

real	6m32,310s
user	12m39,560s
sys	4m6,630s

And on pure CPU:

alex@portable-alex:~/tmp/deepspeech/cpu$ time ./deepspeech ../models/prod.rw.pbmm ../models/alphabet.txt ../audio/2830-3980-0043.wav -t
2018-02-19 19:24:52.936272: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
experience proves tis
cpu_time_overall=8.89595 cpu_time_mfcc=0.00451 cpu_time_infer=8.89144

real	0m4,623s
user	0m8,848s
sys	0m0,117s

Given the current state of your driver (and the flag flipped), are those times expected, or is there something valuable to investigate ?

MichalMrozek · 2018-02-19T18:38:56Z

This flag only changes how writeBuffer operation is being handled, instead of going to GPU it uses CPU for transfer. It is unlikely that it affects performance so significantly, therefore I think problem is somewhere else. I think your investigation may provide valuable feedback.

lissyx · 2018-03-02T18:20:34Z

@bashbaug Yes I did, I was calling with those env variables: LC_ALL=C LD_LIBRARY_PATH=/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/cli/_bin CLI_DllName=/usr/lib/x86_64-linux-gnu/libOpenCL.so CLI_LogToFile=1 CLI_CallLogging=1 CLI_DevicePerformanceTiming=1 CLI_HostPerformanceTiming=1 CLI_ContextHintLevel=255 CLI_ContextCallbackLogging=1

bashbaug · 2018-03-02T18:57:51Z

@lissyx strange, should be working then. To confirm, you're seeing other output in your log (such as OpenCL API calls, from CallLogging)? Since you've also set LogToFile, your log will be in:

~/CLIntercept_Dump/(your process name).

If you're seeing other output in your log, but not the performance hints, please file an issue against the Intercept Layer and we can discuss further there - thanks!

lissyx · 2018-03-02T19:44:59Z

@bashbaug Oh, my bad, I was expecting that on stderr, but looking at the log file it seems here. Sorry for the noise then.

lissyx · 2018-03-05T13:31:04Z

@MichalMrozek Okay, so it's confirmed that the (unaligned) allocations are done on the libComputeCpp side, so it's out of my hacking hands. Fortunately, we should have some help from CodePlay's side sooner than later, but no timeframe yet :)

HoppeMateusz · 2018-03-16T07:18:52Z

Hi @lissyx ,
Problem with read-only memory passed as user ptr to clEnqueueWrite calls or as host_ptr to clCreateBuffer should be fixed with latest commits ( service read_only memory passed to CreateBuffer, Fix indexing of allocated bos in populateOsHandles, Service read only memory passed as host_ptr ).

You may now try to run the app without DoCpuCopyOnWriteBuffer set to = 1.

lissyx · 2018-03-17T00:02:28Z

@mejch Thanks for the heads-up, I'm syncing and rebuilding :)

lissyx · 2018-03-17T00:25:46Z

@mejch Quick test shows some progress :)

alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t
2018-03-17 01:23:41.516974: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-17 01:23:41.521251: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-03-17 01:23:41.521506: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
Running on directory ../audio/
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=7.06919 cpu_time_mfcc=0.01302 cpu_time_infer=7.05618
> ../audio//4507-16021-0012.wav
Abort was called at 180 line in file:
/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/runtime/os_interface/linux/drm_buffer_object.cpp
Abandon (core dumped)

real	0m6,784s
user	0m3,400s
sys	0m3,770s
alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-03-17 01:23:52.349984: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-17 01:23:52.351873: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-03-17 01:23:52.351903: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
Running on directory ../audio/
> ../audio//2830-3980-0043.wav
experience proves les
cpu_time_overall=23.86346 cpu_time_mfcc=0.00571 cpu_time_infer=23.85775
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=9.45240 cpu_time_mfcc=0.00719 cpu_time_infer=9.44521
> ../audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=10.29218 cpu_time_mfcc=0.00686 cpu_time_infer=10.28532

real	0m33,835s
user	0m20,264s
sys	0m26,509s

So, with mmap()-enabled protobuf, now the first inference works, but it fails at the second in a row :).

MichalMrozek · 2018-03-18T06:49:47Z

Thanks for checking, can you share repro steps for multi file tests?
Also sharing input files would help.

lissyx · 2018-03-18T16:19:44Z

@MichalMrozek Sure, I'm sending those by email to you.

HoppeMateusz · 2018-03-20T13:45:17Z

Hi, I am running the first command line provided with new input files:
./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t

And I get SIGABORT after some mmap operation fails ( from strace ), that does not come from libigdrcl.so:

mmap(NULL, 33558528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc6b7fee000
mmap(NULL, 201330688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x5652819bf000) = 0x5652759b8000
mmap(NULL, 201461760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x7fc744000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc697fea000
munmap(0x7fc697fea000, 67108864) = 0
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc693fea000
munmap(0x7fc693fea000, 90112) = 0
munmap(0x7fc698000000, 67018752) = 0
mprotect(0x7fc694000000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 201330688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7fc862f951a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "terminate called after throwing "..., 48terminate called after throwing an instance of ') = 48
write(2, "std::bad_alloc", 14std::bad_alloc) = 14
write(2, "'\n", 2'
) = 2
write(2, " what(): ", 11 what(): ) = 11
write(2, "std::bad_alloc", 14std::bad_alloc) = 14
write(2, "\n", 1
) = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(32196, 32196, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=32196, si_uid=0} ---

As I understand, above cmd line is working for you?
How much RAM memory do you have installed? I'm testing on 3GB BDW.

I've also ran second cmd line (mmap version)
./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t

this one finishes without problems.

I'll try more times to see if I get the same error as you.

Thanks,
Mateusz

lissyx · 2018-03-20T14:18:35Z

@mejch Good catch, it requires some memory: valgrind --tool=massif measures a peak in allocation around ~4.5GB on desktop with that model. I can share you smaller model, if it helps.

HoppeMateusz · 2018-03-20T14:27:17Z

That would be very helpful for further debugging

lissyx · 2018-03-20T17:38:53Z

@mejch Should have been shared to your colleague @MichalMrozek, tell me if you run into other issue :-)

HoppeMateusz · 2018-03-21T08:17:56Z

Thanks a lot,
with the smaller model (test.frozen.494_e50_master.LSTM.ldc93s1.pb2 ) i have PASS.

But I was not able to reproduce the problem with output_graph.pbmm.
I pushed one fix - Free allocated BOs for read-only user pointer - Can you check if it resolves the problem?

From log above I see that abort was called :
Abort was called at 180 line in file:
/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/runtime/os_interface/linux/drm_buffer_object.cpp
Abandon (core dumped)

If the fix does not resolve the issue, can you share the callstack and errno code when the abort is called - now it is line 172 in drm_buffer_obcject.cpp in BufferObject::exec():

UNRECOVERABLE_IF(true);

That will help understand what leads to this fail in exec.

Thanks,
Mateusz

lissyx · 2018-03-21T12:12:34Z

@mejch Interesting. Unfortunately, I won't be able to test that soon, for personnal reasons, but as soon as I can, I'll keep you posted.

HoppeMateusz · 2018-03-21T12:17:13Z

Ok, no problem, I'll try on different machine and let you know if i was able to reproduce and fix this.

BartoszDunajski · 2018-03-23T11:51:16Z

@lissyx I added few additional improvements for Kmd Notify mechanism (0e41bc7).
Performance should be improved significantly.

HoppeMateusz · 2018-03-29T10:15:50Z

Hi @lissyx ,
I managed to reproduce the SIGABORT, I also pushed a fix:

User pointer read-only memory fix - aa088da
With this fix deepspeech finished successfully.

lissyx · 2018-03-29T10:50:01Z

Awesome, I'll try to find some time to test that asap, but I cannot promise any ETA :)

HoppeMateusz · 2018-03-30T08:43:17Z

No problem, I hope your issues are fixed now.

lissyx · 2018-04-05T23:07:20Z

@mejch Okay, I could test updated code on my new laptop (i7-8650U), and it seems the issue is fixed :), all inference can be run with both .pb and .pbmm.

@BartoszDunajski I'll have to retest on my previous laptop, because I'm getting weird results, it seems not much faster with the i7-8650U compared to i7-5600U.

MichalMrozek · 2018-04-06T14:39:56Z

In terms of GPU performance we are comparing Intel® HD Graphics 5500 and Intel® UHD Graphics 620.
Both have the same number of execution units which is 24.

Intel® HD Graphics 5500 has theoretical compute power of ~364 GFLOPS while Intel® UHD Graphics 620 has ~384-400.

Bottom line here is that for compute bound workloads performance delta may follow theoretical compute power, which is not significantly higher for Intel® UHD Graphics 620.

MichalMrozek · 2018-04-14T14:04:50Z

@lissyx looks like all the problems are resolved, can we close this issue ?

lissyx · 2018-04-14T14:39:16Z

@MichalMrozek It's fine by me. Is that an issue if I post new comments to update on latest retries ? What would be the best communication channel for further discussion about improvements, if needed ? Filing a new issue ?

MichalMrozek · 2018-04-14T15:35:06Z

For me I like the 1 problem/suggestion == 1 issue approach.
If all problems mentioned here are fixed then I suggest to close this issue.
If you have any new problems / improvement suggestions I suggest to create new issue.
Frankly, this started as ICD problem which is fixed for a very long time, for the sake of the reader it is better to encapsulate problems with proper description.

lissyx · 2018-04-14T15:55:53Z

@MichalMrozek I totally agree with you, let's close this, but I'll keep the work going :). Thanks for your help, we made a lot of progresses :)

lissyx · 2018-05-03T12:54:11Z

@BartoszDunajski I played a bit with those on my new laptop, and it looks like specific tuning is not required anymore ! Thanks !

lissyx changed the title ~~Unable to get working on Broadwell, or missing steps?~~ Unable to get working on Broadwell i7 5600U, or missing steps? Feb 19, 2018

lissyx mentioned this issue Feb 19, 2018

Debugging kernel creation failure on Intel GPU w/Beignet driver (tl;dr use Intel Neo unified driver) codeplaysoftware/computecpp-sdk#82

Closed

MichalMrozek added the good first issue label Feb 19, 2018

lissyx closed this as completed Apr 14, 2018

ranocha mentioned this issue Jun 22, 2018

i7 8700K not found by clinfo on Kubuntu 18.04; missing steps? #57

Closed

Yanfeng-Mi mentioned this issue Jul 12, 2024

Crash issue due to refcount error with clang compiler #744

Open

Unable to get working on Broadwell i7 5600U, or missing steps? #20

Unable to get working on Broadwell i7 5600U, or missing steps? #20

Comments

lissyx commented Feb 19, 2018 • edited Loading

lissyx commented Feb 19, 2018

pwilma commented Feb 19, 2018 • edited Loading

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

pwilma commented Feb 19, 2018

lissyx commented Feb 19, 2018

pwilma commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018 • edited Loading

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018 • edited Loading

MichalMrozek commented Feb 19, 2018

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018 • edited Loading

MichalMrozek commented Feb 19, 2018

lissyx commented Feb 19, 2018

MichalMrozek commented Feb 19, 2018

lissyx commented Feb 19, 2018 • edited Loading

lissyx commented Feb 19, 2018

lissyx commented Feb 19, 2018

MichalMrozek commented Feb 19, 2018

lissyx commented Feb 19, 2018

MichalMrozek commented Feb 19, 2018

lissyx commented Mar 2, 2018

bashbaug commented Mar 2, 2018

lissyx commented Mar 2, 2018

lissyx commented Mar 5, 2018

HoppeMateusz commented Mar 16, 2018

lissyx commented Mar 17, 2018

lissyx commented Mar 17, 2018

MichalMrozek commented Mar 18, 2018

lissyx commented Mar 18, 2018

HoppeMateusz commented Mar 20, 2018

lissyx commented Mar 20, 2018

HoppeMateusz commented Mar 20, 2018

lissyx commented Mar 20, 2018

HoppeMateusz commented Mar 21, 2018

lissyx commented Mar 21, 2018

HoppeMateusz commented Mar 21, 2018

BartoszDunajski commented Mar 23, 2018

HoppeMateusz commented Mar 29, 2018

lissyx commented Mar 29, 2018

HoppeMateusz commented Mar 30, 2018

lissyx commented Apr 5, 2018

MichalMrozek commented Apr 6, 2018

MichalMrozek commented Apr 14, 2018

lissyx commented Apr 14, 2018

MichalMrozek commented Apr 14, 2018

lissyx commented Apr 14, 2018

lissyx commented May 3, 2018

lissyx commented Feb 19, 2018 •

edited

Loading

pwilma commented Feb 19, 2018 •

edited

Loading

lissyx commented Feb 19, 2018 •

edited

Loading

lissyx commented Feb 19, 2018 •

edited

Loading

lissyx commented Feb 19, 2018 •

edited

Loading

lissyx commented Feb 19, 2018 •

edited

Loading