Skip to content

Unable to get working on Broadwell i7 5600U, or missing steps? #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lissyx opened this issue Feb 19, 2018 · 97 comments
Closed

Unable to get working on Broadwell i7 5600U, or missing steps? #20

lissyx opened this issue Feb 19, 2018 · 97 comments

Comments

@lissyx
Copy link

lissyx commented Feb 19, 2018

I have built and installed without any error (following exactly the steps from https://github.com/intel/compute-runtime#building) on my laptop (ubuntu 17.10, Broadwell i7-5600U), but clinfo returns 0 platform found. Checking clinfo calls with strace shows it is properly picking-up /opt/intel/opencl/libigdrcl.so.

Is Broadwell supported right now? Readme states it is https://github.com/intel/compute-runtime#supported-platforms, but the "GenX" naming is unclear to me, and https://ark.intel.com/products/85215/Intel-Core-i7-5600U-Processor-4M-Cache-up-to-3_20-GHz reports "5th gen". And the GPU is Intel HD Graphics 5500, whose naming would be consistent with 5th gen, but it's not listed on https://www.intel.com/content/www/us/en/architecture-and-technology/visual-technology/graphics-overview.html.

So, am I trying to get it working on an unsupported platform, or did I missed anything to get that working ? Thanks!

@lissyx lissyx changed the title Unable to get working on Broadwell, or missing steps? Unable to get working on Broadwell i7 5600U, or missing steps? Feb 19, 2018
@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

The GPU seems to be identified as Broadwell in the source as well, under the define IBDW_GT2_ULT_MOBL_DEVICE_F0_ID: https://github.com/intel/intel-graphics-compiler/blob/e42b674bce86b372c92ba08812cecccc71debafc/inc/common/igfxfmid.h#L354

@pwilma
Copy link
Contributor

pwilma commented Feb 19, 2018

Broadwell is supported by Neo driver. Your issue looks similar to problem already reported in #9. Could you please check it? To operate correctly Neo driver requires Khronos ICD loader: https://github.com/KhronosGroup/OpenCL-ICD-Loader. It is known limitation that we are trying to solve.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma Thanks! I did search in issues before, but missed #9. I also saw the note about the Khronos ICD loader, but I mistakenly understood that this was provided by the ocl-icd-libopencl1 package on debian/ubuntu. I'll verify all of that, thanks !

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

Ok, OCL_ICD_ASSUME_ICD_EXTENSION=1 clinfo gives cool stuff, it might just be the ICD loader that I messed up with :)

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

Okay, after adding a clone of https://github.com/KhronosGroup/OpenCL-ICD-Loader into the set of sources for the Neo driver and building it by hand, re-running CMake, it was properly picked up as this line shows:
-- Taking ICD library from /home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/../OpenCL-ICD-Loader/build/lib

Then, it's being properly packaged, and clinfo returns proper status, as well as does computecpp_info. Now, my TensorFlow code is failing in neo/runtime/os_interface/linux/drm_buffer_object.cpp at line 183, but that might be something expected.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma i'm unable to get any debug, even forcing PrintDebugMessages=1 on the env, and I could not find that mentionned in any doc. How am I supposed to get those? Should I rebuild with something different than Release ?

@pwilma
Copy link
Contributor

pwilma commented Feb 19, 2018

Yes, debug variables won't work for Release. Please try to build Debug version.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma Thanks, I'm doing that right now. BTW, one step I had to make but I'm not completely sure about is: I did git clone the Khronos ICD Loader next to the neo driver, and I built it by hand. Was building by hand required, or should your CMake build system have handled that for me automagically?

@pwilma
Copy link
Contributor

pwilma commented Feb 19, 2018

You did it correctly. Manual build of Khronos ICD Loader is required. Neo CMake buildsystem won't build it automatically.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma Debug-enabled build is failing: igc/IGC/AdaptorCommon/customApi.cpp:555:41: error: 'struct SRegKeysList' has no member named 'EnableDxbcDump'; did you mean 'EnableCosDump'?

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma I've removed the unexisting member, built and installed debug build, but I'm still unable to get any debug message :(. Update: it works, but only through igdrcl.config :)

2018-02-19 14:44:04.441003: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
WARNING: Failed to request OCL Turbo Boost
hwInfo: {24, 168}: (8, 2, 6)
computeUnitsUsedForScratch: 336
2018-02-19 14:44:04.443499: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 14:44:04.443527: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(52, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(52, 1, 1)	LWS:(52, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(4, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(4, 1, 1)	LWS:(4, 1, 1)	TWGS:(1, 1, 1)	NWGS:(1, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(512, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(512, 1, 1)	LWS:(256, 1, 1)	TWGS:(2, 1, 1)	NWGS:(2, 1, 1)	SWGS:(0, 0, 0)
DIM:1	GWS:(2048, 1, 1)	ELWS:(0, 0, 0)	Offset:(0, 0, 0)	AGWS:(2048, 1, 1)	LWS:(256, 1, 1)	TWGS:(8, 1, 1)	NWGS:(8, 1, 1)	SWGS:(0, 0, 0)
ioctl(I915_GEM_EXECBUFFER2) failed with -1. errno=14(Bad address)

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

Not sure yet, but there's a return -EFAULT here in i915_gem_execbuffer2(): http://kernel.ubuntu.com/git/ubuntu/ubuntu-artful.git/tree/drivers/gpu/drm/i915/i915_gem_execbuffer.c#n2487

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

Enabling DRM's debug in /sys/module/drm/parameters/debug, I get this for the PID of my process:

# grep '8911' dmesg.txt 
[423537.069035] [drm:drm_open [drm]] pid = 8911, minor = 128
[423537.069103] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, DRM_IOCTL_VERSION
[423537.069111] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069117] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069123] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069128] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_CONTEXT_SETPARAM
[423537.069164] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069169] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069174] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069179] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069184] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069210] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.069216] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GETPARAM
[423537.072558] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_USERPTR
[423537.072643] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_USERPTR
[423537.072660] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_REG_READ
[423537.072699] [drm:drm_ioctl [drm]] pid=8911, dev=0xe280, auth=1, I915_GEM_GET_APERTURE
[423537.334023] [drm:drm_release [drm]] pid = 8911, device = 0xe280, open_count = 6

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma Here is most of the debug info I could gather so far. Is it possible it is failing because the driver does not yet support enough of the required calls / opencl for proper operation with TensorFlow / ComputeCpp 0.5.1 ? Or do I need a newer kernel (Ubuntu 17.10 is on 4.13) ?

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@pwilma I've also collected some debug files, if it can be of help, tell me what would be the best way to share them:

$ ls -ah CopyBufferToBuffer*
CopyBufferToBufferLeftLeftover.txt			       CopyBufferToBufferMiddle_arg_2_immediate_size_4_flags_0.bin	     CopyBufferToBufferRightLeftover_arg_1_buffer_size_32_flags_9.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_128_flags_9.bin     CopyBufferToBufferMiddle_arg_3_immediate_size_4_flags_0.bin	     CopyBufferToBufferRightLeftover_arg_2_immediate_size_4_flags_0.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_193664_flags_9.bin  CopyBufferToBufferMiddle.txt					     CopyBufferToBufferRightLeftover_arg_3_immediate_size_4_flags_0.bin
CopyBufferToBufferMiddle_arg_1_buffer_size_32768_flags_9.bin   CopyBufferToBufferRightLeftover_arg_1_buffer_size_128_flags_9.bin     CopyBufferToBufferRightLeftover.txt
CopyBufferToBufferMiddle_arg_1_buffer_size_8192_flags_9.bin    CopyBufferToBufferRightLeftover_arg_1_buffer_size_193664_flags_9.bin

@MichalMrozek
Copy link
Contributor

Useful stuff for debugging:

  1. Open file vpg-compute-neo\runtime\os_interface\DebugVariables.def , there are various debug flags available there.
  2. Good thing to start is to EnableDebugBreak to true
  3. Here problems looks like a failure in exec call, this probably means that inputs are incorrect. To isolate faulty kernel please set MakeEachEnqueueBlocking, that will tell you which clEnqueueNDRangeKernel call is failing
  4. Knowing which kernel fails , look at its arguments, are there any buffers created with CL_MEM_USE_HOST_PTR flag there ? If so, can you rule out that there are any duplicates ( CL_MEM_USE_HOST_PTR buffers created from the same host_ptr)
  5. Please turn on logging and provide dumps:
    DumpKernels/DumpKernelArgs/LogApiCalls/LogPatchTokens/LogMemoryObject

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek Thanks! I have the following in igdrcl.config, is that okay? it seems to provide debug infos at least:

PrintDebugMessages = 1
EnableDebugBreak = true
MakeEachEnqueueBlocking = 1
DumpKernels = 1
DumpKernelArgs = 1
LogApiCalls = 1
LogPatchTokens = 1
LogMemoryObject = 1

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

I cannot find anything referring to clEnqueueNDRangeKernel in igdcrl.log. @MichalMrozek you should have received that by mail

@MichalMrozek
Copy link
Contributor

Can you try to use this flag ?
DoCpuCopyOnWriteBuffer to 1.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek Magic! it's not failing anymore, it seems to be doing actual stuff :)

@MichalMrozek
Copy link
Contributor

Looks like there may be a bug in our clEnqueueWriteBuffer implementation.
We will try to reproduce it basing on the logs sent.
If we fail we will contact for more robust reproducer, thanks for reporting this.

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek I can share you the binaries to reproduce as well, if you need. It's DeepSpeech + ComputeCpp 0.5.1.

Now, so far, it started the run, but it's not completing anything yet, the output is "blocked" on this:

Kernel Name: SYCL_struct_Eigen__TensorSycl__ExecExprFunctorKernel_const_class_Eigen__TensorAssignOp_class_Eigen__TensorMap_class_Eigen__Tensor_struct_std__complex_double___1__1__long___16__MakePointer___const_class_Eigen__TensorConversionOp_struct_std__complex_double___const_class_Eigen__TensorMap_class_Eigen__Tensor_const_long_long__1__1__long___16__MakePointer_______struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_long__1_____struct_utility__tuple__Tuple_struct_utility__tuple__Tuple_struct_Eigen__DSizes_long__1_________struct_utility__tuple__Tuple_struct_Eigen__RangeAccess_cl__sycl__access__mode__read_write___struct_Eigen__RangeAccess_cl__sycl__access__mode__read_____false_
Gen7 Kernel Binary Header:
	CheckSum = 66b54f10
	ShaderHashCode = 104d0de50e756486
	KernelNameSize = 704
	PatchListSize = 1308
	KernelHeapSize = 2688
	GeneralStateHeapSize = 0
	DynamicStateHeapSize = 32
	SurfaceStateHeapSize = 136
Program Binary Header:
	Magic = 494e5443
	Version = 1049
	Device = 11
	GPUPointerSizeInBytes = 8
	NumberOfKernels = 22
	SteppingId = 9
	PatchListSize = 0

The deepspeech process seems to be consuming as much as CPU as it can, so maybe it's just slow, given the model is quite big and it's a debug build. After sometime, it unblocked. i'll rebuild a non-debug with just the DoCpuCopyOnWriteBuffer flipped :)

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek Building release and changing the value of DoCpuCopyOnWriteBuffer to true in runtime/os_interface/DebugVariables.def fails like this:

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/enqueue_write_buffer_tests.cpp:453: Failure
Value of: retVal
  Actual: 0
Expected: -5

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 135
[  FAILED  ] NegativeFailAllocationTest.givenEnqueueWriteBufferWhenHostPtrAllocationCreationFailsThenReturnOutOfResource

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/ooq_task_tests.cpp:67: Failure
Expected: (previousTaskCount) < (commandStreamReceiver.peekTaskCount()), actual: 0 vs 0
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 178
[  FAILED  ] OOQ/OOQTaskTypedTests/6.changesTaskCount

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/ooq_task_tests.cpp:53: Failure
Value of: this->pCmdQ->taskLevel
  Actual: 0
Expected: previousTaskLevel + taskLevelClosed
Which is: 1
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 168
[  FAILED  ] OOQ/OOQTaskTypedTests/6.doesntChangeTaskLevel

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/enqueue_thread_tests.cpp:67: Failure
Expected: (toFree.size()) > (0u), actual: 0 vs 0
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 79
[  FAILED  ] EnqueueThreading.enqueueWriteBuffer

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/get_size_required_buffer_tests.cpp:358: Failure
Value of: usedAfterCS - usedBeforeCS
  Actual: 0
Expected: expectedSizeCS
Which is: 192
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 338
[  FAILED  ] GetSizeRequiredBufferTest.enqueueWriteBufferNonBlocking

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/command_queue/get_size_required_buffer_tests.cpp:413: Failure
Value of: usedAfterCS - usedBeforeCS
  Actual: 0
Expected: expectedSizeCS
Which is: 192
*** WARNING: Leaks found but dumping disabled during test failure ***

/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/unit_tests/memory_leak_listener.cpp:55: Failure
Value of: (int)MemoryManagement::failingAllocation
  Actual: -1
Expected: leak
Which is: 338
[  FAILED  ] GetSizeRequiredBufferTest.enqueueWriteBufferBlocking
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_skl_unit_tests.dir/build.make:61: recipe for target 'run_skl_unit_tests' failed
make[2]: *** [run_skl_unit_tests] Error 134
CMakeFiles/Makefile2:5095: recipe for target 'unit_tests/CMakeFiles/run_skl_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_skl_unit_tests.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_bdw_unit_tests.dir/build.make:61: recipe for target 'run_bdw_unit_tests' failed
make[2]: *** [run_bdw_unit_tests] Error 134
CMakeFiles/Makefile2:5199: recipe for target 'unit_tests/CMakeFiles/run_bdw_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_bdw_unit_tests.dir/all] Error 2
Tests timeout on: CommandStreamReceiverFlushTaskTests.GivenBlockedKernelNotRequiringDCFlushWhenUnblockedThenDCFlushIsNotAdded
Aborted (core dumped)
unit_tests/CMakeFiles/run_kbl_unit_tests.dir/build.make:61: recipe for target 'run_kbl_unit_tests' failed
make[2]: *** [run_kbl_unit_tests] Error 134
CMakeFiles/Makefile2:5447: recipe for target 'unit_tests/CMakeFiles/run_kbl_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_kbl_unit_tests.dir/all] Error 2
Aborted (core dumped)
unit_tests/CMakeFiles/run_bxt_unit_tests.dir/build.make:61: recipe for target 'run_bxt_unit_tests' failed
make[2]: *** [run_bxt_unit_tests] Error 134
CMakeFiles/Makefile2:5836: recipe for target 'unit_tests/CMakeFiles/run_bxt_unit_tests.dir/all' failed
make[1]: *** [unit_tests/CMakeFiles/run_bxt_unit_tests.dir/all] Error 2
Makefile:151: recipe for target 'all' failed
make: *** [all] Error 2

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek Okay, performances are very bad, but it executed successfully:

$ time ./deepspeech ~/codaz/Mozilla/DeepSpeech/deepspeech-kdavis/tf14.frozen.494_e120.LSTM.ldc93s1.pb ../models/alphabet.txt ~/codaz/Mozilla/DeepSpeech/deepspeech-kdavis/data/smoke_test/LDC93S1.wav -t
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-02-19 18:54:33.238904: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-19 18:54:33.243317: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 18:54:33.243427: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
shehadyourdarksuitingreasywwsh wtteral year
cpu_time_overall=835.32753 cpu_time_mfcc=0.00770 cpu_time_infer=835.31983

real	7m14,412s
user	9m31,262s
sys	4m24,331s

For reference, the same run on pure CPU takes 4-6 secs instead of 835. But for experimental driver and with the flag that we had to flip, it's probably not surprising. Good news is that it works !

@MichalMrozek
Copy link
Contributor

It is expected that some tests will fail in non-default debug variable setting.
Those variables are meant to be used for debug/experiments.
Glad to hear that stuff is working.

If you are interested in analyzing why execution is slow I kindly suggest following project
https://github.com/intel/opencl-intercept-layer

Good to start would be
ContextHintLevel to 0xff

This will analyze application for potential pitfalls.

If you want to find the GPU execution times I recommend
DevicePerformanceTiming to 1

@lissyx
Copy link
Author

lissyx commented Feb 19, 2018

@MichalMrozek Thanks for those hints, I'll take a look into that. I pushed the testing to comparing the same model:

alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/prod.rw.pbmm ../models/alphabet.txt ../audio/2830-3980-0043.wav -t
2018-02-19 19:03:28.432828: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-19 19:03:28.439545: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-02-19 19:03:28.439639: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
experience proves tis
cpu_time_overall=1006.05650 cpu_time_mfcc=0.00631 cpu_time_infer=1006.05019

real	6m32,310s
user	12m39,560s
sys	4m6,630s

And on pure CPU:

alex@portable-alex:~/tmp/deepspeech/cpu$ time ./deepspeech ../models/prod.rw.pbmm ../models/alphabet.txt ../audio/2830-3980-0043.wav -t
2018-02-19 19:24:52.936272: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
experience proves tis
cpu_time_overall=8.89595 cpu_time_mfcc=0.00451 cpu_time_infer=8.89144

real	0m4,623s
user	0m8,848s
sys	0m0,117s

Given the current state of your driver (and the flag flipped), are those times expected, or is there something valuable to investigate ?

@MichalMrozek
Copy link
Contributor

This flag only changes how writeBuffer operation is being handled, instead of going to GPU it uses CPU for transfer. It is unlikely that it affects performance so significantly, therefore I think problem is somewhere else. I think your investigation may provide valuable feedback.

@lissyx
Copy link
Author

lissyx commented Mar 2, 2018

@bashbaug Yes I did, I was calling with those env variables: LC_ALL=C LD_LIBRARY_PATH=/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/cli/_bin CLI_DllName=/usr/lib/x86_64-linux-gnu/libOpenCL.so CLI_LogToFile=1 CLI_CallLogging=1 CLI_DevicePerformanceTiming=1 CLI_HostPerformanceTiming=1 CLI_ContextHintLevel=255 CLI_ContextCallbackLogging=1

@bashbaug
Copy link
Contributor

bashbaug commented Mar 2, 2018

@lissyx strange, should be working then. To confirm, you're seeing other output in your log (such as OpenCL API calls, from CallLogging)? Since you've also set LogToFile, your log will be in:

~/CLIntercept_Dump/(your process name).

If you're seeing other output in your log, but not the performance hints, please file an issue against the Intercept Layer and we can discuss further there - thanks!

@lissyx
Copy link
Author

lissyx commented Mar 2, 2018

@bashbaug Oh, my bad, I was expecting that on stderr, but looking at the log file it seems here. Sorry for the noise then.

@lissyx
Copy link
Author

lissyx commented Mar 5, 2018

@MichalMrozek Okay, so it's confirmed that the (unaligned) allocations are done on the libComputeCpp side, so it's out of my hacking hands. Fortunately, we should have some help from CodePlay's side sooner than later, but no timeframe yet :)

@HoppeMateusz
Copy link
Contributor

Hi @lissyx ,
Problem with read-only memory passed as user ptr to clEnqueueWrite calls or as host_ptr to clCreateBuffer should be fixed with latest commits ( service read_only memory passed to CreateBuffer, Fix indexing of allocated bos in populateOsHandles, Service read only memory passed as host_ptr ).

You may now try to run the app without DoCpuCopyOnWriteBuffer set to = 1.

@lissyx
Copy link
Author

lissyx commented Mar 17, 2018

@mejch Thanks for the heads-up, I'm syncing and rebuilding :)

@lissyx
Copy link
Author

lissyx commented Mar 17, 2018

@mejch Quick test shows some progress :)

alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t
2018-03-17 01:23:41.516974: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-17 01:23:41.521251: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-03-17 01:23:41.521506: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
Running on directory ../audio/
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=7.06919 cpu_time_mfcc=0.01302 cpu_time_infer=7.05618
> ../audio//4507-16021-0012.wav
Abort was called at 180 line in file:
/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/runtime/os_interface/linux/drm_buffer_object.cpp
Abandon (core dumped)

real	0m6,784s
user	0m3,400s
sys	0m3,770s
alex@portable-alex:~/tmp/deepspeech/sycl$ time ./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-03-17 01:23:52.349984: I ./tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-17 01:23:52.351873: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-03-17 01:23:52.351903: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Intel(R) Gen8 HD Graphics NEO, vendor: Intel(R) Corporation, profile: FULL_PROFILE
Running on directory ../audio/
> ../audio//2830-3980-0043.wav
experience proves les
cpu_time_overall=23.86346 cpu_time_mfcc=0.00571 cpu_time_infer=23.85775
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=9.45240 cpu_time_mfcc=0.00719 cpu_time_infer=9.44521
> ../audio//8455-210777-0068.wav
your power is sufficient i said
cpu_time_overall=10.29218 cpu_time_mfcc=0.00686 cpu_time_infer=10.28532

real	0m33,835s
user	0m20,264s
sys	0m26,509s

So, with mmap()-enabled protobuf, now the first inference works, but it fails at the second in a row :).

@MichalMrozek
Copy link
Contributor

Thanks for checking, can you share repro steps for multi file tests?
Also sharing input files would help.

@lissyx
Copy link
Author

lissyx commented Mar 18, 2018

@MichalMrozek Sure, I'm sending those by email to you.

@HoppeMateusz
Copy link
Contributor

Hi, I am running the first command line provided with new input files:
./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t

And I get SIGABORT after some mmap operation fails ( from strace ), that does not come from libigdrcl.so:

mmap(NULL, 33558528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc6b7fee000
mmap(NULL, 201330688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
brk(0x5652819bf000) = 0x5652759b8000
mmap(NULL, 201461760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(0x7fc744000000, 67108864, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc697fea000
munmap(0x7fc697fea000, 67108864) = 0
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fc693fea000
munmap(0x7fc693fea000, 90112) = 0
munmap(0x7fc698000000, 67018752) = 0
mprotect(0x7fc694000000, 135168, PROT_READ|PROT_WRITE) = 0
mmap(NULL, 201330688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7fc862f951a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
write(2, "terminate called after throwing "..., 48terminate called after throwing an instance of ') = 48
write(2, "std::bad_alloc", 14std::bad_alloc) = 14
write(2, "'\n", 2'
) = 2
write(2, " what(): ", 11 what(): ) = 11
write(2, "std::bad_alloc", 14std::bad_alloc) = 14
write(2, "\n", 1
) = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
tgkill(32196, 32196, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=32196, si_uid=0} ---

As I understand, above cmd line is working for you?
How much RAM memory do you have installed? I'm testing on 3GB BDW.

I've also ran second cmd line (mmap version)
./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t

  • this one finishes without problems.

I'll try more times to see if I get the same error as you.

Thanks,
Mateusz

@lissyx
Copy link
Author

lissyx commented Mar 20, 2018

@mejch Good catch, it requires some memory: valgrind --tool=massif measures a peak in allocation around ~4.5GB on desktop with that model. I can share you smaller model, if it helps.

@HoppeMateusz
Copy link
Contributor

That would be very helpful for further debugging

@lissyx
Copy link
Author

lissyx commented Mar 20, 2018

@mejch Should have been shared to your colleague @MichalMrozek, tell me if you run into other issue :-)

@HoppeMateusz
Copy link
Contributor

Thanks a lot,
with the smaller model (test.frozen.494_e50_master.LSTM.ldc93s1.pb2 ) i have PASS.

But I was not able to reproduce the problem with output_graph.pbmm.
I pushed one fix - Free allocated BOs for read-only user pointer - Can you check if it resolves the problem?

From log above I see that abort was called :
Abort was called at 180 line in file:
/home/alex/codaz/Mozilla/DeepSpeech/Intel-Neo/neo/runtime/os_interface/linux/drm_buffer_object.cpp
Abandon (core dumped)

If the fix does not resolve the issue, can you share the callstack and errno code when the abort is called - now it is line 172 in drm_buffer_obcject.cpp in BufferObject::exec():

UNRECOVERABLE_IF(true);

That will help understand what leads to this fail in exec.

Thanks,
Mateusz

@lissyx
Copy link
Author

lissyx commented Mar 21, 2018

@mejch Interesting. Unfortunately, I won't be able to test that soon, for personnal reasons, but as soon as I can, I'll keep you posted.

@HoppeMateusz
Copy link
Contributor

Ok, no problem, I'll try on different machine and let you know if i was able to reproduce and fix this.

@BartoszDunajski
Copy link
Contributor

@lissyx I added few additional improvements for Kmd Notify mechanism (0e41bc7).
Performance should be improved significantly.

@HoppeMateusz
Copy link
Contributor

Hi @lissyx ,
I managed to reproduce the SIGABORT, I also pushed a fix:

User pointer read-only memory fix - aa088da
With this fix deepspeech finished successfully.

@lissyx
Copy link
Author

lissyx commented Mar 29, 2018

Awesome, I'll try to find some time to test that asap, but I cannot promise any ETA :)

@HoppeMateusz
Copy link
Contributor

No problem, I hope your issues are fixed now.

@lissyx
Copy link
Author

lissyx commented Apr 5, 2018

@mejch Okay, I could test updated code on my new laptop (i7-8650U), and it seems the issue is fixed :), all inference can be run with both .pb and .pbmm.

@BartoszDunajski I'll have to retest on my previous laptop, because I'm getting weird results, it seems not much faster with the i7-8650U compared to i7-5600U.

@MichalMrozek
Copy link
Contributor

In terms of GPU performance we are comparing Intel® HD Graphics 5500 and Intel® UHD Graphics 620.
Both have the same number of execution units which is 24.

Intel® HD Graphics 5500 has theoretical compute power of ~364 GFLOPS while Intel® UHD Graphics 620 has ~384-400.

Bottom line here is that for compute bound workloads performance delta may follow theoretical compute power, which is not significantly higher for Intel® UHD Graphics 620.

@MichalMrozek
Copy link
Contributor

@lissyx looks like all the problems are resolved, can we close this issue ?

@lissyx
Copy link
Author

lissyx commented Apr 14, 2018

@MichalMrozek It's fine by me. Is that an issue if I post new comments to update on latest retries ? What would be the best communication channel for further discussion about improvements, if needed ? Filing a new issue ?

@MichalMrozek
Copy link
Contributor

For me I like the 1 problem/suggestion == 1 issue approach.
If all problems mentioned here are fixed then I suggest to close this issue.
If you have any new problems / improvement suggestions I suggest to create new issue.
Frankly, this started as ICD problem which is fixed for a very long time, for the sake of the reader it is better to encapsulate problems with proper description.

@lissyx
Copy link
Author

lissyx commented Apr 14, 2018

@MichalMrozek I totally agree with you, let's close this, but I'll keep the work going :). Thanks for your help, we made a lot of progresses :)

@lissyx lissyx closed this as completed Apr 14, 2018
@lissyx
Copy link
Author

lissyx commented May 3, 2018

@BartoszDunajski I played a bit with those on my new laptop, and it looks like specific tuning is not required anymore ! Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants