-
Notifications
You must be signed in to change notification settings - Fork 90
ARM cross-compilation (tl;dr: use proper SPIR target) #117
Comments
For context, I'm using current https://github.com/lukeiwanski/tensorflow/tree/integration/1.8 |
Okay, first mistake on my side, it looks like I should |
Using |
Okay, so it seems Also, proper value for |
Looks like clang sees different things when cross-compiling. Currently stuck with those errors on protobuf, never had any issue cross-compiling for armv7 and armv8 with gcc 4.9.4 and gcc 7.2.1:
|
To summup:
|
Hello lissyx, Our cross-compilation is still experimental and is indeed not documented anywhere yet. Thank you for your interest in it. You seem to be really close to getting it working. Here are a few tips that should help you:
On a side note if you are using the tip of dev/amd_gpu you will need to update to ComputeCpp CE 0.8.0 but that is not needed for cross-compiling. Hope this helps! |
Thanks @Rbiessy for your feedback. So it means with my current setup, it should work upto the link stage (because currently my libComputeCpp.so is amd64 and not armv7). I did pass I'm targetting ARMv7 and not ARMv8 so I guess I need to use |
BTW i'm using |
Yes it should fail to link. |
Would you be able to share which toolchain you are using on your side @Rbiessy ? And which ARM target ? |
@Rbiessy Well, when I say not being picked by configure I mean I run it with those in the env:
But then they are not written into I'll give a try to ComputeCpp 0.8.0, but I'm not sure. It feels like there's some mess in the include directories. I'm trying that with GCC 4.9.4 from Linaro: https://releases.linaro.org/components/toolchain/binaries/4.9-2017.01/arm-linux-gnueabihf/gcc-linaro-4.9.4-2017.01-x86_64_arm-linux-gnueabihf.tar.xz |
I tried the cross-compilation with I see, if I remember correctly it was because we didn't want to introduce even more questions specific to ComputeCpp in the configure. Finally it is pretty much guaranteed that you will have more issues specific to 32bit architectures. We plan to work on that at some point. |
Thanks! We make no use of Python, so we should be good on that side. I'll verify build with armv8 as well, to check if I'm doing something wrong or if that is just the current status of the build support :) |
As expected: using 0.8.0 does not help :-). I'll check targetting ARM64. |
@Rbiessy So, using 0.8.0-ubuntu-14.04-x86_64, gcc 6.3.1 aarch64, I'm facing another build error, with this head:
|
I'm getting the same with |
I think this is because integration/1.8 uses a old version of Eigen. I'd suggest you bump the Eigen version to the commit 410527dff31d (have a look at lukeiwanski/tensorflow@0c833af to see how to do that). |
Right, thanks, I totally understand it might be broken. But so far, it seems like updating eigen does the trick :-). So at least now I'm mostly pretty sure I'm doing it right, and I know it's more than expected to fail. I'll try to move forward on ARM build for the RPi3, hopefully getting something in the end :-). EDIT: After making sure that I replaced |
Successfull build as well with:
|
Same setup, it's failing:
|
So the Aarch64 build works, but the arm32 fails? Just to clarify that's what's happening. |
@DuncanMcBain That's exactly that, aarch64 works, but arm32 does not. I just re-verified, on a tensorflow r1.8 branch with cross-compilation for RPi3 using GCC 4.9.4, and it also does build properly. I have a feeling that the clang in the middle is doing funny things ? |
Possible! Apologies if you've posted it, but do you have a build log for the 32-bit failure? Scanning upthread I only saw the Aarch64 one (which I assume is now fixed). |
Yeah, there was one above, but it's not really meaningful anymore. I'm making a clean one, and attaching that. |
Here is an uptodate log @DuncanMcBain build.log |
Thanks! Immediately I can see that there are errors relating to the size of size_t and some pointers - could you try setting the TF_BITCODE_TARGET to spir, not spir64? It should make a difference, I hope... |
Oh, nice catch. I should have thought about that. Build is going much further now, let's see how much :-) |
Okay, further there's a narrowing error:
Now, those seems to be Eigen-specific. Given this is already a patched-version of Eigen, and this related to sizes that looks like 64 bits, are those just side-effects of the current focus on aarch64 ? Adding |
Update: I could finish the build, and a first attempt would fail as expected because vc4 requires root access for now:
Forcing that under sudo, it's going further:
Now waiting to see the output. The |
Wow that's great! I've seen the C++11 narrowing thing before, I think Eigen doesn't really care about 32-bit builds (which is a shame). I'm also surprised, it looks like Werror is turned on (normally I'd not do that for cross-compile builds which will always tend to be a bit more warny). Here's hoping it works! |
Running for two hours, nothing :'(. Not even error or anything. |
Oh. Presumably you'd have expected a failure or at least some kind of output by now? I'll keep an eye on the issue you've made in the VC4CL repo, if nothing else I have a Pi at home that doesn't do a lot! Would be cool to put a project on there :) |
Yes. I let it run during the night, and nothing. As you just said, I'd expect some output, or some error. Even pushing more TensorFlow logging does reveal no activity further activity. (log attached) |
@DuncanMcBain Do we have way to dump OpenCL code somehow? This might help doe300 to identify what's going on. |
It's... tricky. SYCL as an API doesn't really provide a way to do this right now. It's not really OpenCL C either, it's an intermediate format, which might be the problem? Perhaps VC4CL can deal with it directly, I don't know. I think the problem might be that the kernels are large. That said, if nothing happened overnight... that does seem rather extreme. I will put together some instructions that hopefully will let you crosscompile the SDK so you can test with that (it's much, much smaller, so might be easier to reproduce and debug if it goes wrong too). I'll try to get those to you this afternoon. |
@DuncanMcBain Not sure if you saw the latest developments on the issue, but it looks like VC4CL has nothing more to do. I don't really know how I should get more debug toggled for that, Can we dump the kernels that are being passed to |
It'd be rather tricky to do so from TensorFlow, I'm afraid. I see it looks like you might have managed to reproduce this hanging issue purely from the Vc4cl side of things, right? It would be easier to get our hands on the kernels from the SDK, largely because intercepting the inner workings of the TensorFlow build process is challenging. It's much easier with CMake. We have a script called "spir_extract" which will emit the SPIR code for a given integration header. You can then disassemble it or whatever you need to do. It's in the tools directory of the SDK. If you'd like to crosscompile the SDK and look at the kernels in there, let me know. |
It feels more like something is locked up. Does this gdb stack helps? I got it running deepspeech under gdb, then waiting for the long wait-condition-infinite and CTRL+C:
|
Okay, so, I've somehow got a place inside VC4's Any attemp to 'cont' then CTRL+C gives be the same backtrace as above. Does that shed any light @DuncanMcBain ? |
Yes, kind of! As far as I can tell, TensorFlow is successfully submitting the copy to the device queue, then Eigen calls synchronize() which I believe calls down into cl::sycl::queue::wait_and_throw(). I think, but can't be sure, that this is the prelude to the backtrace you are seeing. This might be helpful to the developer of VC4CL, as it looks like it hangs at the very first copy-on. You've done a great job here, excellent investigation! FWIW, wait_and_throw() will never leave until all events on the queue have finished ending, so it looks like somehow the CL implementation is never completing the copy. Maybe the VC4CL developer can help from here? |
So we enqueue commands, and then, there's a flush. This is a link to the current implementation of flush: https://github.com/doe300/VC4CL/blob/7d5d906c8e2e69ff94ae605cdbfe1f7a32c87833/src/CommandQueue.cpp#L103-L107
|
I'm not sure that's the issue - as far as I can tell, VC4CL has an empty flush method because commands are worked on immediately when submitted to the queue, so there's nothing to flush. |
Right. I'm not making a lot of progress, except that it seems there's an event that reaches vc4cl, and then it's being injected. I'm not sure yet if it's being properly used or not. |
If there's any further information we can provide for the developer of VC4CL, please let us know! This same code runs on all other OpenCL platforms we've tried, and is quite intense, so it is possible that some unusual code paths are being hit. |
@DuncanMcBain Okay, after fighting with some deadlock and others in vc4cl, I've luckily managed to get something running: it failed this way (after removing the noise from my own output:
Rebuilding with |
That's odd, why wasn't snapshot included before? Do we not build it as part of our SYCL port or something? |
What strikes me the most is that this operation is not even registered for the CPU according to your log. It should be registered for both CPU and SYCL, I have no idea what could have caused that. |
No, don't worry about that @DuncanMcBain and @Rbiessy, when we build libdeepspeech, we manually select the kernels to avoid useless-space-taking code :-). |
Ohh ok gotcha so that's why you said
|
So, update is that now I have to deal with compilations issues when running kernels. Which is good. |
Ok, I don't think there's anything needed here. Remainer of the discussion is really tailored to vc4 driver so far, and it is happening on their repo. I'll file new issues if moving forward with the driver reveals issues on the ComputeCpp side. |
As mentionned in mozilla/DeepSpeech#1346 I'm currently investigating how much we can rely on the OpenCL VC4CL [https://github.com/doe300/VC4CL#opencl-support] driver to leverage RPi3's GPU.
So far, I built successfully the driver with a linaro cross-compiler and vc4c's testsuite somehow works. I could also verify that
comptecpp_info
can at least see the things.Now I am facing a dumb issue: how to cross-compile for ARM from SYCL branches. We have setup to cross-compile for ARM and ARMv8 on https://github.com/mozilla/tensorflow, so I blindly did a configure step referencing the ARM version of ComputeCpp:
And then, I built it:
This do build an ARM lib, linked with
libComputeCpp.so
. But at runtime, it does not seems like it runs OpenCL.Now, I've also stumbled upon
TF_SYCL_CROSS_TOOLCHAIN
andTF_SYCL_CROSS_TOOLCHAIN_NAME
, but they lack of documentation, and trying to use them do fail:which seems expected, given
ComputeCpp-CE-0.7.0-Ubuntu-14.04-ARM_32/bin/compute++: ELF 32-bit LSB executable, ARM, EABI5 version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-armhf.so.3, for GNU/Linux 2.6.32, BuildID[sha1]=df02ab122bb64fc87724de838f7d5a45b8e3f1a5, not stripped
So, what step am I missing to be able to cross-compile ?
The text was updated successfully, but these errors were encountered: