Skip to content

Random disconnect during transmission over WiFi #114

@osrf-migration

Description

@osrf-migration

Original report (archived issue) by Bart Cox (Bitbucket: bcox_pv).


Description

When we use ignition transport over WiFi we experience long delays on communication via (asynchronous) service calls and disconnects on pub/sub traffic. These seem to be accompanied with frequent detected disconnects and connects in the discovery layer. Interestingly, these delays seem to happen to a few nodes (but not all) at once and seem to resolve at the same time as well. We have been able to rule out any deadlock-like situations as our nodes will still accept and process service requests from nodes not affected by the delay in the network. Once the delay resolves, the messages seem to come in all at once.

We tested this problem with the basic example code from the source. When running the basic examples publisher.cc and subscriber.cc over WiFi, random disconnection callbacks are fired while both machines are still connected to the same network. We seem to experience similar problems with communication in the publisher/subscriber example which disconnects within a few minutes and in severe cases even seconds.

To rule out relevant external factors, we used an isolated network without any other active clients on a professional grade router and access-point but that seemed to have no influence on the robustness of the connections. We have also been able to exclude Ubuntu versions (16.04/18.04), client hardware/architecture and ignition-transport versions(5.xx - 7.xx), during our tests.

When we run the same tests on the same machines over a wired network no long delays or disconnects are occurring, the connection is stable.

Steps to Reproduce

  • Use Ubuntu 18.04
  • Install dependencies
sudo apt-get update\
          && apt-get -y install\
            gnupg lsb-release\
            cmake pkg-config cppcheck git mercurial build-essential curl\
            libprotobuf-dev protobuf-compiler libprotoc-dev libzmq3-dev uuid-dev\
            doxygen ruby-ronn libsqlite3-dev g++-8\
          && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 800 --slave /usr/bin/g++ g++ /usr/bin/g++-8 --slave /usr/bin/gcov gcov /usr/bin/gcov-8
echo "deb <http://packages.osrfoundation.org/gazebo/ubuntu-stable> $(lsb_release -cs) main" > /etc/apt/sources.list.d/gazebo-stable.list
echo "deb <http://packages.osrfoundation.org/gazebo/ubuntu-prerelease> `lsb_release -cs` main" > /etc/apt/sources.list.d/gazebo-prerelease.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys D2486D2DD83DB69272AFE98867170598AF249743
  • Install ignition libraries
sudo apt-get update \
          && sudo apt-get -y install \
            libignition-cmake2-dev \
            libignition-math6-dev \
            libignition-msgs4-dev \
            libignition-tools-dev
  • Build ignition-transport 7.7.0 from source

hg clone https://osrf-migration.github.io/ignition-gh-pages/#!/ignitionrobotics/ign-transport/ ign-transport \ 
&& cd ign-transport \ 
&& hg up ignition-transport7_7.0.0 \ 
&& mkdir -p build \ 
&& cd build \ 
&& cmake ../ \ 
&& make -j4 \ 
&& sudo make install -j4 \ 
&& cd ../example \ 
&& mkdir -p build \ 
&& cd build \ 
&& cmake .. \ 
&& make -j4
  • Run the publisher example from the source code on machine A

export IGN_PARTITION=transmission_test 
export IGN_VERBOSE=1 
export IGN_IP=${OWN_IP} 
./build/publisher
  • Run the subscriber example from the source code on machine B

export IGN_PARTITION=transmission_test 
export IGN_VERBOSE=1 
export IGN_IP=${OWN_IP} 
./build/subscriber

Expected behavior:

No disconnection callbacks when the machine is connected to the (wireless) network

Actual behavior:

After 2 minutes the subscriber gets a disconnect callback and stops receiving messages. The publisher keeps sending messages.

Reproduces how often:

Periodically.

Versions

  • Ubuntu 18.04
  • source install
  • ignition-transport 7.7.0

Additional context

Our first assumption was that UDP multicast traffic carrying discovery information might get lost over a WiFi connection. Therefore we have been experimenting with different parameter sets in the discovery layer such as a lower heartbeat interval, higher silence interval etc. Only a longer silence interval resulted in a better performance in our tests but only at large values of 20 seconds or more.

We have further tried forcing all the traffic over unicast through modifying the relay functionality such that all discovery related messages are send over unicast within the same network (but not relayed). We were hoping that this lead to more stable connections but we did not see any significant improvement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions