Skip to content

Could not modify bound address for processes created for spark master #434

@truongle1501

Description

@truongle1501

I have a on-premise cluster of ec2 machines. There is 1 head node and N worker nodes. Each node has its own internal IP address. Nodes in the cluster can communicate over TCP at multiple ports.

spark version: 3.5.4
raydp version: 1.6.2
ray version: 2.40.0

  • The ray head node has resources as {"spark_master": 1}
  • Each ray worker node has resources as {"spark_executor": ncpus}
  • My cluster is behind NAT, but i dont think this is the root cause.

At head node, i connected to Ray cluster over 0.0.0.0:6379 and initialized spark on ray as follows

import ray
import os

os.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-11-openjdk-amd64"
runtime_env = {
    "env_vars": {
        "SPARK_HOME": "/opt/spark",
        "JAVA_HOME": "/usr/lib/jvm/java-11-openjdk-amd64"
    }
}
ray.init( runtime_env=runtime_env)

import raydp
spark = raydp.init_spark(
    app_name="RaySpark",
    num_executors=8,
    executor_cores=1,
    executor_memory="2GB",
    configs={
        'spark.driver.host': '0.0.0.0',
        'spark.driver.bindAddress': "0.0.0.0",
        'spark.driver.port': '18001',
        'spark.blockManager.port': '18002',
        "spark.ui.port": "32000",
        'spark.ray.raydp_spark_master.actor.resource.CPU': 0,
        'spark.ray.raydp_spark_master.actor.resource.spark_master': 1,
        'spark.ray.raydp_spark_executor.actor.resource.spark_executor': 1,
        'spark.ray.raydp_spark_executor.actor.resource.cpu': 1,
    }
)

The processes created for spark master were bound to localhost (e.g. 127.0.0.1:) that caused Ray Executor Actor from worker nodes unable to connect over 172.31.X.Y:<port>

java      181389            root  265u  IPv6 765097      0t0  TCP 127.0.0.1:42067 (LISTEN) -- this
java      181389            root  283u  IPv6 761288      0t0  TCP *:10045 (LISTEN)
java      181504            root  287u  IPv6 760549      0t0  TCP *:10046 (LISTEN)
java      181504            root  350u  IPv6 761627      0t0  TCP 127.0.0.1:40337 (LISTEN) -- this
java      181718            root  268u  IPv6 767153      0t0  TCP 127.0.0.1:44859 (LISTEN) -- this
java      181718            root  321u  IPv6 767171      0t0  TCP *:18001 (LISTEN)
java      181718            root  322u  IPv6 770289      0t0  TCP *:32000 (LISTEN)
java      181718            root  389u  IPv6 771265      0t0  TCP *:18002 (LISTEN)

Is there a way to force the processes bound to 0.0.0.0 - *?

Any suggestion and tweak would be highly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions