-
Notifications
You must be signed in to change notification settings - Fork 76
Open
Description
I have a on-premise cluster of ec2 machines. There is 1 head node and N worker nodes. Each node has its own internal IP address. Nodes in the cluster can communicate over TCP at multiple ports.
spark version: 3.5.4
raydp version: 1.6.2
ray version: 2.40.0
- The ray head node has resources as
{"spark_master": 1} - Each ray worker node has resources as
{"spark_executor": ncpus} - My cluster is behind NAT, but i dont think this is the root cause.
At head node, i connected to Ray cluster over 0.0.0.0:6379 and initialized spark on ray as follows
import ray
import os
os.environ['SPARK_HOME'] = '/opt/spark'
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-11-openjdk-amd64"
runtime_env = {
"env_vars": {
"SPARK_HOME": "/opt/spark",
"JAVA_HOME": "/usr/lib/jvm/java-11-openjdk-amd64"
}
}
ray.init( runtime_env=runtime_env)
import raydp
spark = raydp.init_spark(
app_name="RaySpark",
num_executors=8,
executor_cores=1,
executor_memory="2GB",
configs={
'spark.driver.host': '0.0.0.0',
'spark.driver.bindAddress': "0.0.0.0",
'spark.driver.port': '18001',
'spark.blockManager.port': '18002',
"spark.ui.port": "32000",
'spark.ray.raydp_spark_master.actor.resource.CPU': 0,
'spark.ray.raydp_spark_master.actor.resource.spark_master': 1,
'spark.ray.raydp_spark_executor.actor.resource.spark_executor': 1,
'spark.ray.raydp_spark_executor.actor.resource.cpu': 1,
}
)
The processes created for spark master were bound to localhost (e.g. 127.0.0.1:) that caused Ray Executor Actor from worker nodes unable to connect over 172.31.X.Y:<port>
java 181389 root 265u IPv6 765097 0t0 TCP 127.0.0.1:42067 (LISTEN) -- this
java 181389 root 283u IPv6 761288 0t0 TCP *:10045 (LISTEN)
java 181504 root 287u IPv6 760549 0t0 TCP *:10046 (LISTEN)
java 181504 root 350u IPv6 761627 0t0 TCP 127.0.0.1:40337 (LISTEN) -- this
java 181718 root 268u IPv6 767153 0t0 TCP 127.0.0.1:44859 (LISTEN) -- this
java 181718 root 321u IPv6 767171 0t0 TCP *:18001 (LISTEN)
java 181718 root 322u IPv6 770289 0t0 TCP *:32000 (LISTEN)
java 181718 root 389u IPv6 771265 0t0 TCP *:18002 (LISTEN)
Is there a way to force the processes bound to 0.0.0.0 - *?
Any suggestion and tweak would be highly appreciated.
Metadata
Metadata
Assignees
Labels
No labels