CREATE_FAILED with 2.9.1 #2053

rems0 · 2020-09-16T20:57:31Z

Hi all,

I am trying to create a new cluster using ParallelCluster 2.9.1, reusing resources already in use by a previous cluster created with 2.7.0.
The reason to reuse the VPC, subnet and SG is to be able to mount the same EFS staying on the same AZ.
The new cluster should have 2 queues, one with spot nodes and the other with ondemand nodes. Both with c5a.12xlarge and c5.12xlarge instance types.
cloud-init.log shows script /var/lib/cloud/instance/scripts/part-002 failing with error. See lines 714 to 728 in cloud-init.log .

Environment:

AWS ParallelCluster 2.9.1
Configuration file attached config.txt

How to reproduce:

(apc-2.9.1) [centos@ip-172-31-6-186 ~]$ pcluster create -nr -c /home/centos/.parallelcluster/config-2.9.1-v1 HR-11
Beginning cluster creation for cluster: HR-11
Creating stack named: parallelcluster-HR-11
Status: parallelcluster-HR-11 - CREATE_FAILED                                   
Cluster creation failed.  Failed events:
  - AWS::CloudFormation::Stack parallelcluster-HR-11 The following resource(s) failed to create: [MasterServerSubstack]. 
  - AWS::CloudFormation::Stack MasterServerSubstack Embedded stack arn:aws:cloudformation:eu-west-1:434355056914:stack/parallelcluster-HR-11-MasterServerSubstack-3HGXLXUYW8KU/3b2d3af0-f855-11ea-ba71-0adb030b5f58 was not successfully created: The following resource(s) failed to create: [MasterServer]. 
    - AWS::CloudFormation::Stack parallelcluster-HR-11-MasterServerSubstack-3HGXLXUYW8KU The following resource(s) failed to create: [MasterServer]. 
    - AWS::EC2::Instance MasterServer Received FAILURE signal with UniqueId i-0c9bb67914cb297d8

Attached log files:
messages.txt
cloud-init.log

Kind regards,
Richard

The text was updated successfully, but these errors were encountered:

demartinofra · 2020-09-17T08:00:02Z

Hi Richard,

can you please provide /var/log/chef-client.log and /var/log/cfn-init.log from the head node? They are also available in CloudWatch if you prefer downloading them from there.

Francesco

rems0 · 2020-09-17T15:45:43Z

Hi Franceso,

cfn-init.log

chef-client.log is not there.

[centos@ip-10-0-5-109 ~]$ ls -l /var/log
total 560
drwxr-xr-x. 2 root root 37 Sep 14 22:55 amazon
drwxr-xr-x. 2 root root 6 Sep 14 23:10 anaconda
drwx------. 2 root root 23 Sep 16 19:47 audit
-rw------- 1 root utmp 0 Sep 16 19:47 btmp
-rw-r--r-- 1 root root 0 Sep 16 19:48 cfn-init-cmd.log
-rw-r--r-- 1 root root 4400 Sep 16 19:54 cfn-init.log
-rw-r--r-- 1 root root 467 Sep 16 19:54 cfn-wire.log
drwxr-xr-x. 2 chrony chrony 6 Aug 8 2019 chrony
-rw------- 1 root root 231211 Sep 17 15:40 cloud-init.log
-rw------- 1 root root 2101 Sep 17 15:40 cron
drwxr-xr-x. 2 lp sys 57 Sep 16 19:48 cups
drwxrwxrwt. 2 dcv dcv 6 Aug 21 15:20 dcv
-rw-r--r-- 1 root root 31435 Sep 17 15:40 dmesg
-rw-r--r-- 1 root root 31626 Sep 16 19:47 dmesg.old
drwx--x--x. 2 root gdm 6 Apr 1 02:41 gdm
drwxr-xr-x. 2 root root 6 Apr 2 13:17 glusterfs
drwx------. 2 root root 6 Apr 2 13:14 httpd
-rw-r--r--. 1 root root 292292 Sep 17 15:41 lastlog
drwx------. 3 root root 18 Sep 14 23:07 libvirt
drwxr-xr-x. 2 root root 6 Sep 14 23:10 mail
-rw------- 1 root root 378 Sep 17 15:40 maillog
-rw-r--r-- 1 root root 181039 Sep 17 15:41 messages
drwx------. 2 munge root 6 Sep 14 22:56 munge
drwxr-xr-x. 3 root root 33 Sep 14 22:48 nvidia
drwxr-xr-x. 3 root root 18 Sep 14 23:07 pluto
drwxr-xr-x. 2 root root 6 Aug 8 2019 qemu-ga
drwxr-xr-x. 2 root root 6 Apr 22 09:10 rhsm
drwx------. 3 root root 17 Sep 14 23:06 samba
-rw------- 1 root root 9413 Sep 17 15:41 secure
drwx------. 2 root root 6 Jun 10 2014 speech-dispatcher
drwxr-xr-x. 3 root root 21 Sep 14 23:07 swtpm
drwxr-xr-x. 2 root root 23 Sep 16 19:48 tuned
-rw-rw-r-- 1 root utmp 6912 Sep 17 15:41 wtmp

Regards,
Richard

enrico-usai · 2020-09-21T09:34:27Z

Hi Richard,

Cluster creation issue

From the log files, your instance is not able to contact the CloudFormation AWS Endpoint.
The main reason could be that you're specifying use_public_ips = false and probably the subnets you're specifying don't have an internet gateway or a nat gateway, so the instances are not able to retrieve packages and data from the web.

As described in the networking documentation:

When use_public_ips is set to false, the VPC must be correctly set up to use the Proxy for all traffic. Web access is required for both head and compute nodes.

Is the master_subnet_id a public subnet (with an internet gateway) or a private one (with a nat gateway)?
I'd re-try by removing the use_public_ips configuration parameter.

Reusing resources created by another cluster

I see you're using a VPC security group from the old cluster (vpc_security_group_id).
This security group has been probably created within the old cluster.

You can reuse the same VPC and Subnets but we don't recommend to share resources like the Security Groups if they have been automatically created during cluster creation, because by associating them to another cluster you wont be able to remove the old cluster resources.

If you need to access the EFS from another subnet you can add another EFS mount target in the availability zone/subnet you want to use for the new cluster.

The same is valid for EFS resources.
If the EFS has been created with your previous cluster it will be a problem when you'll try to delete the old cluster.

How to share resources between clusters

The best way to share resources is to create them before cluster creation and pass them as configuration parameters.
E.g. : create a new EFS from the UI, backup your old EFS content into a new one, and then associate it to the new cluster with efs_fs_id.

Let us know if it helps.

rems0 · 2020-09-21T21:38:35Z

Hi Enrico,

your answer helped a lot and our problem is solved.
Setting removing use_public_ips did it. I had wrongly let it unset on the previous cluster, thus getting compute nodes with public IPs that I did not want to have. Now on this new cluster I set it to false, but missed the corresponding gateway settings.

I created now a new SG and added it to the EFS, so that's fine now.

I think I've created the EFS by hand last time, but not sure. Where could I check that? In the old stack events or resources?

Many thanks for your help,
Richard

enrico-usai · 2020-09-22T08:52:19Z

Hi Richard,
you can look at the CloudFormation console and look at the value of the EFSOptions input parameter of your old stack.

EFSOptions is a comma separated list of EFS related options and the second one corresponds to the value of the efs_fs_id parameter you used to create the cluster.

If it's valued (fs-xxx) it means you created the cluster by passing an existing EFS file system id.

rems0 · 2020-09-22T14:40:04Z

Hi Enrico,
Perfect. Found it. The fs-xxx value is there, so I created the EFS myself. Good.

Thanks again for all your help, closing this now.

Kind regards,
Richard

enrico-usai added the help wanted label Sep 21, 2020

rems0 closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CREATE_FAILED with 2.9.1 #2053

CREATE_FAILED with 2.9.1 #2053

rems0 commented Sep 16, 2020 •

edited

Loading

demartinofra commented Sep 17, 2020

rems0 commented Sep 17, 2020

enrico-usai commented Sep 21, 2020

rems0 commented Sep 21, 2020

enrico-usai commented Sep 22, 2020 •

edited

Loading

rems0 commented Sep 22, 2020

CREATE_FAILED with 2.9.1 #2053

CREATE_FAILED with 2.9.1 #2053

Comments

rems0 commented Sep 16, 2020 • edited Loading

demartinofra commented Sep 17, 2020

rems0 commented Sep 17, 2020

enrico-usai commented Sep 21, 2020

Cluster creation issue

Reusing resources created by another cluster

How to share resources between clusters

rems0 commented Sep 21, 2020

enrico-usai commented Sep 22, 2020 • edited Loading

rems0 commented Sep 22, 2020

rems0 commented Sep 16, 2020 •

edited

Loading

enrico-usai commented Sep 22, 2020 •

edited

Loading