Revision as of 19:34, 23 February 2022

AWS Parallel Cluster version 3

Setting up a multi-instance cluster

You can now configure pcluster to have multiple queues and compute instances in single cluster. This is a nice feature because it allows you to choose from different types of instances based in your job (e.g., memory vs compute optimized) or to choose instances based on spot availability.
To decide which instances to put into your cluster use the Amazon Spot Instance Advisor to find instances that have a low frequency of interruption (< 5%). We found that instances that have >5% frequency of interruption don't perform well as spot instances. They either take too long to get into the queue, or they get kicked out of the queue too often. Use the Spot Instance Advisor to find several available instances that meet your needs within a region of interest. You should be pairing your instance search with the EC2 spot pricing table to find less expensive options.
After finding the region and instances that will meet your needs build your pcluster config file. See the LADCO modeling instance example below.
Pcluster 3 YAML configuration file documentation

AWS Parallel Cluster version 2

Install AWS-CLI
Install Pcluster
Don't set the spot price. When you do not set a spot price, AWS will give you the spot market price capped at the on-demand price. Setting the spot price makes the instance more prone to being reclaimed and having your job terminated. As there is currently no functionality automatically enabled on EC2 instances for checkpoint/restart, losing an instance is a show-stopper for WRF production runs. For WRF MPI applications it's not worth playing the spot market if the tradeoff is instance reliability.
Use the base alinux AMI for your version of pcluster; e.g., for v2.4.0 : https://github.com/aws/aws-parallelcluster/blob/v2.4.0/amis.txt

Configure the cluster with a config file:

spot pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcospot]
vpc_settings = ladco
ebs_settings = data,input,apps
scheduler = slurm
master_instance_type = c4.large
compute_instance_type = r4.16xlarge
master_root_volume_size = 40
cluster_type = spot
base_os = alinux2
key_name = ******
s3_read_resource = arn:aws:s3:::ladco-backup/*
s3_read_write_resource = arn:aws:s3:::ladco-wrf/*
post_install = s3://ladco-backup/post_install_users.sh
disable_hyperthreading = true
custom_ami = ami-0c283443a1ebb5c17
max_queue_size = 100
 
# Create a 10 Tb cold storage I/O directory 
[ebs data]
shared_dir = data
volume_type = sc1
volume_size = 10000
volume_iops = 1500
encrypted = false

# Attach an "apps" volume with pre-loaded software
[ebs apps]
shared_dir = ladco
ebs_volume_id = vol-*****

# Attach an "input" data volume with pre-loaded input data
[ebs input]
shared_dir = input
ebs_volume_id = vol-*****

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcospot

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

on demand pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

Cluster Access

Start the cluster

 pcluster create -c config.spot ladcospot

Log in to the cluster

pcluster ssh ladcospot -i {name of your AWS Key}

Fault Tolerance

Script to monitor pcluster system logs for termination notice, and restart WRF.

#!/bin/bash
CASE=LADCO_2016_WRFv39_YNT_NAM
JSTART=2016095
wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM
while true
do
  if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then
     echo "terminated"
     break
  else
     echo "Still running fine"
     sleep 3
  fi
done
echo "Restarting WRF job"
#qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6

~

@@ Line 1: / Line 1: @@
+= AWS Parallel Cluster version 3 =
+* [https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html Install AWS-CLI]
+* [https://docs.aws.amazon.com/parallelcluster/latest/ug/parallelcluster-version-3.html Parallel Cluster version 3]
+* [https://docs.aws.amazon.com/parallelcluster/latest/ug/install.html Install Pcluster 3]
+== Setting up a multi-instance cluster ==
+* You can now configure pcluster to have multiple queues and compute instances in single cluster.  This is a nice feature because it allows you to choose from different types of instances based in your job (e.g., memory vs compute optimized) or to choose instances based on spot availability.
+* To decide which instances to put into your cluster use the [https://aws.amazon.com/ec2/spot/instance-advisor/ Amazon Spot Instance Advisor] to find instances that have a low frequency of interruption (< 5%). We found that instances that have >5% frequency of interruption don't perform well as spot instances. They either take too long to get into the queue, or they get kicked out of the queue too often. Use the Spot Instance Advisor to find several available instances that meet your needs within a region of interest. You should be pairing your instance search with the [https://aws.amazon.com/ec2/spot/pricing/ EC2 spot pricing table] to find less expensive options.
+* After finding the region and instances that will meet your needs build your pcluster config file. See the LADCO modeling instance example below.
+* [https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-configuration-file-v3.html Pcluster 3 YAML configuration file documentation]
 = AWS Parallel Cluster version 2 =
 * [https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html Install AWS-CLI]

Difference between revisions of "Working with AWS Parallel Cluster"

Revision as of 19:34, 23 February 2022

Contents

AWS Parallel Cluster version 3

Setting up a multi-instance cluster

AWS Parallel Cluster version 2

spot pcluster config

on demand pcluster config

Cluster Access

Fault Tolerance

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools