Difference between revisions of "Working with AWS Parallel Cluster"
(→AWS Parallel Cluster version 3) |
(→Modeling spot cluster config) |
||
Line 12: | Line 12: | ||
== Modeling spot cluster config == | == Modeling spot cluster config == | ||
Region: us-east-2 | Region: us-east-2 | ||
− | + | Image: | |
− | + | Os: alinux2 # alinux2 | centos7 | ubuntu1804 | ubuntu2004 | |
− | + | HeadNode: | |
− | + | InstanceType: c4.large | |
− | + | Networking: | |
SubnetId: subnet-34d14e4e | SubnetId: subnet-34d14e4e | ||
ElasticIp: false # true|false|EIP-id | ElasticIp: false # true|false|EIP-id | ||
− | + | DisableSimultaneousMultithreading: false | |
− | + | Ssh: | |
KeyName: LADCO WRF | KeyName: LADCO WRF | ||
− | + | LocalStorage: | |
RootVolume: | RootVolume: | ||
Size: 40 | Size: 40 | ||
Encrypted: false | Encrypted: false | ||
DeleteOnTermination: true | DeleteOnTermination: true | ||
− | + | Iam: | |
S3Access: | S3Access: | ||
- BucketName: ladco-backup | - BucketName: ladco-backup | ||
EnableWriteAccess: true | EnableWriteAccess: true | ||
Scheduling: | Scheduling: | ||
− | + | Scheduler: slurm | |
− | + | SlurmSettings: | |
ScaledownIdletime: 15 | ScaledownIdletime: 15 | ||
− | + | SlurmQueues: | |
- Name: slurmq1 # slurm directive: -p slurmq1 | - Name: slurmq1 # slurm directive: -p slurmq1 | ||
CapacityType: SPOT | CapacityType: SPOT | ||
Line 100: | Line 100: | ||
Enabled: true | Enabled: true | ||
CustomS3Bucket: ladco-backup | CustomS3Bucket: ladco-backup | ||
− | + | Tags: | |
- Key: Name | - Key: Name | ||
Value: Feb22Pcluster | Value: Feb22Pcluster |
Revision as of 19:37, 23 February 2022
Contents
AWS Parallel Cluster version 3
Setting up a multi-instance cluster
- You can now configure pcluster to have multiple queues and compute instances in single cluster. This is a nice feature because it allows you to choose from different types of instances based in your job (e.g., memory vs compute optimized) or to choose instances based on spot availability.
- To decide which instances to put into your cluster use the Amazon Spot Instance Advisor to find instances that have a low frequency of interruption (< 5%). We found that instances that have >5% frequency of interruption don't perform well as spot instances. They either take too long to get into the queue, or they get kicked out of the queue too often. Use the Spot Instance Advisor to find several available instances that meet your needs within a region of interest. You should be pairing your instance search with the EC2 spot pricing table to find less expensive options.
- After finding the region and instances that will meet your needs build your pcluster config file. See the LADCO modeling instance example below.
- Pcluster 3 YAML configuration file documentation
Modeling spot cluster config
Region: us-east-2 Image: Os: alinux2 # alinux2 | centos7 | ubuntu1804 | ubuntu2004 HeadNode: InstanceType: c4.large Networking: SubnetId: subnet-34d14e4e ElasticIp: false # true|false|EIP-id DisableSimultaneousMultithreading: false Ssh: KeyName: LADCO WRF LocalStorage: RootVolume: Size: 40 Encrypted: false DeleteOnTermination: true Iam: S3Access: - BucketName: ladco-backup EnableWriteAccess: true Scheduling: Scheduler: slurm SlurmSettings: ScaledownIdletime: 15 SlurmQueues: - Name: slurmq1 # slurm directive: -p slurmq1 CapacityType: SPOT Networking: SubnetIds: - subnet-34d14e4e PlacementGroup: Enabled: True ComputeResources: - Name: r416x # slurm directive: -C "[r416x]" DisableSimultaneousMultithreading: false InstanceType: r4.16xlarge MaxCount: 50 - Name: m6g16x # slurm directive: -C "[m6g16x]" DisableSimultaneousMultithreading: false InstanceType: m6g.16xlarge MaxCount: 50 - Name: c524x # slurm directive: -C "[c524x]" DisableSimultaneousMultithreading: false InstanceType: c5.24xlarge MaxCount: 50 SharedStorage: - MountDir: /ladco StorageType: Ebs Name: appsvolume EbsSettings: Size: 120 SnapshotId: snap-06867f5ffff3ed579 DeletionPolicy: Snapshot # Delete | Retain | Snapshot - MountDir: /data/efs StorageType: Efs Name: dataefs EfsSettings: Encrypted: false # true PerformanceMode: generalPurpose # generalPurpose | maxIO ThroughputMode: bursting # bursting | provisioned - MountDir: /data/ebs StorageType: Ebs Name: dataebs EbsSettings: VolumeType: sc1 # gp2 | gp3 | io1 | io2 | sc1 | st1 | standard Size: 1000 Encrypted: false # true DeletionPolicy: Delete - MountDir: /data/lustre StorageType: FsxLustre Name: datalustre FsxLustreSettings: StorageCapacity: 2400 # 1200 | 2400| 3600 DeploymentType: SCRATCH_1 # PERSISTENT_1 | SCRATCH_1 | SCRATCH_2 ExportPath: s3://ladco-backup/lustre ImportPath: s3://ladco-backup AutoImportPolicy: NEW_CHANGED # NEW | NEW_CHANGED StorageType: SSD # HDD | SSD Monitoring: DetailedMonitoring: false Logs: CloudWatch: Enabled: true RetentionInDays: 14 DeletionPolicy: Delete Dashboards: CloudWatch: Enabled: true CustomS3Bucket: ladco-backup Tags: - Key: Name Value: Feb22Pcluster
AWS Parallel Cluster version 2
- Install AWS-CLI
- Install Pcluster
- Don't set the spot price. When you do not set a spot price, AWS will give you the spot market price capped at the on-demand price. Setting the spot price makes the instance more prone to being reclaimed and having your job terminated. As there is currently no functionality automatically enabled on EC2 instances for checkpoint/restart, losing an instance is a show-stopper for WRF production runs. For WRF MPI applications it's not worth playing the spot market if the tradeoff is instance reliability.
- Use the base alinux AMI for your version of pcluster; e.g., for v2.4.0 : https://github.com/aws/aws-parallelcluster/blob/v2.4.0/amis.txt
Configure the cluster with a config file:
spot pcluster config
[aws] aws_region_name = us-east-2 [cluster ladcospot] vpc_settings = ladco ebs_settings = data,input,apps scheduler = slurm master_instance_type = c4.large compute_instance_type = r4.16xlarge master_root_volume_size = 40 cluster_type = spot base_os = alinux2 key_name = ****** s3_read_resource = arn:aws:s3:::ladco-backup/* s3_read_write_resource = arn:aws:s3:::ladco-wrf/* post_install = s3://ladco-backup/post_install_users.sh disable_hyperthreading = true custom_ami = ami-0c283443a1ebb5c17 max_queue_size = 100 # Create a 10 Tb cold storage I/O directory [ebs data] shared_dir = data volume_type = sc1 volume_size = 10000 volume_iops = 1500 encrypted = false # Attach an "apps" volume with pre-loaded software [ebs apps] shared_dir = ladco ebs_volume_id = vol-***** # Attach an "input" data volume with pre-loaded input data [ebs input] shared_dir = input ebs_volume_id = vol-***** [vpc public] master_subnet_id = subnet-****** vpc_id = vpc-****** [global] update_check = true sanity_check = true cluster_template = ladcospot [aliases] ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
on demand pcluster config
[aws] aws_region_name = us-east-2 [cluster ladcowrf] vpc_settings = public ebs_settings = ladcowrf scheduler = sge master_instance_type = m4.large compute_instance_type = m5a.4xlarge placement = cluster placement_group = DYNAMIC master_root_volume_size = 40 cluster_type = ondemand base_os = alinux key_name = ***** min_vcpus = 0 max_vcpus = 64 desired_vcpus = 0 # Base AMI for pcluster v2.1.0 custom_ami = ami-0381cb7486cdc973f [ebs ladcowrf] shared_dir = data volume_type = gp2 volume_size = 10000 volume_iops = 1500 encrypted = false [vpc public] master_subnet_id = subnet-****** vpc_id = vpc-****** [global] update_check = true sanity_check = true cluster_template = ladcowrf [aliases] ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
Cluster Access
Start the cluster
pcluster create -c config.spot ladcospot
Log in to the cluster
pcluster ssh ladcospot -i {name of your AWS Key}
Fault Tolerance
Script to monitor pcluster system logs for termination notice, and restart WRF.
#!/bin/bash CASE=LADCO_2016_WRFv39_YNT_NAM JSTART=2016095 wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM while true do if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then echo "terminated" break else echo "Still running fine" sleep 3 fi done echo "Restarting WRF job" #qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6
~