Difference between revisions of "Working with AWS Parallel Cluster"

From LADCO Wiki
Jump to: navigation, search
(Modeling spot cluster config)
(AWS Parallel Cluster version 3)
 
(3 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
* [https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-configuration-file-v3.html Pcluster 3 YAML configuration file documentation]
 
* [https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-configuration-file-v3.html Pcluster 3 YAML configuration file documentation]
 
== Modeling spot cluster config ==
 
== Modeling spot cluster config ==
 +
 +
A few key features of the this configuration
 +
* It's configured to run in the us-east-2 region with the Amazon Linux 2 OS
 +
* Head node is a c4.large instance with 40Gb of disk space
 +
* The Slurm scheduler will control the MPI jobs on the cluster
 +
* Computing configuration (Scheduling -> Compute Resources)
 +
** There is 1 queue (slurmq1) with 3 compute instance options:
 +
** r4.16xlarge - Memory optimized instance with 64 vCPUs, 488 Gb RAM, Intel Xeon E5-2686 v4 (Broadwell) processors
 +
** m6g.16xlarge - General purpose instance with 64 vCPUs, 256 Gb RAM, AWS Graviton 2 processors
 +
** c5.24xlarge - Compute optimized instance with 96 vCPUs, and 192 Gb RAM, Intel Xeon Scalable Processors (Cascade Lake)
 +
** To access these different instances for a simulation, use a Slurm directive in your job script (note the connection of the --constraint directive to the ComputeResources -> Name in the config file):
 +
 +
#SBATCH --partition=slurmq1
 +
#SBATCH --constraint="r416x"
 +
 +
* Data storage configuration
 +
** A preconfigured "apps" disk is mounted to the cluster as the "/ladco" directory. This disk is based on a snapshot of an EBS volume that has scripts and software (all built on ALinux2) pre-loaded.  We frequently save snapshots of our applications disk to save specific configurations.  New clusters build off recent or newest apps snapshots
 +
** Three types of "data" disks are mounted to the cluster under the "/data" directory:
 +
** /data/ebs - [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volumes.html Elastic block store (EBS)] provisioned volume with 1 Tb of space. As this is a provisioned disk you pay for the allocated storage regardless of how much is being used
 +
** /data/efs - [https://aws.amazon.com/efs/ Elastic file system (EFS)] flexible volume that gets provisioned dynamically. You pay for what you use
 +
** /data/lustre - [https://aws.amazon.com/fsx/lustre/ Flexible Lustre] volume with a mount point directly to an S3 bucket
 +
 
  Region: us-east-2
 
  Region: us-east-2
 
   Image:
 
   Image:
Line 17: Line 39:
 
   InstanceType: c4.large
 
   InstanceType: c4.large
 
   Networking:
 
   Networking:
     SubnetId: subnet-34d14e4e
+
     SubnetId: subnet-????????
 
     ElasticIp: false  # true|false|EIP-id
 
     ElasticIp: false  # true|false|EIP-id
 
   DisableSimultaneousMultithreading: false
 
   DisableSimultaneousMultithreading: false
 
   Ssh:
 
   Ssh:
     KeyName: LADCO WRF
+
     KeyName: ????????
 
   LocalStorage:
 
   LocalStorage:
 
     RootVolume:
 
     RootVolume:
Line 29: Line 51:
 
   Iam:
 
   Iam:
 
     S3Access:
 
     S3Access:
     - BucketName: ladco-backup
+
     - BucketName: ????????
 
       EnableWriteAccess: true
 
       EnableWriteAccess: true
 
  Scheduling:
 
  Scheduling:
Line 40: Line 62:
 
       Networking:
 
       Networking:
 
         SubnetIds:
 
         SubnetIds:
           - subnet-34d14e4e
+
           - subnet-????????
 
         PlacementGroup:
 
         PlacementGroup:
 
           Enabled: True
 
           Enabled: True
Line 62: Line 84:
 
     EbsSettings:
 
     EbsSettings:
 
       Size: 120
 
       Size: 120
       SnapshotId: snap-06867f5ffff3ed579
+
       SnapshotId: snap-????????
 
       DeletionPolicy: Snapshot # Delete | Retain | Snapshot
 
       DeletionPolicy: Snapshot # Delete | Retain | Snapshot
 
   - MountDir: /data/efs
 
   - MountDir: /data/efs
Line 85: Line 107:
 
       StorageCapacity: 2400  # 1200 | 2400| 3600
 
       StorageCapacity: 2400  # 1200 | 2400| 3600
 
       DeploymentType: SCRATCH_1  # PERSISTENT_1 | SCRATCH_1 | SCRATCH_2
 
       DeploymentType: SCRATCH_1  # PERSISTENT_1 | SCRATCH_1 | SCRATCH_2
       ExportPath: s3://ladco-backup/lustre
+
       ExportPath: s3://????????/lustre
       ImportPath: s3://ladco-backup
+
       ImportPath: s3://????????
 
       AutoImportPolicy: NEW_CHANGED  # NEW | NEW_CHANGED
 
       AutoImportPolicy: NEW_CHANGED  # NEW | NEW_CHANGED
 
       StorageType: SSD  # HDD | SSD
 
       StorageType: SSD  # HDD | SSD
Line 99: Line 121:
 
     CloudWatch:
 
     CloudWatch:
 
       Enabled: true
 
       Enabled: true
  CustomS3Bucket: ladco-backup
+
  CustomS3Bucket: ????????
 
   Tags:
 
   Tags:
 
   - Key: Name
 
   - Key: Name
 
     Value: Feb22Pcluster
 
     Value: Feb22Pcluster
 +
 +
== Starting the cluster ==
 +
Get familiar with the [https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster-v3.html pcluster commands].  Once your config file is set up, use this command to start a cluster called "my-cluster-name" in the us-west-1 region
 +
 +
pcluster create-cluster -c config.spot_us-west-1.pcluster3.22Feb2022.yaml -r us-west-1 -n my-cluster-name
 +
 +
Check the status of the build using:
 +
 +
pcluster list-clusters -r us-west-1
 +
 +
== Connecting to the cluster ==
 +
 +
Once you get a successful build, connect to the cluster with the pcluster ssh command:
 +
 +
pcluster ssh -n my-cluster-name -i ~/.awskeys/mykey.pem
 +
 +
You can also log into the AWS Console, go to the service EC2 and look up your instance. There is a connection button to find and ssh command if you prefer to use an SSH client like putty or a tradition ssh command from a Linux command line.
 +
 +
== Running jobs ==
  
 
= AWS Parallel Cluster version 2 =
 
= AWS Parallel Cluster version 2 =

Latest revision as of 21:06, 23 February 2022

AWS Parallel Cluster version 3

Setting up a multi-instance cluster

  • You can now configure pcluster to have multiple queues and compute instances in single cluster. This is a nice feature because it allows you to choose from different types of instances based in your job (e.g., memory vs compute optimized) or to choose instances based on spot availability.
  • To decide which instances to put into your cluster use the Amazon Spot Instance Advisor to find instances that have a low frequency of interruption (< 5%). We found that instances that have >5% frequency of interruption don't perform well as spot instances. They either take too long to get into the queue, or they get kicked out of the queue too often. Use the Spot Instance Advisor to find several available instances that meet your needs within a region of interest. You should be pairing your instance search with the EC2 spot pricing table to find less expensive options.
  • After finding the region and instances that will meet your needs build your pcluster config file. See the LADCO modeling instance example below.
  • Pcluster 3 YAML configuration file documentation

Modeling spot cluster config

A few key features of the this configuration

  • It's configured to run in the us-east-2 region with the Amazon Linux 2 OS
  • Head node is a c4.large instance with 40Gb of disk space
  • The Slurm scheduler will control the MPI jobs on the cluster
  • Computing configuration (Scheduling -> Compute Resources)
    • There is 1 queue (slurmq1) with 3 compute instance options:
    • r4.16xlarge - Memory optimized instance with 64 vCPUs, 488 Gb RAM, Intel Xeon E5-2686 v4 (Broadwell) processors
    • m6g.16xlarge - General purpose instance with 64 vCPUs, 256 Gb RAM, AWS Graviton 2 processors
    • c5.24xlarge - Compute optimized instance with 96 vCPUs, and 192 Gb RAM, Intel Xeon Scalable Processors (Cascade Lake)
    • To access these different instances for a simulation, use a Slurm directive in your job script (note the connection of the --constraint directive to the ComputeResources -> Name in the config file):
#SBATCH --partition=slurmq1
#SBATCH --constraint="r416x"
  • Data storage configuration
    • A preconfigured "apps" disk is mounted to the cluster as the "/ladco" directory. This disk is based on a snapshot of an EBS volume that has scripts and software (all built on ALinux2) pre-loaded. We frequently save snapshots of our applications disk to save specific configurations. New clusters build off recent or newest apps snapshots
    • Three types of "data" disks are mounted to the cluster under the "/data" directory:
    • /data/ebs - Elastic block store (EBS) provisioned volume with 1 Tb of space. As this is a provisioned disk you pay for the allocated storage regardless of how much is being used
    • /data/efs - Elastic file system (EFS) flexible volume that gets provisioned dynamically. You pay for what you use
    • /data/lustre - Flexible Lustre volume with a mount point directly to an S3 bucket
Region: us-east-2
 Image:
  Os: alinux2  # alinux2 | centos7 | ubuntu1804 | ubuntu2004
 HeadNode:
  InstanceType: c4.large
  Networking:
   SubnetId: subnet-????????
   ElasticIp: false  # true|false|EIP-id
  DisableSimultaneousMultithreading: false
  Ssh:
   KeyName: ????????
  LocalStorage:
   RootVolume:
     Size: 40
     Encrypted: false
     DeleteOnTermination: true
  Iam:
   S3Access:
   - BucketName: ????????
     EnableWriteAccess: true
Scheduling:
  Scheduler: slurm
  SlurmSettings:
   ScaledownIdletime: 15
  SlurmQueues:
   - Name: slurmq1 # slurm directive: -p slurmq1
     CapacityType: SPOT
     Networking:
       SubnetIds:
         - subnet-????????
       PlacementGroup:
         Enabled: True
     ComputeResources:
       - Name: r416x # slurm directive: -C "[r416x]"
         DisableSimultaneousMultithreading: false
         InstanceType: r4.16xlarge
         MaxCount: 50
       - Name: m6g16x # slurm directive: -C "[m6g16x]"
         DisableSimultaneousMultithreading: false
         InstanceType: m6g.16xlarge
         MaxCount: 50
       - Name: c524x # slurm directive: -C "[c524x]"
         DisableSimultaneousMultithreading: false
         InstanceType: c5.24xlarge
         MaxCount: 50
SharedStorage:
 - MountDir: /ladco
   StorageType: Ebs
   Name: appsvolume
   EbsSettings:
     Size: 120
     SnapshotId: snap-????????
     DeletionPolicy: Snapshot # Delete | Retain | Snapshot
 - MountDir: /data/efs
   StorageType: Efs
   Name: dataefs
   EfsSettings:
     Encrypted: false  # true
     PerformanceMode: generalPurpose  # generalPurpose | maxIO
     ThroughputMode: bursting  # bursting | provisioned
 - MountDir: /data/ebs
   StorageType: Ebs
   Name: dataebs
   EbsSettings:
     VolumeType: sc1 # gp2 | gp3 | io1 | io2 | sc1 | st1 | standard
     Size: 1000
     Encrypted: false  # true
     DeletionPolicy: Delete
 - MountDir: /data/lustre
   StorageType: FsxLustre
   Name: datalustre
   FsxLustreSettings:
     StorageCapacity: 2400  # 1200 | 2400| 3600
     DeploymentType: SCRATCH_1  # PERSISTENT_1 | SCRATCH_1 | SCRATCH_2
     ExportPath: s3://????????/lustre
     ImportPath: s3://????????
     AutoImportPolicy: NEW_CHANGED  # NEW | NEW_CHANGED
     StorageType: SSD  # HDD | SSD
Monitoring:
 DetailedMonitoring: false
 Logs:
   CloudWatch:
     Enabled: true
     RetentionInDays: 14
     DeletionPolicy: Delete
 Dashboards:
   CloudWatch:
     Enabled: true
CustomS3Bucket: ????????
 Tags:
 - Key: Name
   Value: Feb22Pcluster

Starting the cluster

Get familiar with the pcluster commands. Once your config file is set up, use this command to start a cluster called "my-cluster-name" in the us-west-1 region

pcluster create-cluster -c config.spot_us-west-1.pcluster3.22Feb2022.yaml -r us-west-1 -n my-cluster-name

Check the status of the build using:

pcluster list-clusters -r us-west-1

Connecting to the cluster

Once you get a successful build, connect to the cluster with the pcluster ssh command:

pcluster ssh -n my-cluster-name -i ~/.awskeys/mykey.pem

You can also log into the AWS Console, go to the service EC2 and look up your instance. There is a connection button to find and ssh command if you prefer to use an SSH client like putty or a tradition ssh command from a Linux command line.

Running jobs

AWS Parallel Cluster version 2

Configure the cluster with a config file:

spot pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcospot]
vpc_settings = ladco
ebs_settings = data,input,apps
scheduler = slurm
master_instance_type = c4.large
compute_instance_type = r4.16xlarge
master_root_volume_size = 40
cluster_type = spot
base_os = alinux2
key_name = ******
s3_read_resource = arn:aws:s3:::ladco-backup/*
s3_read_write_resource = arn:aws:s3:::ladco-wrf/*
post_install = s3://ladco-backup/post_install_users.sh
disable_hyperthreading = true
custom_ami = ami-0c283443a1ebb5c17
max_queue_size = 100
 
# Create a 10 Tb cold storage I/O directory 
[ebs data]
shared_dir = data
volume_type = sc1
volume_size = 10000
volume_iops = 1500
encrypted = false

# Attach an "apps" volume with pre-loaded software
[ebs apps]
shared_dir = ladco
ebs_volume_id = vol-*****

# Attach an "input" data volume with pre-loaded input data
[ebs input]
shared_dir = input
ebs_volume_id = vol-*****

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcospot

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

on demand pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

Cluster Access

Start the cluster

 pcluster create -c config.spot ladcospot

Log in to the cluster

pcluster ssh ladcospot -i {name of your AWS Key}

Fault Tolerance

Script to monitor pcluster system logs for termination notice, and restart WRF.

#!/bin/bash
CASE=LADCO_2016_WRFv39_YNT_NAM
JSTART=2016095
wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM
while true
do
  if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then
     echo "terminated"
     break
  else
     echo "Still running fine"
     sleep 3
  fi
done
echo "Restarting WRF job"
#qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6

~