Difference between revisions of "WRF on the Cloud"

From LADCO Wiki
Jump to: navigation, search
Line 25: Line 25:
 
* Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes
 
* Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes
  
== AWS Parallel Cluster ==
 
* [https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html Install AWS-CLI]
 
* [https://aws.amazon.com/blogs/opensource/aws-parallelcluster/ Install Pcluster]
 
* Don't set the spot price.  When you do not set a spot price, AWS will give you the [https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/ spot market price capped at the on-demand price]. Setting the spot price makes the instance more prone to being reclaimed and having your job terminated.  As there is currently no functionality automatically enabled on EC2 instances for checkpoint/restart, losing an instance is a show-stopper for WRF production runs. For WRF MPI applications it's not worth playing the spot market if the tradeoff is instance reliability.
 
* Use the base alinux AMI for your version of pcluster; e.g., for v2.4.0 : https://github.com/aws/aws-parallelcluster/blob/v2.4.0/amis.txt
 
Configure the cluster with a config file:
 
 
=== spot pcluster config ===
 
 
[aws]
 
aws_region_name = us-east-2
 
 
 
[cluster ladcospot]
 
vpc_settings = public
 
ebs_settings = ladcosc1
 
scheduler = sge
 
master_instance_type = c4.large
 
compute_instance_type = c4.4xlarge
 
placement = computer
 
placement_group = DYNAMIC
 
master_root_volume_size = 40
 
cluster_type = spot
 
#spot_price = 0.2
 
base_os = alinux
 
key_name = *****
 
# Base AMI for pcluster v2.1.0
 
custom_ami = ami-0381cb7486cdc973f
 
 
 
# Create a cold storage I/O directory
 
[ebs ladcosc1]
 
shared_dir = data
 
volume_type = sc1
 
volume_size = 10000
 
volume_iops = 1500
 
encrypted = false
 
 
[vpc public]
 
master_subnet_id = subnet-******
 
vpc_id = vpc-******
 
 
[global]
 
update_check = true
 
sanity_check = true
 
cluster_template = ladcospot
 
 
[aliases]
 
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
 
 
=== on demand pcluster config ===
 
 
[aws]
 
aws_region_name = us-east-2
 
 
 
[cluster ladcowrf]
 
vpc_settings = public
 
ebs_settings = ladcowrf
 
scheduler = sge
 
master_instance_type = m4.large
 
compute_instance_type = m5a.4xlarge
 
placement = cluster
 
placement_group = DYNAMIC
 
master_root_volume_size = 40
 
cluster_type = ondemand
 
base_os = alinux
 
key_name = *****
 
min_vcpus = 0
 
max_vcpus = 64
 
desired_vcpus = 0
 
# Base AMI for pcluster v2.1.0
 
custom_ami = ami-0381cb7486cdc973f
 
 
 
[ebs ladcowrf]
 
shared_dir = data
 
volume_type = gp2
 
volume_size = 10000
 
volume_iops = 1500
 
encrypted = false
 
 
[vpc public]
 
master_subnet_id = subnet-******
 
vpc_id = vpc-******
 
 
[global]
 
update_check = true
 
sanity_check = true
 
cluster_template = ladcowrf
 
 
[aliases]
 
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
 
 
=== Cluster Access ===
 
 
Start the cluster
 
  pcluster create -c config.spot ladcospot
 
 
Log in to the cluster
 
pcluster ssh ladcospot -i {name of your AWS Key}
 
 
=== Fault Tolerance ===
 
 
Script to monitor pcluster system logs for termination notice, and restart WRF.
 
 
#!/bin/bash
 
CASE=LADCO_2016_WRFv39_YNT_NAM
 
JSTART=2016095
 
wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM
 
while true
 
do
 
  if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then
 
      echo "terminated"
 
      break
 
  else
 
      echo "Still running fine"
 
      sleep 3
 
  fi
 
done
 
echo "Restarting WRF job"
 
#qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6
 
~
 
  
 
== AWS Configuration ==
 
== AWS Configuration ==

Revision as of 17:22, 26 March 2021

Ramboll Modeling on the Cloud Contract Results


WRF on AWS: LADCO User's Guide

Last Update: 16May2019 How to configure/optimize AWS for running WRF.

Summary

  • AWS pcluster v2.1.0 instance using ALinux
  • WRFv3.9.1 compiled with netCDF4 (compression)
  • PGI compiler 2018 with OpenMPI v3.1.3
  • NetCDF C 4.6.2, Fortran 4.2
  • Spot instances with sc1 cold storage volumes

The AWS Parallel Cluster package is a way to build a computing cluster such that there is constantly running master node that is used to launch jobs on compute nodes. The compute nodes are only started when a job is initiated. The compute nodes will shut down after a default 10 minutes of idle use (you can change the idle time through the pcluster config file). This system lets you choose different instance types for the master and compute nodes. We chose an inexpensive master instance (c4.large @ 2 CPU 3.75 Gb RAM) and compute optimized compute instances (c4.4xlarge @ 16 CPU 30 Gb RAM). We attached a 10 Tb EBS volume (sc1 Cold HDD) for storage.

For the workflow, we run an operational script system that downloads MADIS obs, runs WPS, REAL, and WRF. It also runs a script to replace the NOAA SST with GLSEA SST. The simulation is run as multiple 32 CPU 5.5 day simulations at the same time.

After WRF completes, we run MCIP and WRFCAMx, and ingest the results into AMET. We archive the wrfout, MCIP, and WRFCAMx data to S3 Glacier.

Notes

  • No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
  • Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
  • Prototyping and testing done with EC2 ondemand m5a.* instance (see ondemand pcluster config below) and gp2 EBS volumes
  • Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes


AWS Configuration

Packages/software installed:

  • PGI 2018
  • GCC and Gfortran 7.2.1
  • NetCDF C 4.6.2, Fortran 4.2
  • HDF5 1.10.1
  • JASPER 1.900.2
  • ZLIB 1.2.11
  • R 3.4.1
  • OpenMPI 3.1.3
  • yum -y install screen dstat htop strace perf pdsh ImageMagick

Misc Notes

Adding ssh users to a pcluster instance

AWS documentation for adding users with OpenLDAP

The post-install command in the pcluster configuration file should invoke the settings in the post_install_users.sh script described above. Alternatively, if you already have an instance running you can run the script as sudo to invoke the settings:

  > sudo ./post_install_users.sh

After running the script you need to use the add-key.sh and add-user.sh script described in the above link.

 #Usage (using zac as an example)
 > sudo ./add-user.sh zac {####., e.g, 2001}
 > sudo ./add-key.sh zac ladco.key
 #Add users to /etc/passwd file
 > sudo getent passwd zac
 > sudo vi /etc/passwd
 #Change ladco to primary group
 > sudo usermod zac -g ladco

Using AWS S3 for offline storage

Data are moved off of the compute servers to the AWS Simple Storage Solution for intermediate to long-term storage. The AWS CLI is used to access/manage the data on S3.

View the S3 commands, with an example for the copy (cp) command

 > aws s3 help
 > aws s3 cp help

List the S3 buckets

 > aws s3 ls
 2019-02-06 21:18:09 ladco-wrf
 > aws s3 ls ladco-wrf/
 PRE 24Apr2019/
 PRE 24Jan2019/
 PRE LADCO_2016_WRFv39_APLX/
 PRE LADCO_2016_WRFv39_YNT/
 PRE LADCO_2016_WRFv39_YNT_GFS/
 PRE LADCO_2016_WRFv39_YNT_NAM/
 PRE aws-reports/

Copy a file from one of the s3 buckets to a location on the compute server

 aws s3 cp s3://ladco-wrf/LADCO_2016_WRFv39_YNT_GFS/wrfout_d01_2016-06-10_00:00:00 /data2/wrf3.9.1/LADCO_2016_WRFv39_YNT_GFS/


Increase size of in-use volume

Add New Volume to Running Instance

From the AWS Console

  • Go to Volumes
  • Create a new Volume
  • Under the Actions menu, select Attach Volume
  • Select the Instance to which attach the new volume

From the AWS Instance

  • Check that the volume is available
lsblk
  • Confirm that the volume is empty (assuming the volume is attached as /dev/xvdf)
sudo file -s /dev/xvdf

If this command returns the following, it confirms that it is empty.

/dev/xvdf: data
  • Format the volume to an ext4 filesystem
sudo mkfs -t ext4 /dev/xvdf
  • Create a new directory and mount the volume
sudo mkdir /newdata
sudo mount /dev/xvdf /newdata

Copy output files from EC2 volume to S2 Glacier

#!/bin/csh -f

set PROJECT = LADCO_2016_WRFv39_YNT_GFS
set NEW_YN = Y 

if ( $NEW_YN == Y ) then
   # Create a new storage vault
   aws glacier create-vault --vault-name $PROJECT --account-id -

   # Add tags to describe vault (10 tags max)
   aws glacier add-tags-to-vault --account-id - --vault-name $PROJECT --tags model="WRFv3.9.1",simyear=2016,stdate=20160610,endate=20160619,awsinst=ec2-ondemand,desc="LADCO YNT GFS Test Run"
endif

# Upload files to the storage vault
set datadir = /data/wrf3.9.1/${PROJECT}/output_full/wrf_out/2016
cd $datadir
foreach f ( *wrfout* )
   echo "Copying $f"
   aws glacier upload-archive --account-id - --vault-name $PROJECT --body $f
end

# Remove files on ec2 after the files are all uploaded
# rm -f $datadir/*wrfout*