Difference between revisions of "WRF on the Cloud"

From LADCO Wiki
Jump to: navigation, search
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
  
= Objectives =
+
= WRF on AWS: LADCO User's Guide =
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:
+
Last Update: 16May2019
 
 
* Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
 
* Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
 
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud
 
 
 
= Call Notes =
 
== November 28, 2018 ==
 
=== WRF Benchmarking ===
 
* Emulating WRF 2016 12/4/1.3 grids
 
* Purpose for estimating costing for CPUs, RAM and Storage
 
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
 
* RAM: ~22 Gb RAM/run (2.5 Gb/core)
 
* Storage
 
** test netCDF4 and netCDF with no compression
 
** with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
 
** need to link in the HDF and NC4 libraries with compression to downstream programs
 
** estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression
 
 
 
=== Conceptual Approach to WRF on the Cloud ===
 
* Cluster management would launch a head node and compute nodes
 
* 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
 
* Head node running constantly
 
* Compute nodes running over the length of project
 
* Memory optimized machines performed better than compute optimized for CAMx
 
 
 
=== Cost Analysis ===
 
* [https://www.ladco.org/wp-content/uploads/Projects/WRF-Cloud/WRF_cloud_computing_costs.pdf Analysis Spreadsheet]
 
 
 
=== Storage Analysis ===
 
* AWS
 
** Don't want to use local because it will need to be moved/migrated
 
** Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
 
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
 
* Azure
 
** Fast and slower lake storage for offline
 
** Managed disks for online
 
 
 
=== Data Transfer Analyis ===
 
* estimate based on 5.8 Gb
 
* AWS
 
** Internet transfer will cost ~ $928 for 5.5 Gb
 
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
 
* Azure
 
** Online transfer
 
** Databox option (like snowball)
 
 
 
=== Cluster Management Tools (interface analysis) ===
 
* 3-4 seemed to work best across several cloud solutions
 
* Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
 
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
 
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools
 
 
 
=== Next Steps ===
 
* LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
 
* LADCO to create a login for Ramboll in our AWS organization
 
* Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
 
* Next call 12/5 @ 3 Central
 
 
 
= Recommendations =
 
== WRF ==
 
* Use netCDF4 with compression
 
* Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once
 
== Cloud Service ==
 
* Costs are equivalent between Azure and AWS so use AWS because of familiarity
 
* Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
 
* Use Standard S3 storage for the lifetime of the project and migrate to Infrequent S3 or Glacier for longterm storage
 
* Use Snowball to transfer completed project to local site
 
 
 
== HPC Platforms ==
 
* Use AWS ParallelCluster (formerly CfnCluster)
 
** Provides CLI-interface, allowing for linux-script automation
 
** Allows for custom AMIs
 
** Provides a variety of schedulers: sge, torque, slurm, or awsbatch
 
** Is actively being developed and enhanced
 
** Additional investigation/test of WRF/CAMx test cases needed to verify tool integrity and performance
 
* Other HPC have demonstrated issues
 
** StarCluster: Problematic auto-scaling; outdated and inactive
 
** AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts
 
 
 
= WRF on AWS: User's Guide =
 
Last Update: 24Apr2019
 
 
How to configure/optimize AWS for running WRF.   
 
How to configure/optimize AWS for running WRF.   
 
== Summary ==
 
== Summary ==
Line 89: Line 8:
 
* PGI compiler 2018 with OpenMPI v3.1.3
 
* PGI compiler 2018 with OpenMPI v3.1.3
 
* NetCDF C 4.6.2, Fortran 4.2
 
* NetCDF C 4.6.2, Fortran 4.2
 +
* Spot instances with sc1 cold storage volumes
 +
 +
The AWS Parallel Cluster package is a way to build a computing cluster such that there is constantly running master node that is used to launch jobs on compute nodes.  The compute nodes are only started when a job is initiated.  The compute nodes will shut down after a default 10 minutes of idle use (you can change the idle time through the pcluster config file). This system lets you choose different instance types for the master and compute nodes. We chose an inexpensive master instance (c4.large @ 2 CPU 3.75 Gb RAM) and compute optimized compute instances (c4.4xlarge @ 16 CPU 30 Gb RAM). We attached a 10 Tb EBS volume (sc1 Cold HDD) for storage.
 +
 +
For the workflow, we run an operational script system that downloads MADIS obs, runs WPS, REAL, and WRF.  It also runs a script to replace the NOAA SST with GLSEA SST. The simulation is run as multiple 32 CPU 5.5 day simulations at the same time.
 +
 +
After WRF completes, we run MCIP and WRFCAMx, and ingest the results into AMET.  We archive the wrfout, MCIP, and WRFCAMx data to S3 Glacier.
  
 
== Notes ==
 
== Notes ==
 
* No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
 
* No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
 
* Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
 
* Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
* Prototyping and testing done with EC2 ondemand instance (see ondemand pcluster config below)
+
* Prototyping and testing done with EC2 ondemand m5a.* instance (see ondemand pcluster config below) and gp2 EBS volumes
* Production done with EC2 spot instances (see spot pcluster config below)
+
* Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes
  
 
== AWS Parallel Cluster ==
 
== AWS Parallel Cluster ==
* [https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html Install AWS-CLI]]
+
* [https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html Install AWS-CLI]
* [[https://aws.amazon.com/blogs/opensource/aws-parallelcluster/ Install Pcluster]]
+
* [https://aws.amazon.com/blogs/opensource/aws-parallelcluster/ Install Pcluster]
 +
* Don't set the spot price.  When you do not set a spot price, AWS will give you the [https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/ spot market price capped at the on-demand price]. Setting the spot price makes the instance more prone to being reclaimed and having your job terminated.  As there is currently no functionality automatically enabled on EC2 instances for checkpoint/restart, losing an instance is a show-stopper for WRF production runs. For WRF MPI applications it's not worth playing the spot market if the tradeoff is instance reliability.
 +
* Use the base alinux AMI for your version of pcluster; e.g., for v2.4.0 : https://github.com/aws/aws-parallelcluster/blob/v2.4.0/amis.txt
 +
Configure the cluster with a config file:
 +
 
 +
=== spot pcluster config ===
  
Configure the cluster with a config file:
+
[aws]
 +
aws_region_name = us-east-2
 +
 
 +
[cluster ladcospot]
 +
vpc_settings = public
 +
ebs_settings = ladcosc1
 +
scheduler = sge
 +
master_instance_type = c4.large
 +
compute_instance_type = c4.4xlarge
 +
placement = computer
 +
placement_group = DYNAMIC
 +
master_root_volume_size = 40
 +
cluster_type = spot
 +
#spot_price = 0.2
 +
base_os = alinux
 +
key_name = *****
 +
# Base AMI for pcluster v2.1.0
 +
custom_ami = ami-0381cb7486cdc973f
 +
 
 +
# Create a cold storage I/O directory
 +
[ebs ladcosc1]
 +
shared_dir = data
 +
volume_type = sc1
 +
volume_size = 10000
 +
volume_iops = 1500
 +
encrypted = false
 +
 +
[vpc public]
 +
master_subnet_id = subnet-******
 +
vpc_id = vpc-******
 +
 +
[global]
 +
update_check = true
 +
sanity_check = true
 +
cluster_template = ladcospot
 +
 +
[aliases]
 +
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
  
=== ondemand pcluster config ===
+
=== on demand pcluster config ===
  
 
  [aws]
 
  [aws]
Line 143: Line 111:
 
  [aliases]
 
  [aliases]
 
  ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
 
  ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}
 +
 +
=== Cluster Access ===
  
 
Start the cluster
 
Start the cluster
   pcluster create -c config.ladcowrf ladcowrf
+
   pcluster create -c config.spot ladcospot
  
 
Log in to the cluster
 
Log in to the cluster
  pcluster ssh ladcowrf -i {name of your AWS Key}
+
  pcluster ssh ladcospot -i {name of your AWS Key}
 +
 
 +
=== Fault Tolerance ===
 +
 
 +
Script to monitor pcluster system logs for termination notice, and restart WRF.
 +
 
 +
#!/bin/bash
 +
CASE=LADCO_2016_WRFv39_YNT_NAM
 +
JSTART=2016095
 +
wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM
 +
while true
 +
do
 +
  if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then
 +
      echo "terminated"
 +
      break
 +
  else
 +
      echo "Still running fine"
 +
      sleep 3
 +
  fi
 +
done
 +
echo "Restarting WRF job"
 +
#qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6
 +
~
  
 
== AWS Configuration ==
 
== AWS Configuration ==
Line 161: Line 153:
 
* OpenMPI 3.1.3
 
* OpenMPI 3.1.3
 
* yum -y install screen dstat htop strace perf pdsh ImageMagick
 
* yum -y install screen dstat htop strace perf pdsh ImageMagick
 
== Misc Notes ==
 
*[[https://support.amimoto-ami.com/english/self-hosting-accounts/increasing-your-ec2-volume-size Instructions for adding space to your data volume on an in-use instance]]
 

Latest revision as of 17:43, 26 March 2021

WRF on AWS: LADCO User's Guide

Last Update: 16May2019 How to configure/optimize AWS for running WRF.

Summary

  • AWS pcluster v2.1.0 instance using ALinux
  • WRFv3.9.1 compiled with netCDF4 (compression)
  • PGI compiler 2018 with OpenMPI v3.1.3
  • NetCDF C 4.6.2, Fortran 4.2
  • Spot instances with sc1 cold storage volumes

The AWS Parallel Cluster package is a way to build a computing cluster such that there is constantly running master node that is used to launch jobs on compute nodes. The compute nodes are only started when a job is initiated. The compute nodes will shut down after a default 10 minutes of idle use (you can change the idle time through the pcluster config file). This system lets you choose different instance types for the master and compute nodes. We chose an inexpensive master instance (c4.large @ 2 CPU 3.75 Gb RAM) and compute optimized compute instances (c4.4xlarge @ 16 CPU 30 Gb RAM). We attached a 10 Tb EBS volume (sc1 Cold HDD) for storage.

For the workflow, we run an operational script system that downloads MADIS obs, runs WPS, REAL, and WRF. It also runs a script to replace the NOAA SST with GLSEA SST. The simulation is run as multiple 32 CPU 5.5 day simulations at the same time.

After WRF completes, we run MCIP and WRFCAMx, and ingest the results into AMET. We archive the wrfout, MCIP, and WRFCAMx data to S3 Glacier.

Notes

  • No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
  • Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
  • Prototyping and testing done with EC2 ondemand m5a.* instance (see ondemand pcluster config below) and gp2 EBS volumes
  • Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes

AWS Parallel Cluster

Configure the cluster with a config file:

spot pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcospot]
vpc_settings = public
ebs_settings = ladcosc1
scheduler = sge
master_instance_type = c4.large
compute_instance_type = c4.4xlarge
placement = computer
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = spot
#spot_price = 0.2
base_os = alinux
key_name = *****
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
# Create a cold storage I/O directory 
[ebs ladcosc1]
shared_dir = data
volume_type = sc1
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcospot

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

on demand pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

Cluster Access

Start the cluster

 pcluster create -c config.spot ladcospot

Log in to the cluster

pcluster ssh ladcospot -i {name of your AWS Key}

Fault Tolerance

Script to monitor pcluster system logs for termination notice, and restart WRF.

#!/bin/bash
CASE=LADCO_2016_WRFv39_YNT_NAM
JSTART=2016095
wrk_dir=/data/apps/WRFV3.9.1/sims/LADCO_2016_WRFv39_YNT_NAM
while true
do
  if [ -z $(curl -s http://169.254.169.254/latest/meta-data/spot/termination-time | grep -q .*T.*Z ) ]; then
     echo "terminated"
     break
  else
     echo "Still running fine"
     sleep 3
  fi
done
echo "Restarting WRF job"
#qsub -N WRF_rest $wrk_dir/wrapper_restart_wrf.csh $JSTART 6

~

AWS Configuration

Packages/software installed:

  • PGI 2018
  • GCC and Gfortran 7.2.1
  • NetCDF C 4.6.2, Fortran 4.2
  • HDF5 1.10.1
  • JASPER 1.900.2
  • ZLIB 1.2.11
  • R 3.4.1
  • OpenMPI 3.1.3
  • yum -y install screen dstat htop strace perf pdsh ImageMagick