Objectives

LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

Call Notes

November 28, 2018

WRF Benchmarking

Emulating WRF 2016 12/4/1.3 grids
Purpose for estimating costing for CPUs, RAM and Storage
CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
RAM: ~22 Gb RAM/run (2.5 Gb/core)
Storage
- test netCDF4 and netCDF with no compression
- with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
- need to link in the HDF and NC4 libraries with compression to downstream programs
- estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression

Conceptual Approach to WRF on the Cloud

Cluster management would launch a head node and compute nodes
77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
Head node running constantly
Compute nodes running over the length of project
Memory optimized machines performed better than compute optimized for CAMx

Cost Analysis

Analysis Spreadsheet

Storage Analysis

AWS
- Don't want to use local because it will need to be moved/migrated
- Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
- Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
Azure
- Fast and slower lake storage for offline
- Managed disks for online

Data Transfer Analyis

estimate based on 5.8 Gb
AWS
- Internet transfer will cost ~ $928 for 5.5 Gb
- Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
Azure
- Online transfer
- Databox option (like snowball)

Cluster Management Tools (interface analysis)

3-4 seemed to work best across several cloud solutions
Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools

Next Steps

LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
LADCO to create a login for Ramboll in our AWS organization
Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
Next call 12/5 @ 3 Central

Ramboll Recommendations

WRF

Use netCDF4 with compression
Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once

Cloud Service

Costs are equivalent between Azure and AWS so use AWS because of familiarity
Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
Use Standard S3 storage for the lifetime of the project and migrate to Infrequent S3 or Glacier for longterm storage
Use Snowball to transfer completed project to local site

HPC Platforms

Use AWS ParallelCluster (formerly CfnCluster)
- Provides CLI-interface, allowing for linux-script automation
- Allows for custom AMIs
- Provides a variety of schedulers: sge, torque, slurm, or awsbatch
- Is actively being developed and enhanced
- Additional investigation/test of WRF/CAMx test cases needed to verify tool integrity and performance
Other HPC have demonstrated issues
- StarCluster: Problematic auto-scaling; outdated and inactive
- AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts

WRF on AWS: LADCO User's Guide

Last Update: 16May2019 How to configure/optimize AWS for running WRF.

Summary

AWS pcluster v2.1.0 instance using ALinux
WRFv3.9.1 compiled with netCDF4 (compression)
PGI compiler 2018 with OpenMPI v3.1.3
NetCDF C 4.6.2, Fortran 4.2
Spot instances with sc1 cold storage volumes

Notes

No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
Prototyping and testing done with EC2 ondemand m5a.* instance (see ondemand pcluster config below) and gp2 EBS volumes
Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes

AWS Parallel Cluster

Configure the cluster with a config file:

ondemand pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcospot]
vpc_settings = public
ebs_settings = ladcosc1
scheduler = sge
master_instance_type = c4.large
compute_instance_type = c4.4xlarge
placement = computer
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = spot
spot_price = 0.2
base_os = alinux
key_name = *****
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
# Create a cold storage I/O directory 
[ebs ladcosc1]
shared_dir = data
volume_type = sc1
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcospot

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

spot pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

Start the cluster

 pcluster create -c config.ladcowrf ladcowrf

Log in to the cluster

pcluster ssh ladcowrf -i {name of your AWS Key}

AWS Configuration

Packages/software installed:

PGI 2018
GCC and Gfortran 7.2.1
NetCDF C 4.6.2, Fortran 4.2
HDF5 1.10.1
JASPER 1.900.2
ZLIB 1.2.11
R 3.4.1
OpenMPI 3.1.3
yum -y install screen dstat htop strace perf pdsh ImageMagick

Misc Notes

Increase size of in-use volume

Add New Volume to Running Instance

From the AWS Console

Go to Volumes
Create a new Volume
Under the Actions menu, select Attach Volume
Select the Instance to which attach the new volume

From the AWS Instance

Check that the volume is available

lsblk

Confirm that the volume is empty (assuming the volume is attached as /dev/xvdf)

sudo file -s /dev/xvdf

If this command returns the following, it confirms that it is empty.

/dev/xvdf: data

Format the volume to an ext4 filesystem

sudo mkfs -t ext4 /dev/xvdf

Create a new directory and mount the volume

sudo mkdir /newdata
sudo mount /dev/xvdf /newdata

Copy output files from EC2 volume to S2 Glacier

#!/bin/csh -f

set PROJECT = LADCO_2016_WRFv39_YNT_GFS

# Create a new storage vault
aws glacier create-vault --vault-name $PROJECT --account-id -

# Add tags to describe vault (10 tags max)
aws glacier add-tags-to-vault --account-id - --vault-name $PROJECT --tags model="WRFv3.9.1",simyear=2016,stdate=20160610,endate=20160619,awsinst=ec2-ondemand,desc="LADCO YNT GFS Test Run"
 
# Upload files to the storage vault
set datadir = /data/wrf3.9.1/${PROJECT}/output_full/wrf_out/2016
cd $datadir
foreach f ( *wrfout* )
   echo "Copying $f"
   aws glacier upload-archive --account-id - --vault-name $PROJECT --body $f
end

# Remove files on ec2 after the files are all uploaded
# rm -f $datadir/*wrfout*

WRF on the Cloud

Contents