WRF on the Cloud

From LADCO Wiki
Revision as of 21:40, 16 May 2019 by Zac (talk | contribs) (Copy output files from EC2 volume to glacier)
Jump to: navigation, search

Objectives

LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

  • Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
  • Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
  • Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

Call Notes

November 28, 2018

WRF Benchmarking

  • Emulating WRF 2016 12/4/1.3 grids
  • Purpose for estimating costing for CPUs, RAM and Storage
  • CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
  • RAM: ~22 Gb RAM/run (2.5 Gb/core)
  • Storage
    • test netCDF4 and netCDF with no compression
    • with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
    • need to link in the HDF and NC4 libraries with compression to downstream programs
    • estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression

Conceptual Approach to WRF on the Cloud

  • Cluster management would launch a head node and compute nodes
  • 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
  • Head node running constantly
  • Compute nodes running over the length of project
  • Memory optimized machines performed better than compute optimized for CAMx

Cost Analysis

Storage Analysis

  • AWS
    • Don't want to use local because it will need to be moved/migrated
    • Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
    • Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
  • Azure
    • Fast and slower lake storage for offline
    • Managed disks for online

Data Transfer Analyis

  • estimate based on 5.8 Gb
  • AWS
    • Internet transfer will cost ~ $928 for 5.5 Gb
    • Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
  • Azure
    • Online transfer
    • Databox option (like snowball)

Cluster Management Tools (interface analysis)

  • 3-4 seemed to work best across several cloud solutions
  • Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
  • CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
  • Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools

Next Steps

  • LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
  • LADCO to create a login for Ramboll in our AWS organization
  • Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
  • Next call 12/5 @ 3 Central

Ramboll Recommendations

WRF

  • Use netCDF4 with compression
  • Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once

Cloud Service

  • Costs are equivalent between Azure and AWS so use AWS because of familiarity
  • Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
  • Use Standard S3 storage for the lifetime of the project and migrate to Infrequent S3 or Glacier for longterm storage
  • Use Snowball to transfer completed project to local site

HPC Platforms

  • Use AWS ParallelCluster (formerly CfnCluster)
    • Provides CLI-interface, allowing for linux-script automation
    • Allows for custom AMIs
    • Provides a variety of schedulers: sge, torque, slurm, or awsbatch
    • Is actively being developed and enhanced
    • Additional investigation/test of WRF/CAMx test cases needed to verify tool integrity and performance
  • Other HPC have demonstrated issues
    • StarCluster: Problematic auto-scaling; outdated and inactive
    • AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts

WRF on AWS: LADCO User's Guide

Last Update: 16May2019 How to configure/optimize AWS for running WRF.

Summary

  • AWS pcluster v2.1.0 instance using ALinux
  • WRFv3.9.1 compiled with netCDF4 (compression)
  • PGI compiler 2018 with OpenMPI v3.1.3
  • NetCDF C 4.6.2, Fortran 4.2
  • Spot instances with sc1 cold storage volumes

Notes

  • No success with MPICH (v3.2.1) on AWS ALinux. Tried with GCC (7.2.1), PGI (2018), and Intel (xe 2019) compilers; also tried WRFV4.0 and nothing worked with MPICH. WRF compiles, but it was unstable and crashed consistently with segfaults after a seemingly random number of output timesteps
  • Stable WRF is created with OpenMPI; ultimately settled on PGI 2018 and OpenMPI v3.1.3
  • Prototyping and testing done with EC2 ondemand m5a.* instance (see ondemand pcluster config below) and gp2 EBS volumes
  • Production done with EC2 spot c4* instances (see spot pcluster config below) and sc1 volumes

AWS Parallel Cluster

Configure the cluster with a config file:

ondemand pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcospot]
vpc_settings = public
ebs_settings = ladcosc1
scheduler = sge
master_instance_type = c4.large
compute_instance_type = c4.4xlarge
placement = computer
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = spot
spot_price = 0.2
base_os = alinux
key_name = *****
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
# Create a cold storage I/O directory 
[ebs ladcosc1]
shared_dir = data
volume_type = sc1
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcospot

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

spot pcluster config

[aws]
aws_region_name = us-east-2
 
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
# Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
 
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 10000
volume_iops = 1500
encrypted = false

[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******

[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf

[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}

Start the cluster

 pcluster create -c config.ladcowrf ladcowrf

Log in to the cluster

pcluster ssh ladcowrf -i {name of your AWS Key}

AWS Configuration

Packages/software installed:

  • PGI 2018
  • GCC and Gfortran 7.2.1
  • NetCDF C 4.6.2, Fortran 4.2
  • HDF5 1.10.1
  • JASPER 1.900.2
  • ZLIB 1.2.11
  • R 3.4.1
  • OpenMPI 3.1.3
  • yum -y install screen dstat htop strace perf pdsh ImageMagick

Misc Notes

Increase size of in-use volume

Add New Volume to Running Instance

From the AWS Console

  • Go to Volumes
  • Create a new Volume
  • Under the Actions menu, select Attach Volume
  • Select the Instance to which attach the new volume

From the AWS Instance

  • Check that the volume is available
lsblk
  • Confirm that the volume is empty (assuming the volume is attached as /dev/xvdf)
sudo file -s /dev/xvdf

If this command returns the following, it confirms that it is empty.

/dev/xvdf: data
  • Format the volume to an ext4 filesystem
sudo mkfs -t ext4 /dev/xvdf
  • Create a new directory and mount the volume
sudo mkdir /newdata
sudo mount /dev/xvdf /newdata

Copy output files from EC2 volume to S2 Glacier

#!/bin/csh -f

set PROJECT = LADCO_2016_WRFv39_YNT_GFS

# Create a new storage vault
aws glacier create-vault --vault-name $PROJECT --account-id -

# Add tags to describe vault (10 tags max)
aws glacier add-tags-to-vault --account-id - --vault-name $PROJECT --tags model="WRFv3.9.1",simyear=2016,stdate=20160610,endate=20160619,awsinst=ec2-ondemand,desc="LADCO YNT GFS Test Run"
 
# Upload files to the storage vault
set datadir = /data/wrf3.9.1/${PROJECT}/output_full/wrf_out/2016
cd $datadir
foreach f ( *wrfout* )
   echo "Copying $f"
   aws glacier upload-archive --account-id - --vault-name $PROJECT --body $f
end

# Remove files on ec2 after the files are all uploaded
# rm -f $datadir/*wrfout*