Difference between revisions of "WRF on the Cloud"
m (AWS HPC Platforms Update) |
|||
Line 80: | Line 80: | ||
** StarCluster: Problematic auto-scaling; outdated and inactive | ** StarCluster: Problematic auto-scaling; outdated and inactive | ||
** AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts | ** AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts | ||
+ | |||
+ | = WRF on the AWS: User's Guide = | ||
+ | How to configure/optimize AWS for running WRF. | ||
+ | == Summary == | ||
+ | * AWS pcluster v2.1.0 instance using ALinux | ||
+ | * WRFv3.9.1 compiled with netCDF4 (compression) | ||
+ | * PGI compiler 2018 with OpenMPI3.1.3 | ||
+ | * NetCDF C 4.6.2, Fortran 4.2 | ||
+ | |||
+ | == Cluster Configuration == | ||
+ | |||
+ | |||
+ | |||
+ | === ondemand pcluster config === | ||
+ | <code> | ||
+ | [aws] | ||
+ | aws_region_name = us-east-2 | ||
+ | |||
+ | [cluster ladcowrf] | ||
+ | vpc_settings = public | ||
+ | ebs_settings = ladcowrf | ||
+ | scheduler = sge | ||
+ | master_instance_type = m4.large | ||
+ | compute_instance_type = m5a.4xlarge | ||
+ | placement = cluster | ||
+ | placement_group = DYNAMIC | ||
+ | master_root_volume_size = 40 | ||
+ | cluster_type = ondemand | ||
+ | base_os = alinux | ||
+ | key_name = ***** | ||
+ | min_vcpus = 0 | ||
+ | max_vcpus = 64 | ||
+ | desired_vcpus = 0 | ||
+ | # Base AMI for pcluster v2.1.0 | ||
+ | custom_ami = ami-0381cb7486cdc973f | ||
+ | |||
+ | [ebs ladcowrf] | ||
+ | shared_dir = data | ||
+ | volume_type = gp2 | ||
+ | volume_size = 1500 | ||
+ | volume_iops = 1500 | ||
+ | encrypted = false | ||
+ | |||
+ | [vpc public] | ||
+ | master_subnet_id = subnet-****** | ||
+ | vpc_id = vpc-****** | ||
+ | |||
+ | [global] | ||
+ | update_check = true | ||
+ | sanity_check = true | ||
+ | cluster_template = ladcowrf | ||
+ | |||
+ | [aliases] | ||
+ | ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS} | ||
+ | </code> |
Revision as of 19:39, 24 April 2019
Objectives
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:
- Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
- Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
- Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud
Call Notes
November 28, 2018
WRF Benchmarking
- Emulating WRF 2016 12/4/1.3 grids
- Purpose for estimating costing for CPUs, RAM and Storage
- CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
- RAM: ~22 Gb RAM/run (2.5 Gb/core)
- Storage
- test netCDF4 and netCDF with no compression
- with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
- need to link in the HDF and NC4 libraries with compression to downstream programs
- estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression
Conceptual Approach to WRF on the Cloud
- Cluster management would launch a head node and compute nodes
- 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
- Head node running constantly
- Compute nodes running over the length of project
- Memory optimized machines performed better than compute optimized for CAMx
Cost Analysis
Storage Analysis
- AWS
- Don't want to use local because it will need to be moved/migrated
- Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
- Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
- Azure
- Fast and slower lake storage for offline
- Managed disks for online
Data Transfer Analyis
- estimate based on 5.8 Gb
- AWS
- Internet transfer will cost ~ $928 for 5.5 Gb
- Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
- Azure
- Online transfer
- Databox option (like snowball)
Cluster Management Tools (interface analysis)
- 3-4 seemed to work best across several cloud solutions
- Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
- CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
- Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools
Next Steps
- LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
- LADCO to create a login for Ramboll in our AWS organization
- Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
- Next call 12/5 @ 3 Central
Recommendations
WRF
- Use netCDF4 with compression
- Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once
Cloud Service
- Costs are equivalent between Azure and AWS so use AWS because of familiarity
- Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
- Use Standard S3 storage for the lifetime of the project and migrate to Infrequent S3 or Glacier for longterm storage
- Use Snowball to transfer completed project to local site
HPC Platforms
- Use AWS ParallelCluster (formerly CfnCluster)
- Provides CLI-interface, allowing for linux-script automation
- Allows for custom AMIs
- Provides a variety of schedulers: sge, torque, slurm, or awsbatch
- Is actively being developed and enhanced
- Additional investigation/test of WRF/CAMx test cases needed to verify tool integrity and performance
- Other HPC have demonstrated issues
- StarCluster: Problematic auto-scaling; outdated and inactive
- AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts
WRF on the AWS: User's Guide
How to configure/optimize AWS for running WRF.
Summary
- AWS pcluster v2.1.0 instance using ALinux
- WRFv3.9.1 compiled with netCDF4 (compression)
- PGI compiler 2018 with OpenMPI3.1.3
- NetCDF C 4.6.2, Fortran 4.2
Cluster Configuration
ondemand pcluster config
[aws]
aws_region_name = us-east-2
[cluster ladcowrf]
vpc_settings = public
ebs_settings = ladcowrf
scheduler = sge
master_instance_type = m4.large
compute_instance_type = m5a.4xlarge
placement = cluster
placement_group = DYNAMIC
master_root_volume_size = 40
cluster_type = ondemand
base_os = alinux
key_name = *****
min_vcpus = 0
max_vcpus = 64
desired_vcpus = 0
- Base AMI for pcluster v2.1.0
custom_ami = ami-0381cb7486cdc973f
[ebs ladcowrf]
shared_dir = data
volume_type = gp2
volume_size = 1500
volume_iops = 1500
encrypted = false
[vpc public]
master_subnet_id = subnet-******
vpc_id = vpc-******
[global]
update_check = true
sanity_check = true
cluster_template = ladcowrf
[aliases]
ssh = ssh -Y {CFN_USER}@{MASTER_IP} {ARGS}