LADCO Wiki - User contributions [en]

WRF on the Cloud

2018-12-13T21:56:51Z

Ramboll: AWS HPC Platforms Update

= Objectives =
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

* Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
* Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

= Call Notes =
== November 28, 2018 ==
=== WRF Benchmarking ===
* Emulating WRF 2016 12/4/1.3 grids
* Purpose for estimating costing for CPUs, RAM and Storage
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
* RAM: ~22 Gb RAM/run (2.5 Gb/core)
* Storage
** test netCDF4 and netCDF with no compression
** with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
** need to link in the HDF and NC4 libraries with compression to downstream programs
** estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression

=== Conceptual Approach to WRF on the Cloud ===
* Cluster management would launch a head node and compute nodes
* 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
* Head node running constantly
* Compute nodes running over the length of project
* Memory optimized machines performed better than compute optimized for CAMx

=== Cost Analysis ===
* [https://www.ladco.org/wp-content/uploads/Projects/WRF-Cloud/WRF_cloud_computing_costs.pdf Analysis Spreadsheet]

=== Storage Analysis ===
* AWS
** Don't want to use local because it will need to be moved/migrated
** Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
* Azure
** Fast and slower lake storage for offline
** Managed disks for online

=== Data Transfer Analyis ===
* estimate based on 5.8 Gb
* AWS
** Internet transfer will cost ~ $928 for 5.5 Gb
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
* Azure
** Online transfer
** Databox option (like snowball)

=== Cluster Management Tools (interface analysis) ===
* 3-4 seemed to work best across several cloud solutions
* Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools

=== Next Steps ===
* LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
* LADCO to create a login for Ramboll in our AWS organization
* Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
* Next call 12/5 @ 3 Central

= Recommendations =
== WRF ==
* Use netCDF4 with compression
* Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once
== Cloud Service ==
* Costs are equivalent between Azure and AWS so use AWS because of familiarity
* Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
* Use Standard S3 storage for the lifetime of the project and migrate to Infrequent S3 or Glacier for longterm storage
* Use Snowball to transfer completed project to local site

== HPC Platforms ==
* Use AWS ParallelCluster (formerly CfnCluster)
** Provides CLI-interface, allowing for linux-script automation
** Allows for custom AMIs
** Provides a variety of schedulers: sge, torque, slurm, or awsbatch
** Is actively being developed and enhanced
** Additional investigation/test of WRF/CAMx test cases needed to verify tool integrity and performance
* Other HPC have demonstrated issues
** StarCluster: Problematic auto-scaling; outdated and inactive
** AlcesFlight: Fee-based ability to use custom AMIs, problems with auto-scaling for large instance counts

WRF on the Cloud

2018-12-07T22:18:18Z

Ramboll: /* Cloud Service */

WRF on the Cloud

2018-12-07T22:17:47Z

Ramboll:

= Objectives =
LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

* Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
* Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

= Call Notes =
== November 28, 2018 ==
=== WRF Benchmarking ===
* Emulating WRF 2016 12/4/1.3 grids
* Purpose for estimating costing for CPUs, RAM and Storage
* CPU: 8 Cores: 5.5 day run = 4 days; 24 Cores: 3 days
* RAM: ~22 Gb RAM/run (2.5 Gb/core)
* Storage
** test netCDF4 and netCDF with no compression
** with compression saves a lot of space (1/3 of the output) relative to uncompressed NCF (~70% compression)
** need to link in the HDF and NC4 libraries with compression to downstream programs
** estimate about 5.8 Tb for the year, goes to 16.9 Tb without compression

=== Conceptual Approach to WRF on the Cloud ===
* Cluster management would launch a head node and compute nodes
* 77 5.5 day chunks, 20 computers for 16 days (or 80 computers for 4 days)
* Head node running constantly
* Compute nodes running over the length of project
* Memory optimized machines performed better than compute optimized for CAMx

=== Cost Analysis ===
* [https://www.ladco.org/wp-content/uploads/Projects/WRF-Cloud/WRF_cloud_computing_costs.pdf Analysis Spreadsheet]

=== Storage Analysis ===
* AWS
** Don't want to use local because it will need to be moved/migrated
** Put the data on a storage appliance (S3) while running, and then push off to longer term storage (Glacier)
** Glacier is archived and need to submit access through the console, response times listed as 1-5 minutes
* Azure
** Fast and slower lake storage for offline
** Managed disks for online

=== Data Transfer Analyis ===
* estimate based on 5.8 Gb
* AWS
** Internet transfer will cost ~ $928 for 5.5 Gb
** Snowball 10 days to get data off a disk, costs $200 for entire WRF run (smallest was 50 Tb)
* Azure
** Online transfer
** Databox option (like snowball)

=== Cluster Management Tools (interface analysis) ===
* 3-4 seemed to work best across several cloud solutions
* Alsys Flight (works on AWS and Azure), used to bring up 40 nodes; set up a Tor queuing system; trouble with using an AMI, need to pay for an AMI with this solution; can use Docker if we want to use containers, but Ramboll not positioned to use containers for this project
* CFN: slower development, but now has an AWS parallel cluster (CFN reincarnated), improved tools and built in the Python package index (can be installed with PIP); let's you spin everything up from the command line and could be scripted
* Haven't yet explored AWS Parallel Cluster/CFN in detail; similar to experience with Star Cluster; seems to be the best solution because you can use your own custom AMI; instance types are independent of the cluster management tools

=== Next Steps ===
* LADCO to create a WRF AMI on AWS: WRF 3.9.1, netCDF4 with compression, MPICH2, PGI compiler, AMET
* LADCO to create a login for Ramboll in our AWS organization
* Ramboll to explore AWS Parallel cluster and then prototype with LADCO WRF AMI
* Next call 12/5 @ 3 Central

= Recommendations =
== WRF ==
* Use netCDF4 with compression
* Use 8 cores per 5.5-day segment and submit all segments for annual run to cluster at once
== Cloud Service ==
* Costs are equivalent between Azure and AWS so use AWS because of familiarity
* Use one memory optimized instance (EC2-r5.2xlarge, 8 cores, 64 GB RAM) for each segment
* Use Standard S3 storage for the lifetime of project and migrate to Infrequent S3 or Glacier for longterm storage
* Use Snowball to transfer completed project to local site

WRF on the Cloud

2018-12-07T21:27:26Z

Ramboll: /* WRF */

WRF on the Cloud

2018-12-07T21:25:47Z

Ramboll:

WRF on the Cloud

2018-11-26T20:44:56Z

Ramboll:

LADCO is seeking to understand the best practices for submitting and managing multiprocessor computing jobs on a cloud computing platform. In particular, LADCO would like to develop a WRF production environment that utilizes cloud-based computing. The goal of this project is to prototype a WRF production environment on a public, on-demand high performance computing service in the cloud to create a WRF platform-as-a-service (PaaS) solution. The WRF PaaS must meet the following objectives:

* Configurable computing and storage to scale, as needed, to meet that needs of different WRF applications
* Configurable WRF options to enable changing grids, simulation periods, physics options, and input data
* Flexible cloud deployment from a command line interface to initiate computing clusters and spawn WRF jobs in the cloud

WRF on the Cloud

2018-11-26T20:42:19Z

Ramboll: Created page with "Test"

Test