Azure Grid Computing - Worker Roles as HPC Compute Nodes
- by JoshReuben
Overview
· With HPC 2008 R2 SP1 You can add Azure worker roles as compute nodes in a local Windows HPC Server cluster.
· The subscription for Windows Azure like any other Azure Service - charged for the time that the role instances are available, as well as for the compute and storage services that are used on the nodes.
· Win-Win ? - Azure charges the computer hour cost (according to vm size) amortized over a month – so you save on purchasing compute node hardware. Microsoft wins because you need to purchase HPC to have a local head node for managing this compute cluster grid distributed in the cloud.
· Blob storage is used to hold input & output files of each job. I can see how Parametric Sweep HPC jobs can be supported (where the same job is run multiple times on each node against different input units), but not MPI.NET (where different HPC Job instances function as coordinated agents and conduct master-slave inter-process communication), unless Azure is somehow tunneling MPI communication through inter-WorkerRole Azure Queues.
· this is not the end of the story for Azure Grid Computing. If MS requires you to purchase a local HPC license (and administrate it), what's to stop a 3rd party from doing this and encapsulating exposing HPC WCF Broker Service to you for managing compute nodes? If MS doesn’t provide head node as a service, someone else will!
Process
· requires creation of a worker node template that specifies a connection to an existing subscription for Windows Azure + an availability policy for the worker nodes.
· After worker nodes are added to the cluster, you can start them, which provisions the Windows Azure role instances, and then bring them online to run HPC cluster jobs.
· A Windows Azure worker role instance runs a HPC compatible Azure guest operating system which runs on the VMs that host your service. The guest operating system is updated monthly. You can choose to upgrade the guest OS for your service automatically each time an update is released - All role instances defined by your service will run on the guest operating system version that you specify. see Windows Azure Guest OS Releases and SDK Compatibility Matrix (http://go.microsoft.com/fwlink/?LinkId=190549).
· use the hpcpack command to upload file packages and install files to run on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514).
Requirements
· assuming you have an azure subscription account and the HPC head node installed and configured.
· Install HPC Pack 2008 R2 SP 1 - see Microsoft HPC Pack 2008 R2 Service Pack 1 Release Notes (http://go.microsoft.com/fwlink/?LinkID=202812).
· Configure the head node to connect to the Internet - connectivity is provided by the connection of the head node to the enterprise network. You may need to configure a proxy client on the head node. Any cluster network topology (1-5) is supported).
· Configure the firewall - allow outbound TCP traffic on the following ports: 80, 443, 5901, 5902, 7998, 7999
· Note: HPC Server uses Admin Mode (Elevated Privileges) in Windows Azure to give the service administrator of the subscription the necessary privileges to initialize HPC cluster services on the worker nodes.
· Obtain a Windows Azure subscription certificate - the Windows Azure subscription must be configured with a public subscription (API) certificate -a valid X.509 certificate with a key size of at least 2048 bits. Generate a self-sign certificate & upload a .cer file to the Windows Azure Portal Account page > Manage my API Certificates link. see Using the Windows Azure Service Management API (http://go.microsoft.com/fwlink/?LinkId=205526).
· import the certificate with an associated private key on the HPC cluster head node - into the trusted root store of the local computer account.
Obtain Windows Azure Connection Information for HPC Server
· required for each worker node template
· copy from azure portal - Get from: navigation pane > Hosted Services > Storage Accounts & CDN
· Subscription ID - a 32-char hex string in the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. In Properties pane.
· Subscription certificate thumbprint - a 40-char hex string (you need to remove spaces). In Management Certificates > Properties pane.
· Service name - the value of <ServiceName> configured in the public URL of the service (http://<ServiceName>.cloudapp.net). In Hosted Services > Properties pane.
· Blob Storage account name - the value of <StorageAccountName> configured in the public URL of the account (http://<StorageAccountName>.blob.core.windows.net). In Storage Accounts > Properties pane.
Import the Azure Subscription Certificate on the HPC Head Node
· enable the services for Windows HPC Server to authenticate properly with the Windows Azure subscription.
· use the Certificates MMC snap-in to import the certificate to the Trusted Root Certification Authorities store of the local computer account. The certificate must be in PFX format (.pfx or .p12 file) with a private key that is protected by a password.
· see Certificates (http://go.microsoft.com/fwlink/?LinkId=163918).
· To open the certificates snapin: Run > mmc. File > Add/Remove Snap-in > certificates > Computer account > Local Computer
· To import the certificate via wizard - Certificates > Trusted Root Certification Authorities > Certificates > All Tasks > Import
· After the certificate is imported, it appears in the details pane in the Certificates snap-in. You can open the certificate to check its status.
Configure a Proxy Client on the HPC Head Node
· the following Windows HPC Server services must be able to communicate over the Internet (through the firewall) with the services for Windows Azure: HPCManagement, HPCScheduler, HPCBrokerWorker.
·
Create a Windows Azure Worker Node Template
· Edit HPC node templates in HPC Node Template Editor.
· Specify: 1) Windows Azure subscription connection info (unique service name) for adding a set of worker nodes to the cluster + 2)worker node availability policy – rules for deploying / removing worker role instances in Windows Azure
o HPC Cluster Manager > Configuration > Navigation Pane > Node Templates > Actions pane > New à Create Node Template Wizard or Edit à Node Template Editor
o Choose Node Template Type page - Windows Azure worker node template
o Specify Template Name page – template name & description
o Provide Connection Information page – Azure Subscription ID (text) & Subscription certificate (browse)
o Provide Service Information page - Azure service name + blob storage account name (optionally click Retrieve Connection Information to get list of available from azure – possible LRT).
o Configure Azure Availability Policy page - how Windows Azure worker nodes start / stop (online / offline the worker role instance - add / remove) – manual / automatic
o for automatic - In the Configure Windows Azure Worker Availability Policy dialog -select days and hours for worker nodes to start / stop.
· To validate the Windows Azure connection information, on the template's Connection Information tab > Validate connection information.
· You can upload a file package to the storage account that is specified in the template - eg upload application or service files that will run on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514).
Add Azure Worker Nodes to the HPC Cluster
· Use the Add Node Wizard – specify: 1) the worker node template, 2) The number of worker nodes (within the quota of role instances in the azure subscription), and 3) The VM size of the worker nodes : ExtraSmall, Small, Medium, Large, or ExtraLarge.
· to add worker nodes of different sizes, must run the Add Node Wizard separately for each size.
· All worker nodes that are added to the cluster by using a specific worker node template define a set of worker nodes that will be deployed and managed together in Windows Azure when you start the nodes. This includes worker nodes that you add later by using the worker node template and, if you choose, worker nodes of different sizes. You cannot start, stop, or delete individual worker nodes.
· To add Windows Azure worker nodes
o In HPC Cluster Manager: Node Management > Actions pane > Add Node à Add Node Wizard
o Select Deployment Method page - Add Azure Worker nodes
o Specify New Nodes page - select a worker node template, specify the number and size of the worker nodes
· After you add worker nodes to the cluster, they are in the Not-Deployed state, and they have a health state of Unapproved. Before you can use the worker nodes to run jobs, you must start them and then bring them online.
· Worker nodes are numbered consecutively in a naming series that begins with the root name AzureCN – this is non-configurable.
Deploying Windows Azure Worker Nodes
· To deploy the role instances in Windows Azure - start the worker nodes added to the HPC cluster and bring the nodes online so that they are available to run cluster jobs. This can be configured in the HPC Azure Worker Node Template – Azure Availability Policy - to be automatic or manual.
· The Start, Stop, and Delete actions take place on the set of worker nodes that are configured by a specific worker node template. You cannot perform one of these actions on a single worker node in a set. You also cannot perform a single action on two sets of worker nodes (specified by two different worker node templates).
· · Starting a set of worker nodes deploys a set of worker role instances in Windows Azure, which can take some time to complete, depending on the number of worker nodes and the performance of Windows Azure.
· To start worker nodes manually and bring them online
o In HPC Node Management > Navigation Pane > Nodes > List / Heat Map view - select one or more worker nodes.
o Actions pane > Start – in the Start Azure Worker Nodes dialog, select a node template.
o the state of the worker nodes changes from Not Deployed to track the provisioning progress – worker node Details Pane > Provisioning Log tab.
o If there were errors during the provisioning of one or more worker nodes, the state of those nodes is set to Unknown and the node health is set to Unapproved. To determine the reason for the failure, review the provisioning logs for the nodes.
o After a worker node starts successfully, the node state changes to Offline. To bring the nodes online, select the nodes that are in the Offline state > Bring Online.
· Troubleshooting
o check node template.
o use telnet to test connectivity:
telnet <ServiceName>.cloudapp.net 7999
o check node status - Deployment status information appears in the service account information in the Windows Azure Portal - HPC queries this - see node status information for any failed nodes in HPC Node Management.
· When role instances are deployed, file packages that were previously uploaded to the storage account using the hpcpack command are automatically installed. You can also upload file packages to storage after the worker nodes are started, and then manually install them on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514).
· to remove a set of role instances in Windows Azure - stop the nodes by using HPC Cluster Manager (apply the Stop action). This deletes the role instances from the service and changes the state of the worker nodes in the HPC cluster to Not Deployed.
· Each time that you start a set of worker nodes, two proxy role instances (size Small) are configured in Windows Azure to facilitate communication between HPC Cluster Manager and the worker nodes. The proxy role instances are not listed in HPC Cluster Manager after the worker nodes are added. However, the instances appear in the Windows Azure Portal. The proxy role instances incur charges in Windows Azure along with the worker node instances, and they count toward the quota of role instances in the subscription.