mongrel cluster - Page 19

protecting an application against hardware failure [on hold]

- by alex

I have an application for which I am looking for a way to protect against hardware and software (operating system ) failure. Cluster seems OK but the storage become the single point of failure and also I do not have a SAN. Can you please tell me if there are other ways to protect the application? Periodically this application is updated and changes should be replicated automatically to the second server.

Read the article

Windows DFS File System Clustering

- by tearman

We're attempted to set up a high availability network for our file servers, and we're wanting to do a DFS file system cluster using the same back-end storage (our back-end storage has its own clustering mechanisms that it manages itself). The question being, A. how would one go about setting up DFS clustering, and B. how can we get Windows to cooperate with multiple servers accessing the same SAN volumes?

Read the article

Moving a single Windows 2008 box to a clustered implementation?

- by chris

I have a system that currently runs on a single box, Windows 2008 Enterprise. This is just used as a web server. What is involved in creating a cluster? Basically doing this for availability reasons - the load on the system will be pretty light.

Read the article

Azure Grid Computing - Worker Roles as HPC Compute Nodes

- by JoshReuben

Overview · With HPC 2008 R2 SP1 You can add Azure worker roles as compute nodes in a local Windows HPC Server cluster. · The subscription for Windows Azure like any other Azure Service - charged for the time that the role instances are available, as well as for the compute and storage services that are used on the nodes. · Win-Win ? - Azure charges the computer hour cost (according to vm size) amortized over a month – so you save on purchasing compute node hardware. Microsoft wins because you need to purchase HPC to have a local head node for managing this compute cluster grid distributed in the cloud. · Blob storage is used to hold input & output files of each job. I can see how Parametric Sweep HPC jobs can be supported (where the same job is run multiple times on each node against different input units), but not MPI.NET (where different HPC Job instances function as coordinated agents and conduct master-slave inter-process communication), unless Azure is somehow tunneling MPI communication through inter-WorkerRole Azure Queues. · this is not the end of the story for Azure Grid Computing. If MS requires you to purchase a local HPC license (and administrate it), what's to stop a 3rd party from doing this and encapsulating exposing HPC WCF Broker Service to you for managing compute nodes? If MS doesn’t provide head node as a service, someone else will! Process · requires creation of a worker node template that specifies a connection to an existing subscription for Windows Azure + an availability policy for the worker nodes. · After worker nodes are added to the cluster, you can start them, which provisions the Windows Azure role instances, and then bring them online to run HPC cluster jobs. · A Windows Azure worker role instance runs a HPC compatible Azure guest operating system which runs on the VMs that host your service. The guest operating system is updated monthly. You can choose to upgrade the guest OS for your service automatically each time an update is released - All role instances defined by your service will run on the guest operating system version that you specify. see Windows Azure Guest OS Releases and SDK Compatibility Matrix (http://go.microsoft.com/fwlink/?LinkId=190549). · use the hpcpack command to upload file packages and install files to run on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514). Requirements · assuming you have an azure subscription account and the HPC head node installed and configured. · Install HPC Pack 2008 R2 SP 1 - see Microsoft HPC Pack 2008 R2 Service Pack 1 Release Notes (http://go.microsoft.com/fwlink/?LinkID=202812). · Configure the head node to connect to the Internet - connectivity is provided by the connection of the head node to the enterprise network. You may need to configure a proxy client on the head node. Any cluster network topology (1-5) is supported). · Configure the firewall - allow outbound TCP traffic on the following ports: 80, 443, 5901, 5902, 7998, 7999 · Note: HPC Server uses Admin Mode (Elevated Privileges) in Windows Azure to give the service administrator of the subscription the necessary privileges to initialize HPC cluster services on the worker nodes. · Obtain a Windows Azure subscription certificate - the Windows Azure subscription must be configured with a public subscription (API) certificate -a valid X.509 certificate with a key size of at least 2048 bits. Generate a self-sign certificate & upload a .cer file to the Windows Azure Portal Account page > Manage my API Certificates link. see Using the Windows Azure Service Management API (http://go.microsoft.com/fwlink/?LinkId=205526). · import the certificate with an associated private key on the HPC cluster head node - into the trusted root store of the local computer account. Obtain Windows Azure Connection Information for HPC Server · required for each worker node template · copy from azure portal - Get from: navigation pane > Hosted Services > Storage Accounts & CDN · Subscription ID - a 32-char hex string in the form xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. In Properties pane. · Subscription certificate thumbprint - a 40-char hex string (you need to remove spaces). In Management Certificates > Properties pane. · Service name - the value of <ServiceName> configured in the public URL of the service (http://<ServiceName>.cloudapp.net). In Hosted Services > Properties pane. · Blob Storage account name - the value of <StorageAccountName> configured in the public URL of the account (http://<StorageAccountName>.blob.core.windows.net). In Storage Accounts > Properties pane. Import the Azure Subscription Certificate on the HPC Head Node · enable the services for Windows HPC Server to authenticate properly with the Windows Azure subscription. · use the Certificates MMC snap-in to import the certificate to the Trusted Root Certification Authorities store of the local computer account. The certificate must be in PFX format (.pfx or .p12 file) with a private key that is protected by a password. · see Certificates (http://go.microsoft.com/fwlink/?LinkId=163918). · To open the certificates snapin: Run > mmc. File > Add/Remove Snap-in > certificates > Computer account > Local Computer · To import the certificate via wizard - Certificates > Trusted Root Certification Authorities > Certificates > All Tasks > Import · After the certificate is imported, it appears in the details pane in the Certificates snap-in. You can open the certificate to check its status. Configure a Proxy Client on the HPC Head Node · the following Windows HPC Server services must be able to communicate over the Internet (through the firewall) with the services for Windows Azure: HPCManagement, HPCScheduler, HPCBrokerWorker. · Create a Windows Azure Worker Node Template · Edit HPC node templates in HPC Node Template Editor. · Specify: 1) Windows Azure subscription connection info (unique service name) for adding a set of worker nodes to the cluster + 2)worker node availability policy – rules for deploying / removing worker role instances in Windows Azure o HPC Cluster Manager > Configuration > Navigation Pane > Node Templates > Actions pane > New à Create Node Template Wizard or Edit à Node Template Editor o Choose Node Template Type page - Windows Azure worker node template o Specify Template Name page – template name & description o Provide Connection Information page – Azure Subscription ID (text) & Subscription certificate (browse) o Provide Service Information page - Azure service name + blob storage account name (optionally click Retrieve Connection Information to get list of available from azure – possible LRT). o Configure Azure Availability Policy page - how Windows Azure worker nodes start / stop (online / offline the worker role instance - add / remove) – manual / automatic o for automatic - In the Configure Windows Azure Worker Availability Policy dialog -select days and hours for worker nodes to start / stop. · To validate the Windows Azure connection information, on the template's Connection Information tab > Validate connection information. · You can upload a file package to the storage account that is specified in the template - eg upload application or service files that will run on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514). Add Azure Worker Nodes to the HPC Cluster · Use the Add Node Wizard – specify: 1) the worker node template, 2) The number of worker nodes (within the quota of role instances in the azure subscription), and 3) The VM size of the worker nodes : ExtraSmall, Small, Medium, Large, or ExtraLarge. · to add worker nodes of different sizes, must run the Add Node Wizard separately for each size. · All worker nodes that are added to the cluster by using a specific worker node template define a set of worker nodes that will be deployed and managed together in Windows Azure when you start the nodes. This includes worker nodes that you add later by using the worker node template and, if you choose, worker nodes of different sizes. You cannot start, stop, or delete individual worker nodes. · To add Windows Azure worker nodes o In HPC Cluster Manager: Node Management > Actions pane > Add Node à Add Node Wizard o Select Deployment Method page - Add Azure Worker nodes o Specify New Nodes page - select a worker node template, specify the number and size of the worker nodes · After you add worker nodes to the cluster, they are in the Not-Deployed state, and they have a health state of Unapproved. Before you can use the worker nodes to run jobs, you must start them and then bring them online. · Worker nodes are numbered consecutively in a naming series that begins with the root name AzureCN – this is non-configurable. Deploying Windows Azure Worker Nodes · To deploy the role instances in Windows Azure - start the worker nodes added to the HPC cluster and bring the nodes online so that they are available to run cluster jobs. This can be configured in the HPC Azure Worker Node Template – Azure Availability Policy - to be automatic or manual. · The Start, Stop, and Delete actions take place on the set of worker nodes that are configured by a specific worker node template. You cannot perform one of these actions on a single worker node in a set. You also cannot perform a single action on two sets of worker nodes (specified by two different worker node templates). · · Starting a set of worker nodes deploys a set of worker role instances in Windows Azure, which can take some time to complete, depending on the number of worker nodes and the performance of Windows Azure. · To start worker nodes manually and bring them online o In HPC Node Management > Navigation Pane > Nodes > List / Heat Map view - select one or more worker nodes. o Actions pane > Start – in the Start Azure Worker Nodes dialog, select a node template. o the state of the worker nodes changes from Not Deployed to track the provisioning progress – worker node Details Pane > Provisioning Log tab. o If there were errors during the provisioning of one or more worker nodes, the state of those nodes is set to Unknown and the node health is set to Unapproved. To determine the reason for the failure, review the provisioning logs for the nodes. o After a worker node starts successfully, the node state changes to Offline. To bring the nodes online, select the nodes that are in the Offline state > Bring Online. · Troubleshooting o check node template. o use telnet to test connectivity: telnet <ServiceName>.cloudapp.net 7999 o check node status - Deployment status information appears in the service account information in the Windows Azure Portal - HPC queries this - see node status information for any failed nodes in HPC Node Management. · When role instances are deployed, file packages that were previously uploaded to the storage account using the hpcpack command are automatically installed. You can also upload file packages to storage after the worker nodes are started, and then manually install them on the worker nodes. see hpcpack (http://go.microsoft.com/fwlink/?LinkID=205514). · to remove a set of role instances in Windows Azure - stop the nodes by using HPC Cluster Manager (apply the Stop action). This deletes the role instances from the service and changes the state of the worker nodes in the HPC cluster to Not Deployed. · Each time that you start a set of worker nodes, two proxy role instances (size Small) are configured in Windows Azure to facilitate communication between HPC Cluster Manager and the worker nodes. The proxy role instances are not listed in HPC Cluster Manager after the worker nodes are added. However, the instances appear in the Windows Azure Portal. The proxy role instances incur charges in Windows Azure along with the worker node instances, and they count toward the quota of role instances in the subscription.

Read the article

jboss cache as hibernate 2nd level - cluster node doesn't persist replicated data

- by Sergey Grashchenko

I'm trying to build an architecture basically described in user guide http://www.jboss.org/file-access/default/members/jbosscache/freezone/docs/3.2.1.GA/userguide_en/html/cache_loaders.html#d0e3090 (Replicated caches with each cache having its own store.) but having jboss cache configured as hibernate second level cache. I've read manual for several days and played with the settings but could not achieve the result - the data in memory (jboss cache) gets replicated across the hosts, but it's not persisted in the datasource/database of the target (not original) cluster host. I had a hope that a node might become persistent at eviction, so I've got a cache listener and attached it to @NoveEvicted event. I found that though I could adjust eviction policy to fully control it, no any persistence takes place. Then I had a though that I could try to modify CacheLoader to set "passivate" to true, but I found that in my case (hibernate 2nd level cache) I don't have a way to access a loader. I wonder if replicated data persistence is possible at all by configuration tuning ? If not, will it work for me to create some manual peristence in CacheListener (I could check whether the eviction event is local, and if not - persist it to hibernate datasource somehow) ? I've used mvcc-entity configuration with the modification of cacheMode - set to REPL_ASYNC. I've also played with the eviction policy configuration. Last thing to mention is that I've tested entty persistence and replication in project that has been generated with Seam. I guess it's not important though.

Read the article

iptable CLUSTERIP won't work

- by Rad Akefirad

We have some requirements which explained here. We tried to satisfy them without any success as described. Here is the brief information: Here are requirements: 1. High Availability 2. Load Balancing Current Configuration: Server #1: one static (real) IP for each 10.17.243.11 Server #2: one static (real) IP for each 10.17.243.12 Cluster (virtual and shared among all servers) IP: 10.17.243.15 I tried to use CLUSTERIP to have the cluster IP by the following: on the server #1 iptables -I INPUT -i eth0 -d 10.17.243.15 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5E:00:00:20 --total-nodes 2 --local-node 1 on the server #2 iptables -I INPUT -i eth0 -d 10.17.243.15 -j CLUSTERIP --new --hashmode sourceip --clustermac 01:00:5E:00:00:20 --total-nodes 2 --local-node 2 When we try to ping 10.17.243.15 there is no reply. And the web service (tomcat on port 8080) is not accessible either. However we managed to get the packets on both servers by using TCPDUMP. Some useful information: iptable roules (iptables -L -n -v): Chain INPUT (policy ACCEPT 21775 packets, 1470K bytes) pkts bytes target prot opt in out source destination 0 0 CLUSTERIP all -- eth0 * 0.0.0.0/0 10.17.243.15 CLUSTERIP hashmode=sourceip clustermac=01:00:5E:00:00:20 total_nodes=2 local_node=1 hash_init=0 Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 14078 packets, 44M bytes) pkts bytes target prot opt in out source destination Log messages: ... kernel: [ 7.329017] e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None ... kernel: [ 7.329133] e1000e 0000:05:00.0: eth3: 10/100 speed: disabling TSO ... kernel: [ 7.329567] ADDRCONF(NETDEV_CHANGE): eth3: link becomes ready ... kernel: [ 71.333285] ip_tables: (C) 2000-2006 Netfilter Core Team ... kernel: [ 71.341804] nf_conntrack version 0.5.0 (16384 buckets, 65536 max) ... kernel: [ 71.343168] ipt_CLUSTERIP: ClusterIP Version 0.8 loaded successfully ... kernel: [ 108.456043] device eth0 entered promiscuous mode ... kernel: [ 112.678859] device eth0 left promiscuous mode ... kernel: [ 117.916050] device eth0 entered promiscuous mode ... kernel: [ 140.168848] device eth0 left promiscuous mode TCPDUMP while pinging: tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes 12:11:55.335528 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 10.17.243.1 > 10.17.243.15: ICMP echo request, id 16162, seq 2390, length 64 12:11:56.335778 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 10.17.243.1 > 10.17.243.15: ICMP echo request, id 16162, seq 2391, length 64 12:11:57.336010 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 10.17.243.1 > 10.17.243.15: ICMP echo request, id 16162, seq 2392, length 64 12:11:58.336287 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 10.17.243.1 > 10.17.243.15: ICMP echo request, id 16162, seq 2393, length 64 And there is no ping reply as I said. Does anyone know which part I missed? Thanks in advance.

Read the article

Using WKA in Large Coherence Clusters (Disabling Multicast)

- by jpurdy

Disabling hardware multicast (by configuring well-known addresses aka WKA) will place significant stress on the network. For messages that must be sent to multiple servers, rather than having a server send a single packet to the switch and having the switch broadcast that packet to the rest of the cluster, the server must send a packet to each of the other servers. While hardware varies significantly, consider that a server with a single gigabit connection can send at most ~70,000 packets per second. To continue with some concrete numbers, in a cluster with 500 members, that means that each server can send at most 140 cluster-wide messages per second. And if there are 10 cluster members on each physical machine, that number shrinks to 14 cluster-wide messages per second (or with only mild hyperbole, roughly zero). It is also important to keep in mind that network I/O is not only expensive in terms of the network itself, but also the consumption of CPU required to send (or receive) a message (due to things like copying the packet bytes, processing a interrupt, etc). Fortunately, Coherence is designed to rely primarily on point-to-point messages, but there are some features that are inherently one-to-many: Announcing the arrival or departure of a member Updating partition assignment maps across the cluster Creating or destroying a NamedCache Invalidating a cache entry from a large number of client-side near caches Distributing a filter-based request across the full set of cache servers (e.g. queries, aggregators and entry processors) Invoking clear() on a NamedCache The first few of these are operations that are primarily routed through a single senior member, and also occur infrequently, so they usually are not a primary consideration. There are cases, however, where the load from introducing new members can be substantial (to the point of destabilizing the cluster). Consider the case where cluster in the first paragraph grows from 500 members to 1000 members (holding the number of physical machines constant). During this period, there will be 500 new member introductions, each of which may consist of several cluster-wide operations (for the cluster membership itself as well as the partitioned cache services, replicated cache services, invocation services, management services, etc). Note that all of these introductions will route through that one senior member, which is sharing its network bandwidth with several other members (which will be communicating to a lesser degree with other members throughout this process). While each service may have a distinct senior member, there's a good chance during initial startup that a single member will be the senior for all services (if those services start on the senior before the second member joins the cluster). It's obvious that this could cause CPU and/or network starvation. In the current release of Coherence (3.7.1.3 as of this writing), the pure unicast code path also has less sophisticated flow-control for cluster-wide messages (compared to the multicast-enabled code path), which may also result in significant heap consumption on the senior member's JVM (from the message backlog). This is almost never a problem in practice, but with sufficient CPU or network starvation, it could become critical. For the non-operational concerns (near caches, queries, etc), the application itself will determine how much load is placed on the cluster. Applications intended for deployment in a pure unicast environment should be careful to avoid excessive dependence on these features. Even in an environment with multicast support, these operations may scale poorly since even with a constant request rate, the underlying workload will increase at roughly the same rate as the underlying resources are added. Unless there is an infrastructural requirement to the contrary, multicast should be enabled. If it can't be enabled, care should be taken to ensure the added overhead doesn't lead to performance or stability issues. This is particularly crucial in large clusters.

Read the article

Which process is using my NAS?

- by sethu

I have a nas connected to my cluster. The NAS holds all our home directories. When I did a set of experiments last week, saving a 1 GB file to the nas took around 30 seconds. If i do the same to a local disk it takes 18 seconds. But when I tried doing the same process today, it takes 150 seconds. I am unsure what is the problem . Can someone help me pointout the issue? Is it possible to find out which process is accessing the NAS or how much NAS bandwidth is getting used ? Thanks for your help. -Sethu

Read the article

coordinating a script to run on only one of identical load-balanced servers

- by Amos Shapira

I have two identically configured CentOS 5 servers (possibly more in the future). I need to run a cron job on any one of them and that it'll run only on one of them. I know about RedHat Cluster Suite (we use it on other servers), but it's a too big a gun to use for this task, plus it doesn't really behave well for less than three nodes. Is there anything light-weight I can use for that? The servers can communicate with each other directly. I suppose I can develope something over ssh or nrpe (two server which are already installed on these servers), but I was wondering whether there is something already available.

Read the article

Linux HA - Best Heartbeat hardware solution

- by Martino Dino

Hi all I would ask anyone what is the best layer 2 medium for heartbeat in Linux and how it's best configured. More precisely I've been thinking about a dedicated NIC for that purpose but then i thought that if a switch breaks then i would loose the heartbeat connection for most of the cluster and STONITH 'BUM'!!! Will probably loose my job after :) Distributing the heartbeat onto the main NICs of every node trough a vif sounds reasonable but im not sure if this is the best option (at least the switches are redundant to some extent). Is it possible to use heartbeat over a bonded interface and that sounds reasonable? Do you have any other tip/solution for that issue?

Read the article

diskpart on RDM's ...

- by karnash

HI, We have ESXi cluster which is attached to clariion CX4 We have windows 2008 R2 as the guest OS. Attahed to this vm is 2 x 1.95T RDM's I select disk 1 create partition primary size=1 (1MB) then list partition Partition ### Type Size Offset * Partition 1 Primary 1024 KB 1024 KB Then I do the same for the other disk and offset is 1024KB I need to present 4T disk to this vm so I right click on disk 1 convert to simple volume then extend it by adding the second disk now when I do list partition, I see the off set is set to 31k. Can anyone please guide me. Thanks

Read the article

Upgrading drives on a MD3000

- by Anonymouse

Hello, Our MD3000 array is getting full as our databases are growing and we need more spaces. Currently, we use a MD3000 with a two-servers Windows 2003 cluster and 15x 73GB SAS drives. Disk groups are configured in RAID1 of two drives. The approach we are currently investigating is simply swapping the existing SAS drives with bigger ones (300GB instead of 73GB), one at a time, and let each RAID1 array rebuilt. Is it a good approach? Will we be able to resize the array afterwards? Will we be able to resize the partitions afterwards? Can the Dell M3000 Management software do it or will we have to bring the server offline and use some partition software to do it? Thanks in advance.

Read the article

How to make a DHCP server on virtual machine serves other virtual machines(on different physical machines)?

- by Tony

I'm building a virtual cluster with VirtualBox and Opensuse. I have 10 physical machines and need several vms on each. The virtual machines are supposed to be in a "private" network, but still have internet access. I was asked to set up a virtual head node working as DHCP server. I installed DHCP server on the virtual head node and it seems works. On VirtualBox I set 2 network adapters to the head node, one bridged adapter and one internal network. One vm on the same physical machine has been set nic as internal network adapter. The vm can get IP address (so DHCP works) but can't access internet. What should I do? Specifically, what network adapter should I choose for head-node and work-nodes in VirtualBox? What in the virtual machines should I do?

Read the article

Decent 1gb switch (16-24 port) for rack...

- by TomTom

Hallo, for a rack containing a smaller nubmer of servers (5 at the moment, going to stay in this area), I look to replace the currently aging 100mbit switch with a 1gb switch. This is for the backend between the servers. I expect some ISCIS traffic there ,so a 10gbit option would be nice (preferably for two ports, as extension modules). I dont need management, this is a pure backend of an internal cluster. I do VLAN, but there is no sensible management the switch can do there. I wuold like: * 1he only, obviously * preferable limited moving parts. * Low price ;) * Enough power to run at least half the ports in full speed at the same time. Anyone any recommendations?

Read the article

How to broadcast a command on Windows

- by Xiao Jia

I am going to frequently deploy different versions of a program on a cluster of Windows machines (mostly Windows XP), so I am willing to use a command-line broadcasting tool (either built-in or 3rd-party) to (1) download a file from some URL, and (2) execute the same command, on all the machines. I googled for a very long time but got nothing related to my goal. (Only pages about broadcasting a message, broadcasting ping, or programmatically broadcast via TCP/IP, etc.) Are there any tool for this purpose? Or is it possible to do it pragmatically (without installing extra client programs on those machines)?

Read the article

Configuring MPI on 2 nodes

- by Wysek

I'm trying to create really simple "cluster" from 2 multicore computers using openmpi. My problem is that I can't find any tutorials on that matter. I don't want to use torque because it's not necessary in my case nevertheless all tutorials give configuration details either about torque or mpd (which doesn't exist in openmpi implementation). Could you give me some tips or links to appropriate manuals? Steps I've already completed: - openmpi installation - network configuration (computers see each other) - ssh password-less login to second computer I tried using machinefiles without further configuration and with just 2 IPs in it. But jobs don't seem to start at all after initialization part. (MPI seems to work because I'm able to scatter jobs on multiple cores of both computers without communication between them).

Read the article

Mongrel_rails can't find memcache-client

- by tonisep

We started to use memcache-client in our rails app and it works just fine with "script/server" but "mongrel_rails start" fails with an error. In environment.rb we define "memcache-client" and version "1.8.1". Gem list shows that the gem is installed: memcache-client (1.8.1). If run with "script/server" everything works but with "mongrel_rails start" it fails with error: no such file to load -- memcache-client Any advice what could be wrong here? Is there something different in the way mongrel_rails loads the gems compared to script/server? Or is my setup just broken?

Read the article

Serving web application without Lighttpd/Apache

- by Shyam

Hi, As Rails applications default run on port 3000, would it be possible to start the application on port 80? Is it really required to have a fastcgi/mod_proxy enabled web server in front? My users won't be more than three at a time. If so, how would I be able to do so? Thanks!

Read the article

HDFS some datanodes of cluster are suddenly disconnected while reducers are running

- by user1429825

I have 8 slave computers and 1 master computer for running Hadoop (ver 0.21) some datanodes of cluster are suddenly disconnected while I was running MapReduce code on 10GB data After all mappers finished and around 80% of reducers was processed, randomly one or more datanode disconned from network. and then the other datanodes start to disappear from network even if I killed the MapReduce job when I found some datanode was disconnected. I've tried to change dfs.datanode.max.xcievers to 4096, turned off fire-walls of all computing node, disabled selinux and increased the number of file open limit to 20000 but they didn't work at all... anyone have a idea to solve this problem? followings are error log from mapreduce 12/06/01 12:31:29 INFO mapreduce.Job: Task Id : attempt_201206011227_0001_r_000006_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink as ***.***.***.148:20010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) and followings are logs from datanode 2012-06-01 13:01:01,118 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5549263231281364844_3453 src: /*.*.*.147:56205 dest: /*.*.*.142:20010 2012-06-01 13:01:01,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020) Starting thread to transfer block blk_-3849519151985279385_5906 to *.*.*.147:20010 2012-06-01 13:01:19,135 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-5797481564121417802_3453 to *.*.*.146:20010 got java.net.ConnectException: > Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1257) at java.lang.Thread.run(Thread.java:722) 2012-06-01 13:06:20,342 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6674438989226364081_3453 2012-06-01 13:09:01,781 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-3849519151985279385_5906 to *.*.*.147:20010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/*.*.*.142:60057 remote=/*.*.*.147:20010] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:203) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:388) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:476) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1284) at java.lang.Thread.run(Thread.java:722) hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/data/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/data/hdfs1,/home/hadoop/data/hdfs2,/home/hadoop/data/hdfs3,/home/hadoop/data/hdfs4,/home/hadoop/data/hdfs5</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:20070</value> <description>50070 The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:20075</value> <description>50075 The datanode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.secondary.http.address</name> <value>0.0.0.0:20090</value> <description>50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:20010</value> <description>50010 The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. </description> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:20020</value> <description>50020 The datanode ipc server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:20475</value> </property> <property> <name>dfs.https.address</name> <value>0.0.0.0:20470</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>masternode:29001</value> </property> <property> <name>mapred.system.dir</name> <value>/home/hadoop/data/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/hadoop/data/mapreduce/local</value> </property> <property> <name>mapred.map.tasks</name> <value>32</value> <description> default number of map tasks per job.</description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> </property> <property> <name>mapred.reduce.tasks</name> <value>8</value> <description> default number of reduce tasks per job.</description> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx2048M</value> </property> <property> <name>io.sort.mb</name> <value>500</value> </property> <property> <name>mapred.task.timeout</name> <value>1800000</value>  </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:20030</value> <description> 50030 The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:20060</value> <description> 50060 </property> </configuration>

Read the article

PBS batch jobs - the qalter command

- by Ryan Budney

I've got a giant computation running on a Scientific Linux cluster. At present I have over 600 jobs parked in the queue, waiting for processor time, while a few are running. I'm trying to use the qalter command on some of the idle but scheduled jobs. I'd like to schedule them for a later time, so that other users can jump part of the queue, sort of as an act of politeness. Is this doable? For example, JOBNAME 292399 is currently idle, scheduled to be run whenever a spot in the queue opens up. But if I run qalter -a 10051000 292398 followed by qrerun 292398 I get qrerun: Request invalid for state of job 292398.euler. From the qalter documentation, I thought 10051000 refers to tomorrow (oct 5th, 10am) but perhaps I'm misunderstanding something? If I'm going about this the wrong way, please let me know. The main thing I'm looking for is a command that's easily scriptable, so that I can modify when my queued tasks get run. qalter seems good for those purposes if I can get it working. I'd rather avoid running qdel and re qsubbing the computations, as there's a bookkeeping issue on which tasks to restart (vs which ones not to). I want to avoid that kind of bookkeeping. From googling around I notice some qalter commands have rather different date formats, but the above appears to be correct, as far as I can tell from the man docs. Any help would be appreciated.

Read the article

Pitfalls to using Gluster as a home/profile directory server?

- by Bart Silverstrim

I was asking recently about options for divvying up access to file servers, as we have a NAS solution that gets fairly bogged down when our users (with giant profiles, especially) all log in nearly simultaneously. I ran across Gluster and it looks like it can cluster different physical storage media into a single virtual volume and share it out like a virtual NAS from the client perspective and it support CIFS. My question is whether something like this would be feasible to use for home and profile directories in an active directory environment. I was worried about ACL's, primarily, as I didn't think CIFS was fine-grained enough to support NTFS permissions and it didn't look like Gluster exports those permission levels, just the base permissions for basic file sharing. I got the impression that using Gluster would allow for data to be redundant across multiple servers and would speed up access to the files under heavy load, while allowing us to dynamically boost storage capacity by just adding another server and telling Gluster's master node to add that server. Maybe I'm wrong with my understanding of it though. Anyone else use it or care to share how feasible this is?

Read the article

Built local glibc, broke system, how do I ssh without parsing the .bashrc?

- by Mikhail

The cluster I am on had really old build tools and I needed to use CUDA5. I'm a pretty clever dude and I planned on building the necissary tools. So, I built a local copy of gcc, bintools, and glibc. Everything a CUDA5 could want. All builds finished without error. and I tested gcc and bintools. Everything was wonderful and I built and ran a few of the programs. I set up the LD_LIBRARY_PATHs in the .bashrc and logged back in, expecting a productive night ahead. To my horror I realized that everything is dynamically linked. Now I can't do simple commands like ls [ex@uid377 ~]$ ls ls: error while loading shared libraries: __vdso_time: invalid mode for dlopen(): Invalid argument and I can't do commands to fix the problem like rm or vim! Is there a way for me to ssh but also to ignore .bashrc file? Any suggestions are much appreciated. This machine is obviously under maintained and I don't know when I could have administrator support.

Read the article

Cannot increase Datastore

- by k4w4zz

Hello, We have an ESX 4.0 cluster with 2 hosts, EMC Clarion SAN storage with 10 LUNs. We have added 2 new 400 GB LUNs. All the LUNs are visible from both hosts. I have extended an existing 500 GB datastore with one of these 400 GB LUNs - the new datastore size is now 900 GB. I'd like to do the same operation with the second 400 GB LUN to extend another existing datastore but I'm not able to do it. The LUN is available to create a brand new datastore but is not visible to extend an existing one. I don't understand why everything was fine with the other one and why can't I do the same exact operation with this LUN. The result is the same on both hosts. The SAN admin have erased and re-created several times this LUN. I have rescan the HBA each time. In attachment you can find the result of the esxcfg-mpath -l and fdisk -l commands on both servers. Does somebody have an idea please?

Read the article

Managing an application across multiple servers, or PXE vs cfEngine/Chef/Puppet

- by matt

We have an application that is running on a few (5 or so and will grow) boxes. The hardware is identical in all the machines, and ideally the software would be as well. I have been managing them by hand up until now, and don't want to anymore (static ip addresses, disabling all necessary services, installing required packages...) . Can anyone balance the pros and cons of the following options, or suggest something more intelligent? 1: Individually install centos on all the boxes and manage the configs with chef/cfengine/puppet. This would be good, as I have wanted an excuse to learn to use one of applications, but I don't know if this is actually the best solution. 2: Make one box perfect and image it. Serve the image over PXE and whenever I want to make modifications, I can just reboot the boxes from a new image. How do cluster guys normally handle things like having mac addresses in the /etc/sysconfig/network-scripts/ifcfg* files? We use infiniband as well, and it also refuses to start if the hwaddr is wrong. Can these be correctly generated at boot? I'm leaning towards the PXE solution, but I think monitoring with munin or nagios will be a little more complicated with this. Anyone have experience with this type of problem? All the servers have SSDs in them and are fast and powerful. Thanks, matt.

Read the article

Determining physical location of data on a disc

- by Synetech

Does anybody know of a way to find out where, physically on a CD or DVD a given piece of data would be located? I am trying to watch a DVD at the moment, and am about half-way through, but it keeps dying at a specific spot in the film, presumably because of a scratch. I have a repair kit, but I don’t know where to focus my repair because there are several scuffs and scratches on the disc and I have no way of knowing which one is causing the issue. Obviously, cleaning all of them is inadvisable because not only does it waste the consumable materials in the kit, but not all of them are a problem, and by working them, some may become unreadable. Moreover, just because I am half-way through the movie does not mean that it would be half-way from the hub to the edge for several reasons: Discs have more data towards the outer edge than the inner edge (circles are more mathematically complicated than rectangles) The disc is not completely filled up (and even if it were, the movie itself would be be using it all, there are extras and such) Because in this particular case it is a commercial DVD, it is also dual-layer which further complicates manual determination As such, I am trying to find a program that can let me identify a file (or part thereof), cluster, etc. and show me a picture of where on the CD/DVD it would be located. That way, I can look at the disc and fix any scratches that correspond to that distance from the hub. For example, the image below might indicate where on a disc a couple of files or range of clusters would be located, so by looking for anomalies in those areas (rotating as necessary), the correct one can be identified. I’m sure it can be done since at least one form of copy protection (DPM) uses it and DVD-lab Pro includes a “DVD Topology” feature to do this.

Search Results

Search found 1914 results on 77 pages for 'mongrel cluster'.

Page 19/77 | < Previous Page | 15 16 17 18 19 20 21 22 23 24 25 26 | Next Page >

- by alex

- by tearman

- by chris

- by JoshReuben

- by Sergey Grashchenko

- by Rad Akefirad

- by jpurdy

- by sethu

- by Amos Shapira

- by Martino Dino

- by karnash

- by Anonymouse

- by Tony

- by TomTom

- by Xiao Jia

- by Wysek

- by tonisep

- by Shyam

- by user1429825

- by Ryan Budney

- by Bart Silverstrim

- by Mikhail

- by k4w4zz

- by matt

- by Synetech

< Previous Page | 15 16 17 18 19 20 21 22 23 24 25 26 | Next Page >