Search Results

Search found 109 results on 5 pages for 'mountpoint'.

Page 5/5 | < Previous Page | 1 2 3 4 5 

  • RHCS: GFS2 in A/A cluster with common storage. Configuring GFS with rgmanager

    - by Pavel A
    I'm configuring a two node A/A cluster with a common storage attached via iSCSI, which uses GFS2 on top of clustered LVM. So far I have prepared a simple configuration, but am not sure which is the right way to configure gfs resource. Here is the rm section of /etc/cluster/cluster.conf: <rm> <failoverdomains> <failoverdomain name="node1" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="rhc-n1"/> </failoverdomain> <failoverdomain name="node2" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="rhc-n2"/> </failoverdomain> </failoverdomains> <resources> <script file="/etc/init.d/clvm" name="clvmd"/> <clusterfs name="gfs" fstype="gfs2" mountpoint="/mnt/gfs" device="/dev/vg-cs/lv-gfs"/> </resources> <service name="shared-storage-inst1" autostart="0" domain="node1" exclusive="0" recovery="restart"> <script ref="clvmd"> <clusterfs ref="gfs"/> </script> </service> <service name="shared-storage-inst2" autostart="0" domain="node2" exclusive="0" recovery="restart"> <script ref="clvmd"> <clusterfs ref="gfs"/> </script> </service> </rm> This is what I mean: when using clusterfs resource agent to handle GFS partition, it is not unmounted by default (unless force_unmount option is given). This way when I issue clusvcadm -s shared-storage-inst1 clvm is stopped, but GFS is not unmounted, so a node cannot alter LVM structure on shared storage anymore, but can still access data. And even though a node can do it quite safely (dlm is still running), this seems to be rather inappropriate to me, since clustat reports that the service on a particular node is stopped. Moreover if I later try to stop cman on that node, it will find a dlm locking, produced by GFS, and fail to stop. I could have simply added force_unmount="1", but I would like to know what is the reason behind the default behavior. Why is it not unmounted? Most of the examples out there silently use force_unmount="0", some don't, but none of them give any clue on how the decision was made. Apart from that I have found sample configurations, where people manage GFS partitions with gfs2 init script - https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Defining_The_Resources or even as simply as just enabling services such as clvm and gfs2 to start automatically at boot (http://pbraun.nethence.com/doc/filesystems/gfs2.html), like: chkconfig gfs2 on If I understand the latest approach correctly, such cluster only controls whether nodes are still alive and can fence errant ones, but such cluster has no control over the status of its resources. I have some experience with Pacemaker and I'm used to that all resources are controlled by a cluster and an action can be taken when not only there are connectivity issues, but any of the resources misbehave. So, which is the right way for me to go: leave GFS partition mounted (any reasons to do so?) set force_unmount="1". Won't this break anything? Why this is not the default? use script resource <script file="/etc/init.d/gfs2" name="gfs"/> to manage GFS partition. start it at boot and don't include in cluster.conf (any reasons to do so?) This may be a sort of question that cannot be answered unambiguously, so it would be also of much value for me if you shared your experience or expressed your thoughts on the issue. How does for example /etc/cluster/cluster.conf look like when configuring gfs with Conga or ccs (they are not available to me since for now I have to use Ubuntu for the cluster)? Thanks you very much!

    Read the article

  • Setup Guide for updating local system and the repository with the incremental Solaris 11.1 SRU

    - by Gurubalan
    This guide covers the steps to implement the following setup. I. Updating the local system from Solaris 11.1 to Solaris 11.1 SRU 16.5II. Setting up local system as an IPS Repository Server (HTTP interface)III. Updating the local repository with the incremental Solaris 11.1 SRU 16.5I. Updating the local system from Solaris 11.1 to Solaris 11.1 SRU 16.5We assume that the local system is currently installed with Solaris 11.1 GA and the system doesn't have internet connectivity.What I have:1. Two parts of full repo iso files downloaded from http://www.oracle.com/technetwork/server-storage/solaris11/downloads/index.html. Both files are concatenated to a single file using the following command. $ cat sol-11_1-repo-full.iso-a sol-11_1-repo-full.iso-b > sol-11_1-repo-full.iso I suggest to verify the downloaded file against its md5checksum value [http://download.oracle.com/otn/solaris/11_1/md5sum.txt] using the following command digest -a md5 <file-name>  // the output of this command should match the original checksum value for that file.2. Incremental repo sol-11_1_16_5_0-incr-repo.iso downloaded from MOS [Patch 18269379: ORACLE SOLARIS 11.1.16.5.0 REPO ISO IMAGE (SPARC/X86 (64-BIT)]. You can get the checksum value of incremental repo iso by clicking the check box "show digest details" when you download the file.3. The local system IP is 192.168.10.10 & port 81 is reserved for repo serverPlease note that this repo file (either full or incremental) is common for both SPARC and X86(64BIT).Steps to update the local system: 1. #mounting s11.1 full repo iso to mnt        $ mount -F hsfs /soft/sol-11_1-repo-full.iso /mnt 2. Setting the pkg publisher to full repo source         $ pkg set-publisher -g file:///mnt/repo solaris 3. Perform the update of the packages.        $ pkg updateII. Setting up local system (Oracle Solaris 11.1) as an IPS Repository Server(HTTP interface):Please note that we have already mounted the full repo iso at /mnt    1. # copying /mnt permanently to the disk location at /s11.1        #zfs create -o atime=off -o mountpoint=/s11.1 rpool/s11.1        #rsync -aP /mnt/* /s11.1     2. #unmounting mnt         #umount /mnt3. To allow clients to access the local repository via HTTP, enable the application/pkg/server Service Management Facility (SMF) service.        svccfg -s application/pkg/server setprop pkg/inst_root=<data_source>/repo        eg: $svccfg -s application/pkg/server setprop pkg/inst_root=/s11.1/repo4. Setting port# to 81      svccfg -s application/pkg/server setprop pkg/port=<port_number>      eg: svccfg -s application/pkg/server setprop pkg/port="81"5a. Enable the pkg/server service (if the service is disabled)     $svcs pkg/server     STATE          STIME    FMRI     disabled        19:55:03 svc:/application/pkg/server:default      $svcadm enable pkg/server5b. Refresh/Restart the service, if it is already online       $svcadm refresh application/pkg/server       $svcadm restart application/pkg/server6. Setting pkg publisher on repo server and repo clients:      pkg set-publisher -G '*' -g http://<ip>:<port> solaris      eg: $pkg set-publisher -G '*' -g 'http://192.168.10.10:81' solaris7. Verify the Solaris 11.1 version from the repository         $pkgrepo list -s http://192.168.10.10:81 | grep entire         solaris   entire     0.5.11,5.11-0.175.1.0.0.24.2:20120919T190135Z You will have multiple row entries if the repository is setup with incremental SRUs.III. Updating the local repository with the incremental Solaris 11.1 SRU 16.51. #mounting s11.1 incremental SRU repo iso to mnt        $ mount -F hsfs <full_path_to>/sol-11_1_sruN_bldnum_respinnum-incr-repo.iso  /mnt        $ mount -F hsfs /soft/sol-11_1_16_5_0-incr-repo.iso /mnt2. Updating the local repository        $pkgrecv -s  /mnt/repo -d /s11.1/repo '*'3. Building a Search Index    $pkgrepo -s /s11.1/repo refresh     Initiating repository refresh.4. Refresh/Restart the service       $svcadm refresh svc:/application/pkg/server       $svcadm restart svc:/application/pkg/server5. Verify the repo has the incremental SRU as well.       # pkgrepo list -s http://192.168.10.10:81 | grep entire        solaris   entire      0.5.11,5.11-0.175.1.16.0.5.0:20140218T165248Z       solaris   entire      0.5.11,5.11-0.175.1.0.0.24.2:20120919T190135Z

    Read the article

  • Can connect to Samba server but cannot access shares?

    - by jlego
    I have setup a stand-alone box running Fedora 16 to use as a file-sharing and web development server. Needs to be able to share files with a PC running Windows 7 and a Mac running OSX Snow Leopard. I've setup Samba using the Samba configuration GUI tool. Added users to Fedora and connected them as Samba users (which are the same as the Windows and Mac usernames and passwords). The workgroup name is the same as the Windows workgroup. Authentication is set to User. I've allowed Samba and Samba client through the firewall and set the ethernet to a trusted port in the firewall. Both the Windows and Mac machines can connect to the server and view the shares, however when trying to access the shares, Windows throws error 0x80070035 " Windows cannot access \SERVERNAME\ShareName." Windows user is not prompted for a username or password when accessing the server (found under "Network Places"). This also happens when connecting with the IP rather than the server name. The Mac can also connect to the server and see the shares but when choosing a share gives the error "The original item for ShareName cannot be found." When connecting via IP, the Mac user is prompted for username and password, which when authenticated gives a list of shares, however when choosing a share to connect to, the error is displayed and the user cannot access the share. Since both machines are acting similarly when trying to access the shares, I assume it is an issue with how Samba is configured. smb.conf: [global] workgroup = workgroup server string = Server log file = /var/log/samba/log.%m max log size = 50 security = user load printers = yes cups options = raw printcap name = lpstat printing = cups [homes] comment = Home Directories browseable = no writable = yes [printers] comment = All Printers path = /var/spool/samba browseable = yes printable = yes [FileServ] comment = FileShare path = /media/FileServ read only = no browseable = yes valid users = user1, user2 [webdev] comment = Web development path = /var/www/html/webdev read only = no browseable = yes valid users = user1 How do I get samba sharing working? UPDATE: Before this box I had another box with the same version of fedora installed (16) and samba working for these same computers. I started up the old machine and copied the smb.conf file from the old machine to the new one (editing the share definitions for the new shares of course) and I still get the same errors on both client machines. The only difference in environment is the hardware and the router. On the old machine the router received a dynamic public IP and assigned dynamic private IPs to each device on the network while the new machine is connected to a router that has a static public IP (still dynamic internal IPs though.) Could either one of these be affecting Samba? UPDATE 2: As the directory I am trying to share is actually an entire internal disk, I have tried to things: 1.) changing the owner of the mounted disk from root to my user (which is the same username as on the Windows machine) 2.) made a share that only included one of the folders on the disk instead of the entire disk with my user again as the owner. Both tests failed giving me the same errors regarding the network address. UPDATE 3: Not sure exactly what I did, but now whenever I try to connect to the share on the Windows 7 client I am prompted for my username and password. When I enter the correct credentials I get an access denied message. However I did notice that under the login box "domain: WINDOWS-PC-NAME" is listed. I believe this could very well be the problem. Any suggestions? UPDATE 4: So I've completely reinstalled Fedora and Samba now. I've created a share on the first harddrive (one fedora is installed on) and I can access that fine from Windows. However when I try to share any data on the second disk, I am receiving the same error. This I believe is the problem. I think I need to change some things in fstab or fdisk or something. UPDATE 5: So in fstab I mapped the drive to automount in a folder which works correctly. I also added the samba_share_t SElinux label to the mountpoint directory which now allows me to access the shares on the Windows machine, however I cannot see any of the files in the directory on the windows machine. (They are there, I can see them in the fedora file browser locally) UPDATE 6: Figured it out. See answer below

    Read the article

  • I can connect to Samba server but cannot access shares.

    - by jlego
    I'm having trouble getting samba sharing working to access shares. I have setup a stand-alone box running Fedora 16 to use as a file-sharing and web development server. It needs to be able to share files with a Windows 7 PC and a Mac running OSX Snow Leopard. I've setup Samba using the Samba configuration GUI tool on Fedora. Added users to Fedora and connected them as Samba users (which are the same as the Windows and Mac usernames and passwords). The workgroup name is the same as the Windows workgroup. Authentication is set to User. I've allowed Samba and Samba client through the firewall and set the ethernet to a trusted port in the firewall. Both the Windows and Mac machines can connect to the server and view the shares, however when trying to access the shares, Windows throws error: 0x80070035 " Windows cannot access \\SERVERNAME\ShareName." Windows user is not prompted for a username or password when accessing the server (found under "Network Places"). This also happens when connecting with the IP rather than the server name. The Mac can also connect to the server and see the shares but when choosing a share gives the error: The original item for ShareName cannot be found. When connecting via IP, the Mac user is prompted for username and password, which when authenticated gives a list of shares, however when choosing a share to connect to, the error is displayed and the user cannot access the share. Since both machines are acting similarly when trying to access the shares, I assume it is an issue with how Samba is configured. smb.conf: [global] workgroup = workgroup server string = Server log file = /var/log/samba/log.%m max log size = 50 security = user load printers = yes cups options = raw printcap name = lpstat printing = cups [homes] comment = Home Directories browseable = no writable = yes [printers] comment = All Printers path = /var/spool/samba browseable = yes printable = yes [FileServ] comment = FileShare path = /media/FileServ read only = no browseable = yes valid users = user1, user2 [webdev] comment = Web development path = /var/www/html/webdev read only = no browseable = yes valid users = user1 How do I get samba sharing working? UPDATE: I Figured it out, it was because I was sharing a second hard drive. See checked answer below. Speculation 1: Before this box I had another box with the same version of fedora installed (16) and samba working for these same computers. I started up the old machine and copied the smb.conf file from the old machine to the new one (editing the share definitions for the new shares of course) and I still get the same errors on both client machines. The only difference in environment is the hardware and the router. On the old machine the router received a dynamic public IP and assigned dynamic private IPs to each device on the network while the new machine is connected to a router that has a static public IP (still dynamic internal IPs though.) Could either one of these be affecting Samba? Speculation 2: As the directory I am trying to share is actually an entire internal disk, I have tried these things: 1.) changing the owner of the mounted disk from root to my user (which is the same username as on the Windows machine) 2.) made a share that only included one of the folders on the disk instead of the entire disk with my user again as the owner. Both tests failed giving me the same errors regarding the network address. Speculation 3: Whenever I try to connect to the share on the Windows 7 client I am prompted for my username and password. When I enter the correct credentials I get an access denied message. However I did notice that under the login box "domain: WINDOWS-PC-NAME" is listed. I believe this could very well be the problem. Speculation 4: So I've completely reinstalled Fedora and Samba now. I've created a share on the first harddrive (one fedora is installed on) and I can access that fine from Windows. However when I try to share any data on the second disk, I am receiving the same error. This I believe is the problem. I think I need to change some things in fstab or fdisk or something. Speculation 5: So in fstab I mapped the drive to automount in a folder which works correctly. I also added the samba_share_t SElinux label to the mountpoint directory which now allows me to access the shares on the Windows machine, however I cannot see any of the files in the directory on the windows machine. (They are there, I can see them in the fedora file browser locally)

    Read the article

  • Quotas - Using quotas on ZFSSA shares and projects and users

    - by Steve Tunstall
    So you don't want your users to fill up your entire storage pool with their MP3 files, right? Good idea to make some quotas. There's some good tips and tricks here, including a helpful workflow (a script) that will allow you to set a default quota on all of the users of a share at once. Let's start with some basics. I mad a project called "small" and inside it I made a share called "Share1". You can set quotas on the project level, which will affect all of the shares in it, or you can do it on the share level like I am here. Go the the share's General property page. First, I'm using a Windows client, so I need to make sure I have my SMB mountpoint. Do you know this trick yet? Go to the Protocol page of the share. See the SMB section? It needs a resource name to make the UNC path for the SMB (Windows) users. You do NOT have to type this name in for every share you make! Do this at the Project level. Before you make any shares, go to the Protocol properties of the Project, and set the SMB Resource name to "On". This special code will automatically make the SMB resource name of every share in the project the same as the share name. Note the UNC path name I got below. Since I did this at the Project level, I didn't have to lift a finger for it to work on every share I make in this project. Simple. So I have now mapped my Windows "Z:" drive to this Share1. I logged in as the user "Joe". Note that my computer shows my Z: drive as 34GB, which is the entire size of my Pool that this share is in. Right now, Joe could fill this drive up and it would fill up my pool.  Now, go back to the General properties of Share1. In the "Space Usage" area, over on the right, click on the "Show All" text under the Users & Groups section. Sure enough, Joe and some other users are in here and have some data. Note this is also a handy window to use just to see how much space your users are using in any given share.  Ok, Joe owes us money from lunch last week, so we want to give him a quota of 100MB. Type his name in the Users box. Notice how it now shows you how much data he's currently using. Go ahead and give him a 100M quota and hit the Apply button. If I go back to "Show All", I can see that Joe now has a quota, and no one else does. Sure enough, as soon as I refresh my screen back on Joe's client, he sees that his Z: drive is now only 100MB, and he's more than half way full.  That was easy enough, but what if you wanted to make the whole share have a quota, so that the share itself, no matter who uses it, can only grow to a certain size? That's even easier. Just use the Quota box on the left hand side. Here, I use a Quota on the share of 300MB.  So now I log off as Joe, and log in as Steve. Even though Steve does NOT have a quota, it is showing my Z: drive as 300MB. This would effect anyone, INCLUDING the ROOT user, becuase you specified the Quota to be on the SHARE, not on a person.  Note that back in the Share, if you click the "Show All" text, the window does NOT show Steve, or anyone else, to have a quota of 300MB. Yet we do, because it's on the share itself, not on any user, so this panel does not see that. Ok, here is where it gets FUN.... Let's say you do NOT want a quota on the SHARE, because you want SOME people, like Root and yourself, to have FULL access to it and you want the ability to fill the whole thing up if you darn well feel like it. HOWEVER, you want to give the other users a quota. HOWEVER you have, say, 200 users, and you do NOT feel like typing in each of their names and giving them each a quota, and they are not all members of a AD global group you could use or anything like that.  Hmmmmmm.... No worries, mate. We have a handy-dandy script that can do this for us. Now, this script was written a few years back by Tim Graves, one of our ZFSSA engineers out of the UK. This is not my script. It is NOT supported by Oracle support in any way. It does work fine with the 2011.1.4 code as best as I can tell, but Oracle, and I, are NOT responsible for ANYTHING that you do with this script. Furthermore, I will NOT give you this script, so do not ask me for it. You need to get this from your local Oracle storage SC. I will give it to them. I want this only going to my fellow SCs, who can then work with you to have it and show you how it works.  Here's what it does...Once you add this workflow to the Maintenance-->Workflows section, you click it once to run it. Nothing seems to happen at this point, but something did.   Go back to any share or project. You will see that you now have four new, custom properties on the bottom.  Do NOT touch the bottom two properties, EVER. Only touch the top two. Here, I'm going to give my users a default quota of about 40MB each. The beauty of this script is that it will only effect users that do NOT already have any kind of personal quota. It will only change people who have no quota at all. It does not effect the Root user.  After I hit Apply on the Share screen. Nothing will happen until I go back and run the script again. The first time you run it, it creates the custom properties. The second and all subsequent times you run it, it checks the shares for any users, and applies your quota number to each one of them, UNLESS they already have one set. Notice in the readout below how it did NOT apply to my Joe user, since Joe had a quota set.  Sure enough, when I go back to the "Show All" in the share properties, all of the users who did not have a quota, now have one for 39.1MB. Hmmm... I did my math wrong, didn't I?    That's OK, I'll just change the number of the Custom Default quota again. Here, I am adding a zero on the end.  After I click Apply, and then run the script again, all of my users, except Joe, now have a quota of 391MB  You can customize a person at any time. Here, I took the Steve user, and specifically gave him a Quota of zero. Now when I run the script again, he is different from the rest, so he is no longer effected by the script. Under Show All, I see that Joe is at 100, and Steve has no Quota at all. I can do this all day long. es, you will have to re-run the script every time new users get added. The script only applies the default quota to users that are present at the time the script is ran. However, it would be a simple thing to schedule the script to run each night, or to make an alert to run the script when certain events occur.  For you power users, if you ever want to delete these custom properties and remove the script completely, you will find these properties under the "Schema" section under the Shares section. You can remove them here. There's no need to, however, they don't hurt a thing if you just don't use them.  I hope these tips have helped you out there. Quotas can be fun. 

    Read the article

  • NFS issue brings down entire vSphere ESX estate

    - by growse
    I experienced an odd issue this morning where an NFS issue appeared to have taken down the majority of my VMs hosted on a small vSphere 5.0 estate. The infrastructure itself is 4x IBM HS21 blades running around 20 VMs. The storage is provided by a single HP X1600 array with attached D2700 chassis running Solaris 11. There's a couple of storage pools on this which are exposed over NFS for the storage of the VM files, and some iSCSI LUNs for things like MSCS shared disks. Normally, this is pretty stable, but I appreciate the lack of resiliancy in having a single X1600 doing all the storage. This morning, in the logs of each ESX host, at around 0521 GMT I saw a lot of entries like this: 2011-11-30T05:21:54.161Z cpu2:2050)NFSLock: 608: Stop accessing fd 0x41000a4cf9a8 3 2011-11-30T05:21:54.161Z cpu2:2050)NFSLock: 608: Stop accessing fd 0x41000a4dc9e8 3 2011-11-30T05:21:54.161Z cpu2:2050)NFSLock: 608: Stop accessing fd 0x41000a4d3fa8 3 2011-11-30T05:21:54.161Z cpu2:2050)NFSLock: 608: Stop accessing fd 0x41000a4de0a8 3 [....] 2011-11-30T06:16:07.042Z cpu0:2058)WARNING: NFS: 283: Lost connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T06:17:01.459Z cpu2:4011)NFS: 292: Restored connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T06:25:17.887Z cpu3:2051)NFSLock: 608: Stop accessing fd 0x41000a4c2b28 3 2011-11-30T06:27:16.063Z cpu3:4011)NFSLock: 568: Start accessing fd 0x41000a4d8928 again 2011-11-30T06:35:30.827Z cpu1:2058)WARNING: NFS: 283: Lost connection to the server 10.13.111.197 mount point /tank/ISO, mounted as 5acdbb3e-410e56e3-0000-000000000000 ("ISO (1)") 2011-11-30T06:36:37.953Z cpu6:2054)NFS: 292: Restored connection to the server 10.13.111.197 mount point /tank/ISO, mounted as 5acdbb3e-410e56e3-0000-000000000000 ("ISO (1)") 2011-11-30T06:40:08.242Z cpu6:2054)NFSLock: 608: Stop accessing fd 0x41000a4c3e68 3 2011-11-30T06:40:34.647Z cpu3:2051)NFSLock: 568: Start accessing fd 0x41000a4d8928 again 2011-11-30T06:44:42.663Z cpu1:2058)WARNING: NFS: 283: Lost connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T06:44:53.973Z cpu0:4011)NFS: 292: Restored connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T06:51:28.296Z cpu5:2058)NFSLock: 608: Stop accessing fd 0x41000ae3c528 3 2011-11-30T06:51:44.024Z cpu4:2052)NFSLock: 568: Start accessing fd 0x41000ae3b8e8 again 2011-11-30T06:56:30.758Z cpu4:2058)WARNING: NFS: 283: Lost connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T06:56:53.389Z cpu7:2055)NFS: 292: Restored connection to the server 10.13.111.197 mount point /sastank/VMStorage, mounted as f0342e1c-19be66b5-0000-000000000000 ("SAStank") 2011-11-30T07:01:50.350Z cpu6:2054)ScsiDeviceIO: 2316: Cmd(0x41240072bc80) 0x12, CmdSN 0x9803 to dev "naa.600508e000000000505c16815a36c50d" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. 2011-11-30T07:03:48.449Z cpu3:2051)NFSLock: 608: Stop accessing fd 0x41000ae46b68 3 2011-11-30T07:03:57.318Z cpu4:4009)NFSLock: 568: Start accessing fd 0x41000ae48228 again (I've put a complete dump from one of the hosts on pastebin: http://pastebin.com/Vn60wgTt) When I got in the office at 9am, I saw various failures and alarms and troubleshooted the issue. It turned out that pretty much all of the VMs were inaccessible, and that the ESX hosts either were describing each VM as 'powered off', 'powered on', or 'unavailable'. The VMs described as 'powered on' where not in any way reachable or responding to pings, so this may be lies. There's absolutely no indication on the X1600 that anything was awry, and nothing on the switches to indicate any loss of connectivity. I only managed to resolve the issue by rebooting the ESX hosts in turn. I have a number of questions: What the hell happened? If this was a temporary NFS failure, why did it put the ESX hosts into a state from which a reboot was the only recovery? In the future, when the NFS server goes a little off-piste, what would be the best approach to add some resilience? I've been looking at budgeting for next year and potentially have budget to purchase another X1600/D2700/disks, would an identical mirrored disk setup help to mitigate these sorts of failures automatically? Edit (Added requested details) To expand with some details as requested: The X1600 has 12x 1TB disks lumped together in mirrored pairs as tank, and the D2700 (connected with a mini SAS cable) has 12x 300GB 10k SAS disks lumped together in mirrored pairs as sastank zpool status pool: rpool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7t0d0s0 ONLINE 0 0 0 errors: No known data errors pool: sastank state: ONLINE scan: scrub repaired 0 in 74h21m with 0 errors on Wed Nov 30 02:51:58 2011 config: NAME STATE READ WRITE CKSUM sastank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t14d0 ONLINE 0 0 0 c7t15d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c7t16d0 ONLINE 0 0 0 c7t17d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c7t18d0 ONLINE 0 0 0 c7t19d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c7t20d0 ONLINE 0 0 0 c7t21d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c7t22d0 ONLINE 0 0 0 c7t23d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c7t24d0 ONLINE 0 0 0 c7t25d0 ONLINE 0 0 0 errors: No known data errors pool: tank state: ONLINE scan: scrub repaired 0 in 17h28m with 0 errors on Mon Nov 28 17:58:19 2011 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t1d0 ONLINE 0 0 0 c7t2d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c7t4d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c7t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 mirror-3 ONLINE 0 0 0 c7t8d0 ONLINE 0 0 0 c7t9d0 ONLINE 0 0 0 mirror-4 ONLINE 0 0 0 c7t10d0 ONLINE 0 0 0 c7t11d0 ONLINE 0 0 0 mirror-5 ONLINE 0 0 0 c7t12d0 ONLINE 0 0 0 c7t13d0 ONLINE 0 0 0 errors: No known data errors The filesystem exposed over NFS for the primary datastore is sastank/VMStorage zfs list NAME USED AVAIL REFER MOUNTPOINT rpool 45.1G 13.4G 92.5K /rpool rpool/ROOT 2.28G 13.4G 31K legacy rpool/ROOT/solaris 2.28G 13.4G 2.19G / rpool/dump 15.0G 13.4G 15.0G - rpool/export 11.9G 13.4G 32K /export rpool/export/home 11.9G 13.4G 32K /export/home rpool/export/home/andrew 11.9G 13.4G 11.9G /export/home/andrew rpool/swap 15.9G 29.2G 123M - sastank 1.08T 536G 33K /sastank sastank/VMStorage 1.01T 536G 1.01T /sastank/VMStorage sastank/comstar 71.7G 536G 31K /sastank/comstar sastank/comstar/sql_tempdb 6.31G 536G 6.31G - sastank/comstar/sql_tx_data 65.4G 536G 65.4G - tank 4.79T 578G 42K /tank tank/FTP 269G 578G 269G /tank/FTP tank/ISO 28.8G 578G 25.9G /tank/ISO tank/backupstage 2.64T 578G 2.49T /tank/backupstage tank/cifs 301G 578G 297G /tank/cifs tank/comstar 1.54T 578G 31K /tank/comstar tank/comstar/msdtc 1.07G 579G 32.8M - tank/comstar/quorum 577M 578G 47.9M - tank/comstar/sqldata 1.54T 886G 304G - tank/comstar/vsphere_lun 2.09G 580G 22.2M - tank/mcs-asset-repository 7.01M 578G 6.99M /tank/mcs-asset-repository tank/mscs-quorum 55K 578G 36K /tank/mscs-quorum tank/sccm 16.1G 578G 12.8G /tank/sccm As for the networking, all connections between the X1600, the Blades and the switch are either LACP or Etherchannel bonded 2x 1Gbit links. Switch is a single Cisco 3750. Storage traffic sits on its own VLAN segregated from VM machine traffic.

    Read the article

  • lxc containers hangs after upgrade to 13.10

    - by doug123
    I have 3 lxc containers. They were all working fine on 12.10 and I upgraded the containers with do-release-upgrade on the containers to 13.04 and 13.10 and that worked great. Then I upgraded the host to 13.04 and then 13.10 and now the 3 containers hang with this: >lxc-start -n as1 -l DEBUG -o $(tty) lxc-start 1383145786.513 INFO lxc_start_ui - using rcfile /var/lib/lxc/as1/config lxc-start 1383145786.513 WARN lxc_log - lxc_log_init called with log already initialized lxc-start 1383145786.513 INFO lxc_apparmor - aa_enabled set to 1 lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/2' (5/6) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/13' (7/8) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/14' (9/10) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/15' (11/12) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/17' (13/14) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/18' (15/16) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/19' (17/18) lxc-start 1383145786.514 DEBUG lxc_conf - allocated pty '/dev/pts/20' (19/20) lxc-start 1383145786.514 INFO lxc_conf - tty's configured lxc-start 1383145786.514 DEBUG lxc_start - sigchild handler set lxc-start 1383145786.514 DEBUG lxc_console - opening /dev/tty for console peer lxc-start 1383145786.514 DEBUG lxc_console - using '/dev/tty' as console lxc-start 1383145786.514 DEBUG lxc_console - 6242 got SIGWINCH fd 25 lxc-start 1383145786.514 DEBUG lxc_console - set winsz dstfd:22 cols:177 rows:53 lxc-start 1383145786.514 INFO lxc_start - 'as1' is initialized lxc-start 1383145786.522 DEBUG lxc_start - Not dropping cap_sys_boot or watching utmp lxc-start 1383145786.524 DEBUG lxc_conf - mac address of host interface 'vethB4L35W' changed to private fe:7c:96:a0:ae:29 lxc-start 1383145786.525 DEBUG lxc_conf - instanciated veth 'vethB4L35W/vethVC61K2', index is '26' lxc-start 1383145786.529 DEBUG lxc_cgroup - cgroup 'memory.limit_in_bytes' set to '20G' lxc-start 1383145786.529 DEBUG lxc_cgroup - cgroup 'cpuset.cpus' set to '12-23' lxc-start 1383145786.529 INFO lxc_cgroup - cgroup has been setup lxc-start 1383145786.555 DEBUG lxc_conf - move 'eth0' to '6249' lxc-start 1383145786.555 INFO lxc_conf - 'as1' hostname has been setup lxc-start 1383145786.575 DEBUG lxc_conf - 'eth0' has been setup lxc-start 1383145786.575 INFO lxc_conf - network has been setup lxc-start 1383145786.575 INFO lxc_conf - looking at .44 42 252:0 / / rw,relatime - ext4 /dev/mapper/limitorderbook1-root rw,errors=remount-ro,data=ordered . lxc-start 1383145786.575 INFO lxc_conf - now p is . /. lxc-start 1383145786.575 INFO lxc_conf - looking at .52 44 0:5 / /dev rw,relatime - devtmpfs udev rw,size=32961632k,nr_inodes=8240408,mode=755 . lxc-start 1383145786.575 INFO lxc_conf - now p is . /dev. lxc-start 1383145786.575 INFO lxc_conf - looking at .61 52 0:11 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=000 . lxc-start 1383145786.575 INFO lxc_conf - now p is . /dev/pts. lxc-start 1383145786.575 INFO lxc_conf - looking at .68 44 0:15 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs rw,size=6594456k,mode=755 . lxc-start 1383145786.575 INFO lxc_conf - now p is . /run. lxc-start 1383145786.575 INFO lxc_conf - looking at .69 68 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none rw,size=5120k . lxc-start 1383145786.575 INFO lxc_conf - now p is . /run/lock. lxc-start 1383145786.575 INFO lxc_conf - looking at .72 68 0:19 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /run/shm. lxc-start 1383145786.575 INFO lxc_conf - looking at .73 68 0:21 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none rw,size=102400k,mode=755 . lxc-start 1383145786.575 INFO lxc_conf - now p is . /run/user. lxc-start 1383145786.575 INFO lxc_conf - looking at .76 44 0:14 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys. lxc-start 1383145786.575 INFO lxc_conf - looking at .77 76 0:16 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755 . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup. lxc-start 1383145786.575 INFO lxc_conf - looking at .78 77 0:20 / /sys/fs/cgroup/cpuset rw,relatime - cgroup cgroup rw,cpuset,clone_children . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/cpuset. lxc-start 1383145786.575 INFO lxc_conf - looking at .79 77 0:23 / /sys/fs/cgroup/cpu rw,relatime - cgroup cgroup rw,cpu . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/cpu. lxc-start 1383145786.575 INFO lxc_conf - looking at .80 77 0:24 / /sys/fs/cgroup/cpuacct rw,relatime - cgroup cgroup rw,cpuacct . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/cpuacct. lxc-start 1383145786.575 INFO lxc_conf - looking at .81 77 0:25 / /sys/fs/cgroup/memory rw,relatime - cgroup cgroup rw,memory . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/memory. lxc-start 1383145786.575 INFO lxc_conf - looking at .82 77 0:26 / /sys/fs/cgroup/devices rw,relatime - cgroup cgroup rw,devices . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/devices. lxc-start 1383145786.575 INFO lxc_conf - looking at .83 77 0:27 / /sys/fs/cgroup/freezer rw,relatime - cgroup cgroup rw,freezer . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/freezer. lxc-start 1383145786.575 INFO lxc_conf - looking at .84 77 0:28 / /sys/fs/cgroup/blkio rw,relatime - cgroup cgroup rw,blkio . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/blkio. lxc-start 1383145786.575 INFO lxc_conf - looking at .85 77 0:29 / /sys/fs/cgroup/perf_event rw,relatime - cgroup cgroup rw,perf_event . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/perf_event. lxc-start 1383145786.575 INFO lxc_conf - looking at .94 77 0:30 / /sys/fs/cgroup/hugetlb rw,relatime - cgroup cgroup rw,hugetlb . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/hugetlb. lxc-start 1383145786.575 INFO lxc_conf - looking at .95 77 0:31 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup systemd rw,name=systemd . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/cgroup/systemd. lxc-start 1383145786.575 INFO lxc_conf - looking at .96 76 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/fuse/connections. lxc-start 1383145786.575 INFO lxc_conf - looking at .98 76 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/kernel/debug. lxc-start 1383145786.575 INFO lxc_conf - looking at .101 76 0:10 / /sys/kernel/security rw,relatime - securityfs none rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/kernel/security. lxc-start 1383145786.575 INFO lxc_conf - looking at .102 76 0:22 / /sys/fs/pstore rw,relatime - pstore none rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /sys/fs/pstore. lxc-start 1383145786.575 INFO lxc_conf - looking at .103 44 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw . lxc-start 1383145786.575 INFO lxc_conf - now p is . /proc. lxc-start 1383145786.575 INFO lxc_conf - looking at .104 44 9:2 / /data rw,relatime - ext4 /dev/md2 rw,errors=remount-ro,data=ordered . lxc-start 1383145786.575 INFO lxc_conf - now p is . /data. lxc-start 1383145786.575 INFO lxc_conf - looking at .105 44 8:1 / /boot rw,relatime - ext2 /dev/sda1 rw,errors=continue . lxc-start 1383145786.575 INFO lxc_conf - now p is . /boot. lxc-start 1383145786.576 DEBUG lxc_conf - mounted '/data/srv/lxc/as1' on '/usr/lib/x86_64-linux-gnu/lxc' lxc-start 1383145786.576 DEBUG lxc_conf - mounted 'none' on '/usr/lib/x86_64-linux-gnu/lxc//dev/pts', type 'devpts' lxc-start 1383145786.576 DEBUG lxc_conf - mounted 'none' on '/usr/lib/x86_64-linux-gnu/lxc//proc', type 'proc' lxc-start 1383145786.576 DEBUG lxc_conf - mounted 'none' on '/usr/lib/x86_64-linux-gnu/lxc//sys', type 'sysfs' lxc-start 1383145786.576 DEBUG lxc_conf - mounted 'none' on '/usr/lib/x86_64-linux-gnu/lxc//run', type 'tmpfs' lxc-start 1383145786.576 INFO lxc_conf - mount points have been setup lxc-start 1383145786.577 INFO lxc_conf - console has been setup lxc-start 1383145786.577 INFO lxc_conf - 8 tty(s) has been setup lxc-start 1383145786.577 INFO lxc_conf - rootfs path is ./data/srv/lxc/as1., mount is ./usr/lib/x86_64-linux-gnu/lxc. lxc-start 1383145786.577 INFO lxc_apparmor - I am 1, /proc/self points to 1 lxc-start 1383145786.577 DEBUG lxc_conf - created '/usr/lib/x86_64-linux-gnu/lxc/lxc_putold' directory lxc-start 1383145786.577 DEBUG lxc_conf - mountpoint for old rootfs is '/usr/lib/x86_64-linux-gnu/lxc/lxc_putold' lxc-start 1383145786.577 DEBUG lxc_conf - pivot_root syscall to '/usr/lib/x86_64-linux-gnu/lxc' successful lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/dev/pts' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/run/lock' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/run/shm' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/run/user' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/cpuset' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/cpu' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/cpuacct' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/memory' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/devices' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/freezer' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/blkio' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/perf_event' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/hugetlb' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup/systemd' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/fuse/connections' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/kernel/debug' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/kernel/security' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/pstore' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/proc' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/data' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/boot' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/dev' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/run' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys/fs/cgroup' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold/sys' lxc-start 1383145786.577 DEBUG lxc_conf - umounted '/lxc_putold' lxc-start 1383145786.577 INFO lxc_conf - created new pts instance lxc-start 1383145786.578 DEBUG lxc_conf - drop capability 'sys_boot' (22) lxc-start 1383145786.578 DEBUG lxc_conf - capabilities have been setup lxc-start 1383145786.578 NOTICE lxc_conf - 'as1' is setup. lxc-start 1383145786.578 DEBUG lxc_cgroup - cgroup 'memory.limit_in_bytes' set to '20G' lxc-start 1383145786.578 DEBUG lxc_cgroup - cgroup 'cpuset.cpus' set to '12-23' lxc-start 1383145786.578 INFO lxc_cgroup - cgroup has been setup lxc-start 1383145786.578 INFO lxc_apparmor - setting up apparmor lxc-start 1383145786.578 INFO lxc_apparmor - changed apparmor profile to lxc-container-default lxc-start 1383145786.578 NOTICE lxc_start - exec'ing '/sbin/init' lxc-start 1383145786.578 INFO lxc_conf - looking at .15 20 0:14 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw . lxc-start 1383145786.578 INFO lxc_conf - now p is . /sys. lxc-start 1383145786.578 INFO lxc_conf - looking at .16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw . lxc-start 1383145786.578 INFO lxc_conf - now p is . /proc. lxc-start 1383145786.578 INFO lxc_conf - looking at .17 20 0:5 / /dev rw,relatime - devtmpfs udev rw,size=32961632k,nr_inodes=8240408,mode=755 . lxc-start 1383145786.578 INFO lxc_conf - now p is . /dev. lxc-start 1383145786.578 INFO lxc_conf - looking at .18 17 0:11 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=000 . lxc-start 1383145786.578 INFO lxc_conf - now p is . /dev/pts. lxc-start 1383145786.578 INFO lxc_conf - looking at .19 20 0:15 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs rw,size=6594456k,mode=755 . lxc-start 1383145786.578 INFO lxc_conf - now p is . /run. lxc-start 1383145786.578 INFO lxc_conf - looking at .20 1 252:0 / / rw,relatime - ext4 /dev/mapper/limitorderbook1-root rw,errors=remount-ro,data=ordered . lxc-start 1383145786.578 INFO lxc_conf - now p is . /. lxc-start 1383145786.578 INFO lxc_conf - looking at .22 15 0:16 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755 . lxc-start 1383145786.578 INFO lxc_conf - now p is . /sys/fs/cgroup. lxc-start 1383145786.578 INFO lxc_conf - looking at .23 15 0:17 / /sys/fs/fuse/connections rw,relatime - fusectl none rw . lxc-start 1383145786.578 INFO lxc_conf - now p is . /sys/fs/fuse/connections. lxc-start 1383145786.578 INFO lxc_conf - looking at .24 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/kernel/debug. lxc-start 1383145786.579 INFO lxc_conf - looking at .25 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/kernel/security. lxc-start 1383145786.579 INFO lxc_conf - looking at .26 19 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none rw,size=5120k . lxc-start 1383145786.579 INFO lxc_conf - now p is . /run/lock. lxc-start 1383145786.579 INFO lxc_conf - looking at .27 19 0:19 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw . lxc-start 1383145786.579 INFO lxc_conf - now p is . /run/shm. lxc-start 1383145786.579 INFO lxc_conf - looking at .28 22 0:20 / /sys/fs/cgroup/cpuset rw,relatime - cgroup cgroup rw,cpuset,clone_children . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/cpuset. lxc-start 1383145786.579 INFO lxc_conf - looking at .29 19 0:21 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none rw,size=102400k,mode=755 . lxc-start 1383145786.579 INFO lxc_conf - now p is . /run/user. lxc-start 1383145786.579 INFO lxc_conf - looking at .30 15 0:22 / /sys/fs/pstore rw,relatime - pstore none rw . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/pstore. lxc-start 1383145786.579 INFO lxc_conf - looking at .31 22 0:23 / /sys/fs/cgroup/cpu rw,relatime - cgroup cgroup rw,cpu . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/cpu. lxc-start 1383145786.579 INFO lxc_conf - looking at .32 22 0:24 / /sys/fs/cgroup/cpuacct rw,relatime - cgroup cgroup rw,cpuacct . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/cpuacct. lxc-start 1383145786.579 INFO lxc_conf - looking at .33 22 0:25 / /sys/fs/cgroup/memory rw,relatime - cgroup cgroup rw,memory . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/memory. lxc-start 1383145786.579 INFO lxc_conf - looking at .34 22 0:26 / /sys/fs/cgroup/devices rw,relatime - cgroup cgroup rw,devices . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/devices. lxc-start 1383145786.579 INFO lxc_conf - looking at .35 22 0:27 / /sys/fs/cgroup/freezer rw,relatime - cgroup cgroup rw,freezer . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/freezer. lxc-start 1383145786.579 INFO lxc_conf - looking at .36 22 0:28 / /sys/fs/cgroup/blkio rw,relatime - cgroup cgroup rw,blkio . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/blkio. lxc-start 1383145786.579 INFO lxc_conf - looking at .37 22 0:29 / /sys/fs/cgroup/perf_event rw,relatime - cgroup cgroup rw,perf_event . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/perf_event. lxc-start 1383145786.579 INFO lxc_conf - looking at .38 22 0:30 / /sys/fs/cgroup/hugetlb rw,relatime - cgroup cgroup rw,hugetlb . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/hugetlb. lxc-start 1383145786.579 INFO lxc_conf - looking at .39 20 9:2 / /data rw,relatime - ext4 /dev/md2 rw,errors=remount-ro,data=ordered . lxc-start 1383145786.579 INFO lxc_conf - now p is . /data. lxc-start 1383145786.579 INFO lxc_conf - looking at .40 20 8:1 / /boot rw,relatime - ext2 /dev/sda1 rw,errors=continue . lxc-start 1383145786.579 INFO lxc_conf - now p is . /boot. lxc-start 1383145786.579 INFO lxc_conf - looking at .41 22 0:31 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup systemd rw,name=systemd . lxc-start 1383145786.579 INFO lxc_conf - now p is . /sys/fs/cgroup/systemd. lxc-start 1383145786.579 NOTICE lxc_start - '/sbin/init' started with pid '6249' lxc-start 1383145786.579 WARN lxc_start - invalid pid for SIGCHLD <4>init: ureadahead main process (7) terminated with status 5 <4>init: console-font main process (94) terminated with status 1 And it will just sit there like that for hours at least. The container becomes pingable but I can't ssh and if I try lxc-console -n as1 I get a blank screen. If I do lxc-stop -n as1 or ^C in the window where it has hung I get: ^CTERM environment variable not set. <4>init: plymouth-upstart-bridge main process (192) terminated with status 1 <4>init: hwclock-save main process (187) terminated with status 70 * Asking all remaining processes to terminate... ...done. * All processes ended within 1 seconds... ...done. * Deactivating swap... ...fail! mount: cannot mount block device /dev/md2 read-only * Will now restart But after 20 minutes it hasn't restarted. Any ideas why these containers are hanging?

    Read the article

  • Bacula & Multiple Tape Devices, and so on

    - by Tom O'Connor
    Bacula won't make use of 2 tape devices simultaneously. (Search for #-#-# for the TL;DR) A little background, perhaps. In the process of trying to get a decent working backup solution (backing up 20TB ain't cheap, or easy) at $dayjob, we bought a bunch of things to make it work. Firstly, there's a Spectra Logic T50e autochanger, 40 slots of LTO5 goodness, and that robot's got a pair of IBM HH5 Ultrium LTO5 drives, connected via FibreChannel Arbitrated Loop to our backup server. There's the backup server.. A Dell R715 with 2x 16 core AMD 62xx CPUs, and 32GB of RAM. Yummy. That server's got 2 Emulex FCe-12000E cards, and an Intel X520-SR dual port 10GE NIC. We were also sold Commvault Backup (non-NDMP). Here's where it gets really complicated. Spectra Logic and Commvault both sent respective engineers, who set up the library and the software. Commvault was running fine, in so far as the controller was working fine. The Dell server has Ubuntu 12.04 server, and runs the MediaAgent for CommVault, and mounts our BlueArc NAS as NFS to a few mountpoints, like /home, and some stuff in /mnt. When backing up from the NFS mountpoints, we were seeing ~= 290GB/hr throughput. That's CRAP, considering we've got 20-odd TB to get through, in a <48 hour backup window. The rated maximum on the BlueArc is 700MB/s (2460GB/hr), the rated maximum write speed on the tape devices is 140MB/s, per drive, so that's 492GB/hr (or double it, for the total throughput). So, the next step was to benchmark NFS performance with IOzone, and it turns out that we get epic write performance (across 20 threads), and it's like 1.5-2.5TB/hr write, but read performance is fecking hopeless. I couldn't ever get higher than 343GB/hr maximum. So let's assume that the 343GB/hr is a theoretical maximum for read performance on the NAS, then we should in theory be able to get that performance out of a) CommVault, and b) any other backup agent. Not the case. Commvault seems to only ever give me 200-250GB/hr throughput, and out of experimentation, I installed Bacula to see what the state of play there is. If, for example, Bacula gave consistently better performance and speeds than Commvault, then we'd be able to say "**$.$ Refunds Plz $.$**" #-#-# Alas, I found a different problem with Bacula. Commvault seems pretty happy to read from one part of the mountpoint with one thread, and stream that to a Tape device, whilst reading from some other directory with the other thread, and writing to the 2nd drive in the autochanger. I can't for the life of me get Bacula to mount and write to two tape drives simultaneously. Things I've tried: Setting Maximum Concurrent Jobs = 20 in the Director, File and Storage Daemons Setting Prefer Mounted Volumes = no in the Job Definition Setting multiple devices in the Autochanger resource. Documentation seems to be very single-drive centric, and we feel a little like we've strapped a rocket to a hamster, with this one. The majority of example Bacula configurations are for DDS4 drives, manual tape swapping, and FreeBSD or IRIX systems. I should probably add that I'm not too bothered if this isn't possible, but I'd be surprised. I basically want to use Bacula as proof to stick it to the software vendors that they're overpriced ;) I read somewhere that @KyleBrandt has done something similar with a modern Tape solution.. Configuration Files: *bacula-dir.conf* # # Default Bacula Director Configuration file Director { # define myself Name = backuphost-1-dir DIRport = 9101 # where we listen for UA connections QueryFile = "/etc/bacula/scripts/query.sql" WorkingDirectory = "/var/lib/bacula" PidDirectory = "/var/run/bacula" Maximum Concurrent Jobs = 20 Password = "yourekiddingright" # Console password Messages = Daemon DirAddress = 0.0.0.0 #DirAddress = 127.0.0.1 } JobDefs { Name = "DefaultFileJob" Type = Backup Level = Incremental Client = backuphost-1-fd FileSet = "Full Set" Schedule = "WeeklyCycle" Storage = File Messages = Standard Pool = File Priority = 10 Write Bootstrap = "/var/lib/bacula/%c.bsr" } JobDefs { Name = "DefaultTapeJob" Type = Backup Level = Incremental Client = backuphost-1-fd FileSet = "Full Set" Schedule = "WeeklyCycle" Storage = "SpectraLogic" Messages = Standard Pool = AllTapes Priority = 10 Write Bootstrap = "/var/lib/bacula/%c.bsr" Prefer Mounted Volumes = no } # # Define the main nightly save backup job # By default, this job will back up to disk in /nonexistant/path/to/file/archive/dir Job { Name = "BackupClient1" JobDefs = "DefaultFileJob" } Job { Name = "BackupThisVolume" JobDefs = "DefaultTapeJob" FileSet = "SpecialVolume" } #Job { # Name = "BackupClient2" # Client = backuphost-12-fd # JobDefs = "DefaultJob" #} # Backup the catalog database (after the nightly save) Job { Name = "BackupCatalog" JobDefs = "DefaultFileJob" Level = Full FileSet="Catalog" Schedule = "WeeklyCycleAfterBackup" # This creates an ASCII copy of the catalog # Arguments to make_catalog_backup.pl are: # make_catalog_backup.pl <catalog-name> RunBeforeJob = "/etc/bacula/scripts/make_catalog_backup.pl MyCatalog" # This deletes the copy of the catalog RunAfterJob = "/etc/bacula/scripts/delete_catalog_backup" Write Bootstrap = "/var/lib/bacula/%n.bsr" Priority = 11 # run after main backup } # # Standard Restore template, to be changed by Console program # Only one such job is needed for all Jobs/Clients/Storage ... # Job { Name = "RestoreFiles" Type = Restore Client=backuphost-1-fd FileSet="Full Set" Storage = File Pool = Default Messages = Standard Where = /srv/bacula/restore } FileSet { Name = "SpecialVolume" Include { Options { signature = MD5 } File = /mnt/SpecialVolume } Exclude { File = /var/lib/bacula File = /nonexistant/path/to/file/archive/dir File = /proc File = /tmp File = /.journal File = /.fsck } } # List of files to be backed up FileSet { Name = "Full Set" Include { Options { signature = MD5 } File = /usr/sbin } Exclude { File = /var/lib/bacula File = /nonexistant/path/to/file/archive/dir File = /proc File = /tmp File = /.journal File = /.fsck } } Schedule { Name = "WeeklyCycle" Run = Full 1st sun at 23:05 Run = Differential 2nd-5th sun at 23:05 Run = Incremental mon-sat at 23:05 } # This schedule does the catalog. It starts after the WeeklyCycle Schedule { Name = "WeeklyCycleAfterBackup" Run = Full sun-sat at 23:10 } # This is the backup of the catalog FileSet { Name = "Catalog" Include { Options { signature = MD5 } File = "/var/lib/bacula/bacula.sql" } } # Client (File Services) to backup Client { Name = backuphost-1-fd Address = localhost FDPort = 9102 Catalog = MyCatalog Password = "surelyyourejoking" # password for FileDaemon File Retention = 30 days # 30 days Job Retention = 6 months # six months AutoPrune = yes # Prune expired Jobs/Files } # # Second Client (File Services) to backup # You should change Name, Address, and Password before using # #Client { # Name = backuphost-12-fd # Address = localhost2 # FDPort = 9102 # Catalog = MyCatalog # Password = "i'mnotjokinganddontcallmeshirley" # password for FileDaemon 2 # File Retention = 30 days # 30 days # Job Retention = 6 months # six months # AutoPrune = yes # Prune expired Jobs/Files #} # Definition of file storage device Storage { Name = File # Do not use "localhost" here Address = localhost # N.B. Use a fully qualified name here SDPort = 9103 Password = "lalalalala" Device = FileStorage Media Type = File } Storage { Name = "SpectraLogic" Address = localhost SDPort = 9103 Password = "linkedinmakethebestpasswords" Device = Drive-1 Device = Drive-2 Media Type = LTO5 Autochanger = yes } # Generic catalog service Catalog { Name = MyCatalog # Uncomment the following line if you want the dbi driver # dbdriver = "dbi:sqlite3"; dbaddress = 127.0.0.1; dbport = dbname = "bacula"; DB Address = ""; dbuser = "bacula"; dbpassword = "bbmaster63" } # Reasonable message delivery -- send most everything to email address # and to the console Messages { Name = Standard mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: %t %e of %c %l\" %r" operatorcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" %r" mail = root@localhost = all, !skipped operator = root@localhost = mount console = all, !skipped, !saved # # WARNING! the following will create a file that you must cycle from # time to time as it will grow indefinitely. However, it will # also keep all your messages if they scroll off the console. # append = "/var/lib/bacula/log" = all, !skipped catalog = all } # # Message delivery for daemon messages (no job). Messages { Name = Daemon mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \"Bacula daemon message\" %r" mail = root@localhost = all, !skipped console = all, !skipped, !saved append = "/var/lib/bacula/log" = all, !skipped } # Default pool definition Pool { Name = Default Pool Type = Backup Recycle = yes # Bacula can automatically recycle Volumes AutoPrune = yes # Prune expired volumes Volume Retention = 365 days # one year } # File Pool definition Pool { Name = File Pool Type = Backup Recycle = yes # Bacula can automatically recycle Volumes AutoPrune = yes # Prune expired volumes Volume Retention = 365 days # one year Maximum Volume Bytes = 50G # Limit Volume size to something reasonable Maximum Volumes = 100 # Limit number of Volumes in Pool } Pool { Name = AllTapes Pool Type = Backup Recycle = yes AutoPrune = yes # Prune expired volumes Volume Retention = 31 days # one Moth } # Scratch pool definition Pool { Name = Scratch Pool Type = Backup } # # Restricted console used by tray-monitor to get the status of the director # Console { Name = backuphost-1-mon Password = "LastFMalsostorePasswordsLikeThis" CommandACL = status, .status } bacula-sd.conf # # Default Bacula Storage Daemon Configuration file # Storage { # definition of myself Name = backuphost-1-sd SDPort = 9103 # Director's port WorkingDirectory = "/var/lib/bacula" Pid Directory = "/var/run/bacula" Maximum Concurrent Jobs = 20 SDAddress = 0.0.0.0 # SDAddress = 127.0.0.1 } # # List Directors who are permitted to contact Storage daemon # Director { Name = backuphost-1-dir Password = "passwordslinplaintext" } # # Restricted Director, used by tray-monitor to get the # status of the storage daemon # Director { Name = backuphost-1-mon Password = "totalinsecurityabound" Monitor = yes } Device { Name = FileStorage Media Type = File Archive Device = /srv/bacula/archive LabelMedia = yes; # lets Bacula label unlabeled media Random Access = Yes; AutomaticMount = yes; # when device opened, read it RemovableMedia = no; AlwaysOpen = no; } Autochanger { Name = SpectraLogic Device = Drive-1 Device = Drive-2 Changer Command = "/etc/bacula/scripts/mtx-changer %c %o %S %a %d" Changer Device = /dev/sg4 } Device { Name = Drive-1 Drive Index = 0 Archive Device = /dev/nst0 Changer Device = /dev/sg4 Media Type = LTO5 AutoChanger = yes RemovableMedia = yes; AutomaticMount = yes; AlwaysOpen = yes; RandomAccess = no; LabelMedia = yes } Device { Name = Drive-2 Drive Index = 1 Archive Device = /dev/nst1 Changer Device = /dev/sg4 Media Type = LTO5 AutoChanger = yes RemovableMedia = yes; AutomaticMount = yes; AlwaysOpen = yes; RandomAccess = no; LabelMedia = yes } # # Send all messages to the Director, # mount messages also are sent to the email address # Messages { Name = Standard director = backuphost-1-dir = all } bacula-fd.conf # # Default Bacula File Daemon Configuration file # # # List Directors who are permitted to contact this File daemon # Director { Name = backuphost-1-dir Password = "hahahahahaha" } # # Restricted Director, used by tray-monitor to get the # status of the file daemon # Director { Name = backuphost-1-mon Password = "hohohohohho" Monitor = yes } # # "Global" File daemon configuration specifications # FileDaemon { # this is me Name = backuphost-1-fd FDport = 9102 # where we listen for the director WorkingDirectory = /var/lib/bacula Pid Directory = /var/run/bacula Maximum Concurrent Jobs = 20 #FDAddress = 127.0.0.1 FDAddress = 0.0.0.0 } # Send all messages except skipped files back to Director Messages { Name = Standard director = backuphost-1-dir = all, !skipped, !restored }

    Read the article

  • Slow NFS and GFS2 performance

    - by Tiago
    Recently I've designed and configured a 4 node cluster for a webapp that does lots of file handling. The cluster have been broken down into 2 main roles, webserver and storage. Each role is replicated to a second server using drbd in active/passive mode. The webserver does a NFS mount of the data directory of the storage server and the latter also has a webserver running to serve files to browser clients. In the storage servers I've created a GFS2 FS to hold the data which is wired to drbd. I've chose GFS2 mainly because the announced performance and also because the volume size which has to be pretty high. Since we entered production I've been facing two problems that I think are deeply connected. First of all, the NFS mount on the webservers keeps hanging for a minute or so and then resumes normal operations. By analyzing the logs I've found out that NFS stops answering for a while and outputs the following log lines: Oct 15 18:15:42 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:44 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:46 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:47 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:48 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:48 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:51 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:52 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:52 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:55 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:55 <server hostname> kernel: nfs: server active.storage.vlan not responding, still trying Oct 15 18:15:58 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK Oct 15 18:15:59 <server hostname> kernel: nfs: server active.storage.vlan OK In this case, the hang lasted for 16 seconds but sometimes it takes 1 or 2 minutes to resume normal operations. My first guess was this was happening due to heavy load of the NFS mount and that by increasing RPCNFSDCOUNT to a higher value, this would become stable. I've increased it several times and apparently, after a while, the logs started appearing less times. The value is now on 32. After further investigating the issue, I've came across a different hang, despite the NFS messages still appear in the logs. Sometimes, the GFS2 FS simply hangs which causes both the NFS and the storage webserver to serve files. Both stay hang for a while and then they resume normal operations. This hangs leaves no trace on client side (also leaves no NFS ... not responding messages) and, on the storage side, the log system appears to be empty, even though the rsyslogd is running. The nodes connect themselves through a 10Gbps non-dedicated connection but I don't think this is an issue because the GFS2 hang is confirmed but connecting directly to the active storage server. I've been trying to solve this for a while now and I've tried different NFS configuration options, before I've found out the GFS2 FS is also hanging. The NFS mount is exported as such: /srv/data/ <ip_address>(rw,async,no_root_squash,no_all_squash,fsid=25) And the NFS client mounts with: mount -o "async,hard,intr,wsize=8192,rsize=8192" active.storage.vlan:/srv/data /srv/data After some tests, these were the configurations that yielded more performance to the cluster. I am desperate to find a solution for this as the cluster is already in production mode and I need to fix this so that this hangs won't happen in the future and I don't really know for sure what and how I should be benchmarking. What I can tell is that this is happening due to heavy loads as I have tested the cluster earlier and this problems weren't happening at all. Please tell me if you need me to provide configuration details of the cluster, and which do you want me to post. As last resort I can migrate the files to a different FS but I need some solid pointers on whether this will solve this problems as the volume size is extremely large at this point. The servers are being hosted by a third-party enterprise and I don't have physical access to them. Best regards. EDIT 1: The servers are physical servers and their specs are: Webservers: Intel Bi Xeon E5606 2x4 2.13GHz 24GB DDR3 Intel SSD 320 2 x 120GB Raid 1 Storage: Intel i5 3550 3.3GHz 16GB DDR3 12 x 2TB SATA Initially there was a VRack setup between the servers but we've upgraded one of the storage servers to have more RAM and it wasn't inside the VRack. They connect through a shared 10Gbps connection between them. Please note that it is the same connection that is used for public access. They use a single IP (using IP Failover) to connect between them and to allow for a graceful failover. NFS is therefore over a public connection and not under any private network (it was before the upgrade, were the problem still existed). The firewall was configured and tested thoroughly but I disabled it for a while to see if the problem still occurred, and it did. From my knowledge the hosting provider isn't blocking or limiting the connection between either the servers and the public domain (at least under a given bandwidth consumption threshold that hasn't been reached yet). Hope this helps figuring out the problem. EDIT 2: Relevant software versions: CentOS 2.6.32-279.9.1.el6.x86_64 nfs-utils-1.2.3-26.el6.x86_64 nfs-utils-lib-1.1.5-4.el6.x86_64 gfs2-utils-3.0.12.1-32.el6_3.1.x86_64 kmod-drbd84-8.4.2-1.el6_3.elrepo.x86_64 drbd84-utils-8.4.2-1.el6.elrepo.x86_64 DRBD configuration on storage servers: #/etc/drbd.d/storage.res resource storage { protocol C; on <server1 fqdn> { device /dev/drbd0; disk /dev/vg_storage/LV_replicated; address <server1 ip>:7788; meta-disk internal; } on <server2 fqdn> { device /dev/drbd0; disk /dev/vg_storage/LV_replicated; address <server2 ip>:7788; meta-disk internal; } } NFS Configuration in storage servers: #/etc/sysconfig/nfs RPCNFSDCOUNT=32 STATD_PORT=10002 STATD_OUTGOING_PORT=10003 MOUNTD_PORT=10004 RQUOTAD_PORT=10005 LOCKD_UDPPORT=30001 LOCKD_TCPPORT=30001 (can there be any conflict in using the same port for both LOCKD_UDPPORT and LOCKD_TCPPORT?) GFS2 configuration: # gfs2_tool gettune <mountpoint> incore_log_blocks = 1024 log_flush_secs = 60 quota_warn_period = 10 quota_quantum = 60 max_readahead = 262144 complain_secs = 10 statfs_slow = 0 quota_simul_sync = 64 statfs_quantum = 30 quota_scale = 1.0000 (1, 1) new_files_jdata = 0 Storage network environment: eth0 Link encap:Ethernet HWaddr <mac address> inet addr:<ip address> Bcast:<bcast address> Mask:<ip mask> inet6 addr: <ip address> Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:957025127 errors:0 dropped:0 overruns:0 frame:0 TX packets:1473338731 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2630984979622 (2.3 TiB) TX bytes:1648430431523 (1.4 TiB) eth0:0 Link encap:Ethernet HWaddr <mac address> inet addr:<ip failover address> Bcast:<bcast address> Mask:<ip mask> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 The IP addresses are statically assigned with the given network configurations: DEVICE="eth0" BOOTPROTO="static" HWADDR=<mac address> ONBOOT="yes" TYPE="Ethernet" IPADDR=<ip address> NETMASK=<net mask> and DEVICE="eth0:0" BOOTPROTO="static" HWADDR=<mac address> IPADDR=<ip failover> NETMASK=<net mask> ONBOOT="yes" BROADCAST=<bcast address> Hosts file to allow for a graceful NFS failover in conjunction with NFS option fsid=25 set on both storage servers: #/etc/hosts <storage ip failover address> active.storage.vlan <webserver ip failover address> active.service.vlan As you can see, packet errors are down to 0. I've also ran ping for a long time without any packet loss. MTU size is the normal 1500. As there is no VLan by now, this is the MTU used to communicate between servers. The webservers' network environment is similar. One thing I forgot to mention is that the storage servers handle ~200GB of new files each day through the NFS connection, which is a key point for me to think this is some kind of heavy load problem with either NFS or GFS2. If you need further configuration details please tell me. EDIT 3: Earlier today we had a major filesystem crash on the storage server. I couldn't get the details of the crash right away because the server stop responding. After the reboot, I noticed the filesystem was extremely slow, and I was not being able to serve a single file through either NFS or httpd, perhaps due to cache warming or so. Nevertheless, I've been monitoring the server closely and the following error came up in dmesg. The source of the problem is clearly GFS, which is waiting for a lock and ends up starving after a while. INFO: task nfsd:3029 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. nfsd D 0000000000000000 0 3029 2 0x00000080 ffff8803814f79e0 0000000000000046 0000000000000000 ffffffff8109213f ffff880434c5e148 ffff880624508d88 ffff8803814f7960 ffffffffa037253f ffff8803815c1098 ffff8803814f7fd8 000000000000fb88 ffff8803815c1098 Call Trace: [<ffffffff8109213f>] ? wake_up_bit+0x2f/0x40 [<ffffffffa037253f>] ? gfs2_holder_wake+0x1f/0x30 [gfs2] [<ffffffff814ff42e>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff814ff2cb>] mutex_lock+0x2b/0x50 [<ffffffffa0379f21>] gfs2_log_reserve+0x51/0x190 [gfs2] [<ffffffffa0390da2>] gfs2_trans_begin+0x112/0x1d0 [gfs2] [<ffffffffa0369b05>] ? gfs2_dir_check+0x35/0xe0 [gfs2] [<ffffffffa0377943>] gfs2_createi+0x1a3/0xaa0 [gfs2] [<ffffffff8121aab1>] ? avc_has_perm+0x71/0x90 [<ffffffffa0383d1e>] gfs2_create+0x7e/0x1a0 [gfs2] [<ffffffffa037783f>] ? gfs2_createi+0x9f/0xaa0 [gfs2] [<ffffffff81188cf4>] vfs_create+0xb4/0xe0 [<ffffffffa04217d6>] nfsd_create_v3+0x366/0x4c0 [nfsd] [<ffffffffa0429703>] nfsd3_proc_create+0x123/0x1b0 [nfsd] [<ffffffffa041a43e>] nfsd_dispatch+0xfe/0x240 [nfsd] [<ffffffffa025a5d4>] svc_process_common+0x344/0x640 [sunrpc] [<ffffffff810602a0>] ? default_wake_function+0x0/0x20 [<ffffffffa025ac10>] svc_process+0x110/0x160 [sunrpc] [<ffffffffa041ab62>] nfsd+0xc2/0x160 [nfsd] [<ffffffffa041aaa0>] ? nfsd+0x0/0x160 [nfsd] [<ffffffff81091de6>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff81091d50>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20

    Read the article

< Previous Page | 1 2 3 4 5