Lustre - issues with simple setup
- by ethrbunny
Issue: I'm trying to assess the (possible) use of Lustre for our group. To this end I've been trying to create a simple system to explore the nuances. I can't seem to get past the 'llmount.sh' test with any degree of success.
What I've done: Each system (throwaway PCs with 70Gb HD, 2Gb RAM) is formatted with CentOS 6.2. I then update everything and install the Lustre kernel from downloads.whamcloud.com and add on the various (appropriate) lustre and e2fs RPM files. Systems are rebooted and tested with 'llmount.sh' (and then cleared with 'llmountcleanup.sh'). All is well to this point.
First I create an MDS/MDT system via:
/usr/sbin/mkfs.lustre --mgs --mdt --fsname=lustre --device-size=200000 --param sys.timeout=20 --mountfsoptions=errors=remount-ro,user_xattr,acl --param lov.stripesize=1048576 --param lov.stripecount=0 --param mdt.identity_upcall=/usr/sbin/l_getidentity --backfstype ldiskfs --reformat /tmp/lustre-mdt1
and then
mkdir -p /mnt/mds1
mount -t lustre -o loop,user_xattr,acl /tmp/lustre-mdt1 /mnt/mds1
Next I take 3 systems and create a 2Gb loop mount via:
/usr/sbin/mkfs.lustre --ost --fsname=lustre --device-size=200000 --param sys.timeout=20 --mgsnode=lustre_MDS0@tcp --backfstype ldiskfs --reformat /tmp/lustre-ost1
mkdir -p /mnt/ost1
mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
The logs on the MDT box show the OSS boxes connecting up. All appears ok.
Last I create a client and attach to the MDT box:
mkdir -p /mnt/lustre
mount -t lustre -o user_xattr,acl,flock luster_MDS0@tcp:/lustre /mnt/lustre
Again, the log on the MDT box shows the client connection. Appears to be successful.
Here's where the issues (appear to) start. If I do a 'df -h' on the client it hangs after showing the system drives. If I attempt to create files (via 'dd') on the lustre mount the session hangs and the job can't be killed. Rebooting the client is the only solution.
If I do a 'lctl dl' from the client it shows that only 2/3 OST boxes are found and 'UP'.
[root@lfsclient0 etc]# lctl dl
0 UP mgc MGC10.127.24.42@tcp 282d249f-fcb2-b90f-8c4e-2f1415485410 5
1 UP lov lustre-clilov-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 4
2 UP lmv lustre-clilmv-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 4
3 UP mdc lustre-MDT0000-mdc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5
4 UP osc lustre-OST0000-osc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5
5 UP osc lustre-OST0003-osc-ffff880037e4d400 00fc176e-3156-0490-44e1-da911be9f9df 5
Doing a 'lfs df' from the client shows:
[root@lfsclient0 etc]# lfs df
UUID 1K-blocks Used Available Use% Mounted on
lustre-MDT0000_UUID 149944 16900 123044 12% /mnt/lustre[MDT:0]
OST0000 : inactive device
OST0001 : Resource temporarily unavailable
OST0002 : Resource temporarily unavailable
lustre-OST0003_UUID 187464 24764 152636 14% /mnt/lustre[OST:3]
filesystem summary: 187464 24764 152636 14% /mnt/lustre
Given that each OSS box has a 2Gb (loop) mount I would expect to see this reflected in available size.
There are no errors on the MDS/MDT box to indicate that multiple OSS/OST boxes have been lost.
EDIT: each system has all other systems defined in /etc/hosts and entries in iptables to provide access.
SO: I'm clearly making several mistakes. Any pointers as to where to start correcting them?