sge - Developer IT

SGE: invoking qmake raises "critical error: can't resolve group"

- by Pierre

I'm new to SGE an I'm trying to run qmake with the simple following Makefile with our very new cluster: merge.txt: job1.txt job2.txt job3.txt ... cat $^ > $@ job1.txt: sleep 1 echo "Hello From " $@ > $@ sleep 1 job2.txt: sleep 2 echo "Hello From " $@ > $@ sleep 2 job3.txt: (...) the following command raises an error: qmake -l arch=lx24-amd64 -cwd -v PATH -- -j 4 sleep 1 dynamic mode sleep 2 dynamic mode sleep 3 dynamic mode sleep 4 dynamic mode critical error: can't resolve group qmake: *** [job3.txt] Error 1 critical error: can't resolve group qmake: *** [job2.txt] Error 1 critical error: can't resolve group qmake: *** [job1.txt] Error 1 critical error: can't resolve group what's wrong ?

Read the article

Sun Grid Engine (SGE) / limiting simultaneous array job sub-tasks

- by wfaulk

I am installing a Sun Grid Engine environment and I have a scheduler limit that I can't quite figure out how to implement. My users will create array jobs that have hundreds of sub-tasks. I would like to be able to limit those jobs to only running a set number of tasks at the same time, independent of other jobs. Like I might have one array job that I want to run 20 tasks at a time, and another I want to run 50 tasks at a time, and yet another that I'm fine running without limit. It seems like this ought to be doable, but I can't figure it out. There's a max_aj_instances configuration option, but that appears to apply globally to all array jobs. I can't see any way to use consumable resources, as I'd need a "complex attribute" that is per-job, and that feature doesn't seem to exist. It didn't look like resource quotas would work, but now I'm not so sure of that. It says "A resource quota set defines a maximum resource quota for a particular job request", but it's unclear if an array job's sub-tasks' resource requests will be aggregated for the purposes of the resource quota. I'm going to play with this, but hopefully someone already knows outright.

Read the article

querying sge using job_name

- by user39212

Hello, qstat manual says -j [job_list] Prints either for all pending jobs or the jobs con- tained in job_list various information. The job_list can contain job_ids, job_names, or wildcard expression sge_types(1). i.e. qstat -j job_name should work. But whenver I try i get an error saying qstat -j test_20100203_01 ERROR: job-ID should be an integer Any idea if this should have worked. I can also use: qstat | grep test_20100203_01 but the problem is that qstat lists the shortened job name (something like test_2010020 and not test_20100203_01). Please advise.

Read the article

SGE: downtime planning

- by mousee

I need to plan a downtime for a maintenance of my environment (or some part of my environment) by means of Sun Grid Engine. Is it possible to somehow use backfilling information to tell the grid engine to plan only those jobs on cluster which are able to finish (i have backfilling information) till let's say 10 am next day? Can I then at 10 am rely on the fact that all compute nodes are clean, jobs are only queued, no job is planned and so that I can start maintenance? Thank you for your time. mousee

Read the article

Avoid generating empty STDOUT and STDERR files with Sun Grid Engine (SGE) and array jobs

- by vy32

I am running array jobs with Sun Grid Engine (SGE). My carefully scripted array job workers generate no stdout and no stderr when they function properly. Unfortunately, SGE insists on creating an empty stdout and stderr file for each run. Sun's manual states: STDOUT and STDERR of array job tasks will be written into dif- ferent files with the default location .['e'|'o']'.' In order to change this default, the -e and -o options (see above) can be used together with the pseudo-environment-vari- ables $HOME, $USER, $JOB_ID, $JOB_NAME, $HOSTNAME, and $SGE_TASK_ID. Note, that you can use the output redirection to divert the out- put of all tasks into the same file, but the result of this is undefined. I would like to have the output files suppressed if they are empty. Is there any way to do this?

Read the article

How to specify the equivalent of ppn (on PBS) in SGE queuing system ?

- by debugger

Hello All, Is there a way to specify the ppn ( or equivalent ) in SGE ? i don't want to use all cpus in one node so i will be able to have more memory per core. ( In PBS you would do -l nodes=16:ppn=2 for exemple) Thanks.

Read the article

How to specify the equivalent of ppn (on PBS) in SGE queuing system ?

- by debugger

Hello All, Is there a way to specify the ppn ( or equivalent ) in SGE ? i don't want to use all cpus in one node so i will be able to have more memory per core. ( In PBS you would do -l nodes=16:ppn=2 for exemple) Thanks.

Read the article

SGE - limit a user to a certain host, using resource quota configuration

- by pufferfish

Is it possible to limit a user to a particular host, using the Resource Quota Configuration option in qmon for Sun Grid Engine? I'm thinking of a line to the effect of: { ... limit users {john} to hostname=compute-1-1.local } The documentation mentions built in resource types: slots, arch, mem_total, num_proc, swap_total, and the ability to make custom types. Details: SGE 6.1u5 on Rocks update: The above rule seems to be valid, since using an unknown hostname mangling the resource name 'hostname' both cause errors

Read the article

alternatives to SGE

- by bmargulies

Once upon at time there was the free SGE from Sun. tricky to install and configure, but functional and free. Now we've got: open source packages on Ubuntu that don't quite work out of the box (details on request). the actual source behind them, with a build process that depends on the c-shell and other obsolescences, available from two competing locations. a commercial packaging from Oracle a commercial package from Univa What I am really wishing for is something with the basic capabilities of this that is simple to install and maintain. Heck, I'd take a front-end to hadoop that just queues and distributes simple shell-script-defined jobs.

Read the article

low performance on HPC cluster (sge) when running multiple jobs

- by Yotam

O know this is a long-shot but I'm clueless here. I'm running several computer simulations on High Performance Computation cluster (HPC) of oracale grid engine (sge). A single job runs at a certain speed (roughly 80 steps per second) when I add jobs to the machine, at a certain treshhold, the speed is recuded by two. On one machine (I don't know the cpu kind) the treshold is 11 jobs for 16 cpu's. On another one with the same number and kind of cpu's , the treshold is 8. I thought at first that this is a memory issue but each job takes about 60MB - 100MB and I have 16GB of ram on each of those machine. Did any of you encountered such a problem? is there any way to analyz this? Thanks.

Read the article

Sun Grid Engine (SGE) Jobs Not Visible After Adding virtual_free

- by Gary Richardson

I'm trying to to use virtual_free to limit the number of large memory jobs running each grid node in my cluster. This seems to be working as expected. After I modified my code to submit jobs with the memory instances, qstat -f -q $queueName no longer shows a list of jobs waiting for a slot. The jobs are submitted with a specific queue (-q $queueName). I'm guessing this is happening due to the magic of SGE queue selection. Is there a way to make my jobs show up as before? Thanks! UPDATE I'm using: qstat -f -u * -q $queueName to view the queue. If I drop the queue argument, I can see the jobs. If I examine a specific job, I can see that it has the correct hard_queue_list value set. I'm also using Sun Grid Engine 6.1u4

Read the article

using qsub (sge) with multi-threaded applications

- by dan12345

i wanted to submit a multi-threaded job to the cluster network i'm working with - but the man page about qsub is not clear how this is done - By default i guess it just sends it as a normal job regardless of the multi-threading - but this might cause problems, i.e. sending many multi-threaded jobs to the same computer, slowing things down. Does anyone know how to accomplish this? thanks. The batch server system is sge.

Read the article

Sun Grid Engine : jobs are not well balanced

- by GlinesMome

I use Open Grid Scheduler (a fork/copy of Sun Grid Engine). I have tried this configuration from master: # qconf -mattr exechost complex_values slots=8 slave2 # qconf -mq all.q | grep slots slots 100,[slave1=1],[slave2=8] slave1 is down, then I run 10 qsub with a sleep example (so no CPU consumption) but only 4 jobs are run at the same time on slave2 instead of I have put 8 slots. What does I missed ? PS: my goal is to provide infinite slots to force SGE to schedule only via consummable ressources.

Read the article

Is there a way to tell SGE to run specific jobs as root on the execution node?

- by Rick Reynolds

The title kinda says it all... We're using SGE/OGE to submit jobs to a set of worker nodes that then do things with specific pieces of equipment. The programs and scripts that have been created that manipulate this equipment rely on running as root. I'd like SGE to handle allocation of resources in a way that is mindful of users, groups, projects, etc., but I also need the actual jobs to run with root permissions. I've read up on How can one run a prologue script as root in gridengine? to see if anything there was pertinent, but it seems that SGE is providing the "user@" kind of spec specifically for prolog and epilog kinds of actions. Is there any similar functionality for the job itself? I'm aware of su/sudo approaches, but that won't really work in this environment because the sudoers file isn't globally managed (i.e. I'd have to add a whole set of users to /etc/sudoers on lots of machines). I'm currently looking into a setuid kind of solution, but that would definitely be an unnecessary kind of work-around if SGE provides me a way to declare that a specific job (or jobs in a specific queue) always needs to run with a specific user's rights.

Read the article

Parallel Environment (PE) on Sun Grid Engine (6.2u5) won't run jobs: "only offers 0 slots"

- by Peter Van Heusden

I have Sun Grid Engine set up (version 6.2u5) on a Ubuntu 10.10 server with 8 cores. In order to be able to reserve multiple slots, I have a parallel environment (PE) set up like this: pe_name serial slots 999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE This is associated with the all.q on the server in question (let's call the server A). However, when I submit a job that uses 4 threads with e.g. qsub -q all.q@A -pe serial 4 mycmd.sh, it never gets scheduled, and I get the following reasoning from qstat: cannot run in PE "serial" because it only offers 0 slots Why is SGE saying "serial" only offers 0 slots, since there are 8 slots available on the server I specified (server A)? The queue in question is configured thus (server names changed): qname all.q hostlist @allhosts seq_no 0 load_thresholds np_load_avg=1.75 suspend_thresholds NONE nsuspend 1 suspend_interval 00:05:00 priority 0 min_cpu_interval 00:05:00 processors UNDEFINED qtype BATCH INTERACTIVE ckpt_list NONE pe_list make orte serial rerun FALSE slots 1,[D=32],[C=8], \ [B=30],[A=8] tmpdir /tmp shell /bin/sh prolog NONE epilog NONE shell_start_mode posix_compliant starter_method NONE suspend_method NONE resume_method NONE terminate_method NONE notify 00:00:60 owner_list NONE user_lists NONE xuser_lists NONE subordinate_list NONE complex_values NONE projects NONE xprojects NONE calendar NONE initial_state default s_rt INFINITY h_rt 08:00:00 s_cpu INFINITY h_cpu INFINITY s_fsize INFINITY h_fsize INFINITY s_data INFINITY h_data INFINITY s_stack INFINITY h_stack INFINITY s_core INFINITY h_core INFINITY s_rss INFINITY h_rss INFINITY s_vmem INFINITY h_vmem INFINITY,[A=30g], \ [B=5g]

Read the article

How to specify the equivalent of ppn (on PBS) in SGE queuing system ?

- by debugger

Hello All, Is there a way to specify the ppn ( or equivalent ) in SGE ? i don't want to use all cpus in one node so i will be able to have more memory per core. ( In PBS you would do -l nodes=16:ppn=2 for exemple) Thanks.

Read the article

How to specify the equivalent of ppn in SGE queing system ?

- by debugger

Hello All, Is there a way to specify the ppn ( or equivalent ) in SGE ? i don't want to use all cpus in one node so i will be able to have more memory per core. ( In PBS you would do -l nodes=16:ppn=2 for exemple) Thanks.

Read the article

Can't configure PAM + LDAP on Debian Lenny - Getting error=49 on server logs

- by Jorge Suárez de Lis

I've been migrating some servers and desktops using Ubuntu 10.04 from getting the users from an old OpenLDAP implementation to a newer Centos Active Directory. I haven't had any problems so far, until I reached a Debian Lenny server. I've set up the server as the others, setting /etc/ldap.conf and /etc/ldap/ldap.conf. However, when I issue "getent passwd", I get nothing from the LDAP server. Reading the pam_ldap manpage, I realized that /etc/ldap.conf was not an accepted file by pam_ldap -it worked with Ubuntu though-, so I renamed it to /etc/pam_ldap.conf. Same result. However, once I've changed the name of this file, when I login using SSH I get this on the LDAP server logs: [20/Jul/2012:11:19:40 +0200] conn=16501 fd=155 slot=155 connection from x.x.x.50 to 10.1.176.237 [20/Jul/2012:11:19:40 +0200] conn=16501 op=0 BIND dn="uid=ubuntu,ou=Applications,ou=CITIUS,dc=inv,dc=usc,dc=es" method=128 version=3 [20/Jul/2012:11:19:40 +0200] conn=16501 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=ubuntu,ou=applications,ou=citius,dc=inv,dc=usc,dc=es" [20/Jul/2012:11:19:40 +0200] conn=16501 op=1 SRCH base="ou=People,ou=CITIUS,dc=inv,dc=usc,dc=es" scope=2 filter="(uid=jorge.suarez)" attrs=ALL [20/Jul/2012:11:19:40 +0200] conn=16501 op=1 RESULT err=0 tag=101 nentries=1 etime=0 notes=U [20/Jul/2012:11:19:40 +0200] conn=16501 op=2 BIND dn="uid=jorge.suarez,ou=People,ou=CITIUS,dc=inv,dc=usc,dc=es" method=128 version=3 [20/Jul/2012:11:19:40 +0200] conn=16501 op=2 RESULT err=49 tag=97 nentries=0 etime=0 The password isn't working. I don't know that could be wrong, anything else seems to be OK. That user/password is working from another clients: [20/Jul/2012:11:29:39 +0200] conn=16528 fd=188 slot=188 connection from x.x.x.224 to 10.1.176.237 [20/Jul/2012:11:29:39 +0200] conn=16528 op=0 BIND dn="uid=ubuntu,ou=Applications,ou=CITIUS,dc=inv,dc=usc,dc=es" method=128 version=3 [20/Jul/2012:11:29:39 +0200] conn=16528 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=ubuntu,ou=applications,ou=citius,dc=inv,dc=usc,dc=es" [20/Jul/2012:11:29:39 +0200] conn=16528 op=1 SRCH base="ou=People,ou=CITIUS,dc=inv,dc=usc,dc=es" scope=2 filter="(uid=jorge.suarez)" attrs=ALL [20/Jul/2012:11:29:39 +0200] conn=16528 op=1 RESULT err=0 tag=101 nentries=1 etime=0 notes=U [20/Jul/2012:11:29:39 +0200] conn=16528 op=2 BIND dn="uid=jorge.suarez,ou=People,ou=CITIUS,dc=inv,dc=usc,dc=es" method=128 version=3 [20/Jul/2012:11:29:39 +0200] conn=16528 op=2 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=jorge.suarez,ou=people,ou=citius,dc=inv,dc=usc,dc=es" I'm using SSHA for storing passwords on the LDAP server. Maybe this is not supported by Debian Lenny? On pam_ldap.conf, I've set up this, as in all the other servers: # Do not hash the password at all; presume # the directory server will do it, if # necessary. This is the default. pam_password md5 Also tried clear, but it didn't work. Anyways, it's weird that issuing getent passwd still gets me no users. However, if I use pamtest from the package libpam-dotfile to test login, it works. # pamtest ssh jorge.suarez Trying to authenticate <jorge.suarez> for service <ssh>. Password: Authentication successful. # pamtest foo jorge.suarez Trying to authenticate <jorge.suarez> for service <foo>. Password: Authentication successful. But "su" won't work also: # su jorge.suarez Id. descoñecido: jorge.suarez Just the output from getent passwd : # getent passwd root:x:0:0:root:/root:/bin/bash daemon:x:1:1:daemon:/usr/sbin:/bin/sh bin:x:2:2:bin:/bin:/bin/sh sys:x:3:3:sys:/dev:/bin/sh sync:x:4:65534:sync:/bin:/bin/sync games:x:5:60:games:/usr/games:/bin/sh man:x:6:12:man:/var/cache/man:/bin/sh lp:x:7:7:lp:/var/spool/lpd:/bin/sh mail:x:8:8:mail:/var/mail:/bin/sh news:x:9:9:news:/var/spool/news:/bin/sh uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh proxy:x:13:13:proxy:/bin:/bin/sh www-data:x:33:33:www-data:/var/www:/bin/sh backup:x:34:34:backup:/var/backups:/bin/sh list:x:38:38:Mailing List Manager:/var/list:/bin/sh irc:x:39:39:ircd:/var/run/ircd:/bin/sh gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/bin/sh nobody:x:65534:65534:nobody:/nonexistent:/bin/sh libuuid:x:100:101::/var/lib/libuuid:/bin/sh Debian-exim:x:101:103::/var/spool/exim4:/bin/false statd:x:102:65534::/var/lib/nfs:/bin/false sshd:x:104:65534::/var/run/sshd:/usr/sbin/nologin luser:x:1000:1000:Usuario local de Burdeos,,,:/home/luser:/bin/bash messagebus:x:105:107::/var/run/dbus:/bin/false sge-admin:x:1001:1001:Administrador do SGE,,,:/home/cluster/sge-admin:/bin/bash ntp:x:107:110::/home/ntp:/bin/false haldaemon:x:108:111:Hardware abstraction layer,,,:/var/run/hald:/bin/false vde2-net:x:109:114::/var/run/vde2:/bin/false uml-net:x:110:115::/home/uml-net:/bin/false polkituser:x:111:116:PolicyKit,,,:/var/run/PolicyKit:/bin/false Debian-pxe:x:113:65534:Dummy user for Debian pxe package,,,:/home/Debian-pxe:/bin/false Nscd was stopped from the beginning.

Read the article

Open Grid Engine or Akka/Something more fault tolerant?

- by Mike Lyons

My use case is that I have a pipeline of independent, stand alone programs, that I want to execute in a certain order on specific pieces of data that our output from previous pipeline stages. The pipeline is entirely linear and doesn't do anything in terms of alternate paths through the pipe. I'm currently using SGE to do this and it works OK, however occasionally a job will overstep it's memory bounds, fail, and all jobs that require that output data will fail. The pipe needs to be restarted in that case, and it seems that whatever is providing the fault tolerance in akka might solve that for me?

Read the article

Sun Grid Engine: Automatically Terminating Idle Interactive Jobs

- by dmcer

We're considering using Sun Grid Engine on a small compute cluster. Right now, the current set up is pretty crude and just involves having people ssh to an open machine to run their jobs. We'd like to allow interactive jobs, since that should ease the transition from manually starting jobs to starting them using qsub. But, there is some concern that, if we do, people might accidentally leave their interactive sessions idle and block other jobs from being run on the machines. The issue isn't just theoretical, since we previously tried using OpenPBS and there was a problem with people opening up an interactive job in a screen session and essentially camping on a machine. Is there anyway to configure SGE to automatically kill idle interactive jobs? It looks like this was requested as an enhancement (Issue #:2447) way back in 2007. But, it doesn't seem like the request ever got implemented.

Read the article

How to execute a command on multiple hosts using IPv6 only?

- by math

First of all there is pdsh which is essentially a parallel distributed shell which may execute commands on a list of given hosts. However, I find myself in an IPv6 only problem setting. It seems that pdsh is not able to use IPv6, as I am getting error messages: pdsh -w ^hostnames my_command pdsh@myhost: gethostbyname("foobar") failed I also tried to use IPv6 addresses only, which also didn't work. So how do you run a single shell script for administrative purpose (no SGE stuff, or similar) on a bunch of hosts that is IPv6 reachable only?

Read the article

Is there a DRMAA Java library that works with Torque/PBS?

- by dondondon

Does anybody know a Java implementation of the DRMAA-API that is known to work with PBS/Torque cluster software? The background behind this: I would like to submit jobs to a newly set-up linux cluster from java using a DRMAA compliant API. The cluster is managed by PBS/Torque. Torque includes PBS DRMAA 1.0 library for Torque/PBS that contains a DRMA-C binding and provides in libdrmaa.so and .a binaries. I know that Sun grid engine includes a drmaa.jar providing a Java-DRMAA API. In fact I opted to use SGE but it was decided to try PBS first. The theory behind that decision was: 'DRMAA is a standard and therefore a Java API needs only a standards compliant drmaa-c binding.' However, I couldn't find such 'general DRMAA-C-java API' and now assume that this assumption is wrong and that the Java libraries are engine specific. Are we stuck here? Any comments appreciated.

Read the article

JSF index out of bounds exception when submiting a form

- by selvin

When I click submit button in a JSF form the following exception occurs. It says an Indexout of bounds exception, but I did not use any ArrayList associated with the code. Is this a bug? what should i do to get rid of this error.. Mojarra: 2.0.2 FCS with primefaces 2.2 JSF: 2.0 NetBeans IDE 6.8 Glassfish Domain V3 Form Code: <p:panel id="jobres" style="min-width: 200px" header="Reservation" widgetVar="jres" closable="true" toggleable="true" > <h:form id="arj" prependId="false" style="width:550px;max-height:400px;overflow:auto;"> <p:tooltip global="true"/> <h:panelGrid columns="2" > <p:panel style="min-width: 220px"> <h:outputLabel value="1.Job type:"/> </p:panel> <h:panelGroup> <h:messages id="aerr"/> <h:selectOneMenu title="Choose a Jobtype" value="#{arjob.jobtype}"> <f:selectItem itemLabel="Sequential" itemValue="sequential"/> <f:selectItem itemLabel="Parallel" itemValue="parallel"/> </h:selectOneMenu> </h:panelGroup> <p:panel> <h:outputLabel value="2.Executable: *"/> </p:panel> <h:panelGroup> <p:fileUpload id="aexeupload" fileUploadListener="#{arjob.chooseListener}" auto="true" update="adlist :erdialog" description="Resource Files"> </p:fileUpload> <h:panelGroup id="aexelistwrapper"> <p:dataList var="fileList" type="ordered" id="adlist" value="#{arjob.fexelist}"> <p:column> #{fileList}  <p:commandLink ajax="true" update="aexelistwrapper" actionListener="#{arjob.removeExe(fileList)}"> <p:graphicImage value="images/closebar.png"/> </p:commandLink> </p:column> </p:dataList> </h:panelGroup> </h:panelGroup> <p:panel> <h:outputLabel value="3.Argument(s):"/> </p:panel> <h:panelGroup> <p:inplace emptyLabel="Add Arguments" onEditUpdate="aarglist"> <h:inputText title="Enter the arguments" id="aiparg" value="#{arjob.args}"> <f:ajax event="valueChange"/> </h:inputText> <p:commandButton update="aarglistwrapper erdialog" value="add" actionListener="#{arjob.addArg}"/> </p:inplace> </h:panelGroup> <h:panelGroup> </h:panelGroup> <h:panelGroup> <h:panelGrid id="aarglistwrapper"> <p:dataList id="aarglist" type="ordered" var="args" value="#{arjob.arglist}"> <p:column id="col2"> #{args}  <p:commandLink ajax="true" update="arj:arglistwrapper" actionListener="#{arjob.removeArgs(args)}"> <p:graphicImage title="remove" value="images/closebar.png"/> </p:commandLink> </p:column> </p:dataList> </h:panelGrid> </h:panelGroup> <p:panel> <h:outputLabel value="4.InputFile(s): *"/> </p:panel> <h:panelGroup> <p:fileUpload id="ainpupload" fileUploadListener="#{arjob.inputChooseListener}" auto="true" update="aipfilelistwrapper :erdialog" description="Resource Files"> </p:fileUpload> <h:panelGroup id="aipfilelistwrapper"> <p:dataList var="ipfile" type="ordered" id="aipflist" value="#{arjob.finlist}"> <p:column> #{ipfile}  <p:commandLink ajax="true" update="aipfilelistwrapper" actionListener="#{arjob.removeInfile(ipfile)}"> <p:graphicImage value="images/closebar.png"/> </p:commandLink> </p:column> </p:dataList> </h:panelGroup> </h:panelGroup> <h:panelGroup> <p:panel > 5)Output File(s): </p:panel> </h:panelGroup> <h:panelGroup> <p:inplace emptyLabel="Add file name" id="aipexe"> <h:inputText title="Enter the output filenames" id="aexe" value="#{arjob.ofilename}"> <f:ajax event="valueChange"/> </h:inputText> <p:commandButton update="adoutlist :erdialog" value="add" actionListener="#{arjob.addOutfile}"/> </p:inplace> </h:panelGroup> <h:panelGroup> </h:panelGroup> <h:panelGroup> <h:panelGrid id="afilelistwrapper"> <p:dataList id="adoutlist" type="ordered" var="ofile" value="#{arjob.foutlist}"> <p:column id="acol"> #{ofile}  <p:commandLink ajax="true" update="afilelistwrapper" actionListener="#{arjob.removeOutfile(ofile)}"> <p:graphicImage value="images/closebar.png"/> </p:commandLink> </p:column> </p:dataList> </h:panelGrid> </h:panelGroup> <p:panel> <h:outputLabel value="6) Operating System"/> </p:panel> <h:panelGroup> <h:selectOneMenu title="Select an OperatingSystem" value="#{arjob.os}"> <f:selectItem itemLabel="CentOS Release 5.2" itemValue="Cent OS 5.2"/> <f:selectItem itemLabel="RHEL Server Release 5" itemValue="RHEL server 5"/> <f:selectItem itemLabel="RHEL Server Release 5.2" itemValue="RHEL server 5.2"/> </h:selectOneMenu> </h:panelGroup> <h:panelGroup> <p:panel> <h:outputLabel value="7) Physical Memory:"/> </p:panel> </h:panelGroup> <h:panelGroup> <p:spinner min="0" style="width: 100px" stepFactor="10" value="#{arjob.mem}"> </p:spinner>(MB) </h:panelGroup> <h:panelGroup> <p:panel> <h:outputLabel value="8) Disk Space:"/> </p:panel> </h:panelGroup> <h:panelGroup> <p:spinner min="0" style="width: 100px" stepFactor="10" value="#{arjob.diskspace}"> </p:spinner>(MB) </h:panelGroup> <h:panelGroup> <p:panel> <h:outputLabel value="9) CPU Mhz:"/> </p:panel> </h:panelGroup> <h:panelGroup> <p:spinner min="0" style="width: 100px" stepFactor="10" value="#{arjob.cpumhz}"> </p:spinner>(Mhz) </h:panelGroup> <p:panel> <h:outputLabel value="10) Start Time:"/> </p:panel> <h:panelGroup> <p:inputMask title="(YYYY-MM-DD HH:MM:SS)" mask="9999-99-99 99:99:99" value="#{arjob.startt}"> </p:inputMask> </h:panelGroup> <p:panel> <h:outputLabel value="11) End Time:"/> </p:panel> <h:panelGroup> <p:inputMask title="(YYYY-MM-DD HH:MM:SS)" mask="9999-99-99 99:99:99" value="#{arjob.endt}"> <p:ajax event="valueChange"/> </p:inputMask> </h:panelGroup> <p:panel> <h:outputLabel value="12) LRMS type"/> </p:panel> <h:panelGroup> <h:selectOneMenu value="#{arjob.lrms}"> <f:selectItem itemLabel="PBS" itemValue="PBS"/> <f:selectItem itemLabel="SGE" itemValue="SGE"/> </h:selectOneMenu> </h:panelGroup> <h:panelGroup> <p:panel> <h:outputLabel value="13)Number of Nodes: *"/> </p:panel> </h:panelGroup> <h:panelGroup> <h:panelGroup> <p:spinner style="width: 100px" min="1" max="100" value="#{arjob.numnodes}"> </p:spinner> </h:panelGroup> </h:panelGroup> <h:panelGroup> </h:panelGroup> <h:panelGroup> <p:commandButton ajax="false" value="Submit" action="#{arjob.jobSubmitAction}"/> </h:panelGroup> </h:panelGrid> </h:form> <p:draggable for="jobres" handle=".ui-panel-titlebar"/> </p:panel> Exception: SEVERE: javax.faces.FacesException: Unexpected error restoring state for component with id j_idt7. Cause: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0. at com.sun.faces.application.view.StateManagementStrategyImpl$2.visit(StateManagementStrategyImpl.java:239) at com.sun.faces.component.visit.FullVisitContext.invokeVisitCallback(FullVisitContext.java:147) at javax.faces.component.UIComponent.visitTree(UIComponent.java:1446) at javax.faces.component.UIComponent.visitTree(UIComponent.java:1457) at javax.faces.component.UIComponent.visitTree(UIComponent.java:1457) at javax.faces.component.UIComponent.visitTree(UIComponent.java:1457) at com.sun.faces.application.view.StateManagementStrategyImpl.restoreView(StateManagementStrategyImpl.java:223) at com.sun.faces.application.StateManagerImpl.restoreView(StateManagerImpl.java:177) at com.sun.faces.application.view.ViewHandlingStrategy.restoreView(ViewHandlingStrategy.java:131) at com.sun.faces.application.view.FaceletViewHandlingStrategy.restoreView(FaceletViewHandlingStrategy.java:430) at com.sun.faces.application.view.MultiViewHandler.restoreView(MultiViewHandler.java:143) at javax.faces.application.ViewHandlerWrapper.restoreView(ViewHandlerWrapper.java:288) at com.sun.faces.lifecycle.RestoreViewPhase.execute(RestoreViewPhase.java:199) at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:101) at com.sun.faces.lifecycle.RestoreViewPhase.doPhase(RestoreViewPhase.java:110) at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:118) at javax.faces.webapp.FacesServlet.service(FacesServlet.java:312) at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1523) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:343) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215) at org.primefaces.webapp.filter.FileUploadFilter.doFilter(FileUploadFilter.java:79) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:256) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:277) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:188) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:641) at com.sun.enterprise.web.WebPipeline.invoke(WebPipeline.java:97) at com.sun.enterprise.web.PESessionLockingStandardPipeline.invoke(PESessionLockingStandardPipeline.java:85) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:185) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:332) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:233) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:165) at com.sun.grizzly.http.ProcessorTask.invokeAdapter(ProcessorTask.java:791) at com.sun.grizzly.http.ProcessorTask.doProcess(ProcessorTask.java:693) at com.sun.grizzly.http.ProcessorTask.process(ProcessorTask.java:954) at com.sun.grizzly.http.DefaultProtocolFilter.execute(DefaultProtocolFilter.java:170) at com.sun.grizzly.DefaultProtocolChain.executeProtocolFilter(DefaultProtocolChain.java:135) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:102) at com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:88) at com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:76) at com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:53) at com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:57) at com.sun.grizzly.ContextTask.run(ContextTask.java:69) at com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:330) at com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:309) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at javax.faces.component.AttachedObjectListHolder.restoreState(AttachedObjectListHolder.java:161) at javax.faces.component.UIComponentBase.restoreState(UIComponentBase.java:1427) at com.sun.faces.application.view.StateManagementStrategyImpl$2.visit(StateManagementStrategyImpl.java:231) ... 45 more

Read the article

LSI 9285-8e and Supermicro SC837E26-RJBOD1 duplicate enclosure ID and slot numbers

- by Andy Shinn

I am working with 2 x Supermicro SC837E26-RJBOD1 chassis connected to a single LSI 9285-8e card in a Supermicro 1U host. There are 28 drives in each chassis for a total of 56 drives in 28 RAID1 mirrors. The problem I am running in to is that there are duplicate slots for the 2 chassis (the slots list twice and only go from 0 to 27). All the drives also show the same enclosure ID (ID 36). However, MegaCLI -encinfo lists the 2 enclosures correctly (ID 36 and ID 65). My question is, why would this happen? Is there an option I am missing to use 2 enclosures effectively? This is blocking me rebuilding a drive that failed in slot 11 since I can only specify enclosure and slot as parameters to replace a drive. When I do this, it picks the wrong slot 11 (device ID 46 instead of device ID 19). Adapter #1 is the LSI 9285-8e, adapter #0 (which I removed due to space limitations) is the onboard LSI. Adapter information: Adapter #1 ============================================================================== Versions ================ Product Name : LSI MegaRAID SAS 9285-8e Serial No : SV12704804 FW Package Build: 23.1.1-0004 Mfg. Data ================ Mfg. Date : 06/30/11 Rework Date : 00/00/00 Revision No : 00A Battery FRU : N/A Image Versions in Flash: ================ BIOS Version : 5.25.00_4.11.05.00_0x05040000 WebBIOS Version : 6.1-20-e_20-Rel Preboot CLI Version: 05.01-04:#%00001 FW Version : 3.140.15-1320 NVDATA Version : 2.1106.03-0051 Boot Block Version : 2.04.00.00-0001 BOOT Version : 06.253.57.219 Pending Images in Flash ================ None PCI Info ================ Vendor Id : 1000 Device Id : 005b SubVendorId : 1000 SubDeviceId : 9285 Host Interface : PCIE ChipRevision : B0 Number of Frontend Port: 0 Device Interface : PCIE Number of Backend Port: 8 Port : Address 0 5003048000ee8e7f 1 5003048000ee8a7f 2 0000000000000000 3 0000000000000000 4 0000000000000000 5 0000000000000000 6 0000000000000000 7 0000000000000000 HW Configuration ================ SAS Address : 500605b0038f9210 BBU : Present Alarm : Present NVRAM : Present Serial Debugger : Present Memory : Present Flash : Present Memory Size : 1024MB TPM : Absent On board Expander: Absent Upgrade Key : Absent Temperature sensor for ROC : Present Temperature sensor for controller : Absent ROC temperature : 70 degree Celcius Settings ================ Current Time : 18:24:36 3/13, 2012 Predictive Fail Poll Interval : 300sec Interrupt Throttle Active Count : 16 Interrupt Throttle Completion : 50us Rebuild Rate : 30% PR Rate : 30% BGI Rate : 30% Check Consistency Rate : 30% Reconstruction Rate : 30% Cache Flush Interval : 4s Max Drives to Spinup at One Time : 2 Delay Among Spinup Groups : 12s Physical Drive Coercion Mode : Disabled Cluster Mode : Disabled Alarm : Enabled Auto Rebuild : Enabled Battery Warning : Enabled Ecc Bucket Size : 15 Ecc Bucket Leak Rate : 1440 Minutes Restore HotSpare on Insertion : Disabled Expose Enclosure Devices : Enabled Maintain PD Fail History : Enabled Host Request Reordering : Enabled Auto Detect BackPlane Enabled : SGPIO/i2c SEP Load Balance Mode : Auto Use FDE Only : No Security Key Assigned : No Security Key Failed : No Security Key Not Backedup : No Default LD PowerSave Policy : Controller Defined Maximum number of direct attached drives to spin up in 1 min : 10 Any Offline VD Cache Preserved : No Allow Boot with Preserved Cache : No Disable Online Controller Reset : No PFK in NVRAM : No Use disk activity for locate : No Capabilities ================ RAID Level Supported : RAID0, RAID1, RAID5, RAID6, RAID00, RAID10, RAID50, RAID60, PRL 11, PRL 11 with spanning, SRL 3 supported, PRL11-RLQ0 DDF layout with no span, PRL11-RLQ0 DDF layout with span Supported Drives : SAS, SATA Allowed Mixing: Mix in Enclosure Allowed Mix of SAS/SATA of HDD type in VD Allowed Status ================ ECC Bucket Count : 0 Limitations ================ Max Arms Per VD : 32 Max Spans Per VD : 8 Max Arrays : 128 Max Number of VDs : 64 Max Parallel Commands : 1008 Max SGE Count : 60 Max Data Transfer Size : 8192 sectors Max Strips PerIO : 42 Max LD per array : 16 Min Strip Size : 8 KB Max Strip Size : 1.0 MB Max Configurable CacheCade Size: 0 GB Current Size of CacheCade : 0 GB Current Size of FW Cache : 887 MB Device Present ================ Virtual Drives : 28 Degraded : 0 Offline : 0 Physical Devices : 59 Disks : 56 Critical Disks : 0 Failed Disks : 0 Supported Adapter Operations ================ Rebuild Rate : Yes CC Rate : Yes BGI Rate : Yes Reconstruct Rate : Yes Patrol Read Rate : Yes Alarm Control : Yes Cluster Support : No BBU : No Spanning : Yes Dedicated Hot Spare : Yes Revertible Hot Spares : Yes Foreign Config Import : Yes Self Diagnostic : Yes Allow Mixed Redundancy on Array : No Global Hot Spares : Yes Deny SCSI Passthrough : No Deny SMP Passthrough : No Deny STP Passthrough : No Support Security : No Snapshot Enabled : No Support the OCE without adding drives : Yes Support PFK : Yes Support PI : No Support Boot Time PFK Change : Yes Disable Online PFK Change : No PFK TrailTime Remaining : 0 days 0 hours Support Shield State : Yes Block SSD Write Disk Cache Change: Yes Supported VD Operations ================ Read Policy : Yes Write Policy : Yes IO Policy : Yes Access Policy : Yes Disk Cache Policy : Yes Reconstruction : Yes Deny Locate : No Deny CC : No Allow Ctrl Encryption: No Enable LDBBM : No Support Breakmirror : No Power Savings : Yes Supported PD Operations ================ Force Online : Yes Force Offline : Yes Force Rebuild : Yes Deny Force Failed : No Deny Force Good/Bad : No Deny Missing Replace : No Deny Clear : No Deny Locate : No Support Temperature : Yes Disable Copyback : No Enable JBOD : No Enable Copyback on SMART : No Enable Copyback to SSD on SMART Error : Yes Enable SSD Patrol Read : No PR Correct Unconfigured Areas : Yes Enable Spin Down of UnConfigured Drives : Yes Disable Spin Down of hot spares : No Spin Down time : 30 T10 Power State : Yes Error Counters ================ Memory Correctable Errors : 0 Memory Uncorrectable Errors : 0 Cluster Information ================ Cluster Permitted : No Cluster Active : No Default Settings ================ Phy Polarity : 0 Phy PolaritySplit : 0 Background Rate : 30 Strip Size : 64kB Flush Time : 4 seconds Write Policy : WB Read Policy : Adaptive Cache When BBU Bad : Disabled Cached IO : No SMART Mode : Mode 6 Alarm Disable : Yes Coercion Mode : None ZCR Config : Unknown Dirty LED Shows Drive Activity : No BIOS Continue on Error : No Spin Down Mode : None Allowed Device Type : SAS/SATA Mix Allow Mix in Enclosure : Yes Allow HDD SAS/SATA Mix in VD : Yes Allow SSD SAS/SATA Mix in VD : No Allow HDD/SSD Mix in VD : No Allow SATA in Cluster : No Max Chained Enclosures : 16 Disable Ctrl-R : Yes Enable Web BIOS : Yes Direct PD Mapping : No BIOS Enumerate VDs : Yes Restore Hot Spare on Insertion : No Expose Enclosure Devices : Yes Maintain PD Fail History : Yes Disable Puncturing : No Zero Based Enclosure Enumeration : No PreBoot CLI Enabled : Yes LED Show Drive Activity : Yes Cluster Disable : Yes SAS Disable : No Auto Detect BackPlane Enable : SGPIO/i2c SEP Use FDE Only : No Enable Led Header : No Delay during POST : 0 EnableCrashDump : No Disable Online Controller Reset : No EnableLDBBM : No Un-Certified Hard Disk Drives : Allow Treat Single span R1E as R10 : No Max LD per array : 16 Power Saving option : Don't Auto spin down Configured Drives Max power savings option is not allowed for LDs. Only T10 power conditions are to be used. Default spin down time in minutes: 30 Enable JBOD : No TTY Log In Flash : No Auto Enhanced Import : No BreakMirror RAID Support : No Disable Join Mirror : No Enable Shield State : Yes Time taken to detect CME : 60s Exit Code: 0x00 Enclosure information: # /opt/MegaRAID/MegaCli/MegaCli64 -encinfo -a1 Number of enclosures on adapter 1 -- 3 Enclosure 0: Device ID : 36 Number of Slots : 28 Number of Power Supplies : 2 Number of Fans : 3 Number of Temperature Sensors : 1 Number of Alarms : 1 Number of SIM Modules : 0 Number of Physical Drives : 28 Status : Normal Position : 1 Connector Name : Port B Enclosure type : SES VendorId is LSI CORP and Product Id is SAS2X36 VendorID and Product ID didnt match FRU Part Number : N/A Enclosure Serial Number : N/A ESM Serial Number : N/A Enclosure Zoning Mode : N/A Partner Device Id : 65 Inquiry data : Vendor Identification : LSI CORP Product Identification : SAS2X36 Product Revision Level : 0718 Vendor Specific : x36-55.7.24.1 Number of Voltage Sensors :2 Voltage Sensor :0 Voltage Sensor Status :OK Voltage Value :5020 milli volts Voltage Sensor :1 Voltage Sensor Status :OK Voltage Value :11820 milli volts Number of Power Supplies : 2 Power Supply : 0 Power Supply Status : OK Power Supply : 1 Power Supply Status : OK Number of Fans : 3 Fan : 0 Fan Speed :Low Speed Fan Status : OK Fan : 1 Fan Speed :Low Speed Fan Status : OK Fan : 2 Fan Speed :Low Speed Fan Status : OK Number of Temperature Sensors : 1 Temp Sensor : 0 Temperature : 48 Temperature Sensor Status : OK Number of Chassis : 1 Chassis : 0 Chassis Status : OK Enclosure 1: Device ID : 65 Number of Slots : 28 Number of Power Supplies : 2 Number of Fans : 3 Number of Temperature Sensors : 1 Number of Alarms : 1 Number of SIM Modules : 0 Number of Physical Drives : 28 Status : Normal Position : 1 Connector Name : Port A Enclosure type : SES VendorId is LSI CORP and Product Id is SAS2X36 VendorID and Product ID didnt match FRU Part Number : N/A Enclosure Serial Number : N/A ESM Serial Number : N/A Enclosure Zoning Mode : N/A Partner Device Id : 36 Inquiry data : Vendor Identification : LSI CORP Product Identification : SAS2X36 Product Revision Level : 0718 Vendor Specific : x36-55.7.24.1 Number of Voltage Sensors :2 Voltage Sensor :0 Voltage Sensor Status :OK Voltage Value :5020 milli volts Voltage Sensor :1 Voltage Sensor Status :OK Voltage Value :11760 milli volts Number of Power Supplies : 2 Power Supply : 0 Power Supply Status : OK Power Supply : 1 Power Supply Status : OK Number of Fans : 3 Fan : 0 Fan Speed :Low Speed Fan Status : OK Fan : 1 Fan Speed :Low Speed Fan Status : OK Fan : 2 Fan Speed :Low Speed Fan Status : OK Number of Temperature Sensors : 1 Temp Sensor : 0 Temperature : 47 Temperature Sensor Status : OK Number of Chassis : 1 Chassis : 0 Chassis Status : OK Enclosure 2: Device ID : 252 Number of Slots : 8 Number of Power Supplies : 0 Number of Fans : 0 Number of Temperature Sensors : 0 Number of Alarms : 0 Number of SIM Modules : 1 Number of Physical Drives : 0 Status : Normal Position : 1 Connector Name : Unavailable Enclosure type : SGPIO Failed in first Inquiry commnad FRU Part Number : N/A Enclosure Serial Number : N/A ESM Serial Number : N/A Enclosure Zoning Mode : N/A Partner Device Id : Unavailable Inquiry data : Vendor Identification : LSI Product Identification : SGPIO Product Revision Level : N/A Vendor Specific : Exit Code: 0x00 Now, notice that each slot 11 device shows an enclosure ID of 36, I think this is where the discrepancy happens. One should be 36. But the other should be on enclosure 65. Drives in slot 11: Enclosure Device ID: 36 Slot Number: 11 Drive's postion: DiskGroup: 5, Span: 0, Arm: 1 Enclosure position: 0 Device Id: 48 WWN: Sequence Number: 11 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 2.728 TB [0x15d50a3b0 Sectors] Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors] Coerced Size: 2.728 TB [0x15d400000 Sectors] Firmware state: Online, Spun Up Is Commissioned Spare : YES Device Firmware Level: A5C0 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5003048000ee8a53 Connected Port Number: 1(path0) Inquiry Data: MJ1311YNG6YYXAHitachi HDS5C3030ALA630 MEAOA5C0 FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :30C (86.00 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's write cache : Disabled Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Enclosure Device ID: 36 Slot Number: 11 Drive's postion: DiskGroup: 19, Span: 0, Arm: 1 Enclosure position: 0 Device Id: 19 WWN: Sequence Number: 4 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 2.728 TB [0x15d50a3b0 Sectors] Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors] Coerced Size: 2.728 TB [0x15d400000 Sectors] Firmware state: Online, Spun Up Is Commissioned Spare : NO Device Firmware Level: A580 Shield Counter: 0 Successful diagnostics completion on : N/A SAS Address(0): 0x5003048000ee8e53 Connected Port Number: 0(path0) Inquiry Data: MJ1313YNG1VA5CHitachi HDS5C3030ALA630 MEAOA580 FDE Enable: Disable Secured: Unsecured Locked: Unlocked Needs EKM Attention: No Foreign State: None Device Speed: 6.0Gb/s Link Speed: 6.0Gb/s Media Type: Hard Disk Device Drive Temperature :30C (86.00 F) PI Eligibility: No Drive is formatted for PI information: No PI: No PI Drive's write cache : Disabled Drive's NCQ setting : Enabled Port-0 : Port status: Active Port's Linkspeed: 6.0Gb/s Drive has flagged a S.M.A.R.T alert : No Update 06/28/12: I finally have some new information about (what we think) the root cause of this problem so I thought I would share. After getting in contact with a very knowledgeable Supermicro tech, they provided us with a tool called Xflash (doesn't appear to be readily available on their FTP). When we gathered some information using this utility, my colleague found something very strange: root@mogile2 test]# ./xflash.dat -i get avail Initializing Interface. Expander: SAS2X36 (SAS2x36) 1) SAS2X36 (SAS2x36) (50030480:00EE917F) (0.0.0.0) 2) SAS2X36 (SAS2x36) (50030480:00E9D67F) (0.0.0.0) 3) SAS2X36 (SAS2x36) (50030480:0112D97F) (0.0.0.0) This lists the connected enclosures. You see the 3 connected (we have since added a 3rd and a 4th which is not yet showing up) with their respective SAS address / WWN (50030480:00EE917F). Now we can use this address to get information on the individual enclosures: [root@mogile2 test]# ./xflash.dat -i 5003048000EE917F get exp Initializing Interface. Expander: SAS2X36 (SAS2x36) Reading the expander information.......... Expander: SAS2X36 (SAS2x36) B3 SAS Address: 50030480:00EE917F Enclosure Logical Id: 50030480:0000007F IP Address: 0.0.0.0 Component Identifier: 0x0223 Component Revision: 0x05 [root@mogile2 test]# ./xflash.dat -i 5003048000E9D67F get exp Initializing Interface. Expander: SAS2X36 (SAS2x36) Reading the expander information.......... Expander: SAS2X36 (SAS2x36) B3 SAS Address: 50030480:00E9D67F Enclosure Logical Id: 50030480:0000007F IP Address: 0.0.0.0 Component Identifier: 0x0223 Component Revision: 0x05 [root@mogile2 test]# ./xflash.dat -i 500304800112D97F get exp Initializing Interface. Expander: SAS2X36 (SAS2x36) Reading the expander information.......... Expander: SAS2X36 (SAS2x36) B3 SAS Address: 50030480:0112D97F Enclosure Logical Id: 50030480:0112D97F IP Address: 0.0.0.0 Component Identifier: 0x0223 Component Revision: 0x05 Did you catch it? The first 2 enclosures logical ID is partially masked out where the 3rd one (which has a correct unique enclosure ID) is not. We pointed this out to Supermicro and were able to confirm that this address is supposed to be set during manufacturing and there was a problem with a certain batch of these enclosures where the logical ID was not set. We believe that the RAID controller is determining the ID based on the logical ID and since our first 2 enclosures have the same logical ID, they get the same enclosure ID. We also confirmed that 0000007F is the default which comes from LSI as an ID. The next pointer that helps confirm this could be a manufacturing problem with a run of JBODs is the fact that all 6 of the enclosures that have this problem begin with 00E. I believe that between 00E8 and 00EE Supermicro forgot to program the logical IDs correctly and neglected to recall or fix the problem post production. Fortunately for us, there is a tool to manage the WWN and logical ID of the devices from Supermicro: ftp://ftp.supermicro.com/utility/ExpanderXtools_Lite/. Our next step is to schedule a shutdown of these JBODs (after data migration) and reprogram the logical ID and see if it solves the problem. Update 06/28/12 #2: I just discovered this FAQ at Supermicro while Google searching for "lsi 0000007f": http://www.supermicro.com/support/faqs/faq.cfm?faq=11805. I still don't understand why, in the last several times we contacted Supermicro, they would have never directed us to this article :\

Search Results

Search found 24 results on 1 pages for 'sge'.

Page 1/1 | 1

- by Pierre

- by wfaulk

- by user39212

- by mousee

- by vy32

- by debugger

- by debugger

- by pufferfish

- by bmargulies

- by Yotam

- by Gary Richardson

- by dan12345

- by GlinesMome

- by Rick Reynolds

- by Peter Van Heusden

- by debugger

- by debugger

- by Jorge Suárez de Lis

- by Mike Lyons

- by dmcer

- by math

- by dondondon

- by selvin

- by Andy Shinn