Background
We have a pool of aproximately 20 linux blades. Some are running Suse, some are running Redhat. ALL share NAS space which contains the following 3 folders:
/NAS/app/java - a symlink that points to an installation of a Java JDK. Currently version 1.5.0_10
/NAS/app/lib - a symlink that points to a version of our application.
/NAS/data - directory where our output is written
All our machines have 2 processors (hyperthreaded) with 4gb of physical memory and 4gb of swap space. We limit the number of 'jobs' each machine can process at a given time to 6 (this number likely needs to change, but that does not enter into the current problem so please ignore it for the time being).
Some of our jobs set a Max Heap size of 512mb, some others reserve a Max Heap size of 2048mb. Again, we realize we could go over our available memory if 6 jobs started on the same machine with the heap size set to 2048, but to our knowledge this has not yet occurred.
The Problem
Once and a while a Job will fail immediately with the following message:
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
We used to chalk this up to too many jobs running at the same time on the same machine. The problem happened infrequently enough (MAYBE once a month) that we'd just restart it and everything would be fine.
The problem has recently gotten much worse. All of our jobs which request a max heap size of 2048m fail immediately almost every time and need to get restarted several times before completing.
We've gone out to individual machines and tried executing them manually with the same result.
Debugging
It turns out that the problem only exists for our SuSE boxes. The reason it has been happening more frequently is becuase we've been adding more machines, and the new ones are SuSE.
'cat /proc/version' on the SuSE boxes give us:
Linux version 2.6.5-7.244-bigsmp (geeko@buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Mon Dec 12 18:32:25 UTC 2005
'cat /proc/version' on the RedHat boxes give us:
Linux version 2.4.21-32.0.1.ELsmp (
[email protected]) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-52)) #1 SMP Tue May 17 17:52:23 EDT 2005
'uname -a' gives us the following on BOTH types of machines:
UTC 2005 i686 i686 i386 GNU/Linux
No jobs are running on the machine, and no other processes are utilizing much memory. All of the processes currently running might be using 100mb
total.
'top' currently shows the following:
Mem: 4146528k
total, 3536360k used, 610168k free, 132136k buffers
Swap: 4194288k
total, 0k used, 4194288k free, 3283908k cached
'vmstat' currently shows the following:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 610292 132136 3283908 0 0 0 2 26 15 0 0 100 0
If we kick off a job with the following command line (Max Heap of 1850mb) it starts fine:
java/bin/java -Xmx1850M -cp helloworld.jar HelloWorld
Hello World
If we bump up the max heap size to 1875mb it fails:
java/bin/java -Xmx1875M -cp helloworld.jar HelloWorld
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
It's quite clear that the memory currently being used is for Buffering/Caching and that's why so little is being displayed as 'free'. What isn't clear is why there is a magical 1850mb line where anything higher means Java can't start.
Any explanations would be greatly appreciated.