WebSphere problems related to new default nproc limit in RHEL 6

We recently had an incident on one of our production systems running under Red Hat Enterprise Linux where under certain load conditions WebSphere Application Server would fail with an OutOfMemoryError with the following message:

Failed to create a thread: retVal -1073741830, errno 11

Error number 11 corresponds to EAGAIN and indicates that the C library function creating the thread fails because of insufficient resources. Often this is related to native memory starvation, but in our case it turned out that it was the nproc limit that was reached. That limit puts an upper bound on the number of processes a given user can create. It may affect WebSphere because in this context, Linux counts each thread as a distinct process.

Starting with RHEL 6, the soft nproc limit is set to 1024 by default, while in previous releases this was not the case. The corresponding configuration can be found in /etc/security/limits.d/90-nproc.conf. Generally a WebSphere instance only uses a few hundred of threads so that this problem may go unnoticed for some time before being triggered by an unusual load condition. You should also take into account that the limit applies to the sum of all threads created by all processes running with the same user as the WebSphere instance. In particular it is not unusual to have IBM HTTP Server running with the same user on the same host. Since the WebSphere plug-in uses a multithreaded processing model (and not an synchronous one), the nproc limit may be reached if the number of concurrent requests increases too much.

One solution is to remove or edit the 90-nproc.conf file to increase the nproc limit for all users. However, since the purpose of the new default value in RHEL 6 is to prevent accidental fork bombs, it may be better to define new hard and soft nproc limits only for the user running the WebSphere instance. While this is easy to configure, there is one other problem that needs to be taken into account.

For some unknown reasons, sudo (in contrast to su) is unable to set the soft limit for the new process to a value larger than the hard limit set on the parent process. If that occurs, instead of failing, sudo creates the new process with the same soft limit as the parent process. This means that if the hard nproc limit for normal users is lower than the soft nproc limit of the WebSphere user and an administrator uses sudo to start a WebSphere instance, then that instance will not have the expected soft nproc limit. To avoid this problem, you should do the following:

  • Increase the soft nproc limit for the user running WebSphere.
  • Increase the hard nproc for all users to the same (or a higher) value, keeping the soft limit unchanged (to avoid accidental fork bombs).

Note that you can verify that the limits are set correctly for a running WebSphere instance by determining the PID of the instance and checking the /proc/<pid>/limits file.