AIX Checklist for stable obiee deployment

Posted by user554629 on Oracle Blogs See other posts from Oracle Blogs or by user554629
Published on Mon, 27 Aug 2012 20:20:51 +0000 Indexed on 2012/08/27 21:49 UTC
Read the original article Hit count: 408

Filed under:

Common AIX configuration issues     ( last updated 27 Aug 2012 )

OBIEE is a complicated system with many moving parts and connection points.
The purpose of this article is to provide a checklist to discuss OBIEE deployment with your systems administrators.

The information in this article is time sensitive, and updated as I discover new  issues or details.

What makes OBIEE different?

When Tech Support suggests AIX component upgrades to a stable, locked-down production AIX environment, it is common to get "push back".  "Why is this necessary?  We aren't we seeing issues with other software?"

It's a fair question that I have often struggled to answer; here are the talking points:

  • OBIEE is memory intensive.  It is the entire purpose of the software to trade memory for repetitive, more expensive database requests across a network.
  • OBIEE is implemented in C++ and is very dependent on the C++ runtime to behave correctly.
  • OBIEE is aggressively thread efficient;  if atomic operations on a particular architecture do not work correctly, the software crashes.
  • OBIEE dynamically loads third-party database client libraries directly into the nqsserver process.  If the library is not thread-safe, or corrupts process memory the OBIEE crash happens in an unrelated part of the code.  These are extremely difficult bugs to find.
  • OBIEE software uses 99% common source across multiple platforms:  Windows, Linux, AIX, Solaris and HPUX.  If a crash happens on only one platform, we begin to suspect other factors.  load intensitysystem differences, configuration choices, hardware failures. 

It is rare to have a single product require so many diverse technical skills.   My role in support is to understand system configurations, performance issues, and crashes.   An analyst trained in Business Analytics can't be expected to know AIX internals in the depth required to make configuration choices.  Here are some guidelines.

  1. AIX C++ Runtime must be at  version 11.1.0.4
    $ lslpp -L | grep xlC.aix
    obiee software will crash if xlC.aix.rte is downlevel;  this is not a "try it" suggestion.
    Nov 2011 11.1.0.4 version  is appropriate for all AIX versions ( 5, 6, 7 )
    Download from here:
    https://www-304.ibm.com/support/docview.wss?uid=swg24031426
    No reboot is necessary to install, it can even be installed while applications are using the current version.
    Restart the apps, and they will pick up the latest version.


  2. AIX 5.3 Technology Level 12 is required when running on Power5,6,7 processors
    AIX 6.1 was introduced with the newer Power chips, and we have seen no issues with 6.1 or 7.1 versions.
    Customers with an unstable deployment, dozens of unexplained crashes, became stable after the upgrade.
    If your AIX system is 5.3, the minimum TL level should be at or higher than this:
    $ oslevel -s
      5300-12-03-
    1107
    IBM typically supports only the two latest versions of AIX ( 6.1 and 7.1, for example).  AIX 5.3 is still supported and popular running in an LPAR.

  3. obiee userid limits
    $ ulimit -Ha  ( hard limits )
    $ ulimit -a   ( default limits )
    core file size (blocks)     unlimited
    data seg size (kbytes)      unlimited
    file size (blocks)          unlimited
    max memory size (kbytes)    unlimited
    open files                  10240
    cpu time (seconds)          unlimited
    virtual memory (kbytes)     unlimited

    It is best to establish the values in /etc/security/limits
    root user is needed to observe and modify this file.
    If you modify a limit, you will need to relog in to change it again.  For example,
    $ ulimit -c 0
    $ ulimit -c 2097151
    cannot modify limit: Operation not permitted
    $ ulimit -c unlimited
    $ ulimit -c
    0

    There are only two meaningful values for ulimit -c ; zero or unlimited.
    Anything else is likely to produce a truncated core file that cannot be analyzed.

  4. Deploy 32-bit or 64-bit ?
    Early versions of OBIEE offered 32-bit or 64-bit choice to AIX customers.
    The 32-bit choice was needed if a database vendor did not supply a 64-bit client library.
    That's no longer an issue and beginning with OBIEE 11, 32-bit code is no longer shipped.

    A common error that leads to "out of memory" conditions to to accept the 32-bit memory configuration choices on 64-bit deployments.  The significant configuration choices are:
    • Maximum process data (heap) size is in an AIX environment variable
      LDR_CNTRL=IGNOREUNLOAD@LOADPUBLIC@PREREAD_SHLIB@MAXDATA=0x...
    • Two thread stack sizes are made in obiee NQSConfig.INI
      [ SERVER ]
      SERVER_THREAD_STACK_SIZE = 0;
      DB_GATEWAY_THREAD_STACK_SIZE = 0;
    • Sort memory in NQSConfig.INI
      [ GENERAL ]
      SORT_MEMORY_SIZE = 4 MB ;
      SORT_BUFFER_INCREMENT_SIZE = 256 KB ;


    Choosing a value for MAXDATA:
    0x080000000  2GB Default maximum 32-bit heap size ( 8 with 7 zeros )
    0x100000000  4GB 64-bit breaking even with 32-bit ( 1 with 8 zeros )
    0x200000000  8GB 64-bit double 32-bit max
    0x400000000 16GB 64-bit safety


    Using 2GB heap size for a 64-bit process will almost certainly lead to an out-of-memory situation.
    Registers are twice as big ... consume twice as much memory in the heap.
    Upgrading to a 4GB heap for a 64-bit process is just "breaking even" with 32-bit.

    A 32-bit process is constrained by the 32-bit virtual addressing limits.  Heap memory is used for dynamic requirements of obiee software, thread stacks for each of the configured threads, and sometimes for shared libraries.

    64-bit processes are not constrained in this way;  extra heap space can be configured for safety against a query that might create a sudden requirement for excessive storage.  If the storage is not available, this query might crash the whole server and disrupt existing users.

    There is no performance penalty on AIX for configuring more memory than required;  extra memory can be configured for safety.  If there are no other considerations, start with 8GB.


    Choosing a value for Thread Stack size:
    zero is the value documented to select an appropriate default for thread stack size.  My preference is to change this to an absolute value, even if you intend to use the documented default;  it provides better documentation and removes the "surprise" factor.

    There are two thread types that can be configured.
    • GATEWAY is used by a thread pool to call a database client library to establish a DB connection.
      The default size is 256KB;  many customers raise this to 512KB ( no performance penalty for over-configuring ).
      This value must be set to 1 MB if Teradata connections are used.
    • SERVER threads are used to run queries.  OBIEE uses recursive algorithms during the analysis of query structures which can consume significant thread stack storage.  It's difficult to provide guidance on a value that depends on data and complexity.  The general notion is to provide more space than you think you need,  "double down" and increase the value if you run out, otherwise inspect the query to understand why it is too complex for the thread stack.  There are protections built into the software to abort a single user query that is too complex, but the algorithms don't cover all situations.
      256 KB  The default 32-bit stack size.  Many customers increased this to 512KB on 32-bit.  A 64-bit server is very likely to crash with this value;  the stack contains mostly register values, which are twice as big.
      512 KB  The documented 64-bit default.  Some early releases of obiee didn't set this correctly, resulting in 256KB stacks.
      1 MB  The recommended 64-bit setting.  If your system only ever uses 512KB of stack space, there is no performance penalty for using 1MB stack size.
      2 MB  Many large customers use this value for safety.  No performance penalty.

      nqscheduler does not use the NQSConfig.INI file to set thread stack size.
      If this process crashes because the thread stack is too small, use this to set 2MB:
      export OBI_BACKGROUND_STACK_SIZE=2048

  5. Shared libraries are not (shared)
    1. When application libraries are loaded at run-time, AIX makes a decision on whether to load the libraries in a "public" memory segment.  If the filesystem library permissions do not have the "Read-Other" permission bit, AIX loads the library into private process memory with two significant side-effects:
      * The libraries reduce the heap storage available.  
          Might be significant in 32-bit processes;  irrelevant in 64-bit processes.
      * Library code is loaded into multiple real pages for execution;  one copy for each process.
      Multiple execution images is a significant issue for both 32- and 64-bit processes.

      The "real memory pages" saved by using public memory segments is a minor concern.  Today's machines typically have plenty of real memory.
      The real problem with private copies of libraries is that they consume processor cache blocks, which are limited.   The same library instructions executing in different real pages will cause memory delays as the i-cache ( instruction cache 128KB blocks) are refreshed from real memory.   Performance loss because instructions are delayed is something that is difficult to measure without access to low-level cache fault data.   The machine just appears to be running slowly for no observable reason.

      This is an easy problem to detect, and an easy problem to correct.

      Detection:  "
      genld -l" AIX command produces a list of the libraries used by each process and the AIX memory address where they are loaded.
      32-bit public segment is 13 ( "dxxxxxxx" ).   private segments are 2-a.
      64-bit public segment is 9 ( "9xxxxxxxxxxxxxxx") ; private segment is 8.

      genld -l | grep -v ' d| 9' | sort +2

      provides a list of privately loaded libraries. 

      Repair: chmod o+r <libname>
      AIX shared libraries will have a suffix of ".so" or ".a".
      Another technique is to change all libraries in a selected directory to repair those that might not be currently loaded.   The usual directories that need repair are obiee code, httpd code and plugins, database client libraries and java.
      chmod o+r /shr/dir/*.a /shr/dir/*.so

  6. Configure your system for diagnostics
    Production systems shouldn't crash, and yet bad things happen to good software.
    If obiee software crashes and produces a core, you should configure your system for reliable transfer of the failing conditions to Oracle Tech Support.  Here's what we need to be able to diagnose a core file from your system.
    * fullcore enabled. chdev -lsys0 -a fullcore=true
    * core naming enabled. chcore -n on -d
    * ulimit must not truncate core. see item 3.
    * pstack.sh is used to capture core documentation.
    * obidoc is used to capture current AIX configuration.
    * snapcore  AIX utility captures core and libraries. Use the proper syntax.
     $ snapcore -r corename executable-fullpath
       /tmp/snapcore will contain the .pax.Z output file.  It is compressed.
    * If cores are directed to a common directory, ensure obiee userid can write to the directory.  ( chcore -p /cores -d ; chmod 777 /cores )
    The filesystem must have sufficient space to hold a crashing obiee application.
    Use:  df -k
      Check the "Free" column ( not "% Used" )
      8388608 is 8GB.

  7. Disable Oracle Client Library signal handling
    The Oracle DB Client Library is frequently distributed with the sqlplus development kit.
    By default, the library enables a signal handler, which will document a call stack if the application crashes.   The signal handler is not needed, and definitely disruptive to obiee diagnostics.   It needs to be disabled.   sqlnet.ora is typically located at:
       $ORACLE_HOME/network/admin/sqlnet.ora
    Add this line at the top of the file:
       DIAG_SIGHANDLER_ENABLED=FALSE

  8. Disable async query in the RPD connection pool.
    This might be an obiee 10.1.3.4 issue only ( still checking  ).
    "async query" must be disabled in the connection pools.
    It was designed to enable query cancellation to a database, and turned out to have too many edge conditions in normal communication that produced random corruption of data and crashes.  Please ensure it is turned off in the RPD.

  9. Check AIX error report (errpt).
    Errors external to obiee applications can trigger crashes.
     $ /bin/errpt -a
    Hardware errors ( firmware, adapters, disks ) should be reported to IBM support.
    All application core files are recorded by AIX;  the most recent ones are listed first.

  10. Reserved for something important to say.

© Oracle Blogs or respective owner

Related posts about /Oracle