A little over a year ago, we received a question from someone who
was trying to build software on Solaris. He was getting errors
from the ar command when creating an archive. At that time, the ar
command on Solaris was a 32-bit command. There was more than
2GB of data, and the ar command was hitting the file size limit
for a 32-bit process that doesn't use the largefile APIs.
Even in 2011, 2GB is a very large amount of code, so we had not heard
this one before. Most of our toolchain was extended
to handle 64-bit sized data back in the 1990's, but archives were not changed,
presumably because there was no perceived need for it. Since then of course,
programs have continued to get larger, and in 2010, the time had finally come
to investigate the issue and find a way to provide for larger archives.
As part of that process, I had to do a deep dive into the archive
format, and also do some Unix archeology. I'm going to record what I
learned here, to document what Solaris does,
and in the hope that it might help someone else trying to solve the same
problem for their platform.
Archive Format Details
Archives are hardly cutting edge technology. They are still used of
course, but their basic form hasn't changed in decades. Other than to
fix a bug, which is rare, we don't tend to touch that code
much. The archive file format is described in /usr/include/ar.h,
and I won't repeat the
details here. Instead, here is a rough overview of the archive file format,
implemented by System V Release 4 (SVR4) Unix systems such as Solaris:
Every archive starts with a "magic number". This is a sequence
of 8 characters: "!<arch>\n".
The magic number is followed by 1 or more members. A
member starts with a fixed header, defined by the
ar_hdr structure in/usr/include/ar.h.
Immediately following the header comes the
data for the member. Members must be padded at the end with newline
characters so that they have even length.
The requirement to pad members to an even length is a dead giveaway
as to the age of the archive format. It tells you that this format
dates from the 1970's, and more specifically from the era of 16-bit
systems such as the PDP-11 that Unix was originally developed on.
A 32-bit system would have required 4 bytes, and 64-bit systems
such as we use today would probably have required 8 bytes.
2 byte alignment is a poor choice for ELF object archive members.
32-bit objects require 4 byte alignment, and 64-bit objects require
64-bit alignment. The link-editor uses mmap() to process
archives, and if the members have the wrong alignment, we have to
slide (copy) them to the correct alignment
before we can access the ELF data structures inside. The archive
format requires 2 byte padding, but it doesn't prohibit more.
The Solaris ar command takes advantage of this, and pads ELF
object members to 8 byte boundaries. Anything else is padded to 2
as required by the format.
The archive header (ar_hdr) represents all numeric
values using an ASCII text representation rather than as binary
integers. This
means that an archive that contains only text members can be viewed
using tools such as cat, more, or a text
editor. The original designers of this format clearly thought
that archives would be used for many file types, and not just for
objects. Things didn't turn out that way of course
nearly all
archives contain relocatable objects for a single operating system
and machine, and are used primarily as input to the link-editor (ld).
Archives can have special members that are created by
the ar command rather than being
supplied by the user. These special members
are all distinguished by having a name that starts with the slash (/)
character. This is an unambiguous marker that says that the user could
not have supplied it. The reason for this is that regular archive members
are given the plain name of the file that was inserted to create them,
and any path components are stripped off. Slash is the delimiter
character used by Unix to separate path components, and as such
cannot occur within a plain file name.
The ar command hides
the special members from you when you list the contents of an archive,
so most users don't know that they exist.
There are only two possible special members: A symbol table that maps
ELF symbols to the object
archive member that provides it, and a string table used to hold
member names that exceed 15 characters. The '/' convention for tagging
special members provides room for adding more such members should the
need arise. As I will discuss below, we took advantage of this fact
to add an alternate 64-bit symbol table special member which is used in
archives that are larger than 4GB.
When an archive contains ELF object members, the ar command builds
a special archive member known as the symbol table that
maps all ELF symbols in the object to the archive member that provides
it. The link-editor uses this symbol table to determine which symbols
are provided by the objects in that archive.
If an archive has a symbol table, it will always be the first
member in the archive, immediately following the magic number. Unlike
member headers, symbol tables do use binary integers to represent
offsets. These integers are always stored in big-endian format,
even on a little endian host such as x86.
The archive header (ar_hdr) provides 15 characters for representing
the member name. If any member has a name that is longer than this,
then the real name is written into a special archive member called
the string table, and the member's name field instead
contains a slash (/) character followed by a decimal representation of
the offset of the real name within the string table.
The string table is required to precede all normal archive members,
so it will be the second member if the archive contains a symbol
table, and the first member otherwise.
The archive format is not designed to make finding a given
member easy. Such operations move through the archive from front
to back examining each member in turn, and run in O(n) time. This would be
bad if archives were commonly used in that manner, but in general,
they are not.
Typically, the ar command is used to build an new archive from
scratch, inserting all the objects in one operation, and then the
link-editor accesses the members in the archive in constant time
by using the offsets provided by the symbol table. Both
of these operations are reasonably efficient. However, listing the contents
of a large archive with the ar
command can be rather slow.
Factors That Limit Solaris Archive Size
As is often the case, there was more than one limiting factor
preventing Solaris archives from growing beyond the 32-bit limits
of 2GB (32-bit signed) and 4GB (32-bit unsigned). These limits are listed
in the order they are hit as archive size grows, so the earlier ones
mask those that follow.
The original Solaris archive file format can handle sizes up to 4GB
without issue. However, the ar command was delivered as a
32-bit executable that did
not use the largefile APIs. As such, the ar command itself could
not create a file larger than 2GB. One can solve this by
building ar with the largefile APIs which would allow it to
reach 4GB, but a simpler and better answer is to deliver a 64-bit
ar, which has the ability to scale well past 4GB.
Symbol table offsets are stored as 32-bit big-endian
binary integers, which limits the maximum archive size
to 4GB. To get around
this limit requires a different symbol table format, or
an extension mechanism to the current one, similar in nature to
the way member names longer than 15 characters are handled in
member headers.
The size field in the archive member header (ar_hdr)
is an ASCII string capable of representing a 32-bit unsigned
value. This places a 4GB size limit on the size of any individual
member in an archive.
In considering format extensions to get past these limits, it is
important to remember that very few archives will require
the ability to scale past 4GB for many years. The old format, while
no beauty, continues to be sufficient for its purpose. This argues for
a backward compatible fix that allows newer versions of Solaris to
produce archives that are compatible with older versions of the system
unless the size of the archive exceeds 4GB.
Archive Format Differences Among Unix Variants
While considering how to extend Solaris archives to scale to 64-bits,
I wanted to know how similar archives from other Unix systems
are to those produced by Solaris, and whether they had already
solved the 64-bit issue. I've successfully moved archives between
different Unix systems before with
good luck, so I knew that there was some commonality. If it
turned out that there was already a viable defacto standard for 64-bit archives,
it would obviously be better to adopt that rather than invent something new.
The archive file format is not formally standardized. However, the
ar command and archive format were part of the original Unix
from Bell Labs. Other systems started with that format, extending it in various
often incompatible ways, but usually with the same common shared core. Most of
these systems use the same magic number to identify their archives, despite
the fact that their archives are not always fully compatible with each other.
It is often true that archives can be copied between different
Unix variants, and if the member names are short enough, the ar command from
one system can often read archives produced on another.
In practice,
it is rare to find an archive containing anything other than objects for
a single operating system and machine type. Such an archive is only of
use on the type of system that created it, and is only used on that system.
This is probably why cross platform compatibility of archives between Unix
variants has never been an issue. Otherwise, the use of the same
magic number in archives with incompatible formats would be a problem.
I was able to find information for a number of
Unix variants, described below. These can be divided roughly into three
tribes, SVR4 Unix, BSD Unix, and IBM AIX. Solaris is a SVR4 Unix, and its
archives are completely compatible with those from the other members
of that group (GNU/Linux, HP-UX, and SGI IRIX).
AIX
AIX is an exception to rule that Unix archive formats are all
based on the original Bell labs Unix format. It appears that AIX
supports 2 formats (small and big), both of which differ in
fundamental ways from other Unix systems:
These formats use a different magic number than the standard
one used by Solaris and other Unix variants.
They include support for removing archive members from
a file without reallocating the file, marking dead areas as
unused, and reusing them when new archive items are inserted.
They have a special table of contents member (File Member Header)
which lets you find out everything that's in the archive
without having to actually traverse the entire file. Their symbol
table members are quite similar to those from other systems though.
Their member headers are doubly linked, containing offsets to
both the previous and next members.
Of the Unix systems described here, AIX has the only format I saw
that will have reasonable insert/delete performance for really large
archives. Everyone else has O(n) performance, and are going to be
slow to use with large archives.
BSD
BSD has gone through 4 versions of archive format, which are
described in their
manpage. They use the same member header as SVR4, but their symbol table
format is different, and their scheme for long member
names puts the name directly after the member header rather than into
a string table.
GNU/Linux
The GNU toolchain uses the SVR4 format, and is compatible with
Solaris.
HP-UX
HP-UX seems to follow the SVR4 model, and is compatible with
Solaris.
IRIX
IRIX has 32 and 64-bit archives. The 32-bit format is the
standard SVR4 format, and is compatible with Solaris. The 64-bit format
is the same, except that the symbol table uses
64-bit integers.
IRIX assumes that an archive
contains objects of a single ELFCLASS/MACHINE, and any archive containing
ELFCLASS64 objects receives a 64-bit symbol table. Although they only use
it for 64-bit objects, nothing in the archive format limits it
to ELFCLASS64. It would be perfectly valid to produce a 64-bit
symbol table in an archive containing 32-bit objects, text files, or
anything else.
Tru64 Unix (Digital/Compaq/HP)
Tru64 Unix uses a format much like ours, but their symbol table is
a hash table, making specific symbol lookup much faster. The Solaris
link-editor uses archives by examining the entire
symbol table looking for unsatisfied symbols for the link, and not
by looking up individual symbols, so there would be no benefit to
Solaris from such a hash table. The
Tru64 ld must use a different approach in which the hash table pays
off for them.
Widening the existing SVR4 archive symbol tables rather than inventing something
new is the simplest path forward. There is ample precedent for this
approach in the ELF world. When ELF was extended to support 64-bit objects,
the approach was largely to take the existing data structures, and define
64-bit versions of them. We called the old set ELF32, and the new set ELF64.
My guess is that there was no need to widen the archive format at that time,
but had there been, it seems obvious that this is
how it would have been done.
The Implementation of 64-bit Solaris Archives
As mentioned earlier, there was no desire to improve the fundamental
nature of archives. They have always had O(n) insert/delete behavior, and
for the most part it hasn't mattered. AIX made efforts to improve this,
but those efforts did not find widespread
adoption. For the purposes of link-editing, which is essentially the
only thing that archives
are used for, the existing format is adequate, and issues of backward
compatibility trump the desire to do something technically better.
Widening the existing symbol table format to 64-bits is therefore the
obvious way to proceed. For Solaris 11, I implemented that,
and I also updated the ar command so that a 64-bit
version is run by default. This eliminates the 2 most significant limits
to archive size, leaving only the limit on an individual archive member.
We only generate
a 64-bit symbol table if the archive exceeds 4GB, or when the new -S
option
to the ar command is used. This maximizes backward compatibility, as an
archive produced by Solaris 11 is highly likely to be less than 4GB in
size, and will therefore employ the same format understood by
older versions of the
system. The main reason for the existence of the -S option is to allow
us to test the 64-bit format without having to construct huge archives to
do so. I don't believe it will find much use outside of that.
Other than the new ability to create and use extremely large archives,
this change is largely invisible to the end user. When reading an
archive, the ar command will transparently accept either form of symbol table.
Similarly, the ELF library (libelf) has been updated to understand
either format. Users of libelf (such as the link-editor ld) do not need
to be modified to use the new format, because these changes are encapsulated
behind the existing functions provided by libelf.
As mentioned above, this work did not lift the limit on the
maximum size of an individual archive member. That limit remains fixed
at 4GB for now. This is not because we think objects will never get that
large, for the history of computing says otherwise. Rather, this is based
on an estimation that single relocatable objects of that size will not
appear for a decade or two. A lot can change in that time, and it is better not
to overengineer things by writing code that will sit and rot for years
without being used.
It is not too soon however to have a plan for that eventuality. When
the time comes when this limit needs to be lifted, I believe that there
is a simple solution that is consistent with the existing format.
The archive member header size field is an ASCII string, like the name,
and as such, the overflow scheme used for long names can also be used to
handle the size. The size string would be placed into the archive string
table, and its offset in the string table would then be written into the
archive header size field using the same format "/ddd" used for overflowed
names.