One of the great pleasures of programming is to invent something
for a narrow purpose, and then to realize that it is a general solution
to a broader problem. In hindsight, these things seem perfectly natural
and obvious. The stub proto area used to build the core Solaris
consolidation has turned out to be one of those things.
As discussed in an earlier article, the stub proto area
was invented as part of the effort
to use stub objects to build the core ON consolidation. Its purpose
was merely as a place to hold stub objects. However, we keep finding
other uses for it. It turns out that the stub proto
should be more properly thought of as an auxiliary place to put things
that we would like to put into the proto to help us build the product,
but which we do not wish to package or deliver to the end user.
Stub objects are one example, but private lint libraries,
header files, archives, and relocatable objects, are all examples
of things that might profitably go into the stub proto.
Without a stub proto, these items were handled in a variety of
ad hoc ways:
If one part of the workspace needed private header files,
libraries, or other such items, it might modify its Makefile
to reach up and over to the place in the workspace where those
things live and use them from there. There are several problems
with this:
Each component invents its own approach, meaning that
programmers maintaining the system have to invest extra
effort to understand what things mean. In the past, this
has created makefile ghettos in which only the person
who wrote the makefiles feels confident to modify them,
while everyone else ignores them. This causes many
difficulties and benefits no one.
These interdependencies are not obvious to the
make, utility, and can lead to races.
They are not obvious to the human reader, who may therefore
not realize that they exist, and break them.
Our policy in ON is not to deliver files into the proto unless those
files are intended to be packaged and delivered to the end user.
However, sometimes non-shipping files were copied into the proto anyway,
causing a different set of problems:
It requires a long list of exceptions to silence our
normal unused proto item error checking.
In the past, we have accidentally shipped files
that we did not intend to deliver to the end user.
Mixing cruft with valuable items makes it hard
to discern which is which.
The stub proto area offers a convenient and robust solution. Files
needed to build the workspace that are not delivered to the end
user can instead be installed into the stub proto. No special exceptions
or custom make rules are needed, and the intent is always clear.
We are already accessing some private lint libraries and compilation
symlinks in this manner.
Ultimately, I'd like to see all of the files in the proto that have
a packaging exception delivered to the stub proto instead, and for
the elimination of all existing special case makefile rules. This would
include shared objects, header files, and lint libraries. I don't expect
this to happen overnight it will be a long term case
by case project, but the overall trend is clear.
The Stub Proto, -z assert_deflib, And The End Of Accidental System Object Linking
We recently used the stub proto to solve an annoying build
issue that goes back to the earliest days of Solaris: How to ensure
that we're linking to the OS bits we're building instead of to those
from the running system.
The Solaris product is made up of objects and files from
a number of different consolidations, each of which
is built separately from the others from an independent code base
called a gate. The core Solaris OS consolidation is ON, which
stands for "Operating System and Networking". You will frequently also
see ON called the OSnet. There are consolidations for X11 graphics,
the desktop environment, open source utilities, compilers and development
tools, and many others. The collection of consolidations
that make up Solaris is known as the "Wad Of Stuff", usually referred to
simply as the WOS. None of these consolidations is self
contained. Even the core ON consolidation has some dependencies on
libraries that come from other consolidations.
The build server used to build the OSnet must be
running a relatively recent version of Solaris, which means that
its objects will be
very similar to the new ones being built. However, it is necessarily
true that the build
system objects will always be a little behind, and that incompatible
differences may exist.
The objects built by the OSnet link to other objects. Some of these
dependencies come from the OSnet, while others come from
other consolidations. The objects from
other consolidations are provided by the standard library directories on the
build system (/lib, /usr/lib). The objects from the OSnet
itself are supposed
to come from the proto areas in the workspace, and not from the build
server. In order to achieve this, we make use of the
-L command line option to the link-editor.
The link-editor finds dependencies by looking in the directories
specified by the caller using the -L command line option. If the desired
dependency is not found in one of these locations, ld will then fall back
to looking at the default locations (/lib, /usr/lib). In order to
use OSnet objects from the workspace instead of the system, while still
accessing non-OSnet objects from the system,
our Makefiles set -L link-editor options
that point at the workspace proto areas. In general, this works well
and dependencies are found in the right places.
However, there have always been failures:
Building objects in the wrong order might mean that
an OSnet dependency hasn't been built before an object that
needs it. If so, the dependency will not be seen in the proto,
and the link-editor will silently fall back to the one on the build
server.
Errors in the makefiles can wipe out the
-L options that our top level makefiles establish to cause ld to
look at the workspace proto first. In this case, all objects will
be found on the build server.
These failures were rarely if ever caught. As I mentioned
earlier, the objects on the build server are generally quite close to
the objects built in the workspace. If they offer compatible linking
interfaces, then the objects that link to them will behave properly,
and no issue will ever be seen. However, if they do not offer compatible
linking interfaces, the failure modes can be puzzling and hard to pin
down. Either way, there won't be a compile-time warning or error.
The advent of the stub proto eliminated the first type of
failure. With stub objects, there is no dependency ordering,
and the necessary stub object dependency will always be in place
for any OSnet object that needs it. However, makefile errors do
still occur, and so, the second form of error was still possible.
While working on the stub object project, we realized that the stub
proto was also the key to solving the second form of failure caused
by makefile errors:
Due to the way we set the -L options to point at our
workspace proto areas, any valid object from the OSnet
should be found via a path specified by -L, and not from the
default locations (/lib, /usr/lib). Any OSnet object found
via the default locations means that we've linked to
the build server, which is an error we'd like to catch.
Non-OSnet objects don't exist in the proto areas, and so
are found via the default paths. However, if we
were to create a symlink in the stub proto pointing at each non-OSnet
dependency that
we require, then the non-OSnet objects would also be found via
the paths specified by -L, and not from the link-editor defaults.
Given the above, we should not find any dependency objects from the
link-editor defaults. Any dependency found via the link-editor
defaults means that we have a Makefile error, and that we
are linking to the build server inappropriately. All we need
to make use of this fact is a linker option
to produce a warning when it happens.
Although warnings are nice, we in the OSnet have a zero
tolerance policy for build noise. The
-z fatal-warnings option that was recently introduced with
-z guidance can be used to turn the warnings into
fatal build errors, forcing the programmer to fix them.
This was too easy to resist. I integrated
7021198 ld option to warn when link accesses a library via default path
PSARC/2011/068 ld -z assert-deflib option
into snv_161 (February 2011), shortly after the stub proto was introduced
into ON. This putback introduced the -z assert-deflib option to
the link-editor:
-z assert-deflib=[libname]
Enables warning messages for libraries specified with
the -l command line option that are found by examining
the default search paths provided by the link-editor. If
a libname value is provided, the default library warning
feature is enabled, and the specified library is added
to a list of libraries for which no warnings will be
issued. Multiple -z assert-deflib options can be specified
in order to specify multiple libraries for which
warnings should not be issued.
The libname value should be the name of the library
file, as found by the link-editor, without any path components.
For example, the following enables default
library warnings, and excludes the standard C library.
ld ... -z assert-deflib=libc.so ...
-z assert-deflib is a specialized option, primarily of
interest in build environments where multiple objects
with the same name exist and tight control over the
library used is required. If is not intended for general
use.
Note that the definition of -z assert-deflib allows for exceptions
to be specified as arguments to the option. In general, the idea of
using a symlink from the stub proto is superior because it does not
clutter up the link command with a long list of objects. When building
the OSnet, we usually use the plain from of -z deflib, and
make symlinks for the non-OSnet dependencies. The exception to this
are dependencies supplied by the compiler itself, which are usually
found at whatever arbitrary location the compiler happens to be installed at.
To handle these special cases, the command line version works better.
Following the integration of the link-editor change, I made
use of -z assert-deflib in OSnet builds with
7021896 Prevent OSnet from accidentally linking to build system
which integrated into snv_162 (March 2011). Turning on
-z assert-deflib exposed between 10 and 20 existing
errors in our Makefiles, which were all fixed in the same putback.
The errors we found in our Makefiles underscore how difficult they
can be prevent without an automatic system in place to catch them.
Conclusions
The stub proto is proving to be a generally
useful construct for ON builds that goes beyond serving as a place to
hold stub objects. Although invented to hold stub objects, it has
already allowed us to simplify a number of previously difficult
situations in our makefiles and builds.
I expect that we'll find uses for it beyond those described here
as we go forward.