Revision History | |
---|---|
Revision 1.0-DRAFT-DO-NOT-USE-YET | 24 February 2017 |
Initial version. |
Table of Contents
This specification describes the environment variable
BUILD_PATH_PREFIX_MAP
for build tools to exchange
information about the build-time filesystem layout, to generate
reproducible output where all embedded paths are independent of that
layout.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The canonical URI for this document is: https://reproducible-builds.org/specs/build-path-prefix-map/.
Software packages are often unreproducible because they embed build-time paths into generated files. At the time of writing, this affects an estimated 15-20% of 25000 Debian packages being unreproducible when the build path is varied.
These paths do not exist at runtime, nor do they exist in the source code. An improvement is to use relative paths that depend only on the source layout. These paths can also be easily used as runtime paths, for example to access source-code packages installed into standard paths, for IDEs and debuggers.
This requires knowledge of where the top-level source directories are, and where (if any) the top-level out-of-tree-build directories are. However, some toolchains were not designed to make this information available to the lower-level tools that generate those build outputs with embedded paths.
This specification therefore defines a distribution-agnostic standard for lower-level tools to consume this information from higher-level tools.
For more discussion on the mechanism of environment variables, see
our page on
Standard
Environment Variables. Other mechanisms were also explored,
but they were not completely effective and only fixed some cases of
the issue whilst generating new cases. For example, automatically
setting GCC"s -fdebug-prefix-map
option does not
fix other sources of irreproducibility such as
__FILE__
, and in some packages the option itself
is saved into other parts of the build output by another tool.
For more information on Reproducible Builds in general, include its motivation and its uses, see https://reproducible-builds.org/.
The environment variable BUILD_PATH_PREFIX_MAP
represents an ordered map that associates reproducible path
prefixes that may appear in the build output, with path prefixes
that exist on the build-time filesystem that must not appear in
the build output. We use the following terms:
A producer is a program that knows how to determine appropriate values for the map, and can pass this information to lower-level build tools. For example, a distribution"s top-level package builder, or a high-level buildsystem.
A consumer is a program that generates output that contains paths, but does not by itself know enough information about the filesystem layout to be able to appropriately strip the build-time specific parts of those paths. Instead, it relies on another tool to pass this information in. For example, a C compiler.
The value of the variable MUST NOT be saved as-is into any output meant to form part of a reproducible binary artefact.
This section describes a data structure encoding, from a list-of-pairs where each pair holds two strings, into a single string.
We use the phrases left- and right end of the list, to refer to the parts of the list that respectively correspond to the left (start) and right (end) ends of the string that it was parsed from, and vice versa.
On POSIX systems these strings are a sequence of 8-bit bytes. On
Windows systems these strings are a sequence of 16-bit
wchar_t
words. On both platforms, these string
types are the types of both filesystem paths and environment
variables on that platform.
When implementing this data structure encoding, either (a) you MUST directly operate on the system string types described above without also decoding or encoding them using a character encoding such as UTF-8 or UTF-16; or (b) if you must use a character encoding e.g. because your language"s standard libraries force you to, then either it is total and injective over the system string type [1], or you MUST raise a parse error for inputs where it is undefined or not injective. (See [2] for further details and guidance on how to do this.)
The encoding is as follows:
The :
character separates encoded list
items from each other.
Empty subsequences between :
characters, or
between a :
character and either the left
or right end of the value, are valid and are ignored.
[3]
Each encoded list item contains exactly one
=
character, that separates encoded pair
elements.
If there are zero or more than one =
characters, this is a parse error. [4]
The encoded pair elements may be empty; this does not need special-casing if the rest of the document is implemented correctly.
Each encoded pair element is encoded with the following mapping:
%
→ %#
=
→ %+
:
→ %.
When decoding, %
characters at the end of a
string are a parse error, as are %X
substrings where X
is any character not in
#+.
.
This encoding allows paths containing %
,
=
, :
to be mapped; since
users may want to run their builds under such paths. However
as a producer, if this is not possible for your consumers, for
example because you directly restrict the possible build
paths, then you may omit this encoding logic.
Our choice of characters means there is flexibility in the
order in which these mappings can be applied. The only
restriction is that the %
→
%#
mapping for encoding must not be applied
on already-encoded %-substrings; and that the
%+
→ =
,
%.
→ :
mappings for
decoding must not be applied on already-decoded %-substrings.
This is meant to ease implementation in a variety of
programming languages.
Our recommended approach for a high-level language with string replace:
Decoding:
check that elem
does not match the
regex /%[^#+.]|%$/g
, then
elem.replace("%.", ":").replace("%+", "=").replace("%#", "%")
Encoding:
elem.replace("%", "%#").replace("=", "%+").replace(":", "%.")
Our recommended approach for a low-level language without string replace:
Decoding:
one single left-to-right pass with lookahead (e.g. see our C example), or
one single left-to-right pass with lookbehind (e.g. see our Rust example)
Encoding:
We don"t anticipate this to be a major use-case
In the event of parse errors, the whole value of the variable should be treated as invalid rather than silently using only the "good" parts. The program should exit with an error code appropriate for the context, or if this is not possible then the parser must communicate the error in some way to the caller.
Producers SHOULD NOT overwrite existing values; instead they should append their new mappings onto the right of any existing value.
Producers that expect reproducible output MUST append at least one distinct mapping for each top-level directory that the build is expected to take place under. Each mapping has the reproducible target path prefix on the left, and the unreproducible source path prefix on the right.
Producers such as top-level distribution package builders, that expect arbitrary consumers to read this variable, MUST NOT expect any additional constraints or special contracts on the output emitted by these consumers ― only that their output be reproducible when the build path changes and the value of this variable is changed to match the new paths.
On the other hand, producers that only support a limited set of consumers, may expect that these consumers apply these mappings in particular ways.
See also the requirements for consumers in the next part for guidance.
Consumers MUST ensure that, at minimum: for all (target, source) prefix pairs in the parsed list, with rightmost pairs taking priority: strings in the final build output, that represent build-time paths derived from source, instead appear to represent potential run-time paths derived from target.
As a consequence, consumers MUST apply mappings as above, regardless of whether the source prefix ends with a directory separator or not.
We do not define "derived from" more specifically, since this may be different for different consumers (languages, buildsystems, etc), and a more specific definition might conflict with their idea of what that means. Generally, consumers SHOULD implement one of the following algorithms:
For each (target, source) prefix pair in the list-of-pairs, going from right to left: if the subject path starts with the source prefix, then replace this occurence with the target prefix, and return this new path, ignoring any pairs further left in the list.
As in (1) but with "starts with" replaced by "starts with, restricted to whole-path components". So for example,
/path/to/a/b/c
"starts with"
/path/to/a
/path/to/aa/b/c
does not "start
with" /path/to/a
This has more robust semantics but is slightly more complex to implement.
Consumers MAY for historical reasons internally store the map with the prefix pairs flipped as in (source, target), instead of (target, source) as described above. New code should prefer the latter representation.
See the appendix for implementation notes on case-insensitive filesystems.
[1]
In practice, this means any two byte sequences that are
invalid UTF-8, or wchar_t
sequences that
are invalid UTF-16, must be decoded into distinct
application-level character string values. This is not
satisfied by most standard Unicode decoding strategies, which
is to replace all invalid input sequences with
U+FFFD REPLACEMENT CHARACTER
. For these
cases, the easiest way to be compliant is to check the decode
output for this character and raise an "unsupported"
exception if it is found.
[2] Detailed implementation notes and advice are available on our wiki page. Example source code is also available there, as well as in runnable form in our git repository. The test vectors from this document"s appendix are also available there.
[3]
This is to make it easier for producers to append values,
e.g. as in
old_value += ":" + encoded_pair
which would be valid even if the value is originally
empty.
[4]
This is to "fail early" in case a naive producer
does not encode characters like =
but
the build path or target path does actually contain them.
Definitions (no HTTPS) -- see "Pathname", "String" and "Byte".
Environment
Variables (no HTTPS) for the type of
environ
.
limits.h
- implementation-defined constants (no HTTPS) for the
definition of CHAR_BIT
.
Windows strings are commonly advertised as "UTF-16", however for environment variable values and filesystem paths the system APIs do not enforce validity of the 16-bit strings passed to it. In other words, it is UCS-2, but this term is deprecated (no HTTPS).
So in practice, user code should not assume that these system strings are valid UTF-16, and should be able to deal with invalid UTF-16 strings. The easiest way to do this, is to treat them as opaque 16-bit sequences with no encoding.
File Management > About File Management > Creating, Deleting, and Maintaining Files
... > C Language Reference > ... > C Identifiers > Multibyte and Wide Characters Note that what Microsoft calls "Wide Characters" and "Unicode" is actually valid-or-invalid UTF-16 as described above, not decoded Unicode code points.
Unicode and Character Sets > About Unicode and Character Sets > Character Sets This often-cited page is not actually relevant to filesystem paths or environment variable values, and rather instead refers to how Windows applications deal with userland, not system, character data.
Copyright © 2017 See Contributors List
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
As per the section Applying the decoded structure above, programs are free to choose their own interpretation of "derived from" to deal with these.
In practice, the following may be an effective approach:
After decoding into a list-of-pairs, canonicalise all source paths in this list.
When applying the mapping, first canonicalise the path to be mapped (the "subject path"), then apply it on the canonicalised map from step (1).
The exact method for canonicalisation depends on the filesystem. It may involve (e.g.) converting to upper case for non-existing files and path components, and converting to the preserved case for existing files and path components, for filesystems that store this information. Hopefully your system already has a utility function that does this.
Here are test vectors for implementations to check their correctness. They are also available as part of an executable test suite in our git repository. They are intended for guidance and not as a substitute to the above specification. In particular, it may be possible to match the behaviour described below exactly but still violate the specification.
In the prescriptions below, statements of the form "E maps I
to O" mean that when E is set as the value of
BUILD_PATH_PREFIX_MAP
, then a compliant
consumer implementing either algorithm 1 or 2 from
Applying the decoded structure, maps
line-separated input paths I to line-separated output paths O.
Where E is wrapped by b"
and
"
, it is to be interpreted as a sequence of
bytes, represented in the same way as Python byte literals.
Case "0.basic", decode
valid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
maps:
/a/d /b/1234 /a/yyy/xxx
to:
lol/d foo/1234 bar/xxx
Case "0.ordering", decode
valid:
ERROR=/a/b%+yyy:lol=/a:ERROR=/b/1234:foo=/b:libbar-3-bison++_41%.10.5-3~rc1pre3+dfsg1.1-3nmu1+b4=/a/b%+yyy
maps:
/a/d /b/1234 /a/b=yyy/xxx
to:
lol/d foo/1234 libbar-3-bison++_41:10.5-3~rc1pre3+dfsg1.1-3nmu1+b4/xxx
Case "pecsplit.0.allbytes-ok",
decode valid:
b"=\x01\x02\x03\x04\x05\x06\x07\x08 \x0b\x0c \x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$&\"()*+,-./0123456789;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
maps:
(same as the input for Case "0.ordering")
to:
(same as the input for Case "0.ordering")
Case "pecsplit.0.empty-ok",
decode valid:
:ERROR=/a/b%+yyy:lol=/a::foo=/b:libbar-3-bison++_41%.10.5-3~rc1pre3+dfsg1.1-3nmu1+b4=/a/b%+yyy:
maps:
(same as the input for Case "0.ordering")
to:
(same as the output for Case "0.ordering")
Case "pecsplit.0.non-utf8",
decode valid:
b"result\xf1=/a/b%+yyy:lol%#%#=/a:foo%#%#=/b%#:result\xf1=/a/b%+yyy:sec%.reteh=/a/b%+yyy\xf1"
maps:
/a/d /b/1234 /b%/1234 b"/a/b=yyy\xf1/xxx" /a/b=yyy/xxx
to:
lol%%/d /b/1234 foo%%/1234 sec:reteh/xxx b"result\xf1/xxx"
Case "pecsplit.1.long-pc-1",
decode invalid:
%%#ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.long-pc-2",
decode invalid:
ERROR%%#=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.long-pc-3",
decode invalid:
ERROR=/a/zzz:lol=%%#/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.long-pc-4",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%%#:bar=/a/yyy
Case "pecsplit.1.long-pc-5",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%%#
Case "pecsplit.1.many-=-not-ok",
decode invalid:
ERROR=/a/b=yyy:lol=/a:foo=/b:bar=/a/byyy
Case "pecsplit.1.plain-pc-1",
decode invalid:
%sERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.plain-pc-2",
decode invalid:
ERROR%s=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.plain-pc-3",
decode invalid:
ERROR=/a/zzz:lol=%s/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.plain-pc-4",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%s:bar=/a/yyy
Case "pecsplit.1.plain-pc-5",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%s
Case "pecsplit.1.short-pc-2",
decode invalid:
ERROR%=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy
Case "pecsplit.1.short-pc-4",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b%:bar=/a/yyy
Case "pecsplit.1.short-pc-5",
decode invalid:
ERROR=/a/zzz:lol=/a:ERROR=/b/1234:foo=/b:bar=/a/yyy%
Case "pecsplit.1.zero-=-not-ok",
decode invalid:
/a/byyyERROR:lol=/a:foo=/b:bar=/a/byyy