Reproducible builds can have some unexpected benefits. For example, a package build machine could track compiled files that have changed since it compiled an older version of the software, and store them at the start of the archive when zipping the installable files into package. Then when the user’s package manager updates to the newer package, it only has to download enough of it to extract all the files that have changed between versions, thus saving time and bandwidth. Software that builds reproducibly will have fewer gratuitous changes (dates, etc), thus making this process work better.
Mm. Reproducible builds have a really wide variety technical advantages, including implicitly removing non-deterministic or unsafe behaviour (such as downloading third-party code from the internet), detecting corrupted build environments, reducing time-to-detection of a build host compromise, validating cross-built packages, the potential space/time savings you mention as well as numerous other debugging and testing advantages.
In other words, even if you are an attacker, you want reproducible builds. :)
SOURCE_DATE_EPOCH is useful in those cases where for a reason or another you can't remove timestamps.
Initially we wanted to remove everything, but we also noticed it's easier to convince people/upstream to accept the support for SOURCE_DATE_EPOCH than remove the timestamp entirely.
What about the opposite direction: create builds from the same source that use different memory layouts to make the life of attackers harder? I think diversity could create resilience, too. Obviously it makes verification harder.
There's various research projects out there to, at compiler level, create different output binaries out of same source code to accomplish this task. I'm not really following it I just remember a professor at my undergrad was recruiting people for his lab on it, and I'm pretty sure his project was not only one. https://ssllab.org/trac/wiki/MultiCompilerPublic
edit: see compiler/compile-time randomization? see some other similar projects under these key terms.
You would probably need to pass in an "-frandom-seed" option and make sure your environment is the same, but I would imagine you can get that to give you a deterministic output fairly easily.
Most of the difficulty comes in the packaging, where you need to make sure that files enter the archive with the same timestamps/permissions, and in the same order.
I'm not really sure how to take this recent and emerging narrative that software source code compilation is actually an unreliable, unpredictable activity.
I find it really strange that there's a sudden, intense focus on the premise that, given the same source code, when providing it to different people in the wild and asking them to try and compile it, and then having them share their results, doing so reveals variations and disparities that are difficult to explain or account for.
I know why people are suddenly paying attention to this detail: paranoia.
I remember when I noticed this concept had first started to gain some traction. Once people started trying to compile certain open-source crypto applications, and found that their binaries weren't byte-for-byte replicas, it begged the question: Am I actually encrypting my confidential data with a surreptitiously hobbled program? Is there a covert plot to infiltrate and sabotage these sorts of software projects? How would I know if the encryption was weakened in a subtle way, when I don't have the resources to try and crack it at levels that demand well-funded hardware?
But, I wonder, is this something to worry about everywhere?
Is this why so many projects are now taking on this additional activity?
Is there a panic and an anxiety driving this trend, or are we just crossing dotting our i's and our crossing our t's with yet another set of best practices?
We live in an era of viruses, rootkits, botnets, ransomware, adware, spyware, and industrial sabotage. This is at the hands of script kiddies, political activists, criminal networks, corporations, and every nation state with the budget for it - or some close approximation thereof, it seems. Sourceforge has been breached. Debian builds have been breached.
Have you managed to dodge all of the above? I haven't. I've had accounts on at least 4 services which were breached at some point. I've seen viruses, adware, and spyware. I've seen script kiddies in my communities, distributing backdoored executables, gathering passwords, and getting banned. And this is only what has failed to escape my notice.
It's not paranoia if they really are out to get you.
> But, I wonder, is this something to worry about everywhere?
Would it be useful for you to be able to detect when your build machine has been compromised, and starts producing malware laden executables? Reproducible builds are a tool that can potentially do that for you - just diff the binaries.
Whether or not it's worthwhile to set up depends on how much of a pain in the ass making reproducible binaries is. Ongoing research into the subject is interesting, if only because it directly correlates to making it less of a pain in the ass when other people figure it out for you.
> Is there a panic and an anxiety driving this trend, or are we just crossing dotting our i's and our crossing our t's with yet another set of best practices?
I wouldn't describe such interest as panic or anxiety - but I wouldn't put it down to merely getting in line with "best practices" either. It's an interesting subject of research - and a potentially useful one at that.
In fact, OpenBSD’s and PC-BSD’s package managers do this. It’s briefly touched on in the last slide of this talk: http://www.openbsd.org/papers/eurobsdcon2015-packages.pdf