So how does the GC performance of Go compare to something like Java/the Hotspot ...

_ph_ · on July 6, 2016

The approaches to GC are difficult to compare, and Java offers a selection of garbage collectors. Overall, the Java collectors are very sophisticated and tuned over years, so in principle are excellent. The downside is, that the Java language itself puts a lot of stress on the GC. The biggest problem is, that Java offers no "value" types beyond the builtin int, double,... So everything else has to be allocated as a separate object and pointed to via references. The GC then has to trace all these references, which takes time. While a collection of the youngest generation in Java is extremely fast, a global GC can take quite some time.

Go on the other side has structs as values, so the memory layout is much easier for the GC. Go always performs full GCs, but mostly running in parallel with the application, a GC cycle only requires a stop-the-world phase of a few milliseconds (for multi gigabyte heaps).

All these numbers of course depend a lot on what your application is doing, but overall Go seems to be doing very well with its newest iterations of the GC.

_0w8t · on July 6, 2016

Another problem with Java is inability to return multiple values. For that one often creates a wrapper object holding the results. JVM can recognize this pattern and stack allocate those wrapper objects, but it does not happen always increasing GC pressure.

pron · on July 6, 2016

The lack of custom value types has ramifications not only for GC, but for cache behavior. Which is why there's serious work on custom value types for Java; it's the major feature planned for Java 10.

Of course, most of the old-gen GC work in G1 is also done in parallel with the application, too.

needusername · on July 7, 2016

> Of course, most of the old-gen GC work in G1 is also done in parallel with the application, too.

Did you want to write concurrently? If so that would be wrong because evacuation can't be done concurrently with the application in G1, only initial marking.

pron · on July 7, 2016

I didn't say that all work is done concurrently with the app. How much work needs to be done in the STW phase is application-dependent. It is likely that if the application exhibits a transactional behavior, namely that objects are created in the beginning of a transaction and are all reclaimed at the end, there's very little need compaction required, as entire regions are likely to be completely free.

Cyph0n · on July 6, 2016

Another strong point for the JVM is the availability of alternate implementations from several vendors. Is this possible for Go say in the future?

_ph_ · on July 6, 2016

It is certainly possible. There are already two Go implementations, the official one, and a gcc based one. And due to the fact that the whole Go implementation is available under BSD license, allows anyone without any license worries to fork a custom Go implementation.

lloeki · on July 6, 2016

You forgot GopherJS.

arcticbull · on July 6, 2016

There's also llgo.

sagichmal · on July 6, 2016

Many view this as an overall negative point, particularly for those who are tasked with running complex JVM applications without deep operational knowledge...

jerf · on July 6, 2016

This observation has fed into the Go team's design philosophy; they're doing their best to minimize the "knobs" the GC has, because tuning them is inevitably a black art. As far as I know, there's still just one right now, GOGC, documented in the third paragraph of https://golang.org/pkg/runtime/ .

sievebrain · on July 6, 2016

Yes, but HotSpot G1 is meant to be usable with only a single knob too (target pause time). Other knobs do exist, but only for unusual cases where you want to precisely control the GC's operation to work around some bad app/gc interaction, for instance. And Go lacking such knobs is probably not really a feature: it's not like the Java guys set out from the start saying "we will build a complicated and hard to configure GC". It's just that as you work through more and more customer cases, these knobs become valuable for the hard ones.

jerf · on July 6, 2016

The point is not that lacking knobs is a feature. The point is that the designers are well aware of the issues and they are explicitly making it a goal that knobs should be unnecessary. (Especially since it has had some knobs off and on in the various versions, as mentioned in the article.)

This is in contrast to something we've probably all done at one point or another, which is just to add a checkbox to avoid having an argument about what the behavior should be. They're committing to having the argument out instead of "just adding knobs".

They also have a track record of, for better or worse, just refusing to add knobs and telling you to either do without or use a different language. If you've got an intensely GC-based workload, I'd consider using something other than Go. (However, bear in mind what may be an intensely GC-based workload in Java may not be in Go, since Go has value types already.)

pcwalton · on July 6, 2016

HotSpot cares a lot about proper defaults too. I don't think that there's a significant philosophy difference between HotSpot and Go there. The philosophy difference is, as you say, that Go is opposed to adding configuration options, while HotSpot does have those options (per customers' requests).

tmd83 · on July 6, 2016

I have seen how G1 suppose to have this one flag but I often get a bad feeling about G1 (without really using it much). It seems it reduces average GC pauses but performs really bad in the really lower (CMS) range. One of thing that looked bad to me is that originally they believed that it can completely ignore the generational hypothesis and then had to brought that back when finding the performance bad. There are also other issues like cross region links that it doesn't handle well. It seems to me that they thought their regional idea was a silver bullet and now tweaking it all over the place. It is a nice GC probably but I don't think its really the GC to end all other GC like Oracle wants it to be.

nickpsecurity · on July 6, 2016

I get what they're doing but it's a false dichotomy and therefore wrong. The dichotomy being it has to be one, ultra-simple GC or what Java was doing.

https://stas-blogspot.blogspot.com.au/2011/07/most-complete-...

It's well-known after over a decade of research and deployments in GC's that certain styles match certain workloads better. So, multiple ones should be available. This can be a small number that are largely pre-built with sane settings. What's left to tune can likewise be small: pause time, max memory, or whatever. There can also be a default as in current Go that covers 95% of apps well. The result is that specific apps or libraries if they went that far can have GC well-suited to their requirements with about one page of HTML describing what those GC's do and how to choose them.

That's what they should do. It will be easy for them and developers. Nothing like JVM mess. Still avoids one-size-fits-all: longest-running, failed concept in IT. Meanwhile, I can't wait to see someone make a HW version of their GC like I've seen in LISP and RT-Java research. IT would be badass given the current metrics. Allow whole OS to be done memory managed like A2 Bluebottle Oberon without performance penalty.

indolering · on July 7, 2016

Can hardware accelerated GC be generalized enough to make it useful? Isn't that what killed previous efforts?

nickpsecurity · on July 8, 2016

Previous efforts got killed because the off-brand hardware, especially the CPU's, were never as fast and/or cheap as Intel/AMD. They also required new tooling and such most of the time. This happened to LISP machines and apparently Azul's Vega's as they're pushing SW solution these days. So, that's my guess.

Most general I saw was in a Scheme CPU where the designer put the GC in the memory subsystem. The Scheme CPU would just allocate and deallocate memory. The GC tracked what was still in use on its own in concurrent fashion. Like reference counting I think. Eventually, it would delete what wasn't needed. Pretty cool stuff.

bad_user · on July 6, 2016

I don't see how it can be a negative. The availability of multiple vendors has given us commercial solutions tuned for particular needs.

For example Azul's C4 garbage collector which they claim is pauseless: https://www.azul.com/resources/azul-technology/azul-c4-garba... ; a pauseless GC is great if you want to tackle real-time systems. For real-time systems actually most garbage collected platforms are unsuitable.

But even more problematic is that stop-the-world latency is directly proportional to the size of the heap memory and today's mainstream garbage collectors cannot cope with more than 4 GB of heap memory without introducing serious latency that's measured in seconds. Think about that for a second - with most GC implementations you cannot have a process that can use 20 GB of RAM, which is pretty cheap these days btw. So keeping a lot of data in memory, like databases are doing, is not feasible with a garbage collector.

sedachv · on July 6, 2016

> For example Azul's C4 garbage collector which they claim is pauseless: https://www.azul.com/resources/azul-technology/azul-c4-garba.... ; a pauseless GC is great if you want to tackle real-time systems. For real-time systems actually most garbage collected platforms are unsuitable.

As far as I can tell Azul's collector claims to be pauseless because they use the new x86 nested page tables (https://en.wikipedia.org/wiki/Second_Level_Address_Translati...) to implement a read barrier (interesting aside: this means it should be possible to implement a read barrier on CPUs without nested page tables by moving the GC into the kernel). Here is an interesting discussion: http://stackoverflow.com/questions/4491260/explanation-of-az...

That still does not mean that C4 is necessarily real-time. You have to take a fundamentally different approach to GC to guarantee real-time bounds (see these papers on the Metronome collector: http://researcher.watson.ibm.com/researcher/files/us-bacon/B... https://www.cs.purdue.edu/homes/hosking/690M/ft_gateway.cfm....) and that comes with a restriction that ties your program's allocation rate to the scheduling of the GC. I am still skeptical about this - it is easy to imagine coming up with an adversarial allocation pattern that breaks time bound guarantees because of some detail of the GC implementation, so both the algorithm and every implementation will need proofs.

> So keeping a lot of data in memory, like databases are doing, is not feasible with a garbage collector.

It is very feasible if you do not make garbage. Either mmap some memory that the GC won't touch or pre-allocate large arrays of primitive types.

Cyph0n · on July 6, 2016

You also have Excelsior[0], which provides full Java AOT (ahead-of-time) compilation to native machine code.

[0]: https://www.excelsiorjet.com/

rogerdpack · on July 6, 2016

Is the output code faster than HotSpot after its been warmed up?

sievebrain · on July 6, 2016

Eh, HotSpot can handle heaps of hundreds of gigabytes with pause times in the 100msec range. It takes a bit of tuning but can be done with the basic open source code.

bad_user · on July 6, 2016

What HotSpot are you talking about? I assume you aren't talking about the Serial, or the Parallel GC or about CMS, which are the older generation, but about G1, right?

Well, I have extensive experience with tuning G1. G1 is a good GC, capable of low latency incremental pauses.

The problem is that with a stressed process, at some point G1 still falls back to a full freeze-the-world mark-and-sweep. For 50 GB I've seen the pause last for over 2 minutes !!!

mioelnir · on July 7, 2016

2 minutes is cute. If you stress a CMS setup hard enough that the young generation is completely full, it will allocate directly in the old generation. This of course screws the full gc heuristic totally, up to the point where the GC is started too late and you fully run out of memory. At which point the JVM drops down to a single threaded oldschool serial GC as last line of defense. On a 96GiB heap, that thing can take hours; all stuck 100% on a single cpu with even signal handling suspended. Fun times.

That said, for heaps above 32ish GiB, we still go with our tuned CMS settings and overcommit one or two additional memory modules. It's a lot cheaper than the time it takes trying to tune in G1 on a large heap with a lot of gc pressure.

rbjorklin · on July 6, 2016

Got any sources for that? I would be very interested in these tuning parameters and some explanation of what they do! :)

coredog64 · on July 6, 2016

Cassandra works around this by pushing some of that responsibility onto the OS's disk-caching mechanism.

Thaxll · on July 6, 2016

No it won't, I asked and got an answer from Brad Fitzpatrick:

https://www.reddit.com/r/golang/comments/4pmlv9/transaction_...

_ph_ · on July 6, 2016

The link you posted was about switchable GCs in the official Go runtime, which won't be there, but the question was whether there are multiple Go implementations.

needusername · on July 6, 2016

Is there any data from production systems available that confirms that is an issue in most/many real world applications (the lack of value types)? From the allocation profiles I have seen in the applications I have seen most allocations in Java programs seem to be from strings, often in logging, or byte array buffers. Value types would no help here but compressed strings would.

nvarsj · on July 6, 2016

A significant drawback of the hotspot JVM is the amount of memory required for even simple apps. At least 64Mi for the most simple, and typically much higher. A typical web app with a 1000 request threads will use something like 1.5Gi of memory (512Mi heap, 1Gi for thread stacks, classes, etc).

Golang apps tend to happily run with less than 100Mi, so are well suited as daemon processes that don't get in the way.

However if you need to support a large amount of dynamic state (> 1Gi), the hotspot GC is very difficult to beat.

sievebrain · on July 6, 2016

Memory usage in Java can be misleading. Some versions of the JVM will happily take ALL your free RAM if it thinks it's sitting there unused because there's a RAM/CPU tradeoff in garbage collected systems: the less frequently you GC the less CPU time you burn and the faster the app runs.

If your machine actually does have gobs of free RAM, it therefore makes sense for Java to use all of it.

If your machine has gobs of free RAM you were planning on using for something else after your Java app started, well, that's something the JVM couldn't know. Some versions (on Windows?) monitor free memory and adjust down its own usage if you seem to be consuming the headroom, but on other platforms, you just have to tell Java it's got a limit and can't go beyond it.

102030485868 · on July 7, 2016

Unless you specifically tell the JVM otherwise using the -Xmx flag.

See: https://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/jrd...

geodel · on July 6, 2016

Technically hotspot GC might do more work in same amount of time but Go's GC makes some performance guarantees like <10ms STW phase which hotspot do not claim or offers for large heaps.

pcwalton · on July 6, 2016

HotSpot does offer that. It's basic functionality that all incremental or concurrent garbage collectors offer. You can adjust the max pause time with -XX:MaxGCPauseMillis.

pas · on July 8, 2016

That's a target. (GC ergonomics)

It won't guarantee it, just tires to size things (eden space, survivor spaces) and time things to meet its target.

But it's a fickle beast. And usually it requires a lot of tinkering with the code for it to be able to meet it. And then it's easier to disable ergonomics, set fixed sizes, and just enjoy how blazingly fast CMS is, restart the app every few weeks (CMS heap fragmentation), and try G1 with every new point release, maybe finally it beats CMS.

sievebrain · on July 6, 2016

Yes, but that's because the Go GC doesn't compact, and nobody quotes throughput numbers. Building a slow GC that leaves the heap fragmented but has short pauses is not that difficult indeed, the tricky part starts when you wish to combine all these things together.

geodel · on July 6, 2016

Of course Java has all technical bullet points checked and may be superior GC from strictly that point of view. But Go has 2 things from users' perspective upfront which Java lacks.

1. It uses about an order of magnitude less memory than Java.

2. It openly proclaims <10ms STW pauses for GC.

mdasen · on July 6, 2016

Go definitely does not use "an order of magnitude less memory than Java". That would mean that a Go program that uses 1GB of memory would need 10GB in Java.

Google wrote a paper comparing C++, Java, Scala, and Go and definitely did not find that (https://days2011.scala-lang.org/sites/days2011/files/ws3-1-H...).

I like Go and it has many wonderful qualities. Still, it's important to be realistic.

richard_todd · on July 6, 2016

I think for "small-data" programs it does work out to about an order of magnitude of overhead in Java. I have ported several small Java programs to Go and I see it (like 100MB Java vs 8MB Go). One encryption program I coded multiple versions of ran 350k C vs. 1.3MB Go vs. 16MB Java.

Programs holding GBs of data in arrays would look much closer, though, I imagine, as the overhead would be dwarfed by the data itself.

geodel · on July 6, 2016

Possible. But typically idiomatic Java usage patterns with collection types have huge overheads. So unless Java code is written in specially memory efficient way that memory usage gap should remain

https://www.cs.virginia.edu/kim/publicity/pldi09tutorials/me...

lossolo · on July 6, 2016

In my experience it does use less memory than Java. I had application written in Java which i migrated to Go and i see less memory usage.

geodel · on July 6, 2016

Well depending on program. Here is what I see.

https://benchmarksgame.alioth.debian.org/u64q/go.html

igouy · on July 6, 2016

"When there's a big disparity between those Java and [Go] programs, it's mostly the default JVM memory allocation versus whatever [Go] needs."

http://programmers.stackexchange.com/questions/189542/is-mem...

geodel · on July 6, 2016

I looked at this:

http://stackoverflow.com/questions/5673388/what-is-the-small...

For me using Java 1.8 the number is around 1.5MB. Something similar in Go is 60KB

kasey_junk · on July 6, 2016

The lack of compaction is not just a bullet point. Especially on large memory data-sets Go's GC will start to suffer and not just on the collection side of things but on the allocation side as well.

gribbly · on July 6, 2016

On the other hand, compaction is itself an extremely expensive operation as it means moving the blocks of memory (allocations) around in the heap.

Do you know of any real-world examples where the lack of compaction is impacting usage of Go ?

pcwalton · on July 6, 2016

> On the other hand, compaction is itself an extremely expensive operation as it means moving the blocks of memory (allocations) around in the heap.

That's why you don't do it often.

> Do you know of any real-world examples where the lack of compaction is impacting usage of Go ?

The biggest problem with all nongenerational GCs, including Go's, is lack of bump allocation in the nursery. You really want a two-space copying collector (or a single-space one) so that allocation in the nursery can be reduced to 3 or 4 machine instructions. By allowing fragmentation in the nursery, Go pays a heavy cost in allocation performance relative to HotSpot.

gribbly · on July 6, 2016

>That's why you don't do it often.

You need to do it when you become too fragmented (or suffer the same potentially poor allocation performance as Go), how often that happens largely depends on what the application is doing.

>including Go's, is lack of bump allocation in the nursery.

Yes, but as I recall this is in the future roadmap for consideration/attempt.

And again, as with everything it's not a silver bullet, as you sacrifice the high cost of promotion (again expensive moving of memory) in order to have very fast allocations while the nursery isn't fragmented or full.

>Go pays a heavy cost in allocation performance relative to HotSpot.

But not the cost of compaction/promoting, which are also heavy when they need to be performed.

That said, I personally believe a bump allocator with generational copying will be a 'net win' if implemented in Go's GC, but all things considered I'd rather see some cold hard numbers confirming it.

pcwalton · on July 6, 2016

> Yes, but as I recall this is in the future roadmap for consideration/attempt.

Not according to the transactional collector proposal. By not unconditionally tenuring young objects, it sacrifices one of the main benefits of generational GC: bump allocation in the nursery.

> And again, as with everything it's not a silver bullet, as you sacrifice the high cost of promotion (again expensive moving of memory) in order to have very fast allocations while the nursery isn't fragmented or full.

You're questioning the generational hypothesis. Generational GC was invented in 1984. Whether generational GC works is something we've had over 30 years to figure out, and the answer has consistently been a resounding "yes, it works, and works well".

> all things considered I'd rather see some cold hard numbers confirming it.

Again, we have over 30 years of experience. Generational GC is not some new research idea that we have to try to see if it works. The odds that things will be different in Go than in the myriad of other languages that preceded it are incredibly slim.

gribbly · on July 6, 2016

>Not according to the transactional collector proposal.

That's hardly the end all of Go GC development, also as I understand it's not even certain it will be used in a Go release as it depends on it actually showing the benefits aren't just theorethical.

>Whether generational GC works is something we've had over 30 years to figure out, and the answer has consistently been a resounding "yes, it works, and works well".

This was not about generational GC's 'working' or not, it was if it is the best solution for the typical workloads of Go applications.

tptacek · on July 7, 2016

Speaking as someone who works in Go full-time and really likes it: kind of the whole point of Go is that it's workloads are very much like every other language.

gribbly · on July 7, 2016

I figured the wide use of goroutines would be the cause for different choices in Go's GC compared to other GC's for languagues not sharing that characteristic.

From what I understand, the upcoming transactional collector is written directly with Go's goroutines in mind.

pcwalton · on July 8, 2016

> From what I understand, the upcoming transactional collector is written directly with Go's goroutines in mind.

I'm pretty skeptical that it will produce wins over a traditional generational GC. At best the "transactional hypothesis" will roughly approximate the generational hypothesis, without the primary benefit that truly generational GC gives you, namely bump allocation in the nursery. Time will tell.

weberc2 · on July 6, 2016

How do you mean? The 1.5 benchmarks showed Go cleaning up a 250GB heap in 5ms.

kasey_junk · on July 6, 2016

Your allocation times will go up as the heap gets fragmented and the it becomes harder to find places to put new items. This is especially true of large interconnected data sets.

Further, depending on your data access patterns you can see data access start to degrade over time as well because the memory locality is worse.

GC benchmarks are great at showing how well 1 part of memory management is behaving (ie the deallocation step) but it doesn't do much for talking about the other 2 parts, allocation and access.

That said, I use Go lang every day and the GC improvements to date have been great, especially given the kinds of memory patterns lots of the services I write have (small, short lived items that aren't really connected to each other). But there are definitely memory patterns where Hotspot will smoke the golang memory system and that doesn't begin to describe something like Zing.

geodel · on July 6, 2016

Zing is not some magic. To start with it needs heavily over provisioned servers with like 64GB+ RAM recommended. And to have the pauseless GC it needs additional contingency and pause prevention memory pools on top of -Xmx memory settings.

And still I have heard the one of the best way to control GC in many trading systems where Zing might be popular is to just provision 100s of GBs of memory heap and simply restart server once trading day is over.

kasey_junk · on July 7, 2016

When I was writing trading systems on JVMs we were much more worried about allocation costs and memory access patterns than we were about GC. The former issues impact the normal latency while the latter impacted the worst case. Now you needed to think about and deal with the worst case, but as you say, making a system that doesn't GC often is pretty straight forward.

Now that I'm writing high throughput systems in go I use many of the same techniques that I did writing low latency systems on the JVM (arena allocation, memory locality, etc). This is because the other 2 parts of memory management, allocation and access, continue to be major drivers of performance even though the deallocation step is fundamentally different.

That is to say, GC times are not the only thing that matters when it comes to memory management and it is a relatively straightforward tradeoff between deallocation and allocation that the current golang GC is making.

uluyol · on July 6, 2016

It would be nice to quantify what the impact of this is. Go is no worse in this regard than C++ (it even uses a fork of tcmalloc for allocation) and has support for value types so there is a lot less pointer chasing than in Java.

Not sure how it could be done but having some numbers on this would be great.

weberc2 · on July 6, 2016

Cool, thanks for the information. :)

geodel · on July 6, 2016

I do not disagree with that but unless some real life example is shown of Go vs Java GC I am inclined to believe advantage is mostly theoretical.

pcwalton · on July 6, 2016

binary-trees is Hans Boehm's benchmark of garbage collection performance (mostly throughput). Bump allocation in the nursery ends up being king here, and it's reflected in the numbers: https://benchmarksgame.alioth.debian.org/u64q/performance.ph...

igouy · on July 6, 2016

From Hans Boehm's java program --

"The results are only really meaningful together with a specification of how much memory was used. It is possible to trade memory for better time performance. This benchmark should be run in a 32 MB heap, though we don't currently know how to enforce that uniformly."

For some years, the benchmarks game did show an alternative task where memory was limited -- but only for half-a-dozen language implementations.

Figuring out an appropriate max heap size for each program was too hands-on trial and error.

geodel · on July 6, 2016

I am not sure in the apps where Go/Java languages are mostly targeted how many times people are implementing binary trees as their core/dominating business logic.

igouy · on July 6, 2016

Header comment from Hans Boehm's original test program --

"This is no substitute for real applications. No actual application is likely to behave in exactly this way. However, this benchmark was designed to be more representative of real applications than other Java GC benchmarks of which we are aware."

pcwalton · on July 6, 2016

Most allocations in real-world apps are nursery allocations (that's the generational hypothesis after all), so the speed of nursery allocations, which is what results in the throughput differential here, very much matters in practice.

the8472 · on July 6, 2016

> 2. It openly proclaims <10ms STW pauses for GC.

Java's GCs make no concrete claims because they scale from tiny to very large heaps with vastly different object populations and root set sizes.

Some java applications run with 100GB heaps or on 128-core NUMA machines with lots of threads.

10ms pause times are achievable with "modest" heap sizes (~single-digit GBs) if you have some cores to spare for a concurrent collector to do its work, well, concurrently.

If you don't have enough spare CPU time or have a larger heap or a workload without enough breathing room then it would be silly to make such guarantees.

Of course they could easily write "<10ms STW pauses. sometimes. read the fine print"