Discussion:
[ruby-core:27135] better GC?
(too old to reply)
Roger Pack
2009-12-10 21:55:53 UTC
Permalink
Could I put in a small plea for a better GC?

With this code:

(1..7).each{|n| puts n; a = ['a']*(10**n); a.inspect;}

when n gets high enough almost 50% of the time is spent in GC.
Somewhat frustrating.

Much thanks.
-r
Yukihiro Matsumoto
2009-12-11 00:07:16 UTC
Permalink
Hi,

In message "Re: [ruby-core:27135] better GC?"
on Fri, 11 Dec 2009 06:55:53 +0900, Roger Pack <***@gmail.com> writes:

|Could I put in a small plea for a better GC?

Yes, but unfortunately it's not small at all. GC has a lot of
trade-offs and difficult technical dependency. It's not that easy to
solve your frustration. I am happy to be proven wrong (that means
patches are welcome).

matz.
Paul Brannan
2010-01-07 19:53:34 UTC
Permalink
Post by Yukihiro Matsumoto
Yes, but unfortunately it's not small at all. GC has a lot of
trade-offs and difficult technical dependency. It's not that easy to
solve your frustration. I am happy to be proven wrong (that means
patches are welcome).
As I mentioned to you at the conference, GC is the primary reason I can
no longer use ruby for production code where I work. We use perl5 and
python instead, both of which use reference counting. (We disable the
cycle-finding GC in python and don't write code that produces cycles).

Adding reference counting to ruby would not work with extensions (they
don't know to increment the reference count) and would probably not be
what we (the ruby community) want anyway. The simplicity of writing
extensions (due to ruby's conservative GC) is part of what makes ruby
attractive to many people.

OTOH the dynamic language space is becoming more competitive every year,
and it would be a shame to see ruby fall behind because it has to remain
backward compatible.

Hypothetically, if ruby could give up backward compatibility with
existing extensions, would the problem become any easier?

Also, where are the bottlenecks in the GC today? Is it the number of
objects that have to be marked? Could gc_mark itself be optimized so
the cost of marking any given object is reduced?

Just brainstorming a bit...

Paul
Eero Saynatkari
2010-01-07 20:55:01 UTC
Permalink
Post by Paul Brannan
OTOH the dynamic language space is becoming more competitive every year,
and it would be a shame to see ruby fall behind because it has to remain
backward compatible.
Hypothetically, if ruby could give up backward compatibility with
existing extensions, would the problem become any easier?
Rubinius is also compatible with existing extensions*
and, as mentioned previously, uses a more advanced GC.

I propose that it is better to spend effort on Rubinius
whether developing or testing it for your purposes than
it is to try to bolt various things onto MRI. This goes
for many recent proposals here, but since this is the
topic at hand...


* That is, a C API identical to MRI's is provided. Only
recompilation needed, except for the cases where some
function/whatever has not yet been defined because it
has not been needed. If you have C exts, try them and
report any missing things so they can be added.


Eero
--
Magic is insufficiently advanced technology.
Kurt Stephens
2010-01-07 22:09:14 UTC
Permalink
Post by Eero Saynatkari
Post by Paul Brannan
OTOH the dynamic language space is becoming more competitive every year,
and it would be a shame to see ruby fall behind because it has to remain
backward compatible.
Hypothetically, if ruby could give up backward compatibility with
existing extensions, would the problem become any easier?
Rubinius is also compatible with existing extensions*
and, as mentioned previously, uses a more advanced GC.
I propose that it is better to spend effort on Rubinius
whether developing or testing it for your purposes than
it is to try to bolt various things onto MRI. This goes
for many recent proposals here, but since this is the
topic at hand...
* That is, a C API identical to MRI's is provided. Only
recompilation needed, except for the cases where some
function/whatever has not yet been defined because it
has not been needed. If you have C exts, try them and
report any missing things so they can be added.
FFIs complicate GC. There are numerous papers that assert that a simple
hemispace copying collector out-performs generational collectors or
other more complex collectors. For example, efforts to parallelize
collectors require read or write barriers or complex locking. Copying
collectors usually require additional pointer redirection for C code
that has "references" to objects allocated from GC.

This is why conservative collectors (like MRIs) are popular with GC'd
languages that leverage C code - the FFI is "dumb" and the allocator
needs to assume the code behind the FFI is "dumb" and the language does
not distinguish object "references" on opposite sides of the FFI.

Maybe we need to start abstracting Ruby implementation internals from C
in a way that doesn't limit the GC implementation -- like JNI when Sun
realized how poorly conservative collectors perform for
long-running/large Java processes. All Java implementations support JNI
and not all Java implementations use the same GC technology.
Post by Eero Saynatkari
Eero
--
Magic is insufficiently advanced technology.
THAKUR PRASHANT SINGH
2010-01-08 04:36:36 UTC
Permalink
Hi ,
I am getting an error while updating gems in my system.
Its pointing to a shared drive which doesn't exist now.
How can I freshly install ruby and gems to get rid of this error.
I have done reinstallation but it doesn't work
=>gem update --system Updating RubyGems
ERROR: While executing gem ... (Errno::ENOENT)
No such file or directory - U:

My ruby and gem versions are
Ruby:

ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]

Gem
1.3.1

Regards,
Prashant
Luis Lavena
2010-01-08 13:43:27 UTC
Permalink
On Fri, Jan 8, 2010 at 1:36 AM, THAKUR PRASHANT SINGH
Post by THAKUR PRASHANT SINGH
Hi ,
I am getting an error while updating gems in my system.
Its pointing to a shared drive which doesn't exist now.
How can I freshly install ruby and gems to get rid of this error.
I have done reinstallation but it doesn't work
=>gem update --system Updating RubyGems
ERROR:  While executing gem ... (Errno::ENOENT)
My ruby and gem versions are
ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]
Gem
1.3.1
Normal operation issues with Ruby and RubyGems suit better ruby-talk
than ruby-core

If gems are trying to install in drive U: maybe there is something in
your environment that is forcing that, either a environment variable
like GEM_HOME and/or GEM_PATH, your own home directory configuration
or gemrc configuration file.

Execute "gem env" and see if any of the installation directories,
configuration for gem locations or other path references U: drive.
--
Luis Lavena
AREA 17
-
Perfection in design is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.
Antoine de Saint-Exupéry
Charles Oliver Nutter
2010-01-11 21:20:26 UTC
Permalink
Maybe we need to start abstracting Ruby implementation internals from C in a
way that doesn't limit the GC implementation -- like JNI when Sun realized
how poorly conservative collectors perform for long-running/large Java
processes.  All Java implementations support JNI and not all Java
implementations use the same GC technology.
To be honest, I think this is the *only* path forward for MRI.
Slavishly supporting the current extension API doesn't really help
anyone that much (the good extensions everyone uses are being mainted
and could be updated to a new API), and makes it almost impossible to
optimize certain aspects of MRI.

Lots of people hate JNI, but it's a far less invasive way to write
extensions...and a lot of the cumbersome parts are by design. You
should not be able to get direct pointer access to object internals,
nor should you ever expose the structure of an object's memory layout.
Macros like RARRAY and RSTRING should be abolished in favor of
API-driven access (or opt-in "give me a copy of this array and then
put it back" functions). It should not be necessary to scavenge the C
stack to find live references. All these things make it harder to
improve MRI and almost impossible to implement the C API on other
runtimes (or at least, impossible without incurring large performance
penalties).

We are interested in supporting a C API in JRuby, but we will not
support any API that gives you direct access to pointers or object
structures. We'll work with Rubinius and MRI folks to come up with
replacement APIs, but any existing extensions that need unsafe access
will not work. If it's worthwhile to have C extensions that work with
many different runtimes/GCs/threading models, these API changes are
not really negotiable.

- Charlie
Rick DeNatale
2010-01-11 22:53:43 UTC
Permalink
On Mon, Jan 11, 2010 at 4:20 PM, Charles Oliver Nutter
Post by Charles Oliver Nutter
Maybe we need to start abstracting Ruby implementation internals from C in a
way that doesn't limit the GC implementation -- like JNI when Sun realized
how poorly conservative collectors perform for long-running/large Java
processes.  All Java implementations support JNI and not all Java
implementations use the same GC technology.
To be honest, I think this is the *only* path forward for MRI.
Slavishly supporting the current extension API doesn't really help
anyone that much (the good extensions everyone uses are being mainted
and could be updated to a new API), and makes it almost impossible to
optimize certain aspects of MRI.
I think that you are correct sir!
Post by Charles Oliver Nutter
Lots of people hate JNI, but it's a far less invasive way to write
extensions...and a lot of the cumbersome parts are by design. You
should not be able to get direct pointer access to object internals,
nor should you ever expose the structure of an object's memory layout.
Macros like RARRAY and RSTRING should be abolished in favor of
API-driven access (or opt-in "give me a copy of this array and then
put it back" functions). It should not be necessary to scavenge the C
stack to find live references. All these things make it harder to
improve MRI and almost impossible to implement the C API on other
runtimes (or at least, impossible without incurring large performance
penalties).
We are interested in supporting a C API in JRuby, but we will not
support any API that gives you direct access to pointers or object
structures. We'll work with Rubinius and MRI folks to come up with
replacement APIs, but any existing extensions that need unsafe access
will not work. If it's worthwhile to have C extensions that work with
many different runtimes/GCs/threading models, these API changes are
not really negotiable.
This would be good.

I'm not sure whether the API ultimately needs to be binary compatible,
or that a macro-based API would be as good.

If I'm not mistaken the MRI extension API macro implementations
changed between 1.8 and 1.9, certainly some of the 'VM' data
structures did, meaning that extensions needed to be recompiled.

Back when we were flogging 'enterprise' versions of Smalltalk and Java
at IBM, maintaining binary compatibility of interfaces was important
because we didn't necessarily want to ship all the source, and folks
wanted to write (and ship) proprietary extensions. Given the
open-source nature of Ruby and much of the world today, I'm not sure
that this is such a firm requirement.
--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
Charles Oliver Nutter
2010-01-16 09:29:01 UTC
Permalink
Post by Rick DeNatale
Post by Charles Oliver Nutter
We are interested in supporting a C API in JRuby, but we will not
support any API that gives you direct access to pointers or object
structures. We'll work with Rubinius and MRI folks to come up with
replacement APIs, but any existing extensions that need unsafe access
will not work. If it's worthwhile to have C extensions that work with
many different runtimes/GCs/threading models, these API changes are
not really negotiable.
This would be good.
I'm not sure whether the API ultimately needs to be binary compatible,
or that a macro-based API would be as good.
If I'm not mistaken the MRI extension API macro implementations
changed between 1.8 and 1.9, certainly some of the 'VM' data
structures did, meaning that extensions needed to be recompiled.
For what it's worth, we have nowhere near the resources nor expertise
to do a lot of work on the JRuby C API, so we're hoping folks
interested in C extensions for JRuby would like to help out. The
project is here:

http://github.com/wmeissner/jruby-cext

And you can poke either me or Wayne Meissner for instructions on how
to get it working. I'll personally ship out JRuby or EY t-shirts to
anyone that can help us with it :)

- Charlie
Roger Pack
2010-01-08 16:41:18 UTC
Permalink
Post by Eero Saynatkari
Rubinius is also compatible with existing extensions*
and, as mentioned previously, uses a more advanced GC.
I wonder if its GC could be merged into MRI :)

Also (@Paul) one option with better GC is jruby [or so I hear--I
haven't actually had any luck getting jruby to run faster than 1.9, on
windows, but maybe it's just me].

-r
Eero Saynatkari
2010-01-08 18:44:06 UTC
Permalink
Post by Roger Pack
Post by Eero Saynatkari
Rubinius is also compatible with existing extensions*
and, as mentioned previously, uses a more advanced GC.
I wonder if its GC could be merged into MRI :)
Nope.


--
Magic is insufficiently advanced technology.
Roger Pack
2010-01-11 22:58:08 UTC
Permalink
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)

-r
Gonçalo Silva
2010-01-12 00:56:40 UTC
Permalink
Among other thins (like the whole architecture of MRI), it would break most
C extensions.

---
Gonçalo S. Silva
http://goncalossilva.com

im: ***@gmail.com
skype: goncalosantaremsilva
twitter: http://twitter.com/goncalossilva
Post by Roger Pack
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)
-r
Evan Phoenix
2010-01-12 20:45:24 UTC
Permalink
Post by Roger Pack
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)
The Rubinius GC (there are actually 3 of them) require write barriers. To port to MRI, you'd have to rewrite most of the internals to use write barriers, and then no extensions would work because the GCs are accurate only (no conservative stack scanning), and they require the ability to move objects.

- Evan
Post by Roger Pack
-r
Joe Damato
2010-01-14 05:03:09 UTC
Permalink
Post by Evan Phoenix
Post by Roger Pack
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)
The Rubinius GC (there are actually 3 of them) require write barriers. To port to MRI, you'd have to rewrite most of the internals to use write barriers, and then no extensions would work because the GCs are accurate only (no conservative stack scanning), and they require the ability to move objects.
You could do it without rewriting the MRI internals AND retaining
compatibility with extensions..

BUT

it would be painfully slow, probably unusable.

i'm joining in a bit late in this conversation, perhaps I can still catch up.

joe
Joe Damato
2010-01-14 05:14:53 UTC
Permalink
Post by Joe Damato
Post by Evan Phoenix
Post by Roger Pack
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)
The Rubinius GC (there are actually 3 of them) require write barriers. To port to MRI, you'd have to rewrite most of the internals to use write barriers, and then no extensions would work because the GCs are accurate only (no conservative stack scanning), and they require the ability to move objects.
You could do it without rewriting the MRI internals AND retaining
compatibility with extensions..
BUT
it would be painfully slow, probably unusable.
OK so the way I see this is:

retain backward compatibility VS implement write barriers toward a
more intelligent GC.

How about we implement the primitives and just punish people who use
the "old" way of touching objects?

How about:

- Expose an API for accessing object internals so that write barriers
can be implemented in software.

and

- At the same time continue to allow extensions to reach in and touch
MRI internals, for the time being. These objects could have their
barriers implemented in hardware. The performance penalty will be HIGH
but perhaps this is good motivation for people to fix their ruby gems?

Perhaps this is too mean?

joe
Gonçalo Silva
2010-01-14 17:50:54 UTC
Permalink
So, we're getting the worse of the two options. GC will still be slow and
people will need to fix their ruby gems. I don't see how this is good for us
or Ruby itself.
Post by Evan Phoenix
Post by Joe Damato
Post by Evan Phoenix
Post by Roger Pack
Post by Roger Pack
I wonder if its GC could be merged into MRI :)
Nope.
Explanation? (Licensing problem?)
The Rubinius GC (there are actually 3 of them) require write barriers.
To port to MRI, you'd have to rewrite most of the internals to use write
barriers, and then no extensions would work because the GCs are accurate
only (no conservative stack scanning), and they require the ability to move
objects.
Post by Joe Damato
You could do it without rewriting the MRI internals AND retaining
compatibility with extensions..
BUT
it would be painfully slow, probably unusable.
retain backward compatibility VS implement write barriers toward a
more intelligent GC.
How about we implement the primitives and just punish people who use
the "old" way of touching objects?
- Expose an API for accessing object internals so that write barriers
can be implemented in software.
and
- At the same time continue to allow extensions to reach in and touch
MRI internals, for the time being. These objects could have their
barriers implemented in hardware. The performance penalty will be HIGH
but perhaps this is good motivation for people to fix their ruby gems?
Perhaps this is too mean?
joe
--
---
Gonçalo S. Silva
http://goncalossilva.com

im: ***@gmail.com
skype: goncalosantaremsilva
twitter: http://twitter.com/goncalossilva
Brian Mitchell
2010-01-14 18:05:42 UTC
Permalink
Post by Gonçalo Silva
So, we're getting the worse of the two options. GC will still be slow and
people will need to fix their ruby gems. I don't see how this is good for us
or Ruby itself.
The key is that there would be a correct way to write things that
could avoid most of the penalty, if I read that right. You'd pay a tax
for extensions that aren't using the new API. So they are slow, what's
new? The option here is to give us some future and the ability to
improve the software we do have control over.

Assuming we don't jump to the ideal GC approach in one big bang...
What we need to decide is what the most appropriate first steps are. I
think there is room to employ new APIs. Penalties are fair. If this
weren't the case, then someone might be complaining that their 286
running in real mode should be able to run a modern operating system
right now. We had to take a hit with that software compatibility and
even performance as protected modes were added.

Let's not freeze progress because something someone did might need
rework. For those who don't want the penalty, well, they can keep
using the current implementation as that's probably what they would be
stuck without these proposed efforts anyway.

Brian.
Roger Pack
2010-01-14 20:09:41 UTC
Permalink
retain backward compatibility  VS   implement write barriers toward a
more intelligent GC.
How about we implement the primitives and just punish people who use
the "old" way of touching objects?
- Expose an API for accessing object internals so that write barriers
can be implemented in software.
I'd be down with that, though I'll admit I don't really know what that
means nor how to write it :)
-rp
Charles Oliver Nutter
2010-01-09 00:55:38 UTC
Permalink
Post by Roger Pack
haven't actually had any luck getting jruby to run faster than 1.9, on
windows, but maybe it's just me].
I'd like to hear about what else isn't running faster for you in
JRuby. It probably could be fixed. At least for your example, JRuby
seems to do pretty well:

~/projects/jruby ➔ time jruby -e "(1..7).each{|n| puts n; a =
['a']*(10**n); a.inspect;}"
1
2
3
4
5
6
7

real 0m5.874s
user 0m6.059s
sys 0m0.354s

~/projects/jruby ➔ time ruby1.9 -e "(1..7).each{|n| puts n; a =
['a']*(10**n); a.inspect;}"
1
2
3
4
5
6
7

real 0m46.667s
user 0m45.472s
sys 0m0.503s

As far as GC goes, if a more realtime (or perhaps less-intrusive) GC
is required, JRuby + concurrent GC from any of the mainstream JVMs
would probably fit the bill. On Hotspot, you'd use jruby
-J-XX:+UseConcMarkSweepGC. I assume JRockit (Oracle) and J9 (IBM) have
something similar.

I believe the above example is using the concurrent GC by default
(Snow Leopard now defaults to Java 6 server with concurrent GC).
Here's GC timing output:

http://gist.github.com/272625

- Charlie
Kornelius Kalnbach
2010-01-07 22:14:21 UTC
Permalink
We disable the cycle-finding GC in python and don't write code that
produces cycles.
How do you do that? I'm curious.

[murphy]
Paul Brannan
2010-01-11 17:12:38 UTC
Permalink
Post by Kornelius Kalnbach
We disable the cycle-finding GC in python and don't write code that
produces cycles.
How do you do that? I'm curious.
How do we do which part? Disable the GC or write code that doesn't
produce cycles? The former is easy:

import gc
gc.disable()

The latter requires good up-front design. Good code generally avoids
cycles anyway. If we create a cycle intentionally, we use a weak
reference. Our C++ and Perl programmers are already used to writing
code like this, because both languages use reference counting.

(for the nitpickers, C++ doesn't use refcounting, but almost all the
libraries we use do).

(and the python language doesn't enforce refcounting either, but we are
using CPython, which does use refcounting in its current
implementation).

There are also tools available for python to detect cycles, which we try
to run before deployment.

Paul
Kurt Stephens
2010-01-07 22:37:40 UTC
Permalink
Post by Paul Brannan
Adding reference counting to ruby would not work with extensions (they
don't know to increment the reference count) and would probably not be
what we (the ruby community) want anyway. The simplicity of writing
extensions (due to ruby's conservative GC) is part of what makes ruby
attractive to many people.
Adding reference counting to MRI would make existing Ruby code that is
not written to avoid cycles unusable, unless a periodic GC was done.

Writing code that does not create cycles is "unnatural" to me -- not
worrying about cycles is one of the reasons I left Perl for Ruby after
12 years. This is not to say that having a reference counting option
might not be suitable for some applications (real-time Ruby on a small
device).
Post by Paul Brannan
Also, where are the bottlenecks in the GC today? Is it the number of
objects that have to be marked? Could gc_mark itself be optimized so
the cost of marking any given object is reduced?
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments. I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an
issue regardless of GC technology, including reference counting.

KAS
Paul Brannan
2010-01-11 14:32:22 UTC
Permalink
Post by Kurt Stephens
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments. I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an issue
regardless of GC technology, including reference counting.
Avoiding garbage doesn't solve the problem; if there is a large number
of reachable objects, the mark phase can still take a long time. This
is why a number of people want a generational collector, because it can
reduce the amount of time spent marking objects.

IMO it's clear that there is no one-size-fits all option. I wonder how
difficult it would be to make the GC pluggable, so alternate GC's could
be provided as gems?

(obviously there would still be limitations on these GC's; an
incremental collector would probably be out of the question).

Paul
Gonçalo Silva
2010-01-11 14:37:07 UTC
Permalink
That's an interesting idea, Paul. I love it but I think it would imply a lot
of effort from a lot of people, specially because some GC implementations
will cause breakages on some C extensions.

Anyway, what would be the best GC implementation for a Rails application?
Since Rails is one of the most important Ruby projects, growing and being
used worldwide, I really think that some effort should be put into
optimizing Ruby for Rails and that includes it's GC.

---
Gonçalo S. Silva
http://goncalossilva.com

im: ***@gmail.com
skype: goncalosantaremsilva
twitter: http://twitter.com/goncalossilva
Post by Paul Brannan
Post by Kurt Stephens
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments. I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an
issue
Post by Kurt Stephens
regardless of GC technology, including reference counting.
Avoiding garbage doesn't solve the problem; if there is a large number
of reachable objects, the mark phase can still take a long time. This
is why a number of people want a generational collector, because it can
reduce the amount of time spent marking objects.
IMO it's clear that there is no one-size-fits all option. I wonder how
difficult it would be to make the GC pluggable, so alternate GC's could
be provided as gems?
(obviously there would still be limitations on these GC's; an
incremental collector would probably be out of the question).
Paul
Magnus Holm
2010-01-11 16:18:17 UTC
Permalink
Well, Rails itself is single-threaded and so are most of the servers.
It's also fairy common with multi-cores for servers these days (maybe
even for desktop too?). The obvious choice then is a concurrent
garbage collector, which could run in it's own native thread (only
locking the GIL for a small amount of time). Because of the GIL, only
one thread can run at a time and there's no point of an on-the-fly
which simplifies the case quite a lot.

For instance, "Age-Oriented Concurrent Garbage Collection" by Paz,
Petrank, Blackburn separates objects into two generations, but always
collects both heaps. However, the older generation uses
reference-counters which appears to be quite faster. Their
implementation showed a max 2ms pause time, but I assume it will be
even shorter when there's only one thread to stop.

Petrank has also worked on the Stopless-collector which is both
on-the-fly and real-time. It was implemented with "virtually no pause
times, good mutator utilization, and acceptable overheads", and is
next on my reading list. Frampton, Bacon, Cheng and Grove has been
working on a real-time collector which has a 1ms max pause time. Also
worth checking out I assume.

// Magnus Holm
Post by Gonçalo Silva
That's an interesting idea, Paul. I love it but I think it would imply a lot
of effort from a lot of people, specially because some GC implementations
will cause breakages on some C extensions.
Anyway, what would be the best GC implementation for a Rails application?
Since Rails is one of the most important Ruby projects, growing and being
used worldwide, I really think that some effort should be put into
optimizing Ruby for Rails and that includes it's GC.
---
Gonçalo S. Silva
http://goncalossilva.com
skype: goncalosantaremsilva
twitter: http://twitter.com/goncalossilva
Post by Paul Brannan
Post by Kurt Stephens
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments.   I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an issue
regardless of GC technology, including reference counting.
Avoiding garbage doesn't solve the problem; if there is a large number
of reachable objects, the mark phase can still take a long time.  This
is why a number of people want a generational collector, because it can
reduce the amount of time spent marking objects.
IMO it's clear that there is no one-size-fits all option.  I wonder how
difficult it would be to make the GC pluggable, so alternate GC's could
be provided as gems?
(obviously there would still be limitations on these GC's; an
incremental collector would probably be out of the question).
Paul
Erik Scheirer
2010-01-11 14:42:29 UTC
Permalink
I think a pluggable/loadable GC scheme, as long as its really simple to use, is perfect.

There would be some overhead created by making it pluggable, though, but in the scheme of things it would be well worth the small amount of cpu cycles lost.

Just my two-cents-worth ...

e
Post by Paul Brannan
Post by Kurt Stephens
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments. I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an issue
regardless of GC technology, including reference counting.
Avoiding garbage doesn't solve the problem; if there is a large number
of reachable objects, the mark phase can still take a long time. This
is why a number of people want a generational collector, because it can
reduce the amount of time spent marking objects.
IMO it's clear that there is no one-size-fits all option. I wonder how
difficult it would be to make the GC pluggable, so alternate GC's could
be provided as gems?
(obviously there would still be limitations on these GC's; an
incremental collector would probably be out of the question).
Paul
Gonçalo Silva
2010-01-11 14:52:05 UTC
Permalink
I think there's no need for it to be hot-pluggable, so there wouldn't be any
overhead at all (only at start, of course).

---
Gonçalo S. Silva
http://goncalossilva.com

im: ***@gmail.com
skype: goncalosantaremsilva
twitter: http://twitter.com/goncalossilva
Post by Erik Scheirer
I think a pluggable/loadable GC scheme, as long as its really simple to use, is perfect.
There would be some overhead created by making it pluggable, though, but in
the scheme of things it would be well worth the small amount of cpu cycles
lost.
Just my two-cents-worth ...
e
Post by Paul Brannan
Post by Kurt Stephens
I'm not convinced that the GC is the issue, but I haven't really been
measuring it in production environments. I think common code or Ruby
semantics that create avoidable garbage is the issue and would be an
issue
Post by Paul Brannan
Post by Kurt Stephens
regardless of GC technology, including reference counting.
Avoiding garbage doesn't solve the problem; if there is a large number
of reachable objects, the mark phase can still take a long time. This
is why a number of people want a generational collector, because it can
reduce the amount of time spent marking objects.
IMO it's clear that there is no one-size-fits all option. I wonder how
difficult it would be to make the GC pluggable, so alternate GC's could
be provided as gems?
(obviously there would still be limitations on these GC's; an
incremental collector would probably be out of the question).
Paul
Rick DeNatale
2010-01-11 15:22:17 UTC
Permalink
Post by Erik Scheirer
I think a pluggable/loadable GC scheme, as long as its really simple to use, is perfect.
So would be a lasting world peace!
Post by Erik Scheirer
There would be some overhead created by making it pluggable, though, but in the scheme of things it would be well worth the small amount of cpu cycles lost.
I just can't see the feasibility of this. Any good GC involves
careful interaction between the parts of the system which mutate
memory and those which manage it. Getting a highly performant GC
almost always involves careful coordinated design of things like:

The low-level layout of objects.
The division of memory into various collections of objects (e.g. in a
GC scheme the old objects and the new object live in different spaces,
sometimes the new space moves each time a minor GC happens.
Efficient detection of whether an object is old or new.
For a GC requiring a 'write-barrier', efficient implementation of that
write barrier.
...

And to really get the most out of a GC, some of the low level
decisions can be platform and processor specific.

There are cascading design decisions to be be made. For example, lets
say we're making a generation scavenging GC. We need to capture
enough information as the mutator (the program) runs so that we can
find any new objects which are referenced by an old object. This is
the reason for the write barrier. So there are several issues:

How do we detect a store of a reference to a new object into an
old object with the lowest overhead.
How do we remember a store into an old object with the lowest overhead.
...

There are several strategies for detecting old vs new objects, each
with it's own tradeoffs, for example:
A flag bit in the object header
Address range checking to see which space it's in, or not in.
On some platforms and processors, one might make use of the virtual
memory hardware and access privileges to detect such stores, but this
is highly non-portable and may or may not outperform other approaches.

Flag bits need to be maintained properly, and are expensive, see below.
Address range checking is more common, and goes back to the
interactions with the overall design of the "VM".

And what about how to remember the old objects which need to be
considered during a new object GC.

We could perhaps make a linked or set of "remembered" objects, but
this is expensive both in terms of space and speed.

Most GCs use some form of "card marking" where old space is broken up
in to 'cards' containing a range of memory. Cards are similar to
pages in a virtual memory system, and may or may not be the same in a
particular GC implementation. In such a scheme when a new object
reference is stored in an old object, the fact that has happened is
stored as a change to the card in which the old object resides. The
most obvious way to do this is to have a data structure somewhere
which has a bit for each card.

But on most processors setting individual bits is expensive involving
fetching, masking, and re-storing a larger datatype.

The Self guys recognized this and found that for the processors they
were working on using a byte rather than a bit for the mark, was much
better overall despite requiring eight times the space for the marks.

http://www.cs.ucsb.edu/~urs/oocsb/papers/write-barrier.pdf

And these are just some of the variations once one has chosen a
particular GC algorithm or perhaps one of a family of GC algorithms.

Now I know that LLVM attempts to do something like this,

http://llvm.org/docs/GarbageCollection.html

but it apparently hasn't been all that successful:

http://lhc-compiler.blogspot.com/2009/01/why-llvm-probably-wont-replace-c.html


The problem is that LLVM defines the interface between the mutator and
the GC "framework" in terms of C functions and function callbacks,
e.g. for the write-barrier, whereas a really efficient GC implements
the write barrier(and other GC bookkeeping tasks) in a few machine
instructions.


I fear that a pluggable GC would only let you play around with pretty
poorly performing GC alternatives.
--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
Paul Brannan
2010-01-11 17:00:49 UTC
Permalink
Post by Rick DeNatale
I fear that a pluggable GC would only let you play around with pretty
poorly performing GC alternatives.
I think I agree with your logic, though not necessarily your wording.
While a pluggable GC might not support some of the more interesting
alternatives, it would allow GC's that support moderate improvement over
the existing implementation. Moreover, it might be enough to whet some
programmers' appetites to dive further into Ruby's internals than the
pluggable system supports.

The first hit is free, as they say, no? :)

Paul
Erik Scheirer
2010-01-11 17:10:48 UTC
Permalink
Free beer wouldn't be bad, either ;-)

I agree that the overhead would be unacceptable, based on the points you raised. For true production-level code, performance would be unacceptable.

However, maybe another thing to consider is to work up a simple method for GC experimentation so that anyone who wanted to try out a new GC approach could do so without having to get into the guts of everything; if such a GC ended up being useful it could then be incorporated with the 'native' build that did not use the 'experimentation hooks'.

In other words, what I am suggesting is a 'pluggable' GC architecture for the purposes of accelerating development of proposed changes/additions to the language in a more 'agile' manner. Then if a particular GC were accepted it would be mind-melded into the production-level code without the hooks so performance is as good as it can get.

This kind of approach might serve well with other aspects of language development, but the GC issue is by far the biggest in terms of performance considerations. GC issues killed java credibility for years, for example, so the faster the growing pains are gotten out of the way, the better, I would offer.

e
Post by Rick DeNatale
Post by Erik Scheirer
I think a pluggable/loadable GC scheme, as long as its really simple to use, is perfect.
So would be a lasting world peace!
Post by Erik Scheirer
There would be some overhead created by making it pluggable, though, but in the scheme of things it would be well worth the small amount of cpu cycles lost.
I just can't see the feasibility of this. Any good GC involves
careful interaction between the parts of the system which mutate
memory and those which manage it. Getting a highly performant GC
The low-level layout of objects.
The division of memory into various collections of objects (e.g. in a
GC scheme the old objects and the new object live in different spaces,
sometimes the new space moves each time a minor GC happens.
Efficient detection of whether an object is old or new.
For a GC requiring a 'write-barrier', efficient implementation of that
write barrier.
...
And to really get the most out of a GC, some of the low level
decisions can be platform and processor specific.
There are cascading design decisions to be be made. For example, lets
say we're making a generation scavenging GC. We need to capture
enough information as the mutator (the program) runs so that we can
find any new objects which are referenced by an old object. This is
How do we detect a store of a reference to a new object into an
old object with the lowest overhead.
How do we remember a store into an old object with the lowest overhead.
...
There are several strategies for detecting old vs new objects, each
A flag bit in the object header
Address range checking to see which space it's in, or not in.
On some platforms and processors, one might make use of the virtual
memory hardware and access privileges to detect such stores, but this
is highly non-portable and may or may not outperform other approaches.
Flag bits need to be maintained properly, and are expensive, see below.
Address range checking is more common, and goes back to the
interactions with the overall design of the "VM".
And what about how to remember the old objects which need to be
considered during a new object GC.
We could perhaps make a linked or set of "remembered" objects, but
this is expensive both in terms of space and speed.
Most GCs use some form of "card marking" where old space is broken up
in to 'cards' containing a range of memory. Cards are similar to
pages in a virtual memory system, and may or may not be the same in a
particular GC implementation. In such a scheme when a new object
reference is stored in an old object, the fact that has happened is
stored as a change to the card in which the old object resides. The
most obvious way to do this is to have a data structure somewhere
which has a bit for each card.
But on most processors setting individual bits is expensive involving
fetching, masking, and re-storing a larger datatype.
The Self guys recognized this and found that for the processors they
were working on using a byte rather than a bit for the mark, was much
better overall despite requiring eight times the space for the marks.
http://www.cs.ucsb.edu/~urs/oocsb/papers/write-barrier.pdf
And these are just some of the variations once one has chosen a
particular GC algorithm or perhaps one of a family of GC algorithms.
Now I know that LLVM attempts to do something like this,
http://llvm.org/docs/GarbageCollection.html
http://lhc-compiler.blogspot.com/2009/01/why-llvm-probably-wont-replace-c.html
The problem is that LLVM defines the interface between the mutator and
the GC "framework" in terms of C functions and function callbacks,
e.g. for the write-barrier, whereas a really efficient GC implements
the write barrier(and other GC bookkeeping tasks) in a few machine
instructions.
I fear that a pluggable GC would only let you play around with pretty
poorly performing GC alternatives.
--
Rick DeNatale
Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
Yukihiro Matsumoto
2010-01-11 14:58:43 UTC
Permalink
Hi,

In message "Re: [ruby-core:27531] Re: better GC?"
on Mon, 11 Jan 2010 23:32:22 +0900, Paul Brannan <***@atdesk.com> writes:

|IMO it's clear that there is no one-size-fits all option. I wonder how
|difficult it would be to make the GC pluggable, so alternate GC's could
|be provided as gems?

It's not easy. Most of performance-wise garbage collectors require
write barrier, and unfortunately pluggable write barrier tend to be
slower than it should be.

matz.
Brent Roman
2010-01-08 18:13:22 UTC
Permalink
Paul,

If compatibility with 'C' extensions is not a major concern for you, I'm
curious to learn why you haven't tried using JRUBY? Or, have you?

For real-time applications, what's needed is any sort of incremental GC to
replace the current "stop-the-world while I mark and sweep" MRI does now.
Generational GC is just one means to that end. MRI needs something that
incrementally collects garbage without copying objects. In my opinion MRI
cannot simply "give up" compatibility with 'C' extensions without becoming a
new implementation entirely.

The run time of the current collector is roughly proportional to Ruby's
process size. An optimization that saves memory, even if it adds CPU
cycles, tends to make MRI faster on the whole when dealing with large
numbers of objects. For this reason alone, you might want to give the MBARI
patches a whirl.

- brent
Post by Paul Brannan
Post by Yukihiro Matsumoto
Yes, but unfortunately it's not small at all. GC has a lot of
trade-offs and difficult technical dependency. It's not that easy to
solve your frustration. I am happy to be proven wrong (that means
patches are welcome).
As I mentioned to you at the conference, GC is the primary reason I can
no longer use ruby for production code where I work. We use perl5 and
python instead, both of which use reference counting. (We disable the
cycle-finding GC in python and don't write code that produces cycles).
Adding reference counting to ruby would not work with extensions (they
don't know to increment the reference count) and would probably not be
what we (the ruby community) want anyway. The simplicity of writing
extensions (due to ruby's conservative GC) is part of what makes ruby
attractive to many people.
OTOH the dynamic language space is becoming more competitive every year,
and it would be a shame to see ruby fall behind because it has to remain
backward compatible.
Hypothetically, if ruby could give up backward compatibility with
existing extensions, would the problem become any easier?
Also, where are the bottlenecks in the GC today? Is it the number of
objects that have to be marked? Could gc_mark itself be optimized so
the cost of marking any given object is reduced?
Just brainstorming a bit...
Paul
--
View this message in context: http://old.nabble.com/-ruby-core%3A27135--better-GC--tp26735247p27080151.html
Sent from the ruby-core mailing list archive at Nabble.com.
Paul Brannan
2010-01-11 14:27:30 UTC
Permalink
Post by Brent Roman
Paul,
If compatibility with 'C' extensions is not a major concern for you, I'm
curious to learn why you haven't tried using JRUBY? Or, have you?
For one application where we deployed both JRuby and Ruby 1.8, JRuby was
able to significantly outperform Ruby 1.8 in terms of total cpu
utilization and latency. However, JRuby with the default GC settings
still periodically became unresponsive for about 1-2 seconds throughout
the running time of of the application. Switching to an incremental GC
made these gaps disappear, but by that point it was just an experiment
for my own curiosity; the amount of effort to setup JRuby and tune the
GC exceeded the time allocated to the project. In the end I went on a
2-month vacation, and when I came back, the entire application had been
rewritten in C++.

(in Ruby's defense here, the next 1-2 months of my time when I got back
were spent fixing the C++ code and adding missing features. So the
perceived cost of C++ was lower, but in the end the actual cost ended up
being much higher than the team expected).

Paul
Kornelius Kalnbach
2010-01-11 14:35:36 UTC
Permalink
... In the end I went on a
2-month vacation, and when I came back, the entire application had been
rewritten in C++.
(in Ruby's defense here, the next 1-2 months of my time when I got back
were spent fixing the C++ code and adding missing features. So the
perceived cost of C++ was lower, but in the end the actual cost ended up
being much higher than the team expected).
and is it faster/more responsive now?

[murphy]
Paul Brannan
2010-01-11 14:53:01 UTC
Permalink
Post by Kornelius Kalnbach
... In the end I went on a
2-month vacation, and when I came back, the entire application had been
rewritten in C++.
(in Ruby's defense here, the next 1-2 months of my time when I got back
were spent fixing the C++ code and adding missing features. So the
perceived cost of C++ was lower, but in the end the actual cost ended up
being much higher than the team expected).
and is it faster/more responsive now?
Yes, by two orders of magnitude.

(which significantly exceeds the original requirements)

Paul
Charles Oliver Nutter
2010-01-11 22:30:48 UTC
Permalink
Post by Paul Brannan
For one application where we deployed both JRuby and Ruby 1.8, JRuby was
able to significantly outperform Ruby 1.8 in terms of total cpu
utilization and latency.  However, JRuby with the default GC settings
still periodically became unresponsive for about 1-2 seconds throughout
the running time of of the application.  Switching to an incremental GC
made these gaps disappear, but by that point it was just an experiment
for my own curiosity; the amount of effort to setup JRuby and tune the
GC exceeded the time allocated to the project.  In the end I went on a
2-month vacation, and when I came back, the entire application had been
rewritten in C++.
Some clarification of the pluggable GC stuff in the JVM (at least what
I know from Hotspot/OpenJDK...the other two major JVMs have their own
unique collectors).

* None of the major JVMs ever give heap space back to the system; if
they grow to 200MB, they'll never use less than 200MB. The
justification is that if you've run a process up to a certain size,
you're likely to need that size. It is possible to limit the growth
using a few simple flags that set minimum and maximum heap size.
* The GCs are not *live* swappable; you pick them at startup and
that's what you use for the lifetime of the process.
* There are many tunable settings, but most of those are also not
live; you set them at startup.
* Many of the settings can be left at defaults since Hotspot will try
to pick an appropriate GC and GC settings for your hardware (and will
adjust some GC behavior at runtime depending on how your application
behaves).

There are obviously way more tunables than most people ever need to
set, so generally running with the default "GC ergonomics" will allow
Hotspot to pick the best settings. It's not always perfect, of course,
so having the tunables is nice.

I thought it would be interesting to get some timings and GC output
for the three main JVM GCs, given the OP's benchmark:

The serial GC is the basic single-threaded collector.

~/projects/jruby ➔ time jruby -J-XX:+UseSerialGC -e "(1..7).each{|n| a
= ['a']*(10**n); a.inspect;}"
real 0m4.538s
user 0m4.359s
sys 0m0.324s

The Parallel GC uses multiple threads to reduce pauses. My system has
2 cores, so I believe it defaults to 2 GC threads...both this and the
Concurrent Mark/Sweep collector would be more interesting on a system
with more cores.

~/projects/jruby ➔ time jruby -J-XX:+UseParallelGC -e "(1..7).each{|n|
a = ['a']*(10**n); a.inspect;}"

real 0m5.489s
user 0m5.383s
sys 0m0.485s

~/projects/jruby ➔ time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=1 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real 0m8.984s
user 0m5.793s
sys 0m0.605s

~/projects/jruby ➔ time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=2 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real 0m5.251s
user 0m5.132s
sys 0m0.510s

~/projects/jruby ➔ time jruby -J-XX:+UseParallelGC
-J-XX:ParallelGCThreads=3 -e "(1..7).each{|n| a = ['a']*(10**n);
a.inspect;}"

real 0m5.685s
user 0m5.146s
sys 0m0.658s

The Concurrent Mark-Sweep collector does the mark and sweep phases of
GC concurrent with program execution, but still stops the world for
compacting:

~/projects/jruby ➔ time jruby -J-XX:+UseConcMarkSweepGC -e
"(1..7).each{|n| a = ['a']*(10**n); a.inspect;}"

real 0m4.860s
user 0m5.077s
sys 0m0.349s

The G1 "Garbage First" collector is a semispace collector that does
not have generations. It is intended to eventually replace the CMS GC
and provide more predictable pauses (CMS can occasionally cause really
long full GC pauses under high load):

~/projects/jruby ➔ time jruby -J-XX:+UnlockExperimentalVMOptions
-J-XX:+UseG1GC -e "(1..7).each{|n| a = ['a']*(10**n); a.inspect;}"

real 0m9.098s
user 0m9.709s
sys 0m1.076s

Here's a nice article on tuning GC for Java 5 (which is EOL, but the
article applies well to Java 6+):

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html

It might help guide potential solutions for MRI's GC.

- Charlie
mathew
2010-01-17 21:39:08 UTC
Permalink
Post by Brent Roman
If compatibility with 'C' extensions is not a major concern for you, I'm
curious to learn why you haven't tried using JRUBY?
Does JRuby run under Java RTS? Regular Java runtimes aren't realtime any
more than MRI is.


mathew
--
<URL:http://www.pobox.com/~meta/>
Charles Oliver Nutter
2010-01-17 21:52:24 UTC
Permalink
Post by mathew
Post by Brent Roman
If compatibility with 'C' extensions is not a major concern for you, I'm
curious to learn why you haven't  tried using JRUBY?
Does JRuby run under Java RTS? Regular Java runtimes aren't realtime any
more than MRI is.
The current Java RTS from Sun is an implementation of Java SE 5.0,
which is a supported runtime by JRuby. So yeah, it probably does. I
don't know enough about RTS to know if we're doing anything that would
be problematic. Grabbing a 90-day trial to try it myself.

- Charlie
Brent Roman
2010-01-12 10:23:26 UTC
Permalink
This paper and excellent slide set seem very promising:

http://matsu-www.is.titech.ac.jp/~endo/papers/endo-ismm2002-gc.pdf
http://matsu-www.is.titech.ac.jp/~endo/papers/endo-ismm2002-gc.ppt

Toshio Endo and Kenjiro Taura adapted the Boehm conservative GC
to *dramatically* reduce pauses with incremental collection. Their GC adds
about 10% to execution times on a single core CPU, but improves
execution times over basic mark-and-sweep
on multi-core CPU -- by allowing marking to run
in parallel on multiple cores.

This GC seems like a good fit to me. Since most folks running Ruby already
use multi-core processors, it should be an all around performance "win".

Has anyone looked into grafting this GC into MRI?

Does anyone see any conceptual problems in doing so?

- brent
Post by Yukihiro Matsumoto
Hi,
In message "Re: [ruby-core:27135] better GC?"
|Could I put in a small plea for a better GC?
Yes, but unfortunately it's not small at all. GC has a lot of
trade-offs and difficult technical dependency. It's not that easy to
solve your frustration. I am happy to be proven wrong (that means
patches are welcome).
matz.
--
View this message in context: http://old.nabble.com/-ruby-core%3A27135--better-GC--tp26735247p27125434.html
Sent from the ruby-core mailing list archive at Nabble.com.
Rick DeNatale
2010-01-12 15:31:40 UTC
Permalink
Post by Brent Roman
http://matsu-www.is.titech.ac.jp/~endo/papers/endo-ismm2002-gc.pdf
http://matsu-www.is.titech.ac.jp/~endo/papers/endo-ismm2002-gc.ppt
Toshio Endo and Kenjiro Taura adapted the Boehm conservative GC
to *dramatically* reduce pauses with incremental collection.  Their GC adds
about 10% to execution times on a single core CPU, but improves
execution times over basic mark-and-sweep
on multi-core CPU -- by allowing marking to run
in parallel on multiple cores.
This GC seems like a good fit to me.  Since most folks running Ruby already
use multi-core processors, it should be an all around performance "win".
Has anyone looked into grafting this GC into MRI?
Does anyone see any conceptual problems in doing so?
Well, I've just scanned the paper, but I have a few comments.

1) The algorithm they present relies on hooking into the virtual
memory implementation in order to implement a write barrier by write
protecting pages, catching the exception caused when a store is done
into a protected page, marking the page as dirty, and then un-write
protecting the page. I'm not sure how portable this is across various
combinations of operating system and processor.

2) IMHO, conservative GCs are compromises which attempt to provide GC
to languages without a strong 'object model'. And without help from
the compiler. By object model here, I mean more than just how objects
are laid out, or how methods are found, but a design which is factored
so that at the least references to objects vs. non-objects can be
distinguished, and reference changes can be efficiently tracked,
usually with some support from the compiler, and/or implementation of
the 'byte-codes'.

There's no reason why Ruby couldn't be implemented with such an object
model, in fact I don't think that it's far from that.

There are decades of experience in implementing VMs for such languages
which provide efficient accurate garbage collection, while also
providing at least, and usually two, interfaces for invoking code not
generated by the languages compiler:

1) An interface for invoking procedural library code which is totally
unaware which language invoked it. Such code takes a list of
parameters and returns result value(s). This code need have no
interaction with the GC, the interface insures that object references
don't pass across the interface, all the parameters get converted to
and result values get converted from basic data types (ints, strings,
structs ...). This is what the FFI does.

2) An interface for using a low-level language for writing
'primitives' which are aware that they need to interact with the VM
and the garbage collector. This requires that the interface provides
macros and/or function calls/call backs and a set of rules to be
followed in writing the primitives.

The biggest problem with MRI, is that it started out using one api for
both of these. I suspect that a large percentage of Ruby extensions
are really cases of use case 1.

If Ruby were to take a path of deprecating using the extension api for
use case 1 in favor of FFI, and then evolving the extension API to
support having objects move during GC (which the similar APIs for
Smalltalk, Java and VM base languages have done), it could actually
get a competitive VM/GC.
--
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
Brent Roman
2010-01-13 00:57:29 UTC
Permalink
Rick,

1) Unix, Solaris and Windows all have the necessary user space virtual
memory management hooks. Nothing this GC is doing is processor dependent.
What specific platforms concern you here?

2) MRI cannot use a copying GC for the reasons Evan Phoenix explained
in a previous message. As a practical matter, there's a huge code base
both in extensions
and in the core itself that would have to be carefully reengineered to
cope with moving objects. And, we're not even considering extensions
written for corporate applications that most likely not open source.

However, having said that:

http://www.engineyard.com/blog/2009/5-things-youll-love-about-rubinius/

The above explains that Rubinus supports the MRI C-API by passing around
opaque
object handles. This appears to work for many 'C' extensions, but,
having looked
at the MRI core's 'C' code, I doubt handles would work for much there.
Also, I'm reading hints that those handles represent significant extra
overhead
the Rubinius VM must maintain for those to remain objects accessible
from 'C'.

In my mind, in comes down to that fact that, once you start moving
objects
around, MRI will necessarily start to resemble Rubinius internally. So,
why
not work to improve Rubinus if that's the direction you'd like to go?

- brent
Post by Rick DeNatale
Well, I've just scanned the paper, but I have a few comments.
1) The algorithm they present relies on hooking into the virtual
memory implementation in order to implement a write barrier by write
protecting pages, catching the exception caused when a store is done
into a protected page, marking the page as dirty, and then un-write
protecting the page. I'm not sure how portable this is across various
combinations of operating system and processor.
2) IMHO, conservative GCs are compromises which attempt to provide GC
to languages without a strong 'object model'. And without help from
the compiler. By object model here, I mean more than just how objects
are laid out, or how methods are found, but a design which is factored
so that at the least references to objects vs. non-objects can be
distinguished, and reference changes can be efficiently tracked,
usually with some support from the compiler, and/or implementation of
the 'byte-codes'.
There's no reason why Ruby couldn't be implemented with such an object
model, in fact I don't think that it's far from that.
There are decades of experience in implementing VMs for such languages
which provide efficient accurate garbage collection, while also
providing at least, and usually two, interfaces for invoking code not
1) An interface for invoking procedural library code which is totally
unaware which language invoked it. Such code takes a list of
parameters and returns result value(s). This code need have no
interaction with the GC, the interface insures that object references
don't pass across the interface, all the parameters get converted to
and result values get converted from basic data types (ints, strings,
structs ...). This is what the FFI does.
2) An interface for using a low-level language for writing
'primitives' which are aware that they need to interact with the VM
and the garbage collector. This requires that the interface provides
macros and/or function calls/call backs and a set of rules to be
followed in writing the primitives.
The biggest problem with MRI, is that it started out using one api for
both of these. I suspect that a large percentage of Ruby extensions
are really cases of use case 1.
If Ruby were to take a path of deprecating using the extension api for
use case 1 in favor of FFI, and then evolving the extension API to
support having objects move during GC (which the similar APIs for
Smalltalk, Java and VM base languages have done), it could actually
get a competitive VM/GC.
--
Rick DeNatale
Blog: http://talklikeaduck.denhaven2.com/
Twitter: http://twitter.com/RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale
--
View this message in context: http://old.nabble.com/-ruby-core%3A27135--better-GC--tp26735247p27137841.html
Sent from the ruby-core mailing list archive at Nabble.com.
Kirk Haines
2010-01-12 16:34:22 UTC
Permalink
Post by Brent Roman
Toshio Endo and Kenjiro Taura adapted the Boehm conservative GC
to *dramatically* reduce pauses with incremental collection.  Their GC adds
about 10% to execution times on a single core CPU, but improves
execution times over basic mark-and-sweep
on multi-core CPU -- by allowing marking to run
in parallel on multiple cores.
This GC seems like a good fit to me.  Since most folks running Ruby already
use multi-core processors, it should be an all around performance "win".
Not unless people are running their code on multi-core processors
where some of the cores aren't being utilized.

If superior execution time is only achieved by offloading extra work
to an idle core, then that really isn't a gain.


Kirk Haines
Kurt Stephens
2010-01-12 20:11:26 UTC
Permalink
Post by Kirk Haines
Not unless people are running their code on multi-core processors
where some of the cores aren't being utilized.
If superior execution time is only achieved by offloading extra work
to an idle core, then that really isn't a gain.
If that's the case, then no one would use multi-core processors at
all. How the threads are mapped to a processor (i.e. another core or
not) is an OS issue.

The issue here is that the majority of the work can be done in a
separate thread which allows the mutator (in this case Ruby) to continue
while the GC's thread is doing its job.

Even if the threads are not mapped to different processors, the
perception is that the memory manager (GC) does not stop the world to do
its work.

KAS
Kirk Haines
2010-01-12 21:09:29 UTC
Permalink
 If that's the case, then no one would use multi-core processors at all.
 How the threads are mapped to a processor (i.e. another core or not) is an
OS issue.
Maybe you miss my point? See below.
 The issue here is that the majority of the work can be done in a separate
thread which allows the mutator (in this case Ruby) to continue
 while the GC's thread is doing its job.
If I have one core, under that particular GC scheme, my stuff runs 10% slower.

If I have two cores, my stuff runs X% faster, because a bunch of the
GC work is offloaded to that second core.

So, if I have my code already running on two cores, there aren't any
more idle cores available to actually offload GC work to. So my code
on both cores is now running 10% slower.

i.e. overall, we're doing more work to get a certain execution time or
throughput on our code using the Boehm conservative GC mentioned, but
that GC can spread the work out among more cores, so if the machine
isn't already running at max CPU capacity, that one job can appear to
run faster, but that breaks down if all of the cores are being
utilized for work. For example, let's say that I have a 4 core
machine. I'm running some code on it that, when pushed to its limits,
utilizes all four cores at 100%.

If I'm running my code under the Boehm conservative GC, then so long
as only two or three cores are being pushed to their limits, or so
long as none of the four cores are being pushed too bar, there is
excess CPU capacity to handle the GC tasks, thus making my code run
with greater throughput. However, as soon as all of the cores get
busy enough, the situation on the multicore machine is no different
than the situation would be on a single core machine -- my code has
execution times that are 10% longer. So, when looking at the machine
as a whole, it's overall processing capacity, at least for CPU bound
tasks, gets reduced versus a GC that doesn't exact this tax. It is
only a benefit when the cores are not being used to capacity.


Kirk Haines
Magnus Holm
2010-01-12 21:13:08 UTC
Permalink
Which leads us to why we need pluggable GC: There simply isn't one GC
that fits all uses.

// Magnus Holm
 If that's the case, then no one would use multi-core processors at all.
 How the threads are mapped to a processor (i.e. another core or not) is an
OS issue.
Maybe you miss my point?  See below.
 The issue here is that the majority of the work can be done in a separate
thread which allows the mutator (in this case Ruby) to continue
 while the GC's thread is doing its job.
If I have one core, under that particular GC scheme, my stuff runs 10% slower.
If I have two cores, my stuff runs X% faster, because a bunch of the
GC work is offloaded to that second core.
So, if I have my code already running on two cores, there aren't any
more idle cores available to actually offload GC work to.  So my code
on both cores is now running 10% slower.
i.e. overall, we're doing more work to get a certain execution time or
throughput on our code using the Boehm conservative GC mentioned, but
that GC can spread the work out among more cores, so if the machine
isn't already running at max CPU capacity, that one job can appear to
run faster, but that breaks down if all of the cores are being
utilized for work.  For example, let's say that I have a 4 core
machine. I'm running some code on it that, when pushed to its limits,
utilizes all four cores at 100%.
If I'm running my code under the Boehm conservative GC, then so long
as only two or three cores are being pushed to their limits, or so
long as none of the four cores are being pushed too bar, there is
excess CPU capacity to handle the GC tasks, thus making my code run
with greater throughput.  However, as soon as all of the cores get
busy enough, the situation on the multicore machine is no different
than the situation would be on a single core machine -- my code has
execution times that are 10% longer.  So, when looking at the machine
as a whole, it's overall processing capacity, at least for CPU bound
tasks, gets reduced versus a GC that doesn't exact this tax.  It is
only a benefit when the cores are not being used to capacity.
Kirk Haines
Brent Roman
2010-01-13 01:54:25 UTC
Permalink
The main point of this particular GC algorithm was to reduce the maximum
latency it imposed. It seems to manage this very well in a way that's
compatible with
the existing MRI structure. But, the cost is that long-term average
throughput may
be reduced by 10% or so. Most incremental algorithms are not as fast as
those
that don't allow for interruption.

So, as you point out, if you've got all your cores and hyperthreads fully
utilized, this GC will hurt average throughput.

However, I'd submit that:

1) most systems running Ruby today have multiple cores and/or
hyperthreading

2) Desktop systems rarely keep all their cores busy,

3) the more cores you have, the more likely it is that one or more is not
being fully utilized at any given instant. It's really hard to keep an
8-core server
CPU bound for extended periods, especially when it's serving the web pages.

So, most Ruby installations would see a net increase in throughput with this
GC.
But, again, that's not its main design goal. Reducing latency is.
10ms max latency would eliminate the GC induces freezes that distract users
and
allow Ruby to be used more in financial, robotics, audio and video
processing.

- brent
Post by Kirk Haines
 If that's the case, then no one would use multi-core processors at all.
 How the threads are mapped to a processor (i.e. another core or not) is an
OS issue.
Maybe you miss my point? See below.
 The issue here is that the majority of the work can be done in a separate
thread which allows the mutator (in this case Ruby) to continue
 while the GC's thread is doing its job.
If I have one core, under that particular GC scheme, my stuff runs 10% slower.
If I have two cores, my stuff runs X% faster, because a bunch of the
GC work is offloaded to that second core.
So, if I have my code already running on two cores, there aren't any
more idle cores available to actually offload GC work to. So my code
on both cores is now running 10% slower.
i.e. overall, we're doing more work to get a certain execution time or
throughput on our code using the Boehm conservative GC mentioned, but
that GC can spread the work out among more cores, so if the machine
isn't already running at max CPU capacity, that one job can appear to
run faster, but that breaks down if all of the cores are being
utilized for work. For example, let's say that I have a 4 core
machine. I'm running some code on it that, when pushed to its limits,
utilizes all four cores at 100%.
...
Kirk Haines
--
View this message in context: http://old.nabble.com/-ruby-core%3A27135--better-GC--tp26735247p27138333.html
Sent from the ruby-core mailing list archive at Nabble.com.
Roger Pack
2010-01-14 20:06:14 UTC
Permalink
Post by Brent Roman
Toshio Endo and Kenjiro Taura adapted the Boehm conservative GC
to *dramatically* reduce pauses with incremental collection.  Their GC adds
about 10% to execution times on a single core CPU, but improves
execution times over basic mark-and-sweep
on multi-core CPU -- by allowing marking to run
in parallel on multiple cores.
This GC seems like a good fit to me.  Since most folks running Ruby already
use multi-core processors, it should be an all around performance "win".
Has anyone looked into grafting this GC into MRI?
I looked into it once and...didn't get too far [I wasn't quite sure
how to mark the roots, and once I did get something that somewhat
worked, it seemed slower than the default GC].

If I did it again I would probably do something like
disable the GC (GC_disable)
have periodic calls to garbage_collect (like it does now)
within garbage_collect mark the roots and then call collect (GC_gcollect)

that might be a more sane way of trying it out.
That being said, typically GC only uses 10% of a ruby's time (I
think), so it's not a super high performance hit currently.

I've also been working on some "native type" wrappers [1] whose goal
is basically to reduce the amount of memory Ruby has to traverse in
order to do its mark and sweep.

ex:
a = GoogleHashSparseIntToInt.new

a[3] = 4 # it saves these away as native C ints, so ruby's GC
basically ignores all members.

I'd be happy to add more functionality [like native {saved away}
strings] or a wrapper to sets/priority queues/std::vector if anybody
would find it useful just let me know.

-r
[1] http://github.com/rdp/google_hash
Brian Mitchell
2010-01-14 20:35:00 UTC
Permalink
Post by Roger Pack
that might be a more sane way of trying it out.
That being said, typically GC only uses 10% of a ruby's time (I
think), so it's not a super high performance hit currently.
I'm not sure what kinds of app code you've been running but I see
somewhere between 20% and 60% spent in memory management. This
includes well tuned code which avoids garbage when possible. Keep in
mind that you can't simple run a benchmark with no real heap. You'll
have to artificially add 60MB to 200MB of objects depending on what
your target runtime footprint will be.
Post by Roger Pack
I've also been working on some "native type" wrappers [1] whose goal
is basically to reduce the amount of memory Ruby has to traverse in
order to do its mark and sweep.
a = GoogleHashSparseIntToInt.new
a[3] = 4 # it saves these away as native C ints, so ruby's GC
basically ignores all members.
I'd be happy to add more functionality [like native {saved away}
strings] or a wrapper to sets/priority queues/std::vector if anybody
would find it useful just let me know.
-r
[1] http://github.com/rdp/google_hash
These sorts of libraries would be excellent to start publicizing. I'd
also be interested in doing some smarter implementations for hashes
along the lines of Lua's tables. Lua's ropes are also worth looking at
since there is a lot of string concat heavy code out there.

Io (iolanguage.com) for example, makes heavy use of unboxed type
collections to allow fast vector operations.

Brian.
Lourens Naudé
2010-01-16 03:43:21 UTC
Permalink
Hi,

I've had this idea to use threaded code (http://en.wikipedia.org/wiki/Threaded_code) for the GC mark phase for some time as it's very much like a
VM core, "dispatching" AST nodes and objects respectively.A minimum of branch misprediction overhead is very important for a fast marking phase.

Here's a link to a patch against REE ( http://github.com/FooBarWidget/rubyenterpriseedition187-248 ) with some more context :

http://code.google.com/p/rubyenterpriseedition/issues/detail?id=28&colspec=ID%20Type%20Status%20Priority%20Milestone%20Summary

With the existing macros (perhaps with some minor refactorings for use outside the opcode loop as well) ko1 extracted to vm_exec.h on 1.9, this could
be a minor patch on that MRI version.

Thoughts ?

- Lourens
Post by Brian Mitchell
Post by Roger Pack
that might be a more sane way of trying it out.
That being said, typically GC only uses 10% of a ruby's time (I
think), so it's not a super high performance hit currently.
I'm not sure what kinds of app code you've been running but I see
somewhere between 20% and 60% spent in memory management. This
includes well tuned code which avoids garbage when possible. Keep in
mind that you can't simple run a benchmark with no real heap. You'll
have to artificially add 60MB to 200MB of objects depending on what
your target runtime footprint will be.
Post by Roger Pack
I've also been working on some "native type" wrappers [1] whose goal
is basically to reduce the amount of memory Ruby has to traverse in
order to do its mark and sweep.
a = GoogleHashSparseIntToInt.new
a[3] = 4 # it saves these away as native C ints, so ruby's GC
basically ignores all members.
I'd be happy to add more functionality [like native {saved away}
strings] or a wrapper to sets/priority queues/std::vector if anybody
would find it useful just let me know.
-r
[1] http://github.com/rdp/google_hash
These sorts of libraries would be excellent to start publicizing. I'd
also be interested in doing some smarter implementations for hashes
along the lines of Lua's tables. Lua's ropes are also worth looking at
since there is a lot of string concat heavy code out there.
Io (iolanguage.com) for example, makes heavy use of unboxed type
collections to allow fast vector operations.
Brian.
Roger Pack
2010-01-16 05:03:47 UTC
Permalink
Post by Lourens Naudé
I've had this idea to use threaded code (http://en.wikipedia.org/wiki/Threaded_code) for the GC mark phase
Benchmarks?
-r
Roger Pack
2010-01-19 02:49:41 UTC
Permalink
Post by Brian Mitchell
Post by Roger Pack
[1] http://github.com/rdp/google_hash
These sorts of libraries would be excellent to start publicizing. I'd
also be interested in doing some smarter implementations for hashes
along the lines of Lua's tables. Lua's ropes are also worth looking at
since there is a lot of string concat heavy code out there.
Interesting. I suppose a cord is pretty much functionally equivalent
to a linked list with native types (in this case, a string). I could
come up with something like that.

I did also learn recently of the NArray library, which apparently
stores arrays of native types succintly, so there are tools out there.

Thanks.
-r
Kurt Stephens
2010-01-19 06:37:08 UTC
Permalink
Post by Roger Pack
Post by Brian Mitchell
Post by Roger Pack
[1] http://github.com/rdp/google_hash
These sorts of libraries would be excellent to start publicizing. I'd
also be interested in doing some smarter implementations for hashes
along the lines of Lua's tables. Lua's ropes are also worth looking at
since there is a lot of string concat heavy code out there.
Interesting. I suppose a cord is pretty much functionally equivalent
to a linked list with native types (in this case, a string). I could
come up with something like that.
I did also learn recently of the NArray library, which apparently
stores arrays of native types succintly, so there are tools out there.
Thanks.
-r
The Boehm (BDW) GC has a GC-aware implementation of string cords in C.
I think there is a C++ binding also.

KAS
Roger Pack
2010-01-28 02:51:39 UTC
Permalink
Post by Brian Mitchell
Post by Roger Pack
I'd be happy to add more functionality [like native {saved away}
strings] or a wrapper to sets/priority queues/std::vector if anybody
would find it useful just let me know.
-r
[1] http://github.com/rdp/google_hash
These sorts of libraries would be excellent to start publicizing. I'd
also be interested in doing some smarter implementations for hashes
along the lines of Lua's tables. Lua's ropes are also worth looking at
since there is a lot of string concat heavy code out there.
I'd be happy to hack something up if it would be useful--contact me offline.
-r

Paul Brannan
2010-01-11 14:52:07 UTC
Permalink
Post by Roger Pack
Could I put in a small plea for a better GC?
(1..7).each{|n| puts n; a = ['a']*(10**n); a.inspect;}
when n gets high enough almost 50% of the time is spent in GC.
Somewhat frustrating.
I modified this program slightly (imo cases n=1..6 aren't useful for a
profile):

***@random:~/tmp> cat test.rb
100.times do |n|
puts n
a = ['a']*(10**6)
a.inspect
end

The results aren't at all surprising. Callgrind reports 52% of the
application's run time is spent in gc_mark. 21% of the time is spent in
gc_mark_children, which means the majority of the time the objects
are immediate objects or have already been marked (I'm guessing the
latter in this program).

This probably isn't very representative of a real-world application due
to the small number of objects, but the large size of the array.

It makes me think two things though:

1. Is gc_mark getting inlined? If it were, marking the array could be
done without so many function calls.

2. If there were many objects, what would be the cost of resetting the
FL_MARK flag on all the objects in gc_sweep? Could this cost be
eliminated by using a 2-bit counter instead of a single flag?

RBASIC(v)->flags = 0 /* when the object is created */
counter = ((counter + 1) % 3) + 1; /* before marking */
if (p->as.basic.mark_counter == counter) /* to check if marked */
p->as.basic.mark_counter = counter; /* to mark */
; /* no need to reset all the flags at the end now */

Hoping to spark some thought/discussion,

Paul
Loading...