Skip to content

Add jemalloc, lock Rust version #1623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 13, 2019
Merged

Conversation

sgrif
Copy link
Contributor

@sgrif sgrif commented Feb 12, 2019

Short version:

We have a memory issue after switching to 1.32 which came from the
allocator switch. 1.31 used jemalloc 4. jemallocator uses jemalloc 5,
which behaves differently. With jemalloc 5 as configured in the PR, we
still have an issue which needs to be addressed eventually, but the
behavior is improved to the point where I'm not concerned about it
causing a service disruption. My current theory points to the problem
coming from civet, which we're removing soon anyway. Looking into this
further will require a lot of time and effort, so I'd like to see if we
have a major improvement from removing civet before going further. I
believe that the behavior that we're seeing with this config is the same
as a rare behavior we saw on Rust 1.31, so I do not think this is any
riskier than locking back down to 1.31. This is also what has been in
production for the last several days as we've been debugging.

The let _ when enabling background threads is that they're not
supported on all platforms (notably mac). The function is safe to call
on unsupported platforms. Folks who are more familiar with jemalloc than
I am recommended this over #[cfg(not(macos))].

This PR also locks down the Rust version so we don't unexpectedly have
issues from a version upgrade mixed with code changes again, or an
unexpected full build with no cache during an incident. I expect that we
will just bump this immediately when a new stable version comes out, but
we should make sure we always deploy right after bumping it.

Everything after this is more detail on what I've said above, and/or
documenting what I've found so it doesn't get lost.

Long version:

After upgrading to Rust 1.32, we saw what appeared to be a large memory
leak. I've been spending a lot of time over the last week attempting to
narrow this down, as well as come up with a permanent fix.

The fact that the 1.32 upgrade caused this makes sense, as it changed
the default allocator from jemalloc 4 to the system allocator. What
doesn't make sense is why the system allocator would cause us to have a
major memory leak, but jemalloc 4 causes us to have completely stable
usage, and jemalloc 5 does something completely different (more on that
later)

From looking into this we've learned a few interesting things:

  • We can run out of memory and it's completely fine.
    • If we run out of memory, we will start swapping to disk which in
      theory should cause major perf issues, but in practice it barely
      caused a noticable effect on response times
  • We are more affected by our choice of allocator than we thought.
  • Our memory usage is in fact coming from our server, not nginx.
    • I've long wondered why exactly we were using so much memory, but
      haven't had time to look into it in detail until now. I've always
      assumed it was nginx's buffers, but this makes it clear that most
      or all of our memory usage is in fact coming from the server.
  • Whatever is causing this issue seems to require production levels of
    traffic, as we haven't been able to reproduce locally or in staging.
  • The underlying problem almost certainly comes from one of our C
    dependencies which I'll elaborate on below
  • We are reading and writing pages to disk (aka swapping) even when
    we're nowhere near max memory usage. My understanding is this should
    not be happening.
  • Our allocations don't appear to correlate to crate uploads, which is
    the only time I'd expect us to be allocating a ton of memory. The
    spikes sometimes come from uploads, but not always. And the amount
    that gets allocated is way larger than the size of the crate being
    uploaded (0.18MB crate correlated with a 113MB increase in RSS)

Here's the behavior we've seen with various allocators:

  • 1.31 default allocator (jemalloc 4 unprefixed)
    • Memory usage very rapidly grows to some number (usually between
      200MB and 350MB) and remains extremely stable at that number until
      the server restarts. The number it grows to is different every
      restart, but whatever number it reaches is the same between both
      servers.
    • On rare occasions, there is a spike in memory usage around 1-2 hours
      before restart. I've suspected in the past this is related to the
      connection pool exhaustion issue, since IIRC the memory usage spiked
      the last time we had that issue occur in production. The spike only
      affects a single server when it happens.
  • 1.32 default allocator (system allocator)
    • Memory usage grows rapidly and unbounded, consuming all available
      RAM within 3 hours and continuing to grow into swap beyond that. I
      do not know if it eventually reaches an upper bound as I've never
      let it run for long after it begins swapping
  • jemallocator (jemalloc 5, prefixed, no background threads)
    • Similar to system allocator, but much slower growth
  • jemallocator unprefixed (jemalloc 5, unprefixed, no background
    threads)
    • Memory usage grows to ~150MB, and is either stable or very slowly
      grows towards ~200MB. Memory usage is similar between servers, but
      not exactly the same. "stable" in this context means not obviously
      leaking or growing in unreasonable ways, but it moves around much
      more than jemalloc 4 did (25MB or so vs <5MB)
      • It's almost always only one server that grows, but I've seen both
        grow at least once
    • After 16 hours, memory usage grows extremely rapidly on one or
      occasionally both servers to some number (usually between 400MB and
      520MB) and stays there until server restart.
      • The number it grows to appears to be roughly 2.25 the usage before
        the growth. And by roughly I mean within 1-2% of that number, and
        it extremely consistently exhibits this behavior.
    • Because unprefixing jemalloc's symbols caused the change here, it's
      pretty clear that it's a leak in memory allocated by one of our C
      dependencies. We have quite a few of these, but the ones that would
      make the most sense to be the source are either civet, miniz (coming
      from flate2), libgit2, or libpq (coming from Diesel)
      • Diesel has been run through valgrind in the past, and I did a
        brief code audit as well, so I don't think it's libpq
      • It wouldn't make a lot of sense for libgit2 to be the cause, but
        once we've moved index updates off the web server there's very
        little reason we can't just subshell out to git
      • civet makes the most sense to be the cause, and is something we're
        intending to remove
      • flate2 would also make sense and is much harder to remove
    • The 16 hour time period before the spike was very consistent, and
      not correlated with a time of day, so I don't think it was triggered
      by specific traffic patterns
    • Once the spike happens memory usage tends to remain pretty stable
      there, with one notable exception of the first day this was deployed
      where it grew to consume all available memory, then randomly decided
      to allocate 116MB of swap, and shortly afterwards it freed that
      amount of RSS. At the time I thought the fact that it stopped at
      almost exactly 512MB of RAM was too specific to be a coincidence but
      I now believe that it was a coincidence
  • jemallocator unprefixed, jemalloc_ctl (jemalloc 5, unprefixed,
    background threads enabled)
    • This is the config in this PR
    • Similar behavior to background threads turned off, but it takes
      22-23 hours for the spike to occur
  • The behavior of the jemallocator crate is the same on both 1.32 and
    1.31

My current working theory is that we do actually have a memory leak, but
what is being leaked is something extremely small. The reason this
causes such a huge difference in behavior on the system allocator is
that these small leaked allocations are causing fragmentation issues.
For this to be the case, it would require that we are slowly requesting
larger and larger allocations over time, which is what leads me to
believe that civet is the cause.

I suspect the growth is from civet growing its request/response buffers
(which should be the largest single allocations we have), but the leak
is something very small that is just causing the allocator to be unable
to re-use previously freed blocks. It's also possible that these
re-allocations are just causing this much fragmentation naturally
without the leak, but I have a hard time believing the system allocator
is actually performing that much worse here. The slow consistent
memory growth before the huge spike also implies the small leak to me.

This would also explain some of the behavior of jemalloc, and why we're
seeing that spike. jemalloc puts smaller allocations into the same
arena, so would prevent the fragmentation caused by leaking smaller
allocations, and potentially would be able to hide it completely for a
while. The fact that the spike is consistently 2.25x the previous max
RSS is too specific not to be the growth factor on something. I think it
is jemalloc growing one of its larger arenas, but it could also be the
size of some internal buffer used by civet or something else.

I believe that the behavior we're seeing with jemalloc 5 isn't actually
different overall from what we saw with jemalloc 4, other than the spike
happening sooner. I think the spike we very rarely saw in jemalloc 4
would consistently have occurred if the server didn't restart every 24
hours.

I suspect switching from civet to hyper will remove this issue entirely.
If it doesn't, I intend to dig in further. The next steps would be to
log more from jemalloc on a regular interval, get some stack traces from
jemalloc for large allocations, and possibly even running valgrind in
production for brief periods. Most of these things will require a lot of
work and have a very low signal to noise ratio, which is why I've chosen
to stop here at a point where I'm confident we don't have an impending
outage, and rule a few more things out before digging deeper.

Other things that I have done which are not in this PR or reflected in
the code base are:

  • Enabling more verbose logging on Heroku's side
    • This gives us more granularity than the metrics graph gives us
      (every 20 seconds, and also separated by server)
    • Additional stats beyond the metrics graphs (disk cache, pages
      written to disk)
  • Reached out to support to see if we can disable the daily restart
    • I don't intend to do this right now if it's possible, but it's a
      good option to have to aid us in debugging this and the connection
      pooling issue.
    • It's not something they offer out of the box, but the platform team
      is looking into making it possible for us

Short version:

We have a memory issue after switching to 1.32 which came from the
allocator switch. 1.31 used jemalloc 4. jemallocator uses jemalloc 5,
which behaves differently. With jemalloc 5 as configured in the PR, we
still have an issue which needs to be addressed eventually, but the
behavior is improved to the point where I'm not concerned about it
causing a service disruption. My current theory points to the problem
coming from civet, which we're removing soon anyway. Looking into this
further will require a lot of time and effort, so I'd like to see if we
have a major improvement from removing civet before going further. I
believe that the behavior that we're seeing with this config is the same
as a rare behavior we saw on Rust 1.31, so I do not think this is any
riskier than locking back down to 1.31. This is also what has been in
production for the last several days as we've been debugging.

The `let _` when enabling background threads is that they're not
supported on all platforms (notably mac). The function is safe to call
on unsupported platforms. Folks who are more familiar with jemalloc than
I am recommended this over `#[cfg(not(macos))]`.

This PR also locks down the Rust version so we don't unexpectedly have
issues from a version upgrade mixed with code changes again, or an
unexpected full build with no cache during an incident. I expect that we
will just bump this immediately when a new stable version comes out, but
we should make sure we always deploy right after bumping it.

Everything after this is more detail on what I've said above, and/or
documenting what I've found so it doesn't get lost.

Long version:

After upgrading to Rust 1.32, we saw what appeared to be a large memory
leak. I've been spending a lot of time over the last week attempting to
narrow this down, as well as come up with a permanent fix.

The fact that the 1.32 upgrade caused this makes sense, as it changed
the default allocator from jemalloc 4 to the system allocator. What
doesn't make sense is why the system allocator would cause us to have a
major memory leak, but jemalloc 4 causes us to have completely stable
usage, and jemalloc 5 does something completely different (more on that
later)

From looking into this we've learned a few interesting things:

- We can run out of memory and it's completely fine.
  - If we run out of memory, we will start swapping to disk which in
    theory should cause major perf issues, but in practice it barely
    caused a noticable effect on response times
- We are more affected by our choice of allocator than we thought.
- Our memory usage is in fact coming from our server, not nginx.
  - I've long wondered why exactly we were using so much memory, but
    haven't had time to look into it in detail until now. I've always
    assumed it was nginx's buffers, but this makes it clear that most
    or all of our memory usage is in fact coming from the server.
- Whatever is causing this issue seems to require production levels of
  traffic, as we haven't been able to reproduce locally or in staging.
- The underlying problem almost certainly comes from one of our C
  dependencies which I'll elaborate on below
- We are reading and writing pages to disk (aka swapping) even when
  we're nowhere near max memory usage. My understanding is this should
  not be happening.
- Our allocations don't appear to correlate to crate uploads, which is
  the only time I'd expect us to be allocating a ton of memory. The
  spikes *sometimes* come from uploads, but not always. And the amount
  that gets allocated is way larger than the size of the crate being
  uploaded (0.18MB crate correlated with a 113MB increase in RSS)

Here's the behavior we've seen with various allocators:

- 1.31 default allocator (jemalloc 4 unprefixed)
  - Memory usage very rapidly grows to *some* number (usually between
    200MB and 350MB) and remains extremely stable at that number until
    the server restarts. The number it grows to is different every
    restart, but whatever number it reaches is the same between both
    servers.
  - On rare occasions, there is a spike in memory usage around 1-2 hours
    before restart. I've suspected in the past this is related to the
    connection pool exhaustion issue, since IIRC the memory usage spiked
    the last time we had that issue occur in production. The spike only
    affects a single server when it happens.
- 1.32 default allocator (system allocator)
  - Memory usage grows rapidly and unbounded, consuming all available
    RAM within 3 hours and continuing to grow into swap beyond that. I
    do not know if it eventually reaches an upper bound as I've never
    let it run for long after it begins swapping
- jemallocator (jemalloc 5, prefixed, no background threads)
  - Similar to system allocator, but much slower growth
- jemallocator unprefixed (jemalloc 5, unprefixed, no background
  threads)
  - Memory usage grows to ~150MB, and is either stable or very slowly
    grows towards ~200MB. Memory usage is similar between servers, but
    not exactly the same. "stable" in this context means not obviously
    leaking or growing in unreasonable ways, but it moves around much
    more than jemalloc 4 did (25MB or so vs <5MB)
    - It's almost always only one server that grows, but I've seen both
      grow at least once
  - After 16 hours, memory usage grows extremely rapidly on one or
    occasionally both servers to some number (usually between 400MB and
    520MB) and stays there until server restart.
    - The number it grows to appears to be roughly 2.25 the usage before
      the growth. And by roughly I mean within 1-2% of that number, and
      it extremely consistently exhibits this behavior.
  - Because unprefixing jemalloc's symbols caused the change here, it's
    pretty clear that it's a leak in memory allocated by one of our C
    dependencies. We have quite a few of these, but the ones that would
    make the most sense to be the source are either civet, miniz (coming
    from flate2), libgit2, or libpq (coming from Diesel)
    - Diesel has been run through valgrind in the past, and I did a
      brief code audit as well, so I don't think it's libpq
    - It wouldn't make a lot of sense for libgit2 to be the cause, but
      once we've moved index updates off the web server there's very
      little reason we can't just subshell out to git
    - civet makes the most sense to be the cause, and is something we're
      intending to remove
    - flate2 would also make sense and is much harder to remove
  - The 16 hour time period before the spike was very consistent, and
    not correlated with a time of day, so I don't think it was triggered
    by specific traffic patterns
  - Once the spike happens memory usage tends to remain pretty stable
    there, with one notable exception of the first day this was deployed
    where it grew to consume all available memory, then randomly decided
    to allocate 116MB of swap, and shortly afterwards it freed that
    amount of RSS. At the time I thought the fact that it stopped at
    almost exactly 512MB of RAM was too specific to be a coincidence but
    I now believe that it was a coincidence
- jemallocator unprefixed, `jemalloc_ctl` (jemalloc 5, unprefixed,
  background threads enabled)
  - This is the config in this PR
  - Similar behavior to background threads turned off, but it takes
    22-23 hours for the spike to occur
- The behavior of the jemallocator crate is the same on both 1.32 and
  1.31

My current working theory is that we do actually have a memory leak, but
what is being leaked is something extremely small. The reason this
causes such a huge difference in behavior on the system allocator is
that these small leaked allocations are causing fragmentation issues.
For this to be the case, it would require that we are slowly requesting
larger and larger allocations over time, which is what leads me to
believe that civet is the cause.

I suspect the growth is from civet growing its request/response buffers
(which should be the largest single allocations we have), but the leak
is something very small that is just causing the allocator to be unable
to re-use previously freed blocks. It's also possible that these
re-allocations are just causing this much fragmentation naturally
without the leak, but I have a hard time believing the system allocator
is actually performing *that* much worse here. The slow consistent
memory growth before the huge spike also implies the small leak to me.

This would also explain some of the behavior of jemalloc, and why we're
seeing that spike. jemalloc puts smaller allocations into the same
arena, so would prevent the fragmentation caused by leaking smaller
allocations, and potentially would be able to hide it completely for a
while. The fact that the spike is consistently 2.25x the previous max
RSS is too specific not to be the growth factor on something. I think it
is jemalloc growing one of its larger arenas, but it could also be the
size of some internal buffer used by civet or something else.

I believe that the behavior we're seeing with jemalloc 5 isn't actually
different overall from what we saw with jemalloc 4, other than the spike
happening sooner. I think the spike we very rarely saw in jemalloc 4
would consistently have occurred if the server didn't restart every 24
hours.

I suspect switching from civet to hyper will remove this issue entirely.
If it doesn't, I intend to dig in further. The next steps would be to
log more from jemalloc on a regular interval, get some stack traces from
jemalloc for large allocations, and possibly even running valgrind in
production for brief periods. Most of these things will require a lot of
work and have a very low signal to noise ratio, which is why I've chosen
to stop here at a point where I'm confident we don't have an impending
outage, and rule a few more things out before digging deeper.

Other things that I have done which are not in this PR or reflected in
the code base are:

- Enabling more verbose logging on Heroku's side
  - This gives us more granularity than the metrics graph gives us
    (every 20 seconds, and also separated by server)
  - Additional stats beyond the metrics graphs (disk cache, pages
    written to disk)
- Reached out to support to see if we can disable the daily restart
  - I don't intend to do this right now if it's possible, but it's a
    good option to have to aid us in debugging this and the connection
    pooling issue.
  - It's not something they offer out of the box, but the platform team
    is looking into making it possible for us
@carols10cents
Copy link
Member

Is there anything remaining preventing us from upgrading to 1.32 then, or is it just that you'd like to deploy that separately?

@sgrif
Copy link
Contributor Author

sgrif commented Feb 13, 2019

Nope, nothing prevents us from upgrading to 1.32 which is why I included it in here (1.31 is in prod right now but that was just to verify that behavior was identical between 1.31 and 1.32 with this config)

@carols10cents
Copy link
Member

Ah I see, I was comparing this to what was on production and I was looking at production when I saw the Rust version at 1.31. My mistake!

Given that this is what's on production, this should get merged imo, so:

@bors r+

@bors
Copy link
Contributor

bors commented Feb 13, 2019

📌 Commit 648e431 has been approved by carols10cents

@bors
Copy link
Contributor

bors commented Feb 13, 2019

⌛ Testing commit 648e431 with merge 1959fa4...

bors added a commit that referenced this pull request Feb 13, 2019
Add jemalloc, lock Rust version

Short version:

We have a memory issue after switching to 1.32 which came from the
allocator switch. 1.31 used jemalloc 4. jemallocator uses jemalloc 5,
which behaves differently. With jemalloc 5 as configured in the PR, we
still have an issue which needs to be addressed eventually, but the
behavior is improved to the point where I'm not concerned about it
causing a service disruption. My current theory points to the problem
coming from civet, which we're removing soon anyway. Looking into this
further will require a lot of time and effort, so I'd like to see if we
have a major improvement from removing civet before going further. I
believe that the behavior that we're seeing with this config is the same
as a rare behavior we saw on Rust 1.31, so I do not think this is any
riskier than locking back down to 1.31. This is also what has been in
production for the last several days as we've been debugging.

The `let _` when enabling background threads is that they're not
supported on all platforms (notably mac). The function is safe to call
on unsupported platforms. Folks who are more familiar with jemalloc than
I am recommended this over `#[cfg(not(macos))]`.

This PR also locks down the Rust version so we don't unexpectedly have
issues from a version upgrade mixed with code changes again, or an
unexpected full build with no cache during an incident. I expect that we
will just bump this immediately when a new stable version comes out, but
we should make sure we always deploy right after bumping it.

Everything after this is more detail on what I've said above, and/or
documenting what I've found so it doesn't get lost.

Long version:

After upgrading to Rust 1.32, we saw what appeared to be a large memory
leak. I've been spending a lot of time over the last week attempting to
narrow this down, as well as come up with a permanent fix.

The fact that the 1.32 upgrade caused this makes sense, as it changed
the default allocator from jemalloc 4 to the system allocator. What
doesn't make sense is why the system allocator would cause us to have a
major memory leak, but jemalloc 4 causes us to have completely stable
usage, and jemalloc 5 does something completely different (more on that
later)

From looking into this we've learned a few interesting things:

- We can run out of memory and it's completely fine.
  - If we run out of memory, we will start swapping to disk which in
    theory should cause major perf issues, but in practice it barely
    caused a noticable effect on response times
- We are more affected by our choice of allocator than we thought.
- Our memory usage is in fact coming from our server, not nginx.
  - I've long wondered why exactly we were using so much memory, but
    haven't had time to look into it in detail until now. I've always
    assumed it was nginx's buffers, but this makes it clear that most
    or all of our memory usage is in fact coming from the server.
- Whatever is causing this issue seems to require production levels of
  traffic, as we haven't been able to reproduce locally or in staging.
- The underlying problem almost certainly comes from one of our C
  dependencies which I'll elaborate on below
- We are reading and writing pages to disk (aka swapping) even when
  we're nowhere near max memory usage. My understanding is this should
  not be happening.
- Our allocations don't appear to correlate to crate uploads, which is
  the only time I'd expect us to be allocating a ton of memory. The
  spikes *sometimes* come from uploads, but not always. And the amount
  that gets allocated is way larger than the size of the crate being
  uploaded (0.18MB crate correlated with a 113MB increase in RSS)

Here's the behavior we've seen with various allocators:

- 1.31 default allocator (jemalloc 4 unprefixed)
  - Memory usage very rapidly grows to *some* number (usually between
    200MB and 350MB) and remains extremely stable at that number until
    the server restarts. The number it grows to is different every
    restart, but whatever number it reaches is the same between both
    servers.
  - On rare occasions, there is a spike in memory usage around 1-2 hours
    before restart. I've suspected in the past this is related to the
    connection pool exhaustion issue, since IIRC the memory usage spiked
    the last time we had that issue occur in production. The spike only
    affects a single server when it happens.
- 1.32 default allocator (system allocator)
  - Memory usage grows rapidly and unbounded, consuming all available
    RAM within 3 hours and continuing to grow into swap beyond that. I
    do not know if it eventually reaches an upper bound as I've never
    let it run for long after it begins swapping
- jemallocator (jemalloc 5, prefixed, no background threads)
  - Similar to system allocator, but much slower growth
- jemallocator unprefixed (jemalloc 5, unprefixed, no background
  threads)
  - Memory usage grows to ~150MB, and is either stable or very slowly
    grows towards ~200MB. Memory usage is similar between servers, but
    not exactly the same. "stable" in this context means not obviously
    leaking or growing in unreasonable ways, but it moves around much
    more than jemalloc 4 did (25MB or so vs <5MB)
    - It's almost always only one server that grows, but I've seen both
      grow at least once
  - After 16 hours, memory usage grows extremely rapidly on one or
    occasionally both servers to some number (usually between 400MB and
    520MB) and stays there until server restart.
    - The number it grows to appears to be roughly 2.25 the usage before
      the growth. And by roughly I mean within 1-2% of that number, and
      it extremely consistently exhibits this behavior.
  - Because unprefixing jemalloc's symbols caused the change here, it's
    pretty clear that it's a leak in memory allocated by one of our C
    dependencies. We have quite a few of these, but the ones that would
    make the most sense to be the source are either civet, miniz (coming
    from flate2), libgit2, or libpq (coming from Diesel)
    - Diesel has been run through valgrind in the past, and I did a
      brief code audit as well, so I don't think it's libpq
    - It wouldn't make a lot of sense for libgit2 to be the cause, but
      once we've moved index updates off the web server there's very
      little reason we can't just subshell out to git
    - civet makes the most sense to be the cause, and is something we're
      intending to remove
    - flate2 would also make sense and is much harder to remove
  - The 16 hour time period before the spike was very consistent, and
    not correlated with a time of day, so I don't think it was triggered
    by specific traffic patterns
  - Once the spike happens memory usage tends to remain pretty stable
    there, with one notable exception of the first day this was deployed
    where it grew to consume all available memory, then randomly decided
    to allocate 116MB of swap, and shortly afterwards it freed that
    amount of RSS. At the time I thought the fact that it stopped at
    almost exactly 512MB of RAM was too specific to be a coincidence but
    I now believe that it was a coincidence
- jemallocator unprefixed, `jemalloc_ctl` (jemalloc 5, unprefixed,
  background threads enabled)
  - This is the config in this PR
  - Similar behavior to background threads turned off, but it takes
    22-23 hours for the spike to occur
- The behavior of the jemallocator crate is the same on both 1.32 and
  1.31

My current working theory is that we do actually have a memory leak, but
what is being leaked is something extremely small. The reason this
causes such a huge difference in behavior on the system allocator is
that these small leaked allocations are causing fragmentation issues.
For this to be the case, it would require that we are slowly requesting
larger and larger allocations over time, which is what leads me to
believe that civet is the cause.

I suspect the growth is from civet growing its request/response buffers
(which should be the largest single allocations we have), but the leak
is something very small that is just causing the allocator to be unable
to re-use previously freed blocks. It's also possible that these
re-allocations are just causing this much fragmentation naturally
without the leak, but I have a hard time believing the system allocator
is actually performing *that* much worse here. The slow consistent
memory growth before the huge spike also implies the small leak to me.

This would also explain some of the behavior of jemalloc, and why we're
seeing that spike. jemalloc puts smaller allocations into the same
arena, so would prevent the fragmentation caused by leaking smaller
allocations, and potentially would be able to hide it completely for a
while. The fact that the spike is consistently 2.25x the previous max
RSS is too specific not to be the growth factor on something. I think it
is jemalloc growing one of its larger arenas, but it could also be the
size of some internal buffer used by civet or something else.

I believe that the behavior we're seeing with jemalloc 5 isn't actually
different overall from what we saw with jemalloc 4, other than the spike
happening sooner. I think the spike we very rarely saw in jemalloc 4
would consistently have occurred if the server didn't restart every 24
hours.

I suspect switching from civet to hyper will remove this issue entirely.
If it doesn't, I intend to dig in further. The next steps would be to
log more from jemalloc on a regular interval, get some stack traces from
jemalloc for large allocations, and possibly even running valgrind in
production for brief periods. Most of these things will require a lot of
work and have a very low signal to noise ratio, which is why I've chosen
to stop here at a point where I'm confident we don't have an impending
outage, and rule a few more things out before digging deeper.

Other things that I have done which are not in this PR or reflected in
the code base are:

- Enabling more verbose logging on Heroku's side
  - This gives us more granularity than the metrics graph gives us
    (every 20 seconds, and also separated by server)
  - Additional stats beyond the metrics graphs (disk cache, pages
    written to disk)
- Reached out to support to see if we can disable the daily restart
  - I don't intend to do this right now if it's possible, but it's a
    good option to have to aid us in debugging this and the connection
    pooling issue.
  - It's not something they offer out of the box, but the platform team
    is looking into making it possible for us
@bors
Copy link
Contributor

bors commented Feb 13, 2019

☀️ Test successful - checks-travis
Approved by: carols10cents
Pushing 1959fa4 to master...

@bors bors merged commit 648e431 into rust-lang:master Feb 13, 2019
@sgrif
Copy link
Contributor Author

sgrif commented Feb 13, 2019 via email

@sgrif sgrif deleted the sg-jemallocator branch March 9, 2019 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants