Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix memory leak in pods #474

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arichiardi
Copy link
Contributor

@arichiardi arichiardi commented Jun 23, 2016

This was particularly evident when doing boot watch test and has been fixed thanks to Micha's
intuition of adding a clojure.core/shutdown-agents in boot.pod/destroy-pod.

Fixes adzerk-oss/boot-test#26

@arichiardi
Copy link
Contributor Author

This still feels not right, please don't merge it.

I tracked down the call to the clojure agent pool to be here.

More to come 😄

@stephenbrady
Copy link
Contributor

The call to shutdown-agents is actually already done in the .close method on the pod, which is a call to this.

Having spent a lot of time profiling for memory leaks with pods (see #268), my own experience (once those fixed were made) has been that either your own code or libraries are creating resources that need to be cleared. core.async, for example, creates a global executor service that must be freed otherwise you'll see a class leak at the very least since Clojure itself cannot be gc'd.

We use this as our custom destroy-pod function:

(defn destroy-pod
  [^ClojureRuntimeShim pod]
  (when pod
    (boot.pod/with-eval-in pod
      (when (find-ns 'clojure.core.async.impl.exec.threadpool)
        (let [e (resolve 'clojure.core.async.impl.exec.threadpool/the-executor)]
          (.shutdownNow ^java.util.concurrent.ExecutorService (var-get e))))

      (when (find-ns 'clojure.core.async.impl.timers)
        ; This will cause the thread to go into a terminated state (and print a ThreadDeath error)
        (let [t (resolve 'clojure.core.async.impl.timers/timeout-daemon)]
          (try (.interrupt ^Thread (var-get t))
            (catch Throwable t))))

      (when (find-ns 'datomic.api)
        (let [f (var-get (resolve 'datomic.api/shutdown))]
          (f false))))

    (.close pod)

    (.close ^java.net.URLClassLoader (.getClassLoader pod))))

@arichiardi
Copy link
Contributor Author

Oh, I am using core.async in my app, let me try that!

@arichiardi
Copy link
Contributor Author

Same applies to thread pools I guess...which is maybe why you have the datomic shutdown...

@stephenbrady
Copy link
Contributor

Yep, any resource at all will do it. It's unfortunate that a decent number of libraries create resources for you by virtue of requiring them and don't properly expose those resources to be managed.

@arichiardi
Copy link
Contributor Author

Yes unfortunate, I avoided starting the db and added your destroy fn but still the memory grows and grows...I am using the Hikary pool, which now I need to try to shutdown properly I guess, btw thanks for the hint.

@stephenbrady
Copy link
Contributor

Sure thing. Good luck. I spent way too many hours chasing this; I feel the pain.

@arichiardi
Copy link
Contributor Author

It is still very weird that the biggest memory consumption happen in clojure-agent-send-off-pool-1...

@arichiardi
Copy link
Contributor Author

Ah well, not at all weird as we are running inside a future...https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/core/src/boot/core.clj#L945

@stephenbrady
Copy link
Contributor

Yep, the symptom, not the cause.

@arichiardi
Copy link
Contributor Author

So basically every defstate I have is not an atom so it contains a reference to whatever it is right? So every defstate should actually be an nillable atom...

@stephenbrady
Copy link
Contributor

Seems like you're using mount. I'm only somewhat familiar with mount, but I thought it offered an api to terminate the resource held by the defstate var. Otherwise, you can just alter-var-root and set the var to nil (or explicitly close the resource at the var, if it has such a method)...or do the equivalent with an atom.

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 26, 2016

Yes it does provide that, on stop I will nil the atom(s) that I am going to use instead of having pure references... of course this is only a leak during recurring testing, potentially all apps with state leak in this way...

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 26, 2016

This also means that each boot task which wants to be used in a watch should provide a way to hook up a destroy function..

@arichiardi
Copy link
Contributor Author

About this, maybe we should enable custom code in destroy-pod, kind of like cleanup but for a specific pod?

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Perhaps JMX can be used to find leaks automatically?

The pod-pool function already accepts a :destroy option that can be used to finalize things in the pod, and if you're calling destroy-pod manually you can evaluate any code you like in the pod before you do that, no?

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Ah, maybe something like a shutdown hook that can be installed in the pod to perform cleanup...destroy-pod would call those before finishing?

@arichiardi
Copy link
Contributor Author

Yes you are right, there are already many hooks, but the problem arises when a task (boot-test) is executing in a pod and I don't have a way to execute custom custom at destroy time..

@arichiardi
Copy link
Contributor Author

Yeah exactly

@micha
Copy link
Contributor

micha commented Jun 29, 2016

makes sense to me

@arichiardi
Copy link
Contributor Author

Maybe instead of an oblect here https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/base/src/main/java/boot/App.java#L326 we pass an (thread-safe) container of functions? I was checking where/when it is better to add this

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Or an atom in boot.pod that you can conj thunks onto. Then destroy-pod could do something like

(with-eval-in p (doseq [x @boot.pod/destroy-hooks] (x)))

@arichiardi
Copy link
Contributor Author

Yes that would work too

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 29, 2016

Any objection on having a

(defn destroy-pod
  "Closes open resources held by the pod, making the pod eligible for GC."
  [pod]
  (when pod
    (*destroy-pod-hook*) ;; dynamic
    (.close pod)
    (.. pod getClassLoader close)))

instead?

The caller would need to do (simplified):

(boot
  (binding boot.pod/*destroy-pod-hook* my-cleaning-fn
    (boot-test/test)))

@arichiardi
Copy link
Contributor Author

The latter also means that all the code inside the binding will "see" the destroy function, aka, all the call to pods created inside the binding will call that function.
This is not a surprising behavior, the contrary, giving a super fine grained control (i.e.: one custom destroy hook per pod), is not really a common use case imho.

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Possibly... although it seems like there is a more elegant solution somewhere, maybe?

Another option to consider is reaping defunct pods out-of-band. Another thread could periodically iterate over the pods in boot.pod/pods and if they've been destroyed it could perform actions on them to force them to die.

I would really prefer some way to use reflection or JMX debugging APIs or something like that to find threads that need to be shut down, automatically, without the need for hooks.

I would also like to find out which things can prevent a pod from being eligible for GC.

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Regarding the *hooks* vs. pod/hooks, it might be a "bothism"; we might want both. The *hooks* thing is good when you need to inject finalization procedures into code that you don't control (like your boot-test/test example). The pod/hooks approach, on the other hand, allows the code that creates the pod to also install the finalization code, which isn't really feasible with the dynamic vars way.

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 29, 2016

Good points in both your comments, I was already working on this, but then stopped.
In the particular problem I am facing, I cannot know the pod boot.test is creating and unless boot.test/test itself provides a hook in the params, I cannot execute any cleaning code.

I could create a task that does what you said above though, I am in a watch, and I could scan the pods, see the ones that have been destroyed (how to identify them?) and just remove their reference/step heavily on them so that they really die 😄

@arichiardi
Copy link
Contributor Author

Also an atom probably would not solve my problem because again, how can I know the pod boot.test/test used?

@micha
Copy link
Contributor

micha commented Jun 29, 2016

Yes, that's why maybe we need both. I mean if I make a task that creates a pod and my task is loading core.async in the pod, then my task should also ensure it gets cleaned up. How would that work with the dynamic var?

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 29, 2016

Can you do a binding inside a binding? Well theoretically maybe yes but I agree it is not really intuitive.

@micha
Copy link
Contributor

micha commented Jun 29, 2016

You can use set! on the binding, but it just seems like a last resort. If we can solve this by reflection or debugger APIs and without any hooks that would be the ideal solution.

@arichiardi
Copy link
Contributor Author

Yeah..there are ways of solving this with brute force without changing boot code, for instance I could use robert.hooke in order to wrap destroy-pod. If noone else is going to have this issue of course we can leave it there...

In order to fix memory leaks in task-local pods, we need a way to execute custom cleaning at pod
destroy time. This was particularly evident when doing `boot watch test` in a resource heavy app.

Fixes adzerk-oss/boot-test#26
@micha
Copy link
Contributor

micha commented Jun 29, 2016

Or (alter-var-root #'boot.pod/destroy-pod #(comp % my-destroy-function))...

@arichiardi
Copy link
Contributor Author

Uhm this is odd, I have to the say that now I am a bit lost http://pastebin.com/ZHVZ4rk8

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 30, 2016

Being a pod pool in boot.test I am not surprised we reuse pods/classloaders, but the above shows UrlClassLoader are not reused (one leaks). The Agent CL part is actually:

(run! #(util/warn "Agent CL %s\n" %)
                  (->> (Thread/getAllStackTraces)
                       (keys)
                       (map #(.getContextClassLoader %))
                       (into #{})))

@micha
Copy link
Contributor

micha commented Jun 30, 2016

The pod pool creates new pods, it doesn't reuse them...

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 30, 2016

For completeness, the above is generated with:

(with-pass-thru _
            (run! #(util/warn "POD CL %s\n" %)
                  (->> boot.pod/pods
                       keys
                       (map #(.getClassLoader %)))
                       (into #{}))
            (run! #(util/warn "Agent CL %s\n" %)
                  (->> (Thread/getAllStackTraces)
                       (keys)
                       ;; (filter #(re-find #"clojure-agent" (.getName %)))
                       (map #(.getContextClassLoader %))
                       (into #{})))

@arichiardi
Copy link
Contributor Author

Some simple set mathematics and I have the classloader that leaks:

([#object[java.net.URLClassLoader 0x5ce1d55 "java.net.URLClassLoader@5ce1d55"]
  [#object[org.apache.logging.log4j.core.util.Log4jThread 0x5fb48c2 "Thread[Log4j2-Log4j2Scheduled-1,5,main]"]]])

@arichiardi
Copy link
Contributor Author

Oh man, really? https://logging.apache.org/log4j/2.x/manual/migration.html

Remove calls to org.apache.log4j.LogManager.shutdown(), they are not needed in version 2 because the Log4j Core now automatically adds a JVM shutdown hook on start up to perform any Core clean ups.

Starting in Log4j 2.1, you can specify a custom ShutdownCallbackRegistry to override the default JVM shutdown hook strategy.
Starting in Log4j 2.6, you can now use org.apache.logging.log4j.LogManager.shutdown() to initiate shutdown manually.

@arichiardi
Copy link
Contributor Author

Oddly enough, alter-var-root does not seem to work from outside the pod probably because I need to alter the destroy-pod inside it?

@arichiardi
Copy link
Contributor Author

@micha I went through a debugging session again and this is the result:

http://pastebin.com/NN0h9uA1

Maybe I asked you this question already, but shouldn't destroy-pod remove the reference in pod/pods when done?

I excluded the primary cause of my leak (log4j2), but by debugging I noticed this as well and I think it should be fixed, but maybe again I don't see other factors in play now.

As you can see in the bin as well, the pod-pool correctly makes and destroys (when :refresh is called) pods.

@micha
Copy link
Contributor

micha commented Jun 30, 2016

The pod/pods is a WeakHashMap, shouldn't interfere with GC I don't think.

@arichiardi
Copy link
Contributor Author

Ah right I remember now 😂 it is my printing keeping it there then? Because I see it growing and growing, even if I trigger the GC in VisualVM they are not GC-ed..someone is keeping them there

@micha
Copy link
Contributor

micha commented Jun 30, 2016

Seems like that's the symptom, not the cause. Those keys will remain in the weak hash map until the pods are collected by GC, not the other way around.

@arichiardi
Copy link
Contributor Author

arichiardi commented Jun 30, 2016

You are right, sorry to bother with these weird questions, but I am more and more confused. If I trigger the GC manually then the keys in pods correctly disappear. If I never do it, I eventually end up having java.lang.OutOfMemoryError: Metaspace.
It is my first time dealing with this kind of issue and I really would like to understand the dynamics, even if I might seem super newbie (well, I am ah ah).
What is going on here? 😄

@arichiardi
Copy link
Contributor Author

I can close this one, nobody has complained and I have not found a solution anyways 😄

@arichiardi arichiardi closed this Jul 11, 2017
@martinklepsch
Copy link
Member

@arichiardi but don't we all agree that it still exists? If so I think we should keep it open just for reference.

@arichiardi
Copy link
Contributor Author

To be honest I don't know if it still exists. I can leave it open yes, maybe it is better after all.

@arichiardi arichiardi reopened this Jul 11, 2017
@martinklepsch
Copy link
Member

Just a note to myself, potential steps to reproduce:

  • create lots of pods/pools
  • obtain references to all of them via boot.pod/pods
  • shutdown these pods
  • check with YourKit or JVisualVM for stale classes etc

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Frequent java.lang.OutOfMemoryError: Metaspace
4 participants