Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Jenkins failing on Scala GPU #13080

Closed
zachgk opened this issue Nov 1, 2018 · 10 comments
Closed

Jenkins failing on Scala GPU #13080

zachgk opened this issue Nov 1, 2018 · 10 comments

Comments

@zachgk
Copy link
Contributor

zachgk commented Nov 1, 2018

There is a failure occurring on the Jenkins in the Scala GPU task. The failure is occurring when running "make scalapkg" to build the core module.

One sample error ending is:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.0:test (default-test) on project mxnet-core_2.11: There are test failures.

[ERROR] 

[ERROR] Please refer to /work/mxnet/scala-package/core/target/surefire-reports for the individual test results.

[ERROR] Please refer to dump files (if any exist) [date]-jvmRun[N].dump, [date].dumpstream and [date]-jvmRun[N].dumpstream.

[ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

[ERROR] Command was /bin/sh -c cd /work/mxnet/scala-package/core && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /work/mxnet/scala-package/core/target/surefire/surefirebooter1191780993849458203.jar /work/mxnet/scala-package/core/target/surefire 2018-11-01T15-57-55_076-jvmRun1 surefire2207349951215643963tmp surefire_05921002131138985800tmp

[ERROR] Error occurred in starting fork, check output in log

[ERROR] Process Exit Code: 1

[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

[ERROR] Command was /bin/sh -c cd /work/mxnet/scala-package/core && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -jar /work/mxnet/scala-package/core/target/surefire/surefirebooter1191780993849458203.jar /work/mxnet/scala-package/core/target/surefire 2018-11-01T15-57-55_076-jvmRun1 surefire2207349951215643963tmp surefire_05921002131138985800tmp

[ERROR] Error occurred in starting fork, check output in log

[ERROR] Process Exit Code: 1

[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:671)

[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:533)

[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:278)

[ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:244)

[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1194)

[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1022)

[ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:868)

[ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)

[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)

[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)

[ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)

[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)

[ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)

[ERROR] at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)

[ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)

[ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)

[ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)

[ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)

[ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)

[ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)

[ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)

[ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

[ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

[ERROR] at java.lang.reflect.Method.invoke(Method.java:498)

[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)

[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)

[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)

[ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

[ERROR] -> [Help 1]

[ERROR] 

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR] 

[ERROR] For more information about the errors and possible solutions, please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

[ERROR] 

[ERROR] After correcting the problems, you can resume the build with the command

[ERROR]   mvn <goals> -rf :mxnet-core_2.11

make: *** [scalapkg] Error 1

Makefile:606: recipe for target 'scalapkg' failed

build.py: 2018-11-01 15:57:55,920 Waiting for status of container a9ca51005111 for 600 s.

build.py: 2018-11-01 15:57:56,101 Container exit status: {'Error': None, 'StatusCode': 2}

build.py: 2018-11-01 15:57:56,101 Stopping container: a9ca51005111

build.py: 2018-11-01 15:57:56,103 Removing container: a9ca51005111

build.py: 2018-11-01 15:57:56,230 Execution of ['/work/runtime_functions.sh', 'integrationtest_ubuntu_gpu_scala'] failed with status: 2

script returned exit code 2

We have identified a number of Jenkins runs that produced this:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13077/1/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13071/1/pipeline
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-13052/2/pipeline

The problem has also been observed on the dev Jenkins:
http://jenkins.mxnet-ci-dev.amazon-ml.com/blue/organizations/jenkins/restricted-publish-artifacts/detail/automate-maven/61/pipeline

@lanking520
Copy link
Member

@mxnet-label-bot [flaky, test, GPU, Scala]

@lanking520
Copy link
Member

@marcoabreu @lebeg @larroy Me and Zach are now investigating the issue and see if we can reproduce them. Please take a look at here as well, is there any changes on GPU config recently?

@lanking520
Copy link
Member

Resolution: This issue is typically appeared with GPU and not crash on CPU and Clojure.

@lanking520
Copy link
Member

The issue appeared with a recent OpenJDK upgrade, here is the workaround we will take to solve this issue:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <configuration>
        <useSystemClassLoader>false</useSystemClassLoader>
    </configuration>
</plugin>

But I am still thinking this problem can only reproduce on GPU, not the CPU. There might be a diff on the JDK version between these two VM,

@marcoabreu
Copy link
Contributor

marcoabreu commented Nov 1, 2018

No there has not been an update recently.

@lebeg you rolled back, right?

@Chancebair fyi

@ChaiBapchya
Copy link
Contributor

@ChaiBapchya
Copy link
Contributor

@zachgk since PR is merged, do we close this?

@ddavydenko
Copy link
Contributor

@zachgk , please close the issue as it seems that PR has been merged. Unless there is something else to do here to address it?

@zachgk zachgk closed this as completed Nov 4, 2018
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

5 participants