Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enable straight-to-jar compilation #597

Merged
merged 29 commits into from
Oct 8, 2018
Merged

Enable straight-to-jar compilation #597

merged 29 commits into from
Oct 8, 2018

Conversation

lukaszwawrzyk
Copy link
Contributor

@lukaszwawrzyk lukaszwawrzyk commented Sep 17, 2018

This PR represents my work on implementing straight to jar compilation in zinc (#305).

To enable the feature, one should specify output as .jar file. This will cause scalac to write files to jar directly. This worked before the PR, but obviously some stuff had to be adjusted for the actual incremental compilation to work.

Most important things that had to be handled

  • pruning - deleting files from jar instead of from a folder
  • merging - if we have a previous output jar it has to be merged with the one produced in next compilation
  • javac output - javac is not able to compile to jar, I am zipping the output and merging it with scalac output
  • representing jared products - I used sytanx like: /develop/zinc/target/output.jar!sbt/internal/inc/Compile.class
  • code that relies on classes being plain files - had to be ifed to work with jars as well

Most of the details related to the feature are in sbt.internal.inc.STJ object and something similar in xsbt.STJ as it is difficult to share code between those.

Optimizations

ZipFileSystem was not good enough as it was rewriting jars. What I done is I took that code and kept only the code that allows to manipulate the index. Reading and writing the index to file from ZipFileSystem was most efficient implementation I could find. On top of that I implemented the required operations in performant manner.

  • merging: The jars are concated except for the index. Then the indices are read from both files, merged and written the new index at the end.
  • deleting: Files are only removed from the index.
  • transactional class file manager: The index is stored at the beginning of compilation. In case of a failure, The old index is written back to the jar,
  • reading stamps: The whole index is read once and timestamps are cached.

Other changes

API of ReadStamps was altered - I added reset() method to easly allow reseting cache of product timestamps between compilations. This is because I made the stamper stateful to avoid reopening jar for each product (that would kill the performance).

Scripted tests were updated to run for both STJ and regular compilation. Running scripted with STJ requires changing a hardcoded flag in IncHandler. I was just using it for development. I am open for discussion on how to do it properly and whether we should e.g. run all scripted tests twice for each build and how to implement that.

Slight addition was to allow disabling compression with a flag when exporting analysis (performance).

Another small addition is to ignore change of -d option for javac as it is overriden anyway.

This PR also contains multiple changes related to Windows file system. Mostly about closing things properly to avoid locks (mostly jars). This includes fixing scripted tests on Windows (large percentage of them were failing because of inability to clear the temp dir between tests). There are still a couple of them that are flaky, but I didn't investigate it further.


THIS PROGRAM IS SUBJECT TO THE TERMS OF THE BSD 3-CLAUSE LICENSE.

THE FOLLOWING DISCLAIMER APPLIES TO ALL SOFTWARE CODE AND OTHER MATERIALS CONTRIBUTED IN CONNECTION WITH THIS SOFTWARE:
THIS SOFTWARE IS LICENSED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OF NON-INFRINGEMENT, ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,
OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THIS SOFTWARE MAY BE REDISTRIBUTED TO OTHERS ONLY BY EFFECTIVELY USING THIS OR
ANOTHER EQUIVALENT DISCLAIMER IN ADDITION TO ANY OTHER REQUIRED LICENSE TERMS.
ONLY THE SOFTWARE CODE AND OTHER MATERIALS CONTRIBUTED IN CONNECTION WITH THIS SOFTWARE, IF ANY, THAT ARE ATTACHED TO (OR OTHERWISE ACCOMPANY) THIS SUBMISSION (AND ORDINARY
COURSE CONTRIBUTIONS OF FUTURES PATCHES THERETO) ARE TO BE CONSIDERED A CONTRIBUTION. NO OTHER SOFTWARE CODE OR MATERIALS ARE A CONTRIBUTION.

@typesafe-tools
Copy link

A validation involving this pull request is in progress...

@typesafe-tools
Copy link

The validator has checked the following projects, tested using dbuild, projects built on top of each other.

Project Reference Commit
sbt develop sbt/sbt@3f1ae8b
zinc pull/597/head 805e953
io develop sbt/io@36abd94
librarymanagement develop sbt/librarymanagement@bb2c73e
util develop sbt/util@965de89

❌ The result is: FAILED
(restart)

@typesafe-tools
Copy link

A validation involving this pull request is in progress...

@typesafe-tools
Copy link

The validator has checked the following projects, tested using dbuild, projects built on top of each other.

Project Reference Commit
sbt develop sbt/sbt@3f1ae8b
zinc pull/597/head a635675
io develop sbt/io@36abd94
librarymanagement develop sbt/librarymanagement@bb2c73e
util develop sbt/util@965de89

✅ The result is: SUCCESS
(restart)

@eed3si9n
Copy link
Member

@lukaszwawrzyk Thanks for the contribution.

Copy link
Member

@eed3si9n eed3si9n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

import java.nio.file.Paths
import java.util.zip.ZipFile

class STJ(outputDirs: Iterable[File]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Scaladoc explaining what STJ stand for, and what this class does before we forget?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this class be marked final?

type JaredClass = String
type RelClass = String

def init(jar: File, cls: RelClass): JaredClass = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Scaladoc here, and perhaps consider renaming this method?
I normally use "init" to mean initialize STJ class, but it doesn't look like you're doing that.
Maybe a better name would be "jaredClassString"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that name is unfortunate. It was the first method I created when I started working on it and it kinda stayed like that for the whole time.
I think ideally for sake of readability I could have a value class JaredClass wrapping the string with method that extract stuff and companion object with factories. Though this is actually mostly kept as File I suspect there would be plenty of conversions. I will try to experiment with that and see if I can improve readability with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for scaladoc I will add it in places you suggested. But aside from that is there a general rule to put it e.g. on every public method I added or something like that, or just for less obvious code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to generally document public methods for Zinc project since many people are involved.

@@ -240,6 +240,22 @@ class IncrementalCompilerImpl extends IncrementalCompiler {
case Some(previous) => previous
case None => Analysis.empty
}

val compileStraightToJar = STJ.isEnabled(output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any unit tests for straight to jar behaviors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no unit tests, I was relying on scripted tests. I will take a look on what should potentially be tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripted tests are fine too.

@jvican
Copy link
Member

jvican commented Sep 18, 2018

Thanks for the contribution. I'll have a look at this at some point in the week. Two first thoughts:

  1. I'm not comfortable merging BSD-3 licensed code. What is the policy regarding this now that we're Apache 2?
  2. Can we add more tests? The tests we have here I think are insufficient for such a big PR.

@lukaszwawrzyk
Copy link
Contributor Author

@jvican

  1. I am not an expert on licensing. I found this: https://softwareengineering.stackexchange.com/questions/40561/is-bsd-license-compatible-with-apache As I understand it it is completely no problem to include a BSD-3 licensed code software licensed with Apache 2.
  2. I found scripted tests to be pretty exhaustive. They actually driven te development and shown a lot of places where I had to change something to get things to work. I would say that if scripted tests are sufficient for current zinc, they are still sufficient for this PR. Zinc should work just as previously in terms of invalidations, products, etc. the only difference is that the files are put in the jars.
    I adjusted assertions in scripted tests to e.g. look for products also in jar. It is enough to toggle one flag to verify that the scripted tests pass while we are compiling to a jar. As I mentioned in PR description it is a manual change I was doing for development and would like to discuss how we want to go about testing it automatically. Other than that I will look at the code changes and see what would be nice to test. I am also open for suggestions on exactly what should be tested.

@jvican
Copy link
Member

jvican commented Sep 18, 2018

As I understand it it is completely no problem to include a BSD-3 licensed code software licensed with Apache 2.

Yes, they are in theory compatible. What I don't understand is why we should accept code that comes with another license, and what the implications for the future of the project are if we do so. We've already accepted several PRs with different licenses before and I'm not happy with it. I'd like to know what @eed3si9n thinks of this.

They actually driven te development and shown a lot of places where I had to change something to get things to work. I would say that if scripted tests are sufficient for current zinc, they are still sufficient for this PR.

I'd like to see more tests but not necessarily unit tests. I'll go through the code at some point and suggest places where we could test better. This is a big PR and we'll need to maintain it if in the future something changes, so I'm just trying to be more conservative than I'm usually. That being said, good job 👍

@eed3si9n
Copy link
Member

As per license, what I care is if the Lightbend CLA (https://www.lightbend.com/contribute/cla) has been agreed or not.

@lukaszwawrzyk
Copy link
Contributor Author

I refactored JaredClass to be a value class with accessors and factory methods. It should make more sense now.

@jvican
Copy link
Member

jvican commented Sep 21, 2018

I've had the first pass through the PR but, before discussing concrete technical points, I feel it's important that we discuss this whole approach from a global perspective. You're doing lots of things here and it's quite a scary PR to merge, especially because you touch lots of things that can have unexpected side effects.

  1. Why are you doing most of the logic inside the compiler bridge sources? I think the changes in the bridge should be as small as possible, and the rest should be done in other zinc modules (note that by bridge sources I also mean all the dependencies of the bridge). Any required synchronization between the two should be formally defined in the analysis callback. Also note a very important point: this feature should work for Scala 3 too, so it's important that the logic here is as compiler-agnostic as possible.
  2. The zip utilities you've copy-pasted contain the following comment, and the fact that this code is not "production-ready" scares me quite a bit as a maintainer of this repository (especially having no tests whatsoever):
/*
 * This source code is provided to illustrate the usage of a given feature
 * or technique and has been deliberately simplified. Additional steps
 * required for a production-quality application, such as security checks,
 * input validation and proper error handling, might not be present in
 * this sample code.
 */
  1. Let's make a clear separation in the commits between the functional changes and the performance changes (with comments where it's due) and let's move any Windows-specific optimization to independent commits.
  2. Does your approach work with forked java compilation? Looks like it doesn't. Some high-level explanation of the runtime reflection you're using in LocalJava would be useful too.
  3. The fact that you're making the stamps reader stateful does worry me a lot. Zinc should not be holding pointers to open jars (this is an optimization that should take place in either the compiler or the build tool, but not in the incremental compiler). I propose you investigate ways to remove this and look for a better solution.

@lukaszwawrzyk
Copy link
Contributor Author

I will address what I can quickly and go back to this after the weekend.

  1. This is simply ZipFileSystem, it was introduced in jdk7. it works really well for the use case. I don't know if the comment is still relevant. Can't say anything more about it.
  2. I added comments regarding that thing with many others I commited just now. The approach does work with forked compilation, the flag that is passed to a regular javac does work, it only causes problems when we are doing it programatically.
  3. Right, this is quite shaky, I mean especially given how the api looks like. If it would rather be class with methods, created differently than default function it would be less surprising. Though it was fastest to do. I will try to do something with it. But note that I am not holding any pointer to open jar there. On first access to the stateful stamper jar is open, all stamps are read, jar is closed, we only hold the stamps. Opening, modifying, closing jars is inherent thing to this feature, we cannot avoid it. Note that we are only collecting stamps from the actual output jar we produced in the past, not from other stuff on classpath or whatever. Zinc has to read those stamps from files in case of regular compilation, before starting another one. So with STJ it has to read stamps from the actual jar before starting the compilation analogously. I initally had it stateless, it was just checking if the file is a jar, then it was opening it and reading the stamp, and closing back. But obviously called in a loop was terribly slow, hence the cache.

@jvican
Copy link
Member

jvican commented Sep 21, 2018

This is simply ZipFileSystem, it was introduced in jdk7. it works really well for the use case. I don't know if the comment is still relevant. Can't say anything more about it.

Could you look into a way we can reuse JDK's ZipFileSystem (the one we have in the classpath) instead of copying your own? Also, note that adding JDK code to this repository is not license-compatible, JDK code is GPL and it's not compatible with Apache 2 (so let's check really well that these files are really not GPL licensed, but BSD as its license header seems to suggest).

I added comments regarding that thing with many others I committed just now. The approach does work with forked compilation, the flag that is passed to a regular javac does work, it only causes problems when we are doing it programmatically.

Great, I'll have a look through them on Monday.

Regarding my first point, I would be much more comfortable with all this logic if it's removed completely from the bridge and added in a new interface JarManager akin to ClassFileManager. Such data structure would immediately be compiler-agnostic and would work for both Scala 2 and Scala 3.

I initially had it stateless, it was just checking if the file is a jar, then it was opening it and reading the stamp, and closing back. But obviously called in a loop was terribly slow, hence the cache.

I would propose another way to achieve this then: let's have a stamp index that is populated after every incremental compiler cycle (we don't need to read the timestamps from the JAR because we know no other process can write to it, let's just have a stateless stamper that takes a pre-computed list of class file names and jars). This approach would need to work for both Scala and Java generated classes. It would take more book-keeping but I think it's a cleaner approach and we would avoid the performance problem of opening the jar, reading the timestamp from the central zip index (I think it's called central directory?) and then closing it.

Copy link

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for posting this. I'm strongly in favor of including support for jar'd outputs in zinc.

I had previously attempted this, so some of the approach here looks familiar. I think that if this lands, it might be good to follow up immediately to make the larger (but more-mechanical/less-errorprone) change to move from File to an enum of the two cases.

* This is enough as stamps are only read from the output jar.
*/
class CachedStampReader {
private var cachedNameToTimestamp: Map[String, Long] = _
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be threadsafe unless it is volatile, probably. Alternatively, should it just be a lazy val?

Regardless: I agree with Jorge that if it's possible to refactor this such that either 1) stamps are read eagerly and are immutable, 2) the stamps API has external caching ... would be preferable.

* @param jar a jar file where the class is located in
* @return the class inside a jar represented as `JaredClass`
*/
def fromURL(url: URL, jar: File): JaredClass = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd method... it feels like if the caller has already split the URL to find the file, re-splitting it here isn't necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is odd. Take a look at sbt.internal.inc.classfile.Analyze#urlAsFile. It previously was simply IO.urlAsFile and it was enough. When compiling to plain files, classloader would either return a plain file or url to a class in jar. IO.urlAsFile would just ignore the class file and take the jar. But now we can get url to file in jar and we want to keep the class part as long as the jar is the output jar that we are compiling to. So I reuse IO.urlAsFile to convert url to file properly and then, if the extracted file is our output jar I know that I need to create a JaredClass. To do this, I extract the relative path to class and combine it with already extracted file to avoid doing it twice. I just did not want to copy code from sbt.io.IO, especially that it handles some odd corner cases. I'd feel more comfortable that it was in one place. What do you think about this?

@@ -118,7 +119,11 @@ final class API(val global: CallbackGlobal) extends Compat with GlobalHelpers wi
if (!symbol.isLocalClass) {
val classFileName = s"${names.binaryName}.class"
val outputDir = global.settings.outputDirs.outputDirFor(sourceFile).file
val classFile = new java.io.File(outputDir, classFileName)
val classFile = if (STJ.enabled) {
new java.io.File(STJ.JaredClass(outputDir, classFileName))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like it would significantly improve type safety to use an enum to represent the two cases. But I could also imagine that that would be an even larger refactoring than this already is... so perhaps better as a followup.

@stuhood
Copy link

stuhood commented Sep 22, 2018

Also, it would be good to confirm that you see some performance benefit on your usecase (if you're able to share that data). I believe that at least part of the perf benefit that we've seen when we've used jar usage is due to a combination of #547 and the fact that less analysis was actually stored without the analysis fixes included here. Of course, we'd absolutely expect to see some additional IO benefit.

If possible, I'll try to run this build through our perf suite in the next few days as well.

@romanowski
Copy link
Contributor

romanowski commented Sep 22, 2018

Also, it would be good to confirm that you see some performance benefit on your usecase

On Windows (since this where most of incremental compilation happen) we observed speedups ranging from 10 to 20% in compilation time.
Speedup is even better when we include our mechanism to reuse compilation artifacts (similar to Hoarder) then we caches are applied even 6 times faster.

@lukaszwawrzyk
Copy link
Contributor Author

As for licensing, I guess I know everything now. So I was looking at zipfs from openjdk8. In that version it was located here http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/demo/nio/zipfs/src/com/sun/nio/zipfsas a demo licensed under BSD-3. This is the code that I used and hence the disclaimer about being just an example. Though I have it as jar on classpath available from /usr/lib/jvm/java-1.8.0-openjdk/jre/lib/ext/zipfs.jar which is a bit odd for demo code. Nevertheless it is BSD-3.
Since java 9 it is included properly in main code base http://hg.openjdk.java.net/jdk9/dev/jdk/file/65464a307408/src/jdk.zipfs/share/classes/jdk/nio/zipfs/ZipFileSystem.java with GPL.
I looked roughly at the code that is relevant for us and didn't see anything crutial, but it was just a quick look.

Could you look into a way we can reuse JDK's ZipFileSystem (the one we have in the classpath) instead of copying your own?

ZipFS as it is could be used to provide all the operations we need for STJ, though as I mentioned not as efficient. e.g. merging involves at a minimum creating temp file, copying there the content of both files, appending new index and moving temp file onto the original one. I think it won't reapply compression but I am not 100% sure. Removing files from jar will work the same way, it will create a copy without given entries.
I was also experimenting with Zip4j though it was terribly inefficient as it was, I also had to edit it to make it reasonable, but it was still slower than using modified ZipFs.
Here is a repo with benchmarks I was doing when I was optimizing the feature. Though unfortunatelly I didn't store results. Modified ZipFS was simply always the fastests so I just took it.

@jvican
Copy link
Member

jvican commented Sep 24, 2018

ZipFS as it is could be used to provide all the operations we need for STJ, though as I mentioned not as efficient. e.g. merging involves at a minimum creating temp file, copying there the content of both files, appending new index and moving temp file onto the original one. I think it won't reapply compression but I am not 100% sure. Removing files from jar will work the same way, it will create a copy without given entries.

I agree with your sentiment of tweaking performance, but are we sure that we're not trading performance by correctness? On the one hand, I'm skeptical about including zip file system code into this repository (without any unit tests) that was meant to be a demo. On the other hand, we cannot include the zip file system implementation in JDK9 because that would be an outright license violation.

I think it would be good we look into reusing the implementation that Scalac uses for writing jars (which uses under the hood the good old ZipFile IIRC) -- but I'm not sure if we can adapt that code to modify jar files. If we're really adamant on using NIO APIs, here's a half working patch by Jason that migrated it to ZipFileSystem.

Aside from this observation, I see two other possibilities to move forward:

  1. You can mmap the intermediate jars created by the same zinc compiler run via NIO channels. This could help remove any IO overhead as all of the operations would be in-memory.
  2. Provide an interface to read/write jar files with a default implementation based on the default JDK utils. Outside this repo, you could plug in your own efficient implementation.

I'm not sure what the benchmark results for the use of this custom zip file system look like, so some numbers could perhaps help me make up my mind and convince me adding this code to the repo is a risk worth taking. When we're talking about more efficient, how much are we talking about? Only in Windows? Or also in Linux?

@lukaszwawrzyk
Copy link
Contributor Author

Why are you doing most of the logic inside the compiler bridge sources?
I don't feel like it is "most of work". Actually before I from 1.1 to current develop branch I had only changes in Analyzer and STJ was simply a private object in there. Though because of pipelining I had to do changes in API as well and because of workaround for bug in the compiler (#559) I had to change CallbackGlobal as well.

Fundamentally I think I didn't add like conceptually new logic to bridge that was not there (except for obvious handling jars instead of files). For example the Analyzer already was converting symbol to a path to real file in the file system and checking if it exists. I just did analogous thing for classes in jar. Similarly API constructs a path before registering generated class, I just constructed this path as JaredClass. findAssociatedFile similarly to Analyzer looks for actual file in file system that was created in previous compilation, I have to look for it in a prev jar to maintain the functionality. So as I saw it I was just keeping the logic where it was.
As for being compiler agnostic, how my changes make it less agnostic? What could potentially be changed in Scala 3 that would break it? I am just trying to understand.

When I look at it now I could e.g. change API of generatedNonLocalClass to take a relative path to a class + output dir (as it can be calculated with global.settings.outputDirs.outputDirFor(sourceFile)). Similarly for Analyzer.
Is this something you were thinking about?
On a sidenote, do we actually need to check if the .class file actually exists? It is checked in the Analyzer as we are after jvm and can check it, but we are not checking it in API (as we can't, before jvm). Maybe it is not necessary in Analyzer then? It would notably simplify the code as well.
As for the changes in CallbackGlobal, this seems difficult to move. findAssociatedFile looks for actual file in the fs and if it doesn't find one, things will not work. It creates an AbstractFile and queries the class path. I have no idea how to extract it properly.

Regarding my first point, I would be much more comfortable with all this logic if it's removed completely from the bridge and added in a new interface JarManager akin to ClassFileManager. Such data structure would immediately be compiler-agnostic and would work for both Scala 2 and Scala 3.

Which logic exactly you would like to move? How do you see the interface of JarManager? Should the bridge code still have code like if (STJ.enabled) jarManager.locateClassInJar else locatePlainClassFile? Or actually where even should the JarManager be used? In the bridge? As per analogy to ClassFileManager it is not used in the bridge.

Let's make a clear separation in the commits between the functional changes and the performance changes (with comments where it's due) and let's move any Windows-specific optimization to independent commits.

I am afraid that I won't have enough time to do that.

I would propose another way to achieve this then: let's have a stamp index that is populated after every incremental compiler cycle (we don't need to read the timestamps from the JAR because we know no other process can write to it, let's just have a stateless stamper that takes a pre-computed list of class file names and jars). This approach would need to work for both Scala and Java generated classes. It would take more book-keeping but I think it's a cleaner approach and we would avoid the performance problem of opening the jar, reading the timestamp from the central zip index (I think it's called central directory?) and then closing it.

Looking through the usages of Stamper.forLastModified, the stamper is used in ExportableCache. We should read the stamps there efficiently, though we can use different interface there easily. This is needed for the imported cache stamps match with the analysis.
Second usage is when it it wrapped to ReadStamps interface. That splits into 2 usages:
The first one is detectInitialChanges. This is when we look at all the stamps of all classes in jar. I guess to be on the safe side this needs to be actually read to guard against external changes to the jar between zinc runs. Especially the jar could have been deleted.
The other one is the AnalysisCallback. After compilation iteration (scalac + javac run) it picks up stamps from the jar. The stamps that it is going to find are generated when scalac is running its jvm phase. I am not sure if we have a way to get them without actually reading them. Even entries written in the same compiler run can have different timestamps as scalac is using System.currentTimeMillis() for each entry. It feels like the only way to have matching stamps in the analysis and jar is to either actually read them or replace them in jar (which involves reading them first).

@lukaszwawrzyk
Copy link
Contributor Author

I agree with your sentiment of tweaking performance, but are we sure that we're not trading performance by correctness? On the one hand, I'm skeptical about including zip file system code into this repository (without any unit tests) that was meant to be a demo. On the other hand, we cannot include the zip file system implementation in JDK9 because that would be an outright license violation.

Well, I am never 100% sure if what I write is correct, even with tests I could miss something, but after running multiple builds and tests compiled with modified zinc I am pretty confident that zinc part of it is correct. Also the scripted tests actually test it indirectly, in my opinion rather extensively.
The code initially meant to be a demo but it doesn't mean it was written carelessly. After all with few changes it made into the JDK9. Before that people were using it as well probably without even knowing what comments in the source code say.

I think it would be good we look into reusing the implementation that Scalac uses for writing jars (which uses under the hood the good old ZipFile IIRC) -- but I'm not sure if we can adapt that code to modify jar files. If we're really adamant on using NIO APIs, here's a half working patch by Jason that migrated it to ZipFileSystem.

I feel like a link for the patch is missing?

Anyway if you look at IndexBasedZipOps and see what operations it requires you will realize that we need fairly low level API, that is not available directly as public fields in any library I was looking at except zip4j. With ZipFile I can read the index, or at least list of entries, but not the file offsets, I cannot change them, I cannot write a new index without rewriting files. This makes following impossible: stash and unstash index in transactional class file manager, cheap merge, cheap remove. Though we can read stamps fairly efficiently, although still a bit slower.

Aside from this observation, I see two other possibilities to move forward:

  1. You can mmap the intermediate jars created by the same zinc compiler run via NIO channels. This could help remove any IO overhead as all of the operations would be in-memory.
  2. Provide an interface to read/write jar files with a default implementation based on the default JDK utils. Outside this repo, you could plug in your own efficient implementation.

As for mmap I think it is too big of a change, I mean it would require too much of work.
Pluggable ZipOps is totally feasible, though without it, all performance benefits might be lost(?) making the feature useless without plugging in a proper implementation. I don't have solid numbers on that though.

I'm not sure what the benchmark results for the use of this custom zip file system look like, so some numbers could perhaps help me make up my mind and convince me adding this code to the repo is a risk worth taking. When we're talking about more efficient, how much are we talking about? Only in Windows? Or also in Linux?

I was testing it on both linux and windows. The difference on windows was AFAIR ususally bigger than on Linux. I run for you the usual case of merging jars i.e. merging a fairly small (9 classes) jar to a big one (scala library).

[info] Benchmark                      Mode  Cnt   Score   Error  Units
[info] MergeToBigBench.myZipfs        avgt   15   2,209 ± 0,068  ms/op
[info] MergeToBigBench.zipfs          avgt   15  13,792 ± 1,683  ms/op
[info] MergeToBigBench.zip4jOptimized avgt   15  13,364 ± 0,126  ms/op
[info] MergeToBigBench.zip4jOriginal  avgt   15  34,592 ± 0,303  ms/op

Here are results of deleting like 20 classes from scala-library jar (also linux):

[info] Benchmark                            Mode  Cnt   Score   Error  Units
[info] DeleteFromBigJarBench.myZipfs        avgt   15   2,192 ± 0,084  ms/op
[info] DeleteFromBigJarBench.zipfs          avgt   15  11,968 ± 1,635  ms/op
[info] DeleteFromBigJarBench.zip4jOptimized avgt   15  12,683 ± 0,213  ms/op
[info] DeleteFromBigJarBench.zip4jOriginal  avgt   15  33,618 ± 0,103  ms/op

@lukaszwawrzyk
Copy link
Contributor Author

lukaszwawrzyk commented Sep 24, 2018

For a cleaner(?) approach for collecting stamps I can simply add say STJ.createCachedStamper(jar) method or STJ.collectStamps(jar). Then, call it in addProductsAndDeps(conditionally, depending on output) and do the same in detectInitialChanges. Though it would be nice to have a stamper that can handle both cases like the current one I created as it would be useful to have in e.g. ExportableCache.

With such approach it wouldnt need to be lazy. Now it has to be as the stamper is created before the compilation, but it can only collect valid data after the compilation (on first call).

What do you think about this idea? What should ideally the api be?

@lukaszwawrzyk
Copy link
Contributor Author

lukaszwawrzyk commented Sep 24, 2018

The same benchmark for merging jars on windows:

[info] Benchmark                       Mode  Cnt    Score    Error  Units
[info] MergeToBigBench.myZipfs         avgt    8   40.498 ▒  7.770  ms/op
[info] MergeToBigBench.zip4jOptimized  avgt    8   59.918 ▒ 12.544  ms/op
[info] MergeToBigBench.zipfs           avgt    8  167.028 ▒ 19.323  ms/op

Deleting:

[info] Benchmark                             Mode  Cnt    Score    Error  Units
[info] DeleteFromBigJarBench.myZipfs         avgt    8   22.225 ▒  3.037  ms/op
[info] DeleteFromBigJarBench.zip4jOptimized  avgt    8   43.073 ▒  9.503  ms/op
[info] DeleteFromBigJarBench.zipfs           avgt    8  141.186 ▒ 31.852  ms/op

Note that absolute numbers are notably bigger.

@lukaszwawrzyk
Copy link
Contributor Author

I altered the stamping part with what I mentioned. Stamps from jars are read explicitly in places they are needed without altering the existing stamper apis.

@lukaszwawrzyk
Copy link
Contributor Author

@stuhood @eed3si9n @jvican I think I addressed all the comments and I as I explained in there I need some more clarification on how to prodceed.

@jvican
Copy link
Member

jvican commented Oct 5, 2018

Looks like a spurious error, I restarted the CI: https://ci.scala-lang.org/sbt/zinc/131

@jvican
Copy link
Member

jvican commented Oct 5, 2018

Let's see if waiting a little bit more makes the CI pass. I don't see how your code could make dependency resolution of the bridges fail. Locally, coursier fetch works.

@lukaszwawrzyk
Copy link
Contributor Author

@jvican The build passed today (without changes), I also used matrix to run scripted tests with --to-jar.

@jvican jvican dismissed eed3si9n’s stale review October 8, 2018 09:00

Feedback is addressed.

@jvican jvican merged commit c8e1f53 into sbt:develop Oct 8, 2018
@stuhood
Copy link

stuhood commented Jan 7, 2019

@jvican : I see that this went to the develop branch... what does that mean in terms of it actually making it into a release? Will it be bound for 1.3.x? If so, when will 1.3.x be cut? Thanks!

@lihaoyi-databricks
Copy link

@jvican I am also interested in this being published :) :) :)

@eed3si9n
Copy link
Member

I am planning to get sbt and Zinc 1.3.x RC-1 within a month or so.
I was hoping to get JDK 11 tests to pass by then. If people want a binary out, I can get an M1 sooner.

@lihaoyi-databricks
Copy link

I would love a M1 release with what we have now, including the straight-to-jar compilation, if that's at all possible.

@eed3si9n
Copy link
Member

1.3.0-M1 is out - https://repo1.maven.org/maven2/org/scala-sbt/zinc_2.12/1.3.0-M1/

@lihaoyi-databricks
Copy link

Thanks eugene, will take a look

@lihaoyi
Copy link
Contributor

lihaoyi commented Jan 11, 2019

I just tried this in our build. Unfortunately, I'm seeing a significant increase in downstream compilations (touch a file in an upstream module, incrementally re-compile downstream module) when I enable this compile-to-jar functionality. Using 1.3.0-M1 without compile-to-jar is fine (no regression)

The Zinc integration in question is here https://github.com/lihaoyi/mill/blob/master/scalalib/worker/src/ZincWorkerImpl.scala#L256-L338, in case anyone wants to review it and tell me if i'm doing anything wrong @jvican @lukaszwawrzyk

@stuhood
Copy link

stuhood commented Jan 11, 2019

@lihaoyi : A thing to watch out for would be your downstream modules being treated as monolithic jars.

Having integrated this into pants, and dumping the analysis in text mode, I see downstream deps for this example being stamped as:

binary stamps:
1 items
.pants.d/compile/zinc/b02abb352ecd/tmp7rxyVH.org.pantsbuild.dep.dep/current/z.jar -> lastModified(1547172843563)

...which indicates that rather than stamping the classfiles in the jar, it's currently stamping the whole jar.

@lihaoyi-databricks
Copy link

I wonder if it's the way I'm looking up analysis files based on os.Path:

    val analysisMap0: Map[os.Path, os.Path] = upstreamCompileOutput.map(_.swap).toMap

    def analysisMap(f: File): Optional[CompileAnalysis] = {
      analysisMap0.get(os.Path(f)) match{
        case Some(zincPath) => FileAnalysisStore.binary(zincPath.toIO).get().map[CompileAnalysis](_.getAnalysis)
        case None => Optional.empty[CompileAnalysis]
      }
    }

Do I need to write some logic to map between classfile-in-jars and actual filesystem paths?

@stuhood
Copy link

stuhood commented Jan 14, 2019

@lihaoyi: I doubt that this is an issue with your particular integration; rather, some more work that needs to be done following up on this patch.

@lukaszwawrzyk
Copy link
Contributor Author

I'd gladly take any suggestions on how to improve performance, or what seems to impact performance the most. While testing overall performance I was focusing on windows and we are very happy with the performance on windows. The compilation itself is slightly faster. The whole workflow that this feature enabled is much faster.

@lihaoyi-databricks
Copy link

@lukaszwawrzyk just to be clear, we're not talking about a performance issue, but a feature regression: multi-module incremental compilation simply does not work with compile-to-jar enabled

@lukaszwawrzyk
Copy link
Contributor Author

Ah, sorry. I misunderstood the issue. I don't see anything obvious in the implementation provided. Relevant timestamps are in the jar. The stamps of jar should not matter. I don't remember doing anything special in integrations we had wrt to analysis and things like that. It would be best to just debug the issue.

@stuhood
Copy link

stuhood commented Mar 5, 2019

More info here. With jars, I see module dependencies that were previously treated as member reference external dependencies being treated as library dependencies instead. This has the effect of disabling incremental compilation when the dependencies change.

@eed3si9n eed3si9n added this to the 1.3.0 milestone Apr 28, 2019
@eed3si9n
Copy link
Member

Could someone create a GitHub issue summarizing the regression/feature interactions involved in compile-to-JAR please? It sounds like the feature is not ready to be used without some more tweaking.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants