Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Place source and build products in OUT_DIR #31

Merged

Conversation

jturner314
Copy link
Contributor

This has two benefits:

This has two benefits:

* It follows the directions specified in the Cargo docs. ("Build
  scripts may save any output files in the directory specified in the
  `OUT_DIR` environment variable. Scripts should not modify any files
  outside of that directory.")

* It makes `cargo clean` work properly. (When build products are
  placed outside of `OUT_DIR`, `cargo clean` cannot remove them, which
  can make issues very difficult to debug.)
@IvanUkhov
Copy link
Member

It was like this in the beginning. It was moved to the location of the extracted package to avoid recompiling OpenBLAS again and again. I’m not sure about this one. It would be nice to gather more feedback from those who use this package.

@jturner314
Copy link
Contributor Author

The Windows builds of CI are failing. Is that normal?

It was moved to the location of the extracted package to avoid recompiling OpenBLAS again and again.

Yeah, that behavior is nice in most cases.

The primary issue is that cargo clean doesn't work, and there are no warnings that it isn't working. I spent a long time debugging an issue that would have been much easier to diagnose (and detect in the first place) if cargo clean had removed openblas-src's build products.

Keeping the existing behavior would be fine with me if we implemented at least one of the following alternatives:

  • Warn the user somehow that cargo clean doesn't work for openblas-src, not just in the docs of openblas-src, which is usually a transitive dependency, but with something more visible.
  • Currently, the build script copies the source files into a directory that is target-specific. (It includes the target name in the directory name.) We should also make the directory specific to the C/Fortran compiler version. (That would have solved my particular debuggability problem.)

What do you think of these ideas?

@IvanUkhov
Copy link
Member

IvanUkhov commented Feb 19, 2020

Yes, it has always been failing on Windows. We never got it to work, as far as I remember.

There is this page, which is referenced to from all related crates:

https://github.com/blas-lapack-rs/blas-lapack-rs.github.io/wiki

It could be mentioned there. Not sure how many people read it, though.

Yes, making the shadow folder even more specific should be fine.

@jturner314
Copy link
Contributor Author

Okay, I've created #32 to make the build directory more specific.

@IvanUkhov
Copy link
Member

IvanUkhov commented Feb 21, 2020

@jturner314, I was thinking about this and #32, would it not help to introduce a new feature called inplace (to build locally in the build directory), and then enabling it in your application?

@jturner314
Copy link
Contributor Author

jturner314 commented Feb 21, 2020

@jturner314, I was thinking about this and #32, would it not help to introduce a new feature called inplace (to build locally in the build directory), and then enabling it in your application?

I've figured out the cause of the issue my application had, and I have a workaround for it. My motivation for either #31 or #32 is not for my application, but rather to help prevent other people from having similar problems with debugging that I did. In order to achieve that, the default behavior needs to change somehow. An inplace feature would be useful, but to achieve the debuggability goal, it would need to be enabled by default. (I'd suggest instead making the default be to place the binaries in OUT_DIR, and use a build-once feature to disable that behavior.)

For context, let me describe the issue I faced in more detail. I know now that the root cause is a bug in either OpenBLAS or gcc/gfortran that only appears when compiling OpenBLAS with gcc/gfortran 9.x, and only affects the computational results; it doesn't produce any warnings, build errors, or panics. I haven't had a chance to narrow down the cause further. I am working on a project that depends on openblas-src via ndarray-linalg.

  • In Nov. 2018, the current version of openblas-src was 0.6.1, and gcc/gfortran was 8.x. So, I ended up with OpenBLAS built in ~/.cargo/.../openblas-src-0.6.1/source_x86_64-unknown-linux-gnu with gcc/gfortran 8.x.

  • Around May 2019, gcc/gfortran updated to 9.x, but since OpenBLAS had already been built, it did not get rebuilt with gcc/gfortran 9.x even after cargo clean, so I didn't know that a clean build of OpenBLAS would cause problems. (This is the "detectability" problem.)

  • In Oct. 2019, I updated to the latest version of ndarray-linalg, which depended on openblas-src 0.7.0. I had never used openblas-src 0.7.0 before that point, so OpenBLAS was built using gcc/gfortran 9.x.

  • Around a month later, I discovered that I was getting incorrect computational results in some cases that weren't covered by unit testing. I wrote a program to sanity-check the erroneous computational results and used git bisect to identify the commit in my project where the issue first occurred (running cargo clean for each commit to perform a "fresh" build). That commit was when I updated ndarray-linalg (and a few other dependencies). So, I believed that there was a bug in one of the dependencies that I updated. I was faced with figuring out which of the dependencies caused the issue, and then isolating the problem within the dependency.

  • Luckily, I started using a new computer before getting a chance to investigate the problem. I checked that the erroneous behavior was still occurring with the new dependency versions (it was), and then I checked to make sure the previous commit was operating correctly (it wasn't). So, the commit prior to updating the dependencies had the issue too, but only on my new computer with a new-microarchitecture processor. As far as I knew, since I was building the same code after running cargo clean on both computers, the only difference was the hardware. The openblas-src dependency was the most hardware-dependent (since it uses different kernels for different microarchitectures), so I suspected that there was an bug in both versions of OpenBLAS with the new microarchitecture and in the newer version of OpenBLAS with the old microarchitecture. This seemed plausible if a (buggy) change to the new microarchitecture's kernel was then later applied to the old microarchitecture's kernel.

    While investigating this possibility, I noticed that openblas-src was placing build products in ~/.cargo/.../openblas-src-0.x.y that were not getting removed by cargo clean. I realized that in order to actually do a clean build, I needed to remove this directory too. So, I called cargo clean and removed the ~/.cargo/.../openblas-src-0.x.y on the old computer, and the issue occurred even with the old dependencies. So, the issue was not due to the new hardware or new dependencies, but something else. At this point, I knew that there was something different between the openblas-src-0.6.1 I had newly built and the openblas-src-0.6.1 binaries I had previously built (back in Nov. 2018 based on file modification timestamps). I discovered that the version of gcc/gfortran had changed from 8.x to 9.x in that time, and I performed a fresh build with gcc/gfortran 8.x (which was fortunately still available in the repositories) to confirm that the issue was present only with gcc/gfortran 9.x.

Either of the following would have allowed me to detect the issue much earlier (around the time gcc/gfortran updated to 9.x) and debug it more easily:

  1. if cargo clean removed the OpenBLAS binaries (i.e. Place source and build products in OUT_DIR #31), since I call cargo clean at least every six weeks when a new version of Rust is released
  2. if the OpenBLAS binaries were rebuilt when gcc/gfortran updated (i.e. Include compiler versions in build directory name #32)

On further reflection, I quite like #32. It would have allowed me to detect the issue the fastest, and if I detected the issue soon enough, I may have been able to attribute it directly to the gcc/gfortran update. Note that the Rust compiler does the same thing as what #32 would accomplish with respect to gcc/gfortran -- when you update to a new version of the Rust compiler, it ignores build results from the old compiler and builds fresh binaries. We could make cargo clean work too, but I think we should at least do #32 to rebuild OpenBLAS when gcc/gfortran update.

@IvanUkhov
Copy link
Member

Yes, I tend to agree with you that this behavior should be enabled by default via either a default enabling feature or an optional disabling feature. I lean toward the latter. This would mean that the behavior will change for those who depend on this crate; one will have to explicitly opt out (/cc @termoshtt).

I don’t necessarily like build in build-once, as most of the features are about building:

  • [build-]cblas,
  • [build-]lapacke,
  • [build-]static, and
  • [build-not-but-use-]system.

Regarding your last point, I understand that you have spent time building the solution in #32, and I appreciate this. I think it might be too early for such measures. Perhaps the new default behavior and functional cleaning will eliminate most of the problems. What do you think?

@jturner314
Copy link
Contributor Author

I don’t necessarily like build in build-once, as most of the features are about building

Maybe a better name for the feature would be common-build-cache, shared-build-cache, or common-object-cache?

I understand that you have spent time building the solution in #32, and I appreciate this. I think it might be too early for such measures.

Don't worry about the work I've put into #32; do what you think is best for the project. Is the primary issue with #32 the additional complexity?

Perhaps the new default behavior and functional cleaning will eliminate most of the problems. What do you think?

Yes, changing the default behavior to place build products in OUT_DIR will eliminate most of the problems. It would still be possible for the user to get confused in a scenario like the following (which would be fixed by #32 but not by simply placing build products in OUT_DIR):

  1. The user builds a crate that depends on openblas-src.
  2. Then, they decide that they'd like to build OpenBLAS with a different compiler or different compiler options, so they specify the OPENBLAS_CC/OPENBLAS_FC/OPENBLAS_HOSTCC environment variables.
  3. They get confused when their specified environment variables are ignored.

However, they'll probably figure out fairly easily that running cargo clean will fix the issue.

The default behavior is to place build products in `OUT_DIR`. This
feature makes it possible to place the build products in the shared
`~/.cargo` directory instead.
@IvanUkhov
Copy link
Member

Thank you for taking the time to respond. Would you be comfortable with just cache? Surely, it’s not as self-explanatory as shared-build-cache, but it is still adequate. build is probably redundant, as all features have something to do with building, and shared arguably doesn’t add much to caching.

@IvanUkhov IvanUkhov merged commit 434c337 into blas-lapack-rs:master Mar 8, 2020
termoshtt added a commit that referenced this pull request Apr 3, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Development

Successfully merging this pull request may close these issues.

2 participants