Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Refactor handling of embedded magic.mgc. #4989

Merged
merged 23 commits into from
Jun 17, 2024
Merged

Refactor handling of embedded magic.mgc. #4989

merged 23 commits into from
Jun 17, 2024

Conversation

teo-tsirpanis
Copy link
Member

SC-25167
SC-47655
SC-47656
SC-47657
SC-47658

This PR overhauls the facilities to embed and load the magic.mgc file that is needed by libmagic:

  • The most important change is the removal of magic_mgc_gzipped.bin.tar.bz2. This file contained a copy of magic.mgc that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and #included in mgc_dict.cc. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to update magic_mgc_gzipped.bin.tar.bz2.

    What we do now is rely on CMake to find magic.mgc and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressed magic.mgc file, compresses it and produces a header file of the following format:

    static const unsigned char magic_mgc_compressed_bytes[] = {
    0x28, 0xb5, 0x2f, 0xfd, …
    };
    constexpr size_t magic_mgc_compressed_size = sizeof(magic_mgc_compressed_bytes);
    // Editorial note: we used to prepend the decompressed size at the start of the
    // binary blob, but this was non-standard and could not be easily done by CMake.
    constexpr size_t magic_mgc_decompressed_size = 7041352;
  • The algorithm to compress magic.mgc was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).

  • Tests for mgc_dict were also updated to use Catch2, and were wired to run along with the other standalone unit tests.

    • This necessitated to make an object library for mgc_dict, which was done as well.

Validated by successfully running unit_mgc_dict locally.


TYPE: BUILD
DESC: Improve embedding of magic.mgc and allow compiling with any libmagic version.

It was rewritten from C++ to CMake, and compression is now being done with CMake's commands.
This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control.
The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.
It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.
Compressed size dropped from 333067 to 270578 bytes.
Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.
Higher levels require CMake 3.26+.
@teo-tsirpanis teo-tsirpanis force-pushed the teo/mgc-dict-refactor branch 2 times, most recently from d649ecc to fdf5e45 Compare June 6, 2024 23:39
Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs updated copyright dates throughout.

There's a whole lot of old C-style code here. Since the code is being changed, it needs to be updated as well. This is particularly true for test code.

Conversion of tests to Catch2 seems incomplete. There is still too much residue of the old integrated C-style test code. A bunch of the special return values and uses of errcnt need to be broken out somehow.

The conversion script needs to ensure that it's temporary files are only the build tree.

Comment on lines 112 to 113
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new function. It shouldn't use pointers.

There are just two uses of this. The one above has the same check that's copypasted below, and the new one below takes addresses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I also removed the duplicate check above.

Comment on lines 112 to 113
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ConstBuffer* input_buffer,
PreallocatedBuffer* output_buffer) {
ConstBuffer& input_buffer,
PreallocatedBuffer& output_buffer) {

@@ -38,8 +38,8 @@ using tiledb::common::Status;

class magic_dict {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should be deleted. The only constructor is getting deleted and there's only static functions left. In order to keep the code in tiledb_filestore.cc compiling, it can become a namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/** holds the expanded data until application exits. */
static shared_ptr<tiledb::sm::ByteVecValue> expanded_buffer_;
static const tiledb::sm::ByteVecValue& expanded_buffer();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a free function, not a class member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 59 to 61
FILE* infile = nullptr;
infile = fopen(TILEDB_PATH_TO_MAGIC_MGC, "rb");
if (!infile) {
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 1;
}
REQUIRE(infile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer that these tests be rewritten with C++ I/O rather than C I/O, but I also admit that it might be a small scope expansion. There's not much I/O, though, so it's not a significant expansion. There's already a need in this PR to do more to update these tests, in addition, so this can go along with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewritten.

fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 1;
}
REQUIRE(infile);

fseek(infile, 0L, SEEK_END);
uint64_t magic_mgc_len = ftell(infile);
fseek(infile, 0L, SEEK_SET);

char* magic_mgc_data = tdb_new_array(char, magic_mgc_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is never a need to allocate a local variable when the size is known in advance. Since the lengths are all constexpr, we can use std::array here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uncompressed file is 6.7MB big. Won't an std::array here overflow the stack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lengths are all constexpr

I would prefer to keep the generated header an implementation detail and include it only in magic_mgc.cc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make it a vector; at present it leaks memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to vector.

fprintf(stderr, "NO errors encountered in mgc_dict validation\n");
return 0;
}
REQUIRE(errcnt == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need this if all the tests to which it might apply are executed with CHECK or REQUIRE. That will take some more rewrites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. We now make a proper CHECK instead of keeping the count of errors.

Comment on lines 456 to 459
REQUIRE(proc_list(file_data_sizes1, true) == 0);
REQUIRE(proc_list(file_data_sizes2, true) == 0);
REQUIRE(proc_list(file_data_sizes1, false) == 0);
REQUIRE(proc_list(file_data_sizes2, false) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests look like they're independent and should be in separate SECTION at least, if not separate TEST_CASE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking of removing all but the first section. The two file data lists have the same content but in different order, and testing both with and without reusing the magic_t is not very worthwhile IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After offline feedback I removed all but the first case, and further simplified the test.

fprintf(stderr, "ERROR reading data from %s\n", TILEDB_PATH_TO_MAGIC_MGC);
return 4;
}
REQUIRE(fread(magic_mgc_data, 1, magic_mgc_len, infile) == magic_mgc_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a check that the file size is equal to the length we expect it to be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens a bit below, in line 72.

@teo-tsirpanis
Copy link
Member Author

All feedback has been addressed, this is ready for review.

Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing to resolve is an issue with expanded_buffer. I think it'll be easy to make prepare_data not copy, but if it's not we need documentation to that effect so someone else can fix it later.

The rewritten tests are much better. It's clear that we need an RAII class to open and close magic, but that's not necessary for this PR.

reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0]));

uncompressed_magic_dict_ = expanded_buffer_.get()->data();
return expanded_buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't RVO apply?

This situation is called NRVO, "named return value optimiziation". It's non-mandatory. The compiler can either optimize by eliding or alternatively call the copy constructor. Thus we have to assume that the copy constructor will be called in some circumstances.

RVO is mandatory only when the return value is a constructor expression.

reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0]));

uncompressed_magic_dict_ = expanded_buffer_.get()->data();
return expanded_buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop using ByteVecValue

The possibility of the copy constructor is just as present with vector as it is with ByteVecValue, which under the hood is just a vector.

Looking at the code in more detail, you could fix this by having the caller allocate the storage. The caller could allocate a vector of size magic_mgc_decompressed_size and prepare could take a span argument.

static const char magic_mgc_compressed_bytes[] = {
#include "magic_mgc_gzipped.bin"
};
std::vector<uint8_t> prepare_data() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for commentary.

Suggested change
std::vector<uint8_t> prepare_data() {
void prepare_data(std::span<uint8_t> buffer) {

Copy link
Contributor

@eric-hughes-tiledb eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 58 to 61
static std::vector<uint8_t> expanded_buffer(magic_mgc_decompressed_size);
static std::once_flag once_flag;
// Thread-safe initialization of the expanded data.
static auto expanded_buffer = prepare_data();
std::call_once(once_flag, [&]() { prepare_data(expanded_buffer); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I see the code, I realize this could be done another way. If prepare_data were to return its input argument, it could be used to initialize a second static variable, which would then be the return value.

No need to change it.

@KiterLuc KiterLuc merged commit 7b4f403 into dev Jun 17, 2024
61 checks passed
@KiterLuc KiterLuc deleted the teo/mgc-dict-refactor branch June 17, 2024 14:52
KiterLuc pushed a commit that referenced this pull request Jul 17, 2024
[SC-36912](https://app.shortcut.com/tiledb-inc/story/36912/remove-cmake-variable-tiledb-vcpkg-from-the-build)

[SC-36913](https://app.shortcut.com/tiledb-inc/story/36913/remove-superbuild)

Historically, the Core's build system has been using CMake external
projects to download and build external dependencies, and a "superbuild"
architecture to ensure a build order. With the advent of vcpkg, we have
stopped building the dependencies ourselves and instead rely on vcpkg
(or the "system" in general) to provide them for us. The superbuild has
thus became a relic of the past, consisting of only the inner `tiledb`
project when the new system is enabled (formerly by specifying
`-DTILEDB_VCPKG=ON`, now always).

This PR removes the superbuild. TileDB became a regular CMake project,
whose targets can be built directly without first building the outer
project, and then building the `build/tiledb` subdirectory. The CI
scripts were updated accordingly to not use the subdirectory.

This is inevitably a breaking change in the build system. For starters,
local development environment will certainly need to make a clean build
after this change. Furthermore, there will need to be changes in build
scripts to not build again on the `tiledb` subdirectory.

For examples of downstream migrations, TileDB-Inc/TileDB-Go#316 uses a
`make` invocation that has a similar effect both with and without the
superbuild, and conda-forge/tiledb-feedstock#290 uses a semi-documented
CMake option to disable the superbuild (which will have no effect after
the superbuild gets removed).

The majority of first-party downstreams (VCF, SOMA, MariaDB, Vector
Search, the Python and Java APIs) use the `install-tiledb` target, which
currently is defiend on the superbuild and builds the `install` target
in the inner `tiledb` project. With this PR the `install-tiledb` target
will be kept for compatibility, but alias to `install`.

~~I tried to build VCF with a TileDB external project from this branch,
but it fails with an error that will be fixed with #4989. I will try
again once that PR gets merged.~~ Never mind, VCF builds with the latest
changes.

---
TYPE: BUILD
DESC: The superbuild architecture of the build system has been removed
and TileDB is a regular CMake project. Build commands of the form `make
&& make -C tiledb <targets>` will have to be replaced by `make
<targets>`.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants