Refactor handling of embedded `magic.mgc`. #4989

teo-tsirpanis · 2024-05-18T20:51:01Z

SC-25167
SC-47655
SC-47656
SC-47657
SC-47658

This PR overhauls the facilities to embed and load the magic.mgc file that is needed by libmagic:

The most important change is the removal of magic_mgc_gzipped.bin.tar.bz2. This file contained a copy of magic.mgc that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and #included in mgc_dict.cc. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to update magic_mgc_gzipped.bin.tar.bz2.

What we do now is rely on CMake to find magic.mgc and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressed magic.mgc file, compresses it and produces a header file of the following format:
```
static const unsigned char magic_mgc_compressed_bytes[] = {
0x28, 0xb5, 0x2f, 0xfd, …
};
constexpr size_t magic_mgc_compressed_size = sizeof(magic_mgc_compressed_bytes);
// Editorial note: we used to prepend the decompressed size at the start of the
// binary blob, but this was non-standard and could not be easily done by CMake.
constexpr size_t magic_mgc_decompressed_size = 7041352;
```
The algorithm to compress magic.mgc was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).
Tests for mgc_dict were also updated to use Catch2, and were wired to run along with the other standalone unit tests.
- This necessitated to make an object library for mgc_dict, which was done as well.

Validated by successfully running unit_mgc_dict locally.

TYPE: BUILD
DESC: Improve embedding of magic.mgc and allow compiling with any libmagic version.

tiledb/sm/misc/CMakeLists.txt

It was rewritten from C++ to CMake, and compression is now being done with CMake's commands. This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control. The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.

It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.

…/sm/misc`.

Compressed size dropped from 333067 to 270578 bytes. Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.

Higher levels require CMake 3.26+.

…thout needing a pool.

…icate that this is a CMake script.

eric-hughes-tiledb

Needs updated copyright dates throughout.

There's a whole lot of old C-style code here. Since the code is being changed, it needs to be updated as well. This is particularly true for test code.

Conversion of tests to Catch2 seems incomplete. There is still too much residue of the old integrated C-style test code. A bunch of the special return values and uses of errcnt need to be broken out somehow.

The conversion script needs to ensure that it's temporary files are only the build tree.

eric-hughes-tiledb · 2024-06-11T12:14:13Z

tiledb/sm/compressors/zstd_compressor.cc

+    ConstBuffer* input_buffer,
+    PreallocatedBuffer* output_buffer) {


This is a new function. It shouldn't use pointers.

There are just two uses of this. The one above has the same check that's copypasted below, and the new one below takes addresses.

Updated. I also removed the duplicate check above.

eric-hughes-tiledb · 2024-06-11T12:14:29Z

tiledb/sm/compressors/zstd_compressor.cc

+    ConstBuffer* input_buffer,
+    PreallocatedBuffer* output_buffer) {


Suggested change

ConstBuffer* input_buffer,

PreallocatedBuffer* output_buffer) {

ConstBuffer& input_buffer,

PreallocatedBuffer& output_buffer) {

tiledb/sm/misc/generate_embedded_data_header.script.cmake

eric-hughes-tiledb · 2024-06-11T12:46:21Z

tiledb/sm/misc/mgc_dict.h

@@ -38,8 +38,8 @@ using tiledb::common::Status;

 class magic_dict {


This class should be deleted. The only constructor is getting deleted and there's only static functions left. In order to keep the code in tiledb_filestore.cc compiling, it can become a namespace.

eric-hughes-tiledb · 2024-06-11T12:46:39Z

tiledb/sm/misc/mgc_dict.h

-
-  /** holds the expanded data until application exits. */
-  static shared_ptr<tiledb::sm::ByteVecValue> expanded_buffer_;
+  static const tiledb::sm::ByteVecValue& expanded_buffer();


This should be a free function, not a class member.

eric-hughes-tiledb · 2024-06-11T13:34:40Z

tiledb/sm/misc/test/unit_mgc_dict.cc

  FILE* infile = nullptr;
  infile = fopen(TILEDB_PATH_TO_MAGIC_MGC, "rb");
-  if (!infile) {
-    fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
-    return 1;
-  }
+  REQUIRE(infile);


I would really prefer that these tests be rewritten with C++ I/O rather than C I/O, but I also admit that it might be a small scope expansion. There's not much I/O, though, so it's not a significant expansion. There's already a need in this PR to do more to update these tests, in addition, so this can go along with that.

eric-hughes-tiledb · 2024-06-11T13:42:23Z

tiledb/sm/misc/test/unit_mgc_dict.cc

-    fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC);
-    return 1;
-  }
+  REQUIRE(infile);

  fseek(infile, 0L, SEEK_END);
  uint64_t magic_mgc_len = ftell(infile);
  fseek(infile, 0L, SEEK_SET);

  char* magic_mgc_data = tdb_new_array(char, magic_mgc_len);


There is never a need to allocate a local variable when the size is known in advance. Since the lengths are all constexpr, we can use std::array here.

The uncompressed file is 6.7MB big. Won't an std::array here overflow the stack?

the lengths are all constexpr

I would prefer to keep the generated header an implementation detail and include it only in magic_mgc.cc.

I will make it a vector; at present it leaks memory.

Changed to vector.

eric-hughes-tiledb · 2024-06-11T13:46:02Z

tiledb/sm/misc/test/unit_mgc_dict.cc

-    fprintf(stderr, "NO errors encountered in mgc_dict validation\n");
-    return 0;
-  }
+  REQUIRE(errcnt == 0);


We won't need this if all the tests to which it might apply are executed with CHECK or REQUIRE. That will take some more rewrites.

Updated. We now make a proper CHECK instead of keeping the count of errors.

eric-hughes-tiledb · 2024-06-11T13:49:11Z

tiledb/sm/misc/test/unit_mgc_dict.cc

+  REQUIRE(proc_list(file_data_sizes1, true) == 0);
+  REQUIRE(proc_list(file_data_sizes2, true) == 0);
+  REQUIRE(proc_list(file_data_sizes1, false) == 0);
+  REQUIRE(proc_list(file_data_sizes2, false) == 0);


These tests look like they're independent and should be in separate SECTION at least, if not separate TEST_CASE.

I'm actually thinking of removing all but the first section. The two file data lists have the same content but in different order, and testing both with and without reusing the magic_t is not very worthwhile IMO.

After offline feedback I removed all but the first case, and further simplified the test.

eric-hughes-tiledb · 2024-06-11T13:50:21Z

tiledb/sm/misc/test/unit_mgc_dict.cc

-    fprintf(stderr, "ERROR reading data from %s\n", TILEDB_PATH_TO_MAGIC_MGC);
-    return 4;
-  }
+  REQUIRE(fread(magic_mgc_data, 1, magic_mgc_len, infile) == magic_mgc_len);


We need a check that the file size is equal to the length we expect it to be.

This happens a bit below, in line 72.

Testing the case when they are opened and closed each time has little value.

teo-tsirpanis · 2024-06-12T19:16:19Z

All feedback has been addressed, this is ready for review.

eric-hughes-tiledb

The only thing to resolve is an issue with expanded_buffer. I think it'll be easy to make prepare_data not copy, but if it's not we need documentation to that effect so someone else can fix it later.

The rewritten tests are much better. It's clear that we need an RAII class to open and close magic, but that's not necessary for this PR.

eric-hughes-tiledb · 2024-06-14T01:40:19Z

tiledb/sm/misc/mgc_dict.cc

-      reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0]));
-
-  uncompressed_magic_dict_ = expanded_buffer_.get()->data();
+  return expanded_buffer;


doesn't RVO apply?

This situation is called NRVO, "named return value optimiziation". It's non-mandatory. The compiler can either optimize by eliding or alternatively call the copy constructor. Thus we have to assume that the copy constructor will be called in some circumstances.

RVO is mandatory only when the return value is a constructor expression.

eric-hughes-tiledb · 2024-06-14T01:47:23Z

tiledb/sm/misc/mgc_dict.cc

-      reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0]));
-
-  uncompressed_magic_dict_ = expanded_buffer_.get()->data();
+  return expanded_buffer;


stop using ByteVecValue

The possibility of the copy constructor is just as present with vector as it is with ByteVecValue, which under the hood is just a vector.

Looking at the code in more detail, you could fix this by having the caller allocate the storage. The caller could allocate a vector of size magic_mgc_decompressed_size and prepare could take a span argument.

eric-hughes-tiledb · 2024-06-14T01:48:52Z

tiledb/sm/misc/mgc_dict.cc

-static const char magic_mgc_compressed_bytes[] = {
-#include "magic_mgc_gzipped.bin"
-};
+std::vector<uint8_t> prepare_data() {


See below for commentary.

Suggested change

std::vector<uint8_t> prepare_data() {

void prepare_data(std::span<uint8_t> buffer) {

eric-hughes-tiledb

LGTM

eric-hughes-tiledb · 2024-06-17T13:50:46Z

tiledb/sm/misc/mgc_dict.cc

+  static std::vector<uint8_t> expanded_buffer(magic_mgc_decompressed_size);
+  static std::once_flag once_flag;
  // Thread-safe initialization of the expanded data.
-  static auto expanded_buffer = prepare_data();
+  std::call_once(once_flag, [&]() { prepare_data(expanded_buffer); });


Now that I see the code, I realize this could be done another way. If prepare_data were to return its input argument, it could be used to initialize a second static variable, which would then be the return value.

No need to change it.

[SC-36912](https://app.shortcut.com/tiledb-inc/story/36912/remove-cmake-variable-tiledb-vcpkg-from-the-build) [SC-36913](https://app.shortcut.com/tiledb-inc/story/36913/remove-superbuild) Historically, the Core's build system has been using CMake external projects to download and build external dependencies, and a "superbuild" architecture to ensure a build order. With the advent of vcpkg, we have stopped building the dependencies ourselves and instead rely on vcpkg (or the "system" in general) to provide them for us. The superbuild has thus became a relic of the past, consisting of only the inner `tiledb` project when the new system is enabled (formerly by specifying `-DTILEDB_VCPKG=ON`, now always). This PR removes the superbuild. TileDB became a regular CMake project, whose targets can be built directly without first building the outer project, and then building the `build/tiledb` subdirectory. The CI scripts were updated accordingly to not use the subdirectory. This is inevitably a breaking change in the build system. For starters, local development environment will certainly need to make a clean build after this change. Furthermore, there will need to be changes in build scripts to not build again on the `tiledb` subdirectory. For examples of downstream migrations, TileDB-Inc/TileDB-Go#316 uses a `make` invocation that has a similar effect both with and without the superbuild, and conda-forge/tiledb-feedstock#290 uses a semi-documented CMake option to disable the superbuild (which will have no effect after the superbuild gets removed). The majority of first-party downstreams (VCF, SOMA, MariaDB, Vector Search, the Python and Java APIs) use the `install-tiledb` target, which currently is defiend on the superbuild and builds the `install` target in the inner `tiledb` project. With this PR the `install-tiledb` target will be kept for compatibility, but alias to `install`. ~~I tried to build VCF with a TileDB external project from this branch, but it fails with an error that will be fixed with #4989. I will try again once that PR gets merged.~~ Never mind, VCF builds with the latest changes. --- TYPE: BUILD DESC: The superbuild architecture of the build system has been removed and TileDB is a regular CMake project. Build commands of the form `make && make -C tiledb <targets>` will have to be replaced by `make <targets>`.

teo-tsirpanis commented May 27, 2024

View reviewed changes

tiledb/sm/misc/CMakeLists.txt Outdated Show resolved Hide resolved

teo-tsirpanis marked this pull request as ready for review May 28, 2024 14:49

teo-tsirpanis requested a review from eric-hughes-tiledb May 28, 2024 14:49

teo-tsirpanis mentioned this pull request May 29, 2024

Remove the superbuild and the external projects. #5021

Merged

teo-tsirpanis force-pushed the teo/mgc-dict-refactor branch from c83c7ea to 7ba9432 Compare June 6, 2024 15:53

teo-tsirpanis added 11 commits June 7, 2024 02:38

Fix missing libmagic dependency.

c3a9727

Support specifying the window bits in the GZip decompressor.

fe6e602

Adapt the mgc_dict class to the new way of getting magic.mgc.

0ff6d69

It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.

Add mgc_dict object library and move packing magic.mgc to `tiledb…

35df955

…/sm/misc`.

Change unit_mgc_dict to use Catch2.

5ff8787

Compress magic.mgc with zstd.

6d9614d

Compressed size dropped from 333067 to 270578 bytes. Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.

Lower compression level to 9.

9b5913f

Higher levels require CMake 3.26+.

Remove a leftover line.

64ccaab

Add overload to ZStd::decompress that directly accepts a context wi…

fd5651b

…thout needing a pool.

Move the CMake script to tiledb/sm/misc and rename it to better ind…

fdf5e45

…icate that this is a CMake script.

teo-tsirpanis force-pushed the teo/mgc-dict-refactor branch 2 times, most recently from d649ecc to fdf5e45 Compare June 6, 2024 23:39

eric-hughes-tiledb suggested changes Jun 11, 2024

View reviewed changes

teo-tsirpanis added 10 commits June 11, 2024 18:22

Use a vector to hold the decompressed buffer and return it in a span.

45c7382

Convert the magic_dict class to a namespace.

ff5b049

Do checks instead of incrementing an error counter.

48f7f3f

Use C++ I/O to read magic.mgc.

a31fdc7

Use references in the new ZStd::decompress overload.

cc56474

Remove duplicate check.

7cc9bab

Use a vector to store the file data.

12331d0

Use only one test data set.

5019bd7

Remove the lanbdas in the test and always reuse the magic_t handles.

26d50c0

Testing the case when they are opened and closed each time has little value.

Update copyright dates.

22151ee

teo-tsirpanis requested a review from eric-hughes-tiledb June 12, 2024 19:16

eric-hughes-tiledb suggested changes Jun 14, 2024

View reviewed changes

teo-tsirpanis added 2 commits June 15, 2024 02:16

Do not rely on NRVO when decompressing magic.mgc.

96295c5

Update documentation.

8ce06d1

teo-tsirpanis requested a review from eric-hughes-tiledb June 14, 2024 23:42

eric-hughes-tiledb approved these changes Jun 17, 2024

View reviewed changes

KiterLuc merged commit 7b4f403 into dev Jun 17, 2024
61 checks passed

KiterLuc deleted the teo/mgc-dict-refactor branch June 17, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor handling of embedded `magic.mgc`. #4989

Refactor handling of embedded `magic.mgc`. #4989

teo-tsirpanis commented May 18, 2024

eric-hughes-tiledb left a comment

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

teo-tsirpanis Jun 11, 2024

teo-tsirpanis Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

teo-tsirpanis Jun 12, 2024

eric-hughes-tiledb Jun 11, 2024

teo-tsirpanis Jun 11, 2024

teo-tsirpanis commented Jun 12, 2024

eric-hughes-tiledb left a comment

eric-hughes-tiledb Jun 14, 2024

eric-hughes-tiledb Jun 14, 2024

eric-hughes-tiledb Jun 14, 2024

eric-hughes-tiledb left a comment

eric-hughes-tiledb Jun 17, 2024

		ConstBuffer* input_buffer,
		PreallocatedBuffer* output_buffer) {

		@@ -38,8 +38,8 @@ using tiledb::common::Status;

		class magic_dict {

	std::vector<uint8_t> prepare_data() {
	void prepare_data(std::span<uint8_t> buffer) {

Refactor handling of embedded magic.mgc. #4989

Refactor handling of embedded magic.mgc. #4989

Conversation

teo-tsirpanis commented May 18, 2024

eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teo-tsirpanis commented Jun 12, 2024

eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Refactor handling of embedded `magic.mgc`. #4989

Refactor handling of embedded `magic.mgc`. #4989