-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Refactor handling of embedded magic.mgc
.
#4989
Conversation
c83c7ea
to
7ba9432
Compare
It was rewritten from C++ to CMake, and compression is now being done with CMake's commands. This allows us to pack the upstream `magic.mgc` file and stop keeping a pre-compressed and pre-escaped one in source control. The size of the uncompressed file used to be kept at the start of the binary file. We no longer have the capability to easily modify binary files with CMake, so the script generates a complete header, alongside a constexpr variable with the uncompressed size.
It was also simplified a bit and `gzip_wrappers.cc` is now unused and got removed.
Compressed size dropped from 333067 to 270578 bytes. Changes to the gzip compressor were reverted. The script was also renamed and slightly updated.
Higher levels require CMake 3.26+.
…thout needing a pool.
…icate that this is a CMake script.
d649ecc
to
fdf5e45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs updated copyright dates throughout.
There's a whole lot of old C-style code here. Since the code is being changed, it needs to be updated as well. This is particularly true for test code.
Conversion of tests to Catch2 seems incomplete. There is still too much residue of the old integrated C-style test code. A bunch of the special return values and uses of errcnt
need to be broken out somehow.
The conversion script needs to ensure that it's temporary files are only the build tree.
ConstBuffer* input_buffer, | ||
PreallocatedBuffer* output_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new function. It shouldn't use pointers.
There are just two uses of this. The one above has the same check that's copypasted below, and the new one below takes addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I also removed the duplicate check above.
ConstBuffer* input_buffer, | ||
PreallocatedBuffer* output_buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConstBuffer* input_buffer, | |
PreallocatedBuffer* output_buffer) { | |
ConstBuffer& input_buffer, | |
PreallocatedBuffer& output_buffer) { |
tiledb/sm/misc/mgc_dict.h
Outdated
@@ -38,8 +38,8 @@ using tiledb::common::Status; | |||
|
|||
class magic_dict { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class should be deleted. The only constructor is getting deleted and there's only static functions left. In order to keep the code in tiledb_filestore.cc
compiling, it can become a namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tiledb/sm/misc/mgc_dict.h
Outdated
|
||
/** holds the expanded data until application exits. */ | ||
static shared_ptr<tiledb::sm::ByteVecValue> expanded_buffer_; | ||
static const tiledb::sm::ByteVecValue& expanded_buffer(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a free function, not a class member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
FILE* infile = nullptr; | ||
infile = fopen(TILEDB_PATH_TO_MAGIC_MGC, "rb"); | ||
if (!infile) { | ||
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 1; | ||
} | ||
REQUIRE(infile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really prefer that these tests be rewritten with C++ I/O rather than C I/O, but I also admit that it might be a small scope expansion. There's not much I/O, though, so it's not a significant expansion. There's already a need in this PR to do more to update these tests, in addition, so this can go along with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewritten.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "ERROR: Unable to open %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 1; | ||
} | ||
REQUIRE(infile); | ||
|
||
fseek(infile, 0L, SEEK_END); | ||
uint64_t magic_mgc_len = ftell(infile); | ||
fseek(infile, 0L, SEEK_SET); | ||
|
||
char* magic_mgc_data = tdb_new_array(char, magic_mgc_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is never a need to allocate a local variable when the size is known in advance. Since the lengths are all constexpr
, we can use std::array
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uncompressed file is 6.7MB big. Won't an std::array
here overflow the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lengths are all
constexpr
I would prefer to keep the generated header an implementation detail and include it only in magic_mgc.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make it a vector; at present it leaks memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to vector.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "NO errors encountered in mgc_dict validation\n"); | ||
return 0; | ||
} | ||
REQUIRE(errcnt == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't need this if all the tests to which it might apply are executed with CHECK
or REQUIRE
. That will take some more rewrites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. We now make a proper CHECK
instead of keeping the count of errors.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
REQUIRE(proc_list(file_data_sizes1, true) == 0); | ||
REQUIRE(proc_list(file_data_sizes2, true) == 0); | ||
REQUIRE(proc_list(file_data_sizes1, false) == 0); | ||
REQUIRE(proc_list(file_data_sizes2, false) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests look like they're independent and should be in separate SECTION
at least, if not separate TEST_CASE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking of removing all but the first section. The two file data lists have the same content but in different order, and testing both with and without reusing the magic_t
is not very worthwhile IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After offline feedback I removed all but the first case, and further simplified the test.
tiledb/sm/misc/test/unit_mgc_dict.cc
Outdated
fprintf(stderr, "ERROR reading data from %s\n", TILEDB_PATH_TO_MAGIC_MGC); | ||
return 4; | ||
} | ||
REQUIRE(fread(magic_mgc_data, 1, magic_mgc_len, infile) == magic_mgc_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a check that the file size is equal to the length we expect it to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens a bit below, in line 72.
Testing the case when they are opened and closed each time has little value.
All feedback has been addressed, this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing to resolve is an issue with expanded_buffer
. I think it'll be easy to make prepare_data
not copy, but if it's not we need documentation to that effect so someone else can fix it later.
The rewritten tests are much better. It's clear that we need an RAII class to open and close magic, but that's not necessary for this PR.
tiledb/sm/misc/mgc_dict.cc
Outdated
reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0])); | ||
|
||
uncompressed_magic_dict_ = expanded_buffer_.get()->data(); | ||
return expanded_buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't RVO apply?
This situation is called NRVO, "named return value optimiziation". It's non-mandatory. The compiler can either optimize by eliding or alternatively call the copy constructor. Thus we have to assume that the copy constructor will be called in some circumstances.
RVO is mandatory only when the return value is a constructor expression.
tiledb/sm/misc/mgc_dict.cc
Outdated
reinterpret_cast<const uint8_t*>(&magic_mgc_compressed_bytes[0])); | ||
|
||
uncompressed_magic_dict_ = expanded_buffer_.get()->data(); | ||
return expanded_buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop using ByteVecValue
The possibility of the copy constructor is just as present with vector
as it is with ByteVecValue
, which under the hood is just a vector
.
Looking at the code in more detail, you could fix this by having the caller allocate the storage. The caller could allocate a vector of size magic_mgc_decompressed_size
and prepare
could take a span
argument.
tiledb/sm/misc/mgc_dict.cc
Outdated
static const char magic_mgc_compressed_bytes[] = { | ||
#include "magic_mgc_gzipped.bin" | ||
}; | ||
std::vector<uint8_t> prepare_data() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below for commentary.
std::vector<uint8_t> prepare_data() { | |
void prepare_data(std::span<uint8_t> buffer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
static std::vector<uint8_t> expanded_buffer(magic_mgc_decompressed_size); | ||
static std::once_flag once_flag; | ||
// Thread-safe initialization of the expanded data. | ||
static auto expanded_buffer = prepare_data(); | ||
std::call_once(once_flag, [&]() { prepare_data(expanded_buffer); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I see the code, I realize this could be done another way. If prepare_data
were to return its input argument, it could be used to initialize a second static variable, which would then be the return value.
No need to change it.
[SC-36912](https://app.shortcut.com/tiledb-inc/story/36912/remove-cmake-variable-tiledb-vcpkg-from-the-build) [SC-36913](https://app.shortcut.com/tiledb-inc/story/36913/remove-superbuild) Historically, the Core's build system has been using CMake external projects to download and build external dependencies, and a "superbuild" architecture to ensure a build order. With the advent of vcpkg, we have stopped building the dependencies ourselves and instead rely on vcpkg (or the "system" in general) to provide them for us. The superbuild has thus became a relic of the past, consisting of only the inner `tiledb` project when the new system is enabled (formerly by specifying `-DTILEDB_VCPKG=ON`, now always). This PR removes the superbuild. TileDB became a regular CMake project, whose targets can be built directly without first building the outer project, and then building the `build/tiledb` subdirectory. The CI scripts were updated accordingly to not use the subdirectory. This is inevitably a breaking change in the build system. For starters, local development environment will certainly need to make a clean build after this change. Furthermore, there will need to be changes in build scripts to not build again on the `tiledb` subdirectory. For examples of downstream migrations, TileDB-Inc/TileDB-Go#316 uses a `make` invocation that has a similar effect both with and without the superbuild, and conda-forge/tiledb-feedstock#290 uses a semi-documented CMake option to disable the superbuild (which will have no effect after the superbuild gets removed). The majority of first-party downstreams (VCF, SOMA, MariaDB, Vector Search, the Python and Java APIs) use the `install-tiledb` target, which currently is defiend on the superbuild and builds the `install` target in the inner `tiledb` project. With this PR the `install-tiledb` target will be kept for compatibility, but alias to `install`. ~~I tried to build VCF with a TileDB external project from this branch, but it fails with an error that will be fixed with #4989. I will try again once that PR gets merged.~~ Never mind, VCF builds with the latest changes. --- TYPE: BUILD DESC: The superbuild architecture of the build system has been removed and TileDB is a regular CMake project. Build commands of the form `make && make -C tiledb <targets>` will have to be replaced by `make <targets>`.
SC-25167
SC-47655
SC-47656
SC-47657
SC-47658
This PR overhauls the facilities to embed and load the
magic.mgc
file that is needed by libmagic:The most important change is the removal of
magic_mgc_gzipped.bin.tar.bz2
. This file contained a copy ofmagic.mgc
that was compressed, converted to escaped C characters, packed and compressed again to take less space, and stored in source control, so that at build time to get unpacked and#include
d inmgc_dict.cc
. Because this file was being prepared ahead of time by a manually invoked C++ program, this approach had the disadvantage that it tied the Core to a specific version of libmagic. This was made evident in Update libmagic to version 5.45 #4673, where just updating libmagic was not enough; we also had to updatemagic_mgc_gzipped.bin.tar.bz2
.What we do now is rely on CMake to find
magic.mgc
and perform its entire preparation at build time. The C++ program was rewritten to be a CMake script, which makes it much simpler and enables it to run on cross-compilation scenarios. The script accepts the uncompressedmagic.mgc
file, compresses it and produces a header file of the following format:The algorithm to compress
magic.mgc
was changed from gzip to zstd, resulting in a 17.9% reduction of the compressed size (from 333067 το 273500 bytes).Tests for
mgc_dict
were also updated to use Catch2, and were wired to run along with the other standalone unit tests.mgc_dict
, which was done as well.Validated by successfully running
unit_mgc_dict
locally.TYPE: BUILD
DESC: Improve embedding of
magic.mgc
and allow compiling with any libmagic version.