-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ARROW-14222: [C++] implement GCSFileSystem skeleton #11331
Conversation
cpp/src/arrow/filesystem/gcsfs.h
Outdated
static GCSOptions Anonymous(); | ||
}; | ||
|
||
class ARROW_EXPORT GCSFileSystem : public FileSystem { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: GcsFileSystem? Also add docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what was the style in this project. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, google style, There are some cases where we've used the all caps spelling though (e.g. CSV)
Ping? |
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
|
||
namespace gcs = google::cloud::storage; | ||
|
||
google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const should be in front of GcsOptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I did not mean to restart the "east const" vs. "const west" holy way, but muscle memory is what it is.
namespace gcs = google::cloud::storage; | ||
|
||
google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) { | ||
auto options = google::cloud::Options{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer this over google::cloud::Options options
; ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Almost Always Auto", another holy war. I can change it if you think it is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the almost always use auto (my undertanding this is from effective C++) directly contradicts the style guide in use (which is only use auto if type is obvious). In this case type is obvious, it just seems like there is more text in this case.
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
google::cloud::Options AsGoogleCloudOptions(GcsOptions const& o) { | ||
auto options = google::cloud::Options{}; | ||
if (!o.endpoint_override.empty()) { | ||
auto scheme = o.scheme; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please spell out the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
namespace internal { | ||
// TODO(ARROW-1231) - during development only tests should create a GcsFileSystem. | ||
// Remove, and provide a public API, before declaring the feature complete. | ||
std::shared_ptr<GcsFileSystem> MakeGcsFileSystemForTest(const GcsOptions& options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO covers make a static method Make on GcsFileSystem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to make the TODO more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that is what you had in mind.
namespace internal { | ||
|
||
std::shared_ptr<GcsFileSystem> MakeGcsFileSystemForTest(const GcsOptions& options) { | ||
return std::shared_ptr<GcsFileSystem>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe comment on why std::make_shared isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
EXPECT_THAT(b.get(), NotNull()); | ||
EXPECT_EQ(b, b); | ||
|
||
EXPECT_NE(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Equals here is mean logically equal, shouldn't this be equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know what "logically equals" means. At a guess something like "modulo transient errors, both produce the same results"? If so, I am not sure there is a reliable way to do that: consider storage.googleapis.com
vs. private.googleapis.com
as endpoint overrides, different strings, but depending on your environment, exactly the same behavior (or not!).
Anyway, implemented something closer to the current implementation for the S3 FileSystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I meant configuration parameters are equal. Before this looked like it was using solely memory address equality.
EXPECT_EQ(a, a); | ||
|
||
auto b = internal::MakeGcsFileSystemForTest(GcsOptions{}); | ||
EXPECT_THAT(b.get(), NotNull()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think .get() is necessary here and above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh, fixed.
TEST(GCSFileSystem, Compare) { | ||
auto a = internal::MakeGcsFileSystemForTest(GcsOptions{}); | ||
EXPECT_THAT(a.get(), NotNull()); | ||
EXPECT_EQ(a, a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be comparing the objects EXPECT_EQ(*a, *a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we want a->Equals()
fixed.
C++ looks good to me. I'll wait and see if there are any comments on CMake (I'll merge later today if nothing comes up) |
@@ -801,7 +801,11 @@ endif() | |||
set(ARROW_SHARED_PRIVATE_LINK_LIBS ${ARROW_STATIC_LINK_LIBS}) | |||
|
|||
# boost::filesystem is needed for S3 and Flight tests as a boost::process dependency. | |||
if(((ARROW_FLIGHT OR ARROW_S3) AND (ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION))) | |||
if(((ARROW_FLIGHT | |||
OR ARROW_S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the mac build is a transient error or somehow due to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think is this change, as the build has -DARROW_GCS=OFF
:
https://github.com/apache/arrow/pull/11331/checks?check_run_id=3830637295#step:9:62
and the change is (modulo reformatting) changing (ARROW_FLIGHT OR ARROW_S3)
to (ARROW_FLIGHT OR ARROW_S3 OR ARROW_GCS)
. In addition, this change has been there since the first commit in the branch.
At a guess this is either a transient or the point in master
where I based the branch has a problem, I can rebase if you think that would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, lets try a rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the builds at master
, I do not think a rebase would help, and it seems even less likely that these changes caused the build breaks:
https://github.com/apache/arrow/actions/runs/1317260919
https://github.com/apache/arrow/actions?query=event%3Apush+branch%3Amaster
Rebased anyway. At least that makes it easier to understand what build problems are coming from where. |
include_directories(SYSTEM ${GOOGLE_CLOUD_CPP_INCLUDE_DIR}) | ||
get_target_property(absl_base_INCLUDE_DIR absl::base INTERFACE_INCLUDE_DIRECTORIES) | ||
include_directories(SYSTEM ${absl_base_INCLUDE_DIR}) | ||
message(STATUS "Found google-cloud-cpp::storage static library: ${GOOGLE_CLOUD_CPP_STATIC_LIBRARY_STORAGE}" | ||
) | ||
message(STATUS "Found google-cloud-cpp::storage headers: ${GOOGLE_CLOUD_CPP_INCLUDE_DIR}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you extract include dir and static library path information from google-cloud-cpp::storage
?
If we use system Google Cloud Storage (find_package(google_cloud_cpp_storage)
is succeeded case), GOOGLE_CLOUD_CPP_*
aren't defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the static library, seems to be unused, and extracted the include dir from the target properties.
This is really not my problem, I will say it anyway: there is something wrong, or at least very unusual, about Arrow's usage of CMake. With modern CMake adding google-cloud-cpp::storage
to the target_link_libraries()
should define all the necessary command-line flags. If one must set include_directories()
then some target is missing a dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You're right.
We have some problems around it but we didn't dig into them yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. In case it helps, it might be this:
arrow/cpp/cmake_modules/BuildUtils.cmake
Lines 285 to 287 in 00c2741
# Generate a single "objlib" from all C++ modules and link | |
# that "objlib" into each library kind, to avoid compiling twice | |
add_library(${LIB_NAME}_objlib OBJECT ${ARG_SOURCES}) |
the arrow_objlib
target does not seem to link any dependencies, so when its object files are compiled, none of the required flags are set. You may want to add the dependencies to arrow_objlib
and then things should "Just Work":tm:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be right. We should try it later. :-)
CI failures on macOS will be fixed by #11374. |
@kou any more concerns, otherwise I think this can be merged. |
Nothing. We can merge this. |
Benchmark runs are scheduled for baseline = 00c2741 and contender = 1b0ea15. 1b0ea15 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No description provided.