-
Notifications
You must be signed in to change notification settings - Fork 0
feat: reading from bigtable #2
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)
WORKSPACE, line 115 at r1 (raw file):
name = "com_google_googleapis", build_file = "@com_github_googleapis_google_cloud_cpp//bazel:googleapis.BUILD", sha256 = "a53e15405f81d5a32594d7f6486e649131fadda5431cf28377dff4ae54d45d16",
Why this change?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 27 at r1 (raw file):
void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override { // Parse and validate any input tensors that define the dataset using
I don't understand this comment - what input tensors do we have here?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 33 at r1 (raw file):
// Create the dataset object, passing any (already-validated) arguments from // attrs or input tensors.
I'm also guessing this comment is left over from a hello world.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 56 at r1 (raw file):
std::unique_ptr<IteratorBase> MakeIteratorInternal( const std::string& prefix) const { VLOG(1) << "MakeIteratorInternal. instance=" << project_id_ << ":"
I think it should be table=
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 58 at r1 (raw file):
VLOG(1) << "MakeIteratorInternal. instance=" << project_id_ << ":" << instance_id_ << ":" << table_id_; return std::unique_ptr<IteratorBase>(
absl::make_unique
maybe?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 63 at r1 (raw file):
} // Record structure: Each record is represented by a scalar string tensor.
I'm guessing that you left all the comments from a tutorial/example - please remove them if they do no
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 69 at r1 (raw file):
DataTypeVector
Shouldn't this vector be of the same size as columns_
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r1 (raw file):
const std::vector<PartialTensorShape>& output_shapes() const override { static std::vector<PartialTensorShape>* shapes = new std::vector<PartialTensorShape>({{}});
That's a weird way to write it. Why not simply static std::vector<PartialTensorShape> shapes({{}});
?
BTW, the shape should somehow reflect the number of columns specified by the user, so I think a it should be a member rather than a global variable.
Anyway, google's C++ style guide says "Objects with static storage duration are forbidden unless they are trivially destructible."
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 84 at r1 (raw file):
protected: // Optional: Implementation of `GraphDef` serialization for this dataset.
It seems you can safely remove this member function?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 105 at r1 (raw file):
std::vector<std::string> columns_; class Iterator : public DatasetIterator<Dataset> {
Please make the member function implementation non-inline. It's pretty hard to read with the extra indentation.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 112 at r1 (raw file):
std::vector<std::string> columns) : DatasetIterator<Dataset>(params), data_client_(CreateDataClient(project_id, instance_id)),
I think we don't have the same problem here as we had in pyTorch and we can keep the client in the Dataset
. That's what Bigquery does and I think we should follow it.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 143 at r1 (raw file):
if (it_ == reader_.end()) { VLOG(1) << "End of sequence"; *end_of_sequence = true;
I think the code would be easier to comprehend if it followed this pattern:
if (it_ == reader_.end()) {
...
*end_of_sequence = true;
return Status::Ok();
}
*end_of_sequence = false;
// handle the non-trivial scenario
That way:
end_of_sequence
is handled in one place of the function- there is less indentation
- the reader doesn't have to scroll all the way down to learn that nothing else happens in case of reaching the end
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 147 at r1 (raw file):
record_tensor
I'm guessing the name record
came from Bigquery implementation, but Bigtable doesn't have "records" in its terminology. How about res
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 148 at r1 (raw file):
record_v
It's pretty hard to decrypt what it means. How about res_data
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 153 at r1 (raw file):
auto const& row = *it_; for (const auto& cell : row.value().cells()) { std::pair<std::string, std::string> key(cell.family_name(),
auto const key = std::make_pair(cell.family_name(), cell.column_qualifier())
is a bit shorter.
Also, this is piece of code is performance critical, so I'd rather not lookup in the map twice. Use find()
instead:
auto const column_it = column_map_.find({});
// somehow handle column_it == column_map_.end()
VLOG(1) << "getting column:" << *column_it;
record_v(*column_it) = cell.value();
Also, we should avoid unnecessary string copies here, so I'd write it as:
auto cells = std::move(row).cells();
for (auto & cell : cells) {
auto const key = std::make_pair(cell.family_name(), cell.column_qualifier());
...
record_v(*column_it) = std::move(cell).value();
}
There's even more room for optimization. Currently, we're copying the strings representing the column qualifier and family name. We could avoid it if the key in our column_map
was std::pair<std::string const&, std::string const&>
. You would have to manually ensure that the strings outlive the map, but that's not hard. Given that this code is executed for every cell, this might be worth it. WDYT?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 204 at r1 (raw file):
pair
That's not a very descriptive variable name. How about column
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 205 at r1 (raw file):
pair.first, pair.second
std::move(pair.first), std::move(pair.second)
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r1 (raw file):
column_map_
In this project we can use absl::Hash
, which has a hash function for pairs.
BTW, would column_to_idx_
be more self-descriptive?
tensorflow_io/core/ops/bigtable_ops.cc, line 8 at r1 (raw file):
That's no longer needed, right?
tensorflow_io/core/ops/bigtable_ops.cc, line 17 at r1 (raw file):
MyReaderDataset
Please update the comment :)
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 11 at r1 (raw file):
MyDataset
I think we should rename it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 27 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I don't understand this comment - what input tensors do we have here?
Whoops. Deleted.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 33 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I'm also guessing this comment is left over from a hello world.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 56 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think it should be
table=
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 58 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
absl::make_unique
maybe?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 63 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I'm guessing that you left all the comments from a tutorial/example - please remove them if they do no
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 69 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
DataTypeVector
Shouldn't this vector be of the same size as
columns_
?
For the sake of this PR we assume that our output is a tensor of type DT_STRING, so no need for that. Will be changed in the next one.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
That's a weird way to write it. Why not simply
static std::vector<PartialTensorShape> shapes({{}});
?BTW, the shape should somehow reflect the number of columns specified by the user, so I think a it should be a member rather than a global variable.
Anyway, google's C++ style guide says "Objects with static storage duration are forbidden unless they are trivially destructible."
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 84 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
It seems you can safely remove this member function?
Then the compiler whines that the abstract function is not implemented.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 105 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please make the member function implementation non-inline. It's pretty hard to read with the extra indentation.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 112 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think we don't have the same problem here as we had in pyTorch and we can keep the client in the
Dataset
. That's what Bigquery does and I think we should follow it.
Definately. I plan to implement it in next PRs. This one focuses on reading data.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 143 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think the code would be easier to comprehend if it followed this pattern:
if (it_ == reader_.end()) { ... *end_of_sequence = true; return Status::Ok(); } *end_of_sequence = false; // handle the non-trivial scenario
That way:
end_of_sequence
is handled in one place of the function- there is less indentation
- the reader doesn't have to scroll all the way down to learn that nothing else happens in case of reaching the end
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 147 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
record_tensor
I'm guessing the name
record
came from Bigquery implementation, but Bigtable doesn't have "records" in its terminology. How aboutres
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 148 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
record_v
It's pretty hard to decrypt what it means. How about
res_data
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
column_map_
In this project we can use
absl::Hash
, which has a hash function for pairs.BTW, would
column_to_idx_
be more self-descriptive?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 204 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
pair
That's not a very descriptive variable name. How about
column
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 205 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
pair.first, pair.second
std::move(pair.first), std::move(pair.second)
?
Done.
tensorflow_io/core/ops/bigtable_ops.cc, line 8 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
That's no longer needed, right?
It will be needed later. The 'BigtableDataset' Op is a special kind of Op representing a dataset. When we create a separate Op for creating and storing the client connection we'll need to use one just like BigtableTest, that's why I would leave it for now. It will serve as a template for creating a new Op in the next PR.
tensorflow_io/core/ops/bigtable_ops.cc, line 17 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
MyReaderDataset
Please update the comment :)
Done.
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 11 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
MyDataset
I think we should rename it :)
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)
WORKSPACE, line 115 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why this change?
It's a newer version.
ERROR: /home/kajetan/.cache/bazel/_bazel_kajetan/0cce586c91f45fa0eaddae47f82490ab/external/com_github_googleapis_google_cloud_cpp/google/cloud/bigtable/BUILD:24:11: no such target '@com_google_googleapis//google/bigtable/admin/v2:admin_cc_grpc': target 'admin_cc_grpc' not declared in package 'google/bigtable/admin/v2' defined by /home/kajetan/.cache/bazel/_bazel_kajetan/0cce586c91f45fa0eaddae47f82490ab/external/com_google_googleapis/google/bigtable/admin/v2/BUILD.bazel and referenced by '@com_github_googleapis_google_cloud_cpp//google/cloud/bigtable:bigtable_client'
The error says that com_github_googleapis_google_cloud_cpp
references some admin_cc_grpc
but no such target exists in com_google_googleapis
.
That's because the bigtable code from google_cloud_cpp
is expecting a newer version of googleapis
. As long as bigtable was not used, this discrepancy was not causing any errors because bazel didn't try to build it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 153 at r1 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
auto const key = std::make_pair(cell.family_name(), cell.column_qualifier())
is a bit shorter.Also, this is piece of code is performance critical, so I'd rather not lookup in the map twice. Use
find()
instead:auto const column_it = column_map_.find({}); // somehow handle column_it == column_map_.end() VLOG(1) << "getting column:" << *column_it; record_v(*column_it) = cell.value();
Also, we should avoid unnecessary string copies here, so I'd write it as:
auto cells = std::move(row).cells(); for (auto & cell : cells) { auto const key = std::make_pair(cell.family_name(), cell.column_qualifier()); ... record_v(*column_it) = std::move(cell).value(); }
There's even more room for optimization. Currently, we're copying the strings representing the column qualifier and family name. We could avoid it if the key in our
column_map
wasstd::pair<std::string const&, std::string const&>
. You would have to manually ensure that the strings outlive the map, but that's not hard. Given that this code is executed for every cell, this might be worth it. WDYT?
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/bigtable.py, line 1 at r9 (raw file):
from tensorflow_io.python.ops.bigtable_dataset_ops import BigtableDataset
Please remove this file for now
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):
delimiter_pos + 1
This might be past a past-the-end position. I think you want to extend your if with (delimiter_pos > 0 && delimiter_pos + 1 < col_name_full.size())
Or better, use absl::StrSplit
and verify the resulting vector length.
BTW, let's stick to the official naming: column == family:qualifier.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):
cbt::RowReader reader_ GUARDED_BY(mu_); cbt::v1::internal::RowReaderIterator it_ GUARDED_BY(mu_); absl::flat_hash_map<std::pair<std::string const&, std::string const&>, size_t>
Please add a comment explaining why we're using such an eccentric type. Also, can we make it const
? If we can, we don't need GUARDED_BY
, I think.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):
A general comment, applying to all of the review is to not define the member functions inline. Google style guide says:
Do not put large method definitions inline in the class definition. Usually, only trivial or performance-critical, and very short, methods may be defined inline
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 35 at r9 (raw file):
class BigtableClientResource : public ResourceBase { public: explicit BigtableClientResource(std::string const& project_id,
Please make this ctor take the parameters by value and std::move
them to data client creation.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 42 at r9 (raw file):
CreateDefaultDataClient
CreateDefaultDataClient()
is deprecated, please use MakeDataClient
.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 43 at r9 (raw file):
std::move(project_id), std::move(instance_id)
Please either make this member function take arguments by value (preferred) or get rid of the std::move()
s.
BTW, do we even need this member function? It seem unnecessary, especially in the public API.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 46 at r9 (raw file):
} std::unique_ptr<cbt::Table> CreateTable(std::string const& table_id) {
cbt::Table
is supposed to be lightweight - do we need the std::unique_ptr
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 55 at r9 (raw file):
mutex mu_; std::shared_ptr<cbt::DataClient> data_client_ TF_GUARDED_BY(mu_);
DataClient
is thread safe and so is std::shared_ptr
, so you don't need the mutex
. You're not using it anyway ;)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 59 at r9 (raw file):
Google style guide says:
don't put more than one or two blank lines between functions
Please remove at least one :)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r9 (raw file):
VLOG(1) << "BigtableClientOp dtor";
I think this log is misplaced.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 120 at r9 (raw file):
explicit Iterator(const typename DatasetIterator<Dataset>::Params& params, std::string const& table_id, std::vector<std::string> columns)
We don't need the columns
by value - they are only read by CreateColumnPairs()
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):
VLOG(1) << "alocating tensor"; long n_cols = column_to_idx_.size();
std::size_t const kNumCols
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
VLOG(1) << "getting column:" << column_idx->second; res_data(column_idx->second) = std::move(cell.value()); } else {
This case deserves a serious warning - it means we got a column which we did not expect, so something is wrong with our filter generation or the Bigtable client/service.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 161 at r9 (raw file):
VLOG(1) << "returning value"; out_tensors->emplace_back(std::move(res)); *end_of_sequence = false;
I think I'd move this line to right after the initial if
, so that both cases are handled near each other.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 209 at r9 (raw file):
} static std::pair<std::string, std::string> ColumnNameToPair(
ColumnToFamilyAndQualifier
is only 10 characters longer but drastically more descriptive. If you want to save on screen real estate, Col2FamAndQual
should work, too :)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 224 at r9 (raw file):
} static std::vector<std::pair<std::string, std::string>> CreateColumnPairs(
ColumnPair
suggests a pair of columns. How about ColumnsToFamiliesAndQualifiers
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 236 at r9 (raw file):
static absl::flat_hash_map<std::pair<std::string const&, std::string const&>, size_t> CreateColumnMap(
How about CreateColumn2TensorIdxMap
?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r9 (raw file):
size_t> column_map; size_t index = 0;
std::size_t
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 253 at r9 (raw file):
mutex mu_; std::shared_ptr<cbt::DataClient> data_client_ GUARDED_BY(mu_); std::vector<std::pair<std::string, std::string>> columns_;
Can we make this member const
? This would allow us to not use GUARDED_BY
, indeed.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):
~Dataset(){ client_resource_->Unref();
Why not ScopedUnref()
member instead having a non-trivial dtor?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 278 at r9 (raw file):
MakeIteratorInternal
This member function should be marked @override
, no?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):
Iterator<Dataset>
Wouldn't return absl::make_unique<Iterator<Dataset>>({this, strings::StrCat(prefix, "::BigtableDataset")}, table_id_, columns_)
suffice?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 296 at r9 (raw file):
return client_resource_; }
Something is weirdly formatted here.
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):
_BigTableDataset
I think you meant BigtableDataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera, @kboroszko, and @OverRide)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
delimiter_pos + 1
This might be past a past-the-end position. I think you want to extend your if with
(delimiter_pos > 0 && delimiter_pos + 1 < col_name_full.size())
Or better, use
absl::StrSplit
and verify the resulting vector length.BTW, let's stick to the official naming: column == family:qualifier.
😒 StrSplit returns a vector. I then have to check the size of that vector and verify non-zero length of both elements of that vector. I'd rather extend the if.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please add a comment explaining why we're using such an eccentric type. Also, can we make it
const
? If we can, we don't needGUARDED_BY
, I think.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 35 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please make this ctor take the parameters by value and
std::move
them to data client creation.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 42 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
CreateDefaultDataClient
CreateDefaultDataClient()
is deprecated, please useMakeDataClient
.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 43 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
std::move(project_id), std::move(instance_id)
Please either make this member function take arguments by value (preferred) or get rid of the
std::move()
s.BTW, do we even need this member function? It seem unnecessary, especially in the public API.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 46 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
cbt::Table
is supposed to be lightweight - do we need thestd::unique_ptr
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 55 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
mutex mu_; std::shared_ptr<cbt::DataClient> data_client_ TF_GUARDED_BY(mu_);
DataClient
is thread safe and so isstd::shared_ptr
, so you don't need themutex
. You're not using it anyway ;)
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 59 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Google style guide says:
don't put more than one or two blank lines between functions
Please remove at least one :)
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
VLOG(1) << "BigtableClientOp dtor";
I think this log is misplaced.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 120 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
We don't need the
columns
by value - they are only read byCreateColumnPairs()
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
std::size_t const kNumCols
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 209 at r9 (raw file):
ColumnToFamilyAndQualifier
Eh, I have a wide screen.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 224 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
ColumnPair
suggests a pair of columns. How aboutColumnsToFamiliesAndQualifiers
?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
std::size_t
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 253 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Can we make this member
const
? This would allow us to not useGUARDED_BY
, indeed.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 278 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
MakeIteratorInternal
This member function should be marked
@override
, no?
It's not java so no @OverRide 😆, but I added override in the appropriate place.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Iterator<Dataset>
Wouldn't
return absl::make_unique<Iterator<Dataset>>({this, strings::StrCat(prefix, "::BigtableDataset")}, table_id_, columns_)
suffice?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 296 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
return client_resource_; }
Something is weirdly formatted here.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera, @kboroszko, and @OverRide)
tensorflow_io/bigtable.py, line 1 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Please remove this file for now
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
A general comment, applying to all of the review is to not define the member functions inline. Google style guide says:
Do not put large method definitions inline in the class definition. Usually, only trivial or performance-critical, and very short, methods may be defined inline
Have I defined anything inline?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
This case deserves a serious warning - it means we got a column which we did not expect, so something is wrong with our filter generation or the Bigtable client/service.
We are very serious people with very serious warnings.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 161 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think I'd move this line to right after the initial
if
, so that both cases are handled near each other.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 236 at r9 (raw file):
CreateColumn2TensorIdxMap
Maybe ColumnToIdxMap?
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
_BigTableDataset
I think you meant
BigtableDataset
?
I wanted to indicate that it is not to be created on its own. It should only be made as a result of BigtableTable.read_rows. Bigquery connector follows a similar pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 13 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
😒 StrSplit returns a vector. I then have to check the size of that vector and verify non-zero length of both elements of that vector. I'd rather extend the if.
But you won't have to write a test for the corner cases :). Seriously speaking - the code is not trivial, as a matter of fact you made it get a SIGSEGV in a valid scenario (the column qualifier may be empty) - I think it's a pretty good indication that using something ready and tested is a better idea unless you have a really good reason not to. StrSplit
has yet another advantage - it's immediately clear what's going on.
Also, please rename the variables to reflect column == family:qualifier
.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
Sorry, I should have been more specific. The absl::flat_hash_map
is self-explanatory, I think, but holding the references isn't. Could you please describe that we're keeping const references to strings, so that when searching for an index we do not need to copy strings when constructing the search key?
BTW, std::map
doesn't use hash functions - std::unordered_map
does and you can always give it an explicit hash function (e.g. from absl
) as a template argument. The difference between std::unordered_map
and absl::flat_hash_map
, I think, is that std
does open hashing and absl
- closed. For something that is read mostly, the absl
is likely better.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Have I defined anything inline?
Everything, actually :)
When you do this:
class Foo {
void bar() {
std::cout << "Hello world" << std::endl;
}
};
then Foo::bar
is implicitly marked inline by the compiler.
This is equivalent to
class Foo {
inline void bar();
};
inline void Foo::bar() {
std::cout << "Hello world" << std::endl;
}
Only in the second version (when you drop the inline
keyword) is the member function not inline.
The style guide says, the trivial or very short member functions can be inlined.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
You left it as long
still.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
We are very serious people with very serious warnings.
We are, but unfortunately, I think this will not emit this warning - it only checks for what a status is. TBH, I'm surprised it compiles.
I think you want LOG(WARNING) <<
. Please also add which column was found unexpected.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why not
ScopedUnref()
member instead having a non-trivial dtor?
I think you missed this comment.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
Why store the returned value in a local variable rather than return directly?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r11 (raw file):
/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.
I noticed only now - this project uses west const, please adjust the placement of const
s in your PR.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 180 at r11 (raw file):
private: std::unique_ptr<cbt::Table> CreateTable(
I think this member function is unused.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 187 at r11 (raw file):
} std::shared_ptr<cbt::DataClient> CreateDataClient(
I think this member function is unused.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 220 at r11 (raw file):
col_name_full.substr(delimiter_pos + 1, col_name_full.length()); std::pair<std::string, std::string> pair(col_family, col_name); return pair;
Any reason to not write return std::make_pair(std::move(family), std::move(qualifier))
instead of declaring the local variable?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 300 at r11 (raw file):
If this member function is required because it's abstract then please add the override
keyword to it.
The style guide says:
Explicitly annotate overrides of virtual functions or virtual destructors with exactly one of an override or (less frequently) final specifier.
BTW, please try to run the linter - it should catch such things.
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
I wanted to indicate that it is not to be created on its own. It should only be made as a result of BigtableTable.read_rows. Bigquery connector follows a similar pattern.
Oh, sorry - I didn't notice the underscore. The case doesn't match, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 13 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
But you won't have to write a test for the corner cases :). Seriously speaking - the code is not trivial, as a matter of fact you made it get a SIGSEGV in a valid scenario (the column qualifier may be empty) - I think it's a pretty good indication that using something ready and tested is a better idea unless you have a really good reason not to.
StrSplit
has yet another advantage - it's immediately clear what's going on.Also, please rename the variables to reflect
column == family:qualifier
.
Used StrSplit.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Sorry, I should have been more specific. The
absl::flat_hash_map
is self-explanatory, I think, but holding the references isn't. Could you please describe that we're keeping const references to strings, so that when searching for an index we do not need to copy strings when constructing the search key?BTW,
std::map
doesn't use hash functions -std::unordered_map
does and you can always give it an explicit hash function (e.g. fromabsl
) as a template argument. The difference betweenstd::unordered_map
andabsl::flat_hash_map
, I think, is thatstd
does open hashing andabsl
- closed. For something that is read mostly, theabsl
is likely better.
alrighty.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Everything, actually :)
When you do this:
class Foo { void bar() { std::cout << "Hello world" << std::endl; } };
then
Foo::bar
is implicitly marked inline by the compiler.This is equivalent to
class Foo { inline void bar(); }; inline void Foo::bar() { std::cout << "Hello world" << std::endl; }
Only in the second version (when you drop the
inline
keyword) is the member function not inline.The style guide says, the trivial or very short member functions can be inlined.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
You left it as
long
still.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
We are, but unfortunately, I think this will not emit this warning - it only checks for what a status is. TBH, I'm surprised it compiles.
I think you want
LOG(WARNING) <<
. Please also add which column was found unexpected.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think you missed this comment.
I did. sorry. I guess it should work either way, no?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Why store the returned value in a local variable rather than return directly?
I restructured some of this code when debugging and then missed that when cleaning up. Sorry about that.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r11 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I noticed only now - this project uses west const, please adjust the placement of
const
s in your PR.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 180 at r11 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think this member function is unused.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 187 at r11 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
I think this member function is unused.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 220 at r11 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Any reason to not write
return std::make_pair(std::move(family), std::move(qualifier))
instead of declaring the local variable?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 300 at r11 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
If this member function is required because it's abstract then please add the
override
keyword to it.The style guide says:
Explicitly annotate overrides of virtual functions or virtual destructors with exactly one of an override or (less frequently) final specifier.
BTW, please try to run the linter - it should catch such things.
Done.
tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Oh, sorry - I didn't notice the underscore. The case doesn't match, though.
Ah, in the comment! Pff, I completely missed that! sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're getting there, just a couple more nitpicks
Reviewable status: 0 of 8 files reviewed, 7 unresolved discussions (waiting on @dopiera and @kboroszko)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
Previously, kboroszko (Kajetan Boroszko) wrote…
Done.
But VLOG is only for verbose logging. We want this error to be in all log configs, no?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 39 at r12 (raw file):
CreateDataClient
Does this need to be public?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 232 at r12 (raw file):
const std::shared_ptr<cbt::DataClient> data_client_; std::vector<std::pair<std::string, std::string>> columns_;
These two members can be const
if I'm not mistaken
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 273 at r12 (raw file):
} BigtableClientResource* client_resource() const { return client_resource_; }
Sorry, I only noticed now. Please return a reference, since the style guide says:
Avoid returning a pointer unless it can be null
And given that it can't be null, why not hold a a reference as well?
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 285 at r12 (raw file):
private: BigtableClientResource* client_resource_;
Sorry, I noticed it only now - I think all members can be const.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 286 at r12 (raw file):
private: BigtableClientResource* client_resource_; core::ScopedUnref scoped_unref;
You're missing the trailing underscore.
BTW, please avoid calling variables by their type - instead try to reflect what's their purpose, e.g. client_resource_unref
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 7 unresolved discussions (waiting on @dopiera)
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
But VLOG is only for verbose logging. We want this error to be in all log configs, no?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 39 at r12 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
CreateDataClient
Does this need to be public?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 232 at r12 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
const std::shared_ptr<cbt::DataClient> data_client_; std::vector<std::pair<std::string, std::string>> columns_;
These two members can be
const
if I'm not mistaken
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 273 at r12 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Sorry, I only noticed now. Please return a reference, since the style guide says:
Avoid returning a pointer unless it can be null
And given that it can't be null, why not hold a a reference as well?
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 285 at r12 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
Sorry, I noticed it only now - I think all members can be const.
Done.
tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 286 at r12 (raw file):
Previously, dopiera (Marek Dopiera) wrote…
You're missing the trailing underscore.
BTW, please avoid calling variables by their type - instead try to reflect what's their purpose, e.g.
client_resource_unref
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @dopiera)
Implements reading from bigtable in a synchronous manner.
Implements reading from bigtable in a synchronous manner.
Implements reading from bigtable in a synchronous manner.
Implements reading from bigtable in a synchronous manner.
Implements reading from bigtable in a synchronous manner.
Implements reading from bigtable in a synchronous manner.
This PR contains the basic implementation for reading from bigtable database.
For this purpose, it creates a custom dataset op. This is the first step towards implementing the read functionality. It reads the data straight as bytes and does minimal argument validation.
This change is