Skip to content

feat: reading from bigtable #2

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 36 commits into from
Oct 27, 2021
Merged

feat: reading from bigtable #2

merged 36 commits into from
Oct 27, 2021

Conversation

kboroszko
Copy link
Collaborator

@kboroszko kboroszko commented Oct 12, 2021

This PR contains the basic implementation for reading from bigtable database.

For this purpose, it creates a custom dataset op. This is the first step towards implementing the read functionality. It reads the data straight as bytes and does minimal argument validation.


This change is Reviewable

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)


WORKSPACE, line 115 at r1 (raw file):

    name = "com_google_googleapis",
    build_file = "@com_github_googleapis_google_cloud_cpp//bazel:googleapis.BUILD",
    sha256 = "a53e15405f81d5a32594d7f6486e649131fadda5431cf28377dff4ae54d45d16",

Why this change?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 27 at r1 (raw file):

  void MakeDataset(OpKernelContext* ctx, DatasetBase** output) override {
    // Parse and validate any input tensors that define the dataset using

I don't understand this comment - what input tensors do we have here?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 33 at r1 (raw file):

    // Create the dataset object, passing any (already-validated) arguments from
    // attrs or input tensors.

I'm also guessing this comment is left over from a hello world.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 56 at r1 (raw file):

    std::unique_ptr<IteratorBase> MakeIteratorInternal(
        const std::string& prefix) const {
      VLOG(1) << "MakeIteratorInternal. instance=" << project_id_ << ":"

I think it should be table=


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 58 at r1 (raw file):

      VLOG(1) << "MakeIteratorInternal. instance=" << project_id_ << ":"
              << instance_id_ << ":" << table_id_;
      return std::unique_ptr<IteratorBase>(

absl::make_unique maybe?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 63 at r1 (raw file):

    }

    // Record structure: Each record is represented by a scalar string tensor.

I'm guessing that you left all the comments from a tutorial/example - please remove them if they do no


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 69 at r1 (raw file):

DataTypeVector

Shouldn't this vector be of the same size as columns_?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r1 (raw file):

    const std::vector<PartialTensorShape>& output_shapes() const override {
      static std::vector<PartialTensorShape>* shapes =
          new std::vector<PartialTensorShape>({{}});

That's a weird way to write it. Why not simply static std::vector<PartialTensorShape> shapes({{}});?

BTW, the shape should somehow reflect the number of columns specified by the user, so I think a it should be a member rather than a global variable.

Anyway, google's C++ style guide says "Objects with static storage duration are forbidden unless they are trivially destructible."


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 84 at r1 (raw file):

   protected:
    // Optional: Implementation of `GraphDef` serialization for this dataset.

It seems you can safely remove this member function?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 105 at r1 (raw file):

    std::vector<std::string> columns_;

    class Iterator : public DatasetIterator<Dataset> {

Please make the member function implementation non-inline. It's pretty hard to read with the extra indentation.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 112 at r1 (raw file):

                        std::vector<std::string> columns)
          : DatasetIterator<Dataset>(params),
            data_client_(CreateDataClient(project_id, instance_id)),

I think we don't have the same problem here as we had in pyTorch and we can keep the client in the Dataset. That's what Bigquery does and I think we should follow it.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 143 at r1 (raw file):

        if (it_ == reader_.end()) {
          VLOG(1) << "End of sequence";
          *end_of_sequence = true;

I think the code would be easier to comprehend if it followed this pattern:

if (it_ == reader_.end()) {
   ...
   *end_of_sequence = true;
   return Status::Ok();
}
*end_of_sequence = false;
// handle the non-trivial scenario

That way:

  • end_of_sequence is handled in one place of the function
  • there is less indentation
  • the reader doesn't have to scroll all the way down to learn that nothing else happens in case of reaching the end

tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 147 at r1 (raw file):

record_tensor

I'm guessing the name record came from Bigquery implementation, but Bigtable doesn't have "records" in its terminology. How about res?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 148 at r1 (raw file):

record_v

It's pretty hard to decrypt what it means. How about res_data?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 153 at r1 (raw file):

          auto const& row = *it_;
          for (const auto& cell : row.value().cells()) {
            std::pair<std::string, std::string> key(cell.family_name(),

auto const key = std::make_pair(cell.family_name(), cell.column_qualifier()) is a bit shorter.

Also, this is piece of code is performance critical, so I'd rather not lookup in the map twice. Use find() instead:

auto const column_it = column_map_.find({});
// somehow handle column_it == column_map_.end()
VLOG(1) << "getting column:" << *column_it;
record_v(*column_it) = cell.value();

Also, we should avoid unnecessary string copies here, so I'd write it as:

auto cells = std::move(row).cells();
for (auto & cell : cells) {
  auto const key = std::make_pair(cell.family_name(), cell.column_qualifier());
  ...
  record_v(*column_it) = std::move(cell).value();
}

There's even more room for optimization. Currently, we're copying the strings representing the column qualifier and family name. We could avoid it if the key in our column_map was std::pair<std::string const&, std::string const&>. You would have to manually ensure that the strings outlive the map, but that's not hard. Given that this code is executed for every cell, this might be worth it. WDYT?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 204 at r1 (raw file):

pair

That's not a very descriptive variable name. How about column?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 205 at r1 (raw file):

pair.first, pair.second

std::move(pair.first), std::move(pair.second)?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r1 (raw file):

column_map_

In this project we can use absl::Hash, which has a hash function for pairs.

BTW, would column_to_idx_ be more self-descriptive?


tensorflow_io/core/ops/bigtable_ops.cc, line 8 at r1 (raw file):

    

That's no longer needed, right?


tensorflow_io/core/ops/bigtable_ops.cc, line 17 at r1 (raw file):

MyReaderDataset

Please update the comment :)


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 11 at r1 (raw file):

MyDataset

I think we should rename it :)

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 27 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I don't understand this comment - what input tensors do we have here?

Whoops. Deleted.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 33 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I'm also guessing this comment is left over from a hello world.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 56 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think it should be table=

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 58 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

absl::make_unique maybe?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 63 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I'm guessing that you left all the comments from a tutorial/example - please remove them if they do no

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 69 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
DataTypeVector

Shouldn't this vector be of the same size as columns_?

For the sake of this PR we assume that our output is a tensor of type DT_STRING, so no need for that. Will be changed in the next one.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

That's a weird way to write it. Why not simply static std::vector<PartialTensorShape> shapes({{}});?

BTW, the shape should somehow reflect the number of columns specified by the user, so I think a it should be a member rather than a global variable.

Anyway, google's C++ style guide says "Objects with static storage duration are forbidden unless they are trivially destructible."

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 84 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

It seems you can safely remove this member function?

Then the compiler whines that the abstract function is not implemented.

very funny gif. I am hillarious.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 105 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please make the member function implementation non-inline. It's pretty hard to read with the extra indentation.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 112 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think we don't have the same problem here as we had in pyTorch and we can keep the client in the Dataset. That's what Bigquery does and I think we should follow it.

Definately. I plan to implement it in next PRs. This one focuses on reading data.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 143 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think the code would be easier to comprehend if it followed this pattern:

if (it_ == reader_.end()) {
   ...
   *end_of_sequence = true;
   return Status::Ok();
}
*end_of_sequence = false;
// handle the non-trivial scenario

That way:

  • end_of_sequence is handled in one place of the function
  • there is less indentation
  • the reader doesn't have to scroll all the way down to learn that nothing else happens in case of reaching the end

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 147 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
record_tensor

I'm guessing the name record came from Bigquery implementation, but Bigtable doesn't have "records" in its terminology. How about res?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 148 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

record_v

It's pretty hard to decrypt what it means. How about res_data?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
column_map_

In this project we can use absl::Hash, which has a hash function for pairs.

BTW, would column_to_idx_ be more self-descriptive?

Done.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 204 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
pair

That's not a very descriptive variable name. How about column?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 205 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
pair.first, pair.second

std::move(pair.first), std::move(pair.second)?

Done.


tensorflow_io/core/ops/bigtable_ops.cc, line 8 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
    

That's no longer needed, right?

It will be needed later. The 'BigtableDataset' Op is a special kind of Op representing a dataset. When we create a separate Op for creating and storing the client connection we'll need to use one just like BigtableTest, that's why I would leave it for now. It will serve as a template for creating a new Op in the next PR.


tensorflow_io/core/ops/bigtable_ops.cc, line 17 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
MyReaderDataset

Please update the comment :)

Done.


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 11 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
MyDataset

I think we should rename it :)

Done.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera and @kboroszko)


WORKSPACE, line 115 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why this change?

It's a newer version.

ERROR: /home/kajetan/.cache/bazel/_bazel_kajetan/0cce586c91f45fa0eaddae47f82490ab/external/com_github_googleapis_google_cloud_cpp/google/cloud/bigtable/BUILD:24:11: no such target '@com_google_googleapis//google/bigtable/admin/v2:admin_cc_grpc': target 'admin_cc_grpc' not declared in package 'google/bigtable/admin/v2' defined by /home/kajetan/.cache/bazel/_bazel_kajetan/0cce586c91f45fa0eaddae47f82490ab/external/com_google_googleapis/google/bigtable/admin/v2/BUILD.bazel and referenced by '@com_github_googleapis_google_cloud_cpp//google/cloud/bigtable:bigtable_client'

The error says that com_github_googleapis_google_cloud_cpp references some admin_cc_grpc but no such target exists in com_google_googleapis.

That's because the bigtable code from google_cloud_cpp is expecting a newer version of googleapis. As long as bigtable was not used, this discrepancy was not causing any errors because bazel didn't try to build it at all.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 21 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 153 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

auto const key = std::make_pair(cell.family_name(), cell.column_qualifier()) is a bit shorter.

Also, this is piece of code is performance critical, so I'd rather not lookup in the map twice. Use find() instead:

auto const column_it = column_map_.find({});
// somehow handle column_it == column_map_.end()
VLOG(1) << "getting column:" << *column_it;
record_v(*column_it) = cell.value();

Also, we should avoid unnecessary string copies here, so I'd write it as:

auto cells = std::move(row).cells();
for (auto & cell : cells) {
  auto const key = std::make_pair(cell.family_name(), cell.column_qualifier());
  ...
  record_v(*column_it) = std::move(cell).value();
}

There's even more room for optimization. Currently, we're copying the strings representing the column qualifier and family name. We could avoid it if the key in our column_map was std::pair<std::string const&, std::string const&>. You would have to manually ensure that the strings outlive the map, but that's not hard. Given that this code is executed for every cell, this might be worth it. WDYT?

Done

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/bigtable.py, line 1 at r9 (raw file):

from tensorflow_io.python.ops.bigtable_dataset_ops import BigtableDataset

Please remove this file for now


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):

delimiter_pos + 1

This might be past a past-the-end position. I think you want to extend your if with (delimiter_pos > 0 && delimiter_pos + 1 < col_name_full.size())

Or better, use absl::StrSplit and verify the resulting vector length.

BTW, let's stick to the official naming: column == family:qualifier.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):

  cbt::RowReader reader_ GUARDED_BY(mu_);
  cbt::v1::internal::RowReaderIterator it_ GUARDED_BY(mu_);
  absl::flat_hash_map<std::pair<std::string const&, std::string const&>, size_t>

Please add a comment explaining why we're using such an eccentric type. Also, can we make it const? If we can, we don't need GUARDED_BY, I think.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):
A general comment, applying to all of the review is to not define the member functions inline. Google style guide says:

Do not put large method definitions inline in the class definition. Usually, only trivial or performance-critical, and very short, methods may be defined inline


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 35 at r9 (raw file):

class BigtableClientResource : public ResourceBase {
 public:
  explicit BigtableClientResource(std::string const& project_id,

Please make this ctor take the parameters by value and std::move them to data client creation.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 42 at r9 (raw file):

CreateDefaultDataClient

CreateDefaultDataClient() is deprecated, please use MakeDataClient.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 43 at r9 (raw file):

 std::move(project_id), std::move(instance_id)

Please either make this member function take arguments by value (preferred) or get rid of the std::move()s.

BTW, do we even need this member function? It seem unnecessary, especially in the public API.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 46 at r9 (raw file):

  }

  std::unique_ptr<cbt::Table> CreateTable(std::string const& table_id) {

cbt::Table is supposed to be lightweight - do we need the std::unique_ptr?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 55 at r9 (raw file):

  mutex mu_;
  std::shared_ptr<cbt::DataClient> data_client_ TF_GUARDED_BY(mu_);

DataClient is thread safe and so is std::shared_ptr, so you don't need the mutex. You're not using it anyway ;)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 59 at r9 (raw file):
Google style guide says:

don't put more than one or two blank lines between functions

Please remove at least one :)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r9 (raw file):

VLOG(1) << "BigtableClientOp dtor";

I think this log is misplaced.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 120 at r9 (raw file):

  explicit Iterator(const typename DatasetIterator<Dataset>::Params& params,
                    std::string const& table_id,
                    std::vector<std::string> columns)

We don't need the columns by value - they are only read by CreateColumnPairs()


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):

    VLOG(1) << "alocating tensor";
    long n_cols = column_to_idx_.size();

std::size_t const kNumCols?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

        VLOG(1) << "getting column:" << column_idx->second;
        res_data(column_idx->second) = std::move(cell.value());
      } else {

This case deserves a serious warning - it means we got a column which we did not expect, so something is wrong with our filter generation or the Bigtable client/service.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 161 at r9 (raw file):

    VLOG(1) << "returning value";
    out_tensors->emplace_back(std::move(res));
    *end_of_sequence = false;

I think I'd move this line to right after the initial if, so that both cases are handled near each other.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 209 at r9 (raw file):

  }

  static std::pair<std::string, std::string> ColumnNameToPair(

ColumnToFamilyAndQualifier is only 10 characters longer but drastically more descriptive. If you want to save on screen real estate, Col2FamAndQual should work, too :)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 224 at r9 (raw file):

  }

  static std::vector<std::pair<std::string, std::string>> CreateColumnPairs(

ColumnPair suggests a pair of columns. How about ColumnsToFamiliesAndQualifiers?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 236 at r9 (raw file):

  static absl::flat_hash_map<std::pair<std::string const&, std::string const&>,
                             size_t>
  CreateColumnMap(

How about CreateColumn2TensorIdxMap?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r9 (raw file):

                        size_t>
        column_map;
    size_t index = 0;

std::size_t


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 253 at r9 (raw file):

  mutex mu_;
  std::shared_ptr<cbt::DataClient> data_client_ GUARDED_BY(mu_);
  std::vector<std::pair<std::string, std::string>> columns_;

Can we make this member const? This would allow us to not use GUARDED_BY, indeed.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):

  ~Dataset(){
      client_resource_->Unref();

Why not ScopedUnref() member instead having a non-trivial dtor?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 278 at r9 (raw file):

MakeIteratorInternal

This member function should be marked @override, no?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):

 Iterator<Dataset>

Wouldn't return absl::make_unique<Iterator<Dataset>>({this, strings::StrCat(prefix, "::BigtableDataset")}, table_id_, columns_) suffice?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 296 at r9 (raw file):

      return client_resource_;
    }

Something is weirdly formatted here.


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):

_BigTableDataset

I think you meant BigtableDataset?

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera, @kboroszko, and @OverRide)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
delimiter_pos + 1

This might be past a past-the-end position. I think you want to extend your if with (delimiter_pos > 0 && delimiter_pos + 1 < col_name_full.size())

Or better, use absl::StrSplit and verify the resulting vector length.

BTW, let's stick to the official naming: column == family:qualifier.

😒 StrSplit returns a vector. I then have to check the size of that vector and verify non-zero length of both elements of that vector. I'd rather extend the if.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please add a comment explaining why we're using such an eccentric type. Also, can we make it const? If we can, we don't need GUARDED_BY, I think.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 35 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please make this ctor take the parameters by value and std::move them to data client creation.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 42 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
CreateDefaultDataClient

CreateDefaultDataClient() is deprecated, please use MakeDataClient.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 43 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
 std::move(project_id), std::move(instance_id)

Please either make this member function take arguments by value (preferred) or get rid of the std::move()s.

BTW, do we even need this member function? It seem unnecessary, especially in the public API.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 46 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

cbt::Table is supposed to be lightweight - do we need the std::unique_ptr?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 55 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
  mutex mu_;
  std::shared_ptr<cbt::DataClient> data_client_ TF_GUARDED_BY(mu_);

DataClient is thread safe and so is std::shared_ptr, so you don't need the mutex. You're not using it anyway ;)

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 59 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Google style guide says:

don't put more than one or two blank lines between functions

Please remove at least one :)

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 75 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
VLOG(1) << "BigtableClientOp dtor";

I think this log is misplaced.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 120 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

We don't need the columns by value - they are only read by CreateColumnPairs()

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

std::size_t const kNumCols?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 209 at r9 (raw file):

ColumnToFamilyAndQualifier
Eh, I have a wide screen.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 224 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

ColumnPair suggests a pair of columns. How about ColumnsToFamiliesAndQualifiers?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 242 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

std::size_t

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 253 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Can we make this member const? This would allow us to not use GUARDED_BY, indeed.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 278 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
MakeIteratorInternal

This member function should be marked @override, no?

It's not java so no @OverRide 😆, but I added override in the appropriate place.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
 Iterator<Dataset>

Wouldn't return absl::make_unique<Iterator<Dataset>>({this, strings::StrCat(prefix, "::BigtableDataset")}, table_id_, columns_) suffice?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 296 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
      return client_resource_;
    }

Something is weirdly formatted here.

Done.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 25 unresolved discussions (waiting on @dopiera, @kboroszko, and @OverRide)


tensorflow_io/bigtable.py, line 1 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please remove this file for now

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

A general comment, applying to all of the review is to not define the member functions inline. Google style guide says:

Do not put large method definitions inline in the class definition. Usually, only trivial or performance-critical, and very short, methods may be defined inline

Have I defined anything inline?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

This case deserves a serious warning - it means we got a column which we did not expect, so something is wrong with our filter generation or the Bigtable client/service.

We are very serious people with very serious warnings.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 161 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think I'd move this line to right after the initial if, so that both cases are handled near each other.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 236 at r9 (raw file):

CreateColumn2TensorIdxMap
Maybe ColumnToIdxMap?


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
_BigTableDataset

I think you meant BigtableDataset?

I wanted to indicate that it is not to be created on its own. It should only be made as a result of BigtableTable.read_rows. Bigquery connector follows a similar pattern.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 13 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

😒 StrSplit returns a vector. I then have to check the size of that vector and verify non-zero length of both elements of that vector. I'd rather extend the if.

But you won't have to write a test for the corner cases :). Seriously speaking - the code is not trivial, as a matter of fact you made it get a SIGSEGV in a valid scenario (the column qualifier may be empty) - I think it's a pretty good indication that using something ready and tested is a better idea unless you have a really good reason not to. StrSplit has yet another advantage - it's immediately clear what's going on.

Also, please rename the variables to reflect column == family:qualifier.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

Sorry, I should have been more specific. The absl::flat_hash_map is self-explanatory, I think, but holding the references isn't. Could you please describe that we're keeping const references to strings, so that when searching for an index we do not need to copy strings when constructing the search key?

BTW, std::map doesn't use hash functions - std::unordered_map does and you can always give it an explicit hash function (e.g. from absl) as a template argument. The difference between std::unordered_map and absl::flat_hash_map, I think, is that std does open hashing and absl - closed. For something that is read mostly, the absl is likely better.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Have I defined anything inline?

Everything, actually :)

When you do this:

class Foo {
  void bar() {
    std::cout << "Hello world" << std::endl;
  }
};

then Foo::bar is implicitly marked inline by the compiler.

This is equivalent to

class Foo {
  inline void bar();
};

inline void Foo::bar() {
  std::cout << "Hello world" << std::endl;
}

Only in the second version (when you drop the inline keyword) is the member function not inline.

The style guide says, the trivial or very short member functions can be inlined.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

You left it as long still.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

We are very serious people with very serious warnings.

We are, but unfortunately, I think this will not emit this warning - it only checks for what a status is. TBH, I'm surprised it compiles.

I think you want LOG(WARNING) << . Please also add which column was found unexpected.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why not ScopedUnref() member instead having a non-trivial dtor?

I think you missed this comment.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

Why store the returned value in a local variable rather than return directly?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r11 (raw file):

/* Copyright 2021 The TensorFlow Authors. All Rights Reserved.

I noticed only now - this project uses west const, please adjust the placement of consts in your PR.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 180 at r11 (raw file):

 private:
  std::unique_ptr<cbt::Table> CreateTable(

I think this member function is unused.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 187 at r11 (raw file):

  }

  std::shared_ptr<cbt::DataClient> CreateDataClient(

I think this member function is unused.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 220 at r11 (raw file):

        col_name_full.substr(delimiter_pos + 1, col_name_full.length());
    std::pair<std::string, std::string> pair(col_family, col_name);
    return pair;

Any reason to not write return std::make_pair(std::move(family), std::move(qualifier)) instead of declaring the local variable?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 300 at r11 (raw file):
If this member function is required because it's abstract then please add the override keyword to it.

The style guide says:

Explicitly annotate overrides of virtual functions or virtual destructors with exactly one of an override or (less frequently) final specifier.

BTW, please try to run the linter - it should catch such things.


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

I wanted to indicate that it is not to be created on its own. It should only be made as a result of BigtableTable.read_rows. Bigquery connector follows a similar pattern.

Oh, sorry - I didn't notice the underscore. The case doesn't match, though.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 13 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 219 at r8 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

But you won't have to write a test for the corner cases :). Seriously speaking - the code is not trivial, as a matter of fact you made it get a SIGSEGV in a valid scenario (the column qualifier may be empty) - I think it's a pretty good indication that using something ready and tested is a better idea unless you have a really good reason not to. StrSplit has yet another advantage - it's immediately clear what's going on.

Also, please rename the variables to reflect column == family:qualifier.

Used StrSplit.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 256 at r8 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Sorry, I should have been more specific. The absl::flat_hash_map is self-explanatory, I think, but holding the references isn't. Could you please describe that we're keeping const references to strings, so that when searching for an index we do not need to copy strings when constructing the search key?

BTW, std::map doesn't use hash functions - std::unordered_map does and you can always give it an explicit hash function (e.g. from absl) as a template argument. The difference between std::unordered_map and absl::flat_hash_map, I think, is that std does open hashing and absl - closed. For something that is read mostly, the absl is likely better.

alrighty.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Everything, actually :)

When you do this:

class Foo {
  void bar() {
    std::cout << "Hello world" << std::endl;
  }
};

then Foo::bar is implicitly marked inline by the compiler.

This is equivalent to

class Foo {
  inline void bar();
};

inline void Foo::bar() {
  std::cout << "Hello world" << std::endl;
}

Only in the second version (when you drop the inline keyword) is the member function not inline.

The style guide says, the trivial or very short member functions can be inlined.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 141 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

You left it as long still.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

We are, but unfortunately, I think this will not emit this warning - it only checks for what a status is. TBH, I'm surprised it compiles.

I think you want LOG(WARNING) << . Please also add which column was found unexpected.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 275 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think you missed this comment.

I did. sorry. I guess it should work either way, no?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 281 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Why store the returned value in a local variable rather than return directly?

I restructured some of this code when debugging and then missed that when cleaning up. Sorry about that.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 1 at r11 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I noticed only now - this project uses west const, please adjust the placement of consts in your PR.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 180 at r11 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think this member function is unused.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 187 at r11 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think this member function is unused.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 220 at r11 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Any reason to not write return std::make_pair(std::move(family), std::move(qualifier)) instead of declaring the local variable?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 300 at r11 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

If this member function is required because it's abstract then please add the override keyword to it.

The style guide says:

Explicitly annotate overrides of virtual functions or virtual destructors with exactly one of an override or (less frequently) final specifier.

BTW, please try to run the linter - it should catch such things.

Done.


tensorflow_io/python/ops/bigtable_dataset_ops.py, line 10 at r7 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Oh, sorry - I didn't notice the underscore. The case doesn't match, though.

Ah, in the comment! Pff, I completely missed that! sorry.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting there, just a couple more nitpicks

Reviewable status: 0 of 8 files reviewed, 7 unresolved discussions (waiting on @dopiera and @kboroszko)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

Previously, kboroszko (Kajetan Boroszko) wrote…

Done.

But VLOG is only for verbose logging. We want this error to be in all log configs, no?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 39 at r12 (raw file):

CreateDataClient

Does this need to be public?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 232 at r12 (raw file):

  const std::shared_ptr<cbt::DataClient> data_client_;
  std::vector<std::pair<std::string, std::string>> columns_;

These two members can be const if I'm not mistaken


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 273 at r12 (raw file):

  }

  BigtableClientResource* client_resource() const { return client_resource_; }

Sorry, I only noticed now. Please return a reference, since the style guide says:
Avoid returning a pointer unless it can be null

And given that it can't be null, why not hold a a reference as well?


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 285 at r12 (raw file):

 private:
  BigtableClientResource* client_resource_;

Sorry, I noticed it only now - I think all members can be const.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 286 at r12 (raw file):

 private:
  BigtableClientResource* client_resource_;
  core::ScopedUnref scoped_unref;

You're missing the trailing underscore.

BTW, please avoid calling variables by their type - instead try to reflect what's their purpose, e.g. client_resource_unref.

Copy link
Collaborator Author

@kboroszko kboroszko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 7 unresolved discussions (waiting on @dopiera)


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 154 at r9 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

But VLOG is only for verbose logging. We want this error to be in all log configs, no?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 39 at r12 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
CreateDataClient

Does this need to be public?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 232 at r12 (raw file):

Previously, dopiera (Marek Dopiera) wrote…
  const std::shared_ptr<cbt::DataClient> data_client_;
  std::vector<std::pair<std::string, std::string>> columns_;

These two members can be const if I'm not mistaken

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 273 at r12 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Sorry, I only noticed now. Please return a reference, since the style guide says:
Avoid returning a pointer unless it can be null

And given that it can't be null, why not hold a a reference as well?

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 285 at r12 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Sorry, I noticed it only now - I think all members can be const.

Done.


tensorflow_io/core/kernels/bigtable/bigtable_dataset_kernel.cc, line 286 at r12 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

You're missing the trailing underscore.

BTW, please avoid calling variables by their type - instead try to reflect what's their purpose, e.g. client_resource_unref.

Done.

Copy link

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, all discussions resolved (waiting on @dopiera)

@kboroszko kboroszko merged commit 66644c0 into master Oct 27, 2021
@kboroszko kboroszko deleted the kb/dataset branch October 27, 2021 08:38
kboroszko added a commit that referenced this pull request Nov 23, 2021
Implements reading from bigtable in a synchronous manner.
dopiera pushed a commit that referenced this pull request Nov 24, 2021
Implements reading from bigtable in a synchronous manner.
dopiera pushed a commit that referenced this pull request Dec 3, 2021
Implements reading from bigtable in a synchronous manner.
dopiera pushed a commit that referenced this pull request Dec 3, 2021
Implements reading from bigtable in a synchronous manner.
kboroszko added a commit that referenced this pull request Dec 13, 2021
Implements reading from bigtable in a synchronous manner.
kboroszko added a commit that referenced this pull request Dec 20, 2021
Implements reading from bigtable in a synchronous manner.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants