-
Notifications
You must be signed in to change notification settings - Fork 299
feat: support Bigtable dataset #1578
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@dopiera Thanks for the PR! I think |
@yongtang I can be the point of contact for the sake of this PR. Thank you for your comments. We have placed the |
Thanks @kboroszko, the bigquery was keep on top level mostly due to the backward-compatibility reason (not to break existing user). This acutally caused issues in docs generation and a special handling has to be used (see https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L28-L32) We do want to eventually group bigquery with other google cloud related APIs, and have it placed in the same way as other standard APIs (through https://github.com/tensorflow/io/blob/master/tensorflow_io/python/api/__init__.py#L22-L26). For bigtable, since it is also part of the google cloud, I think (We can alias APIs of In order to expose
That will make sure the docs generation follows the same pattern and can shown up correctly in tensorflow.org. |
@yongtang I would like to avoid putting everything in gcloud module in the same namespace as it can quickly become very confusing. For instance both bigtable and bigquery have tables, rows, etc. so we could end up having several "Table" classes in the gcloud module. To avoid the confusion I could place everything in Also, since
It's pretty painful, to explicitly write the whole path every time. Another argument for keeping them separate is that they don't share code and are independent technologies, so apart from being created by google I don't see a reason to keep them together. How about just naming it |
Thanks for the explanation @kboroszko. This is a fair point. The reason we are gradually phasing out top level namespaces, is that in the past the number of top level namespaces exploded and we had to scale back. Assuming we are not going to see many more services like bigquery and bigtable in the future, it should be fine to place API as part of In that case, we still want
Can you take a look and see if the above will resolve the issues? |
@kboroszko I think the issue is likely caused by Windows artifact too large to upload to GitHub Actions' storage. I have created a PR #1582 to remove unneeded files from artifact before upload. |
@yongtang That's great, thank you for the heads up. As soon as it's merged I will update this PR so it involves the fix. |
@kboroszko As PR #1582 has been merged, can you update the PR? Believe build will pass after that. |
Implements reading from bigtable in a synchronous manner.
In this pr we make the read methods accept a row_set reading only rows specified by the user. We also add a parallel read, that leverages the sample_row_keys method to split work among workers.
This PR adds support for Bigtable version filters.
moved bigtable to tfensorflow_io.python.api
@yongtang It's done. Fingers crossed! 🤞 |
It looks like the all tests hangs. Don't know if this is a GitHub CI issue or code issue. I will retrigger the tests to give it another run. |
@kboroszko Looks like the tests just hangs (and time out after 6hours). Think this might be related to the code change. Can you take a look? |
Hi @yongtang , All current builds are failing due to the freetype repository being unavailable and mirrors not working. Apart from that I have added pytest-timeout and it looks like it's the Do you have any idea what might be the cause? |
@dopiera If this is the only one we can disable the test for now. Can you add |
changed path to bigtable emulator and cbt in tests moved arguments' initializations to the body of the function in bigtable_ops.py fixed interleaveFromRange of column filters when using only one column
@yongtang It turned out to be a problem with xdist using fork that was causing our tests to hang. It was quite hard to find, but it's fixed now. Python initializes the default arguments at the start of the program and when xdist forkes the process in order to run the tests in parallel, the whole thing hangs. We fixed it by initializing the arguments to the default values in the body of the function, so they are created after the fork. It passes the tests in the CI now. |
@kboroszko Looks like linux tests pass though there might be some issues with macOS. Can you disable the macOS tests with |
* disable tests on macos
@yongtang I have disabled the tests on macOS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great effort to make it work!
@yongtang thank you too! I would merge it but I get a msg:
So I guess you'll have to do it. |
@pierreoberholzer Thanks for the reminder! The issue has been closed now. |
This PR adds the ability to read tensors from Bigtable.
The design is heavily inspired by Bigquery's implementation and shares some code with https://github.com/Unoperate/pytorch-cbt
It supports both sequential (sorted) reads and parallel reads (no particular row orders).