-
Notifications
You must be signed in to change notification settings - Fork 578
RFC: Modular Filesystems C API #101
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sending the RFC!
Overall I think two things are confusing:
- it's not clear what parts of the existing system need to be understood by users using the APIs proposed here, and you go into a lot of detail about the file APIs in tf. I think you should fork that documentation off into a separate doc
- the key detail of how to handle versioning and upgrading is buried, and you show more than one implementation of the core structs. I think it'd be simpler if you only showed the intended final implementation of each struct (assuming you go with making the structs public, which I advise you to avoid and instead expose them only through functions to allocate/free and set/get the individual fields).
The RFC files got moved to a new directory but text was still using the old path.
```cpp | ||
// Operations on a TF_RandomAccessFile | ||
typedef struct TF_RandomAccessFileOps { | ||
size_t (*const Read)(const TF_RandomAccessFile*, uint64, size_t, char*, TF_Status*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihaimaruseac One very useful method for random access file could be Size
or GetSize
which returns the size of the file. At the moment, File Size is not visible from RandomAccessFile which is really inconvenient in many situations.
The reason being that many file format actually has the meta data at the end of the file (not at the beginning of the file). One good (and important) example is the zip file. Zip file's meta data is stored at the end of the file (central directory). So any parser to open a zip file will have to find the file size first before processing.
With a GetSize
(or Size
) API, RandomAccessFile could be greatly optimized.
Note technically it is possible to find the file size with Read
to probe through until the EOF is reached, but this is really not efficient.
Of course, some file systems may not be able to provide a Size
before hand (e.g., in stream fashion). I think by default GetSize
could simply return NotImpemented
. So any file size based optimization will only be "best effort".
For optimization, if a GetSize
/Size
API is exposed, then reading files could try GetSize
first. If NotImplemented
is returned, then a normal approach of probe through until the end is used. Otherwise the size of the file could be directly used.
Note: ZIP is the format mentioned but a lot of the files are actually zip files with [Content_Types].xml
or META-INFO
file inside.
For example, Windows office files (pptx, docx, etc) are all zip files. Java jar files are zip files. And python whl files are zip files as well.
You could just use unzip to decompress pptx, jar, whl actually and see each individual content files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this request. I only included the API calls that are currently included in the existing classes but I guess we can expand these. I'm going to add a new section on proposed additions to the op tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually one of the fields of Read
is the offset from where you want to read and the other is the number of bytes to read. So, if reading a zip file, you can go directly to the end of the file and read the size section from there.
Sure, we can add it as convenience function but we only identify the filesystem plugin based on the scheme part of the URI, not the extension (hdfs://path/to/file.zip
and /path/to/file.zip
will use different implementations but /path/to/file.zip
and /path/to/another/file.txt
will use the same ops)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihaimaruseac while it is possible to read the content with offset, the Read
API does not give you the option to specify offset = -10
to read from 10 bytes before the end of the file EOF. In oder to achieve that, we will need to find out the length of the file (e.g., say 100), and subtract by 10 (100 - 10 = 90) to read from offset 90
(vs. -10
).
The filesystem plugin will be consumed by a client calling it (a function inside tensorflow, or another plugin/customer ops outside of tensorflow repo).
For example, inside tensorflow one function might want to parse a file (not necessary zip file) with file name hdfs://path/to/special/file
. Outside of tensorflow another plugin want to access Tensorflow's file system schema URI gcs://path/to/another/file
.
Eventually the function (inside or outside tensorflow) will need to access Read
(or proposed GetSize/GetFileSize
) through the plugin.
If GetSize/GetFileSize
is not exposed, then size is unknown to this function (function inside tensorflow, or another plugin using tensorflow). This function will have to probe through to find the EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't clear enough, sorry. Let me try with an example.
Suppose we live in a world where there are only 3 types of files: *.start
which have the size metadata at the beginning, *.end
which have it at the end and *.none
which have no size metadata.
Suppose we also have only two filesystems: A (a:///path/to/file
) and B (b:///path/to/file
). Let's assume that A allows rewinds (so you can read the file and then come back to its start) while B is only streaming (once you read some part of the file, you can't go back, unless you open the file again).
Suppose we have a piece of user code which takes as input a filepath and returns the size of the file, let's call this ExternalFileSize(const string& path)
.
There are 4 posibilities:
-
We keep the API as it is. In this case,
ExternalFileSize(path)
will callEnv::Default()->NewRandomAccessFile(...)
to get a newRandomAccessFile
object and will have to callRead
on it until the end increasing a size variable and then returning it. This, I agree with you, is not that good.// This code is outside TensorFlow size_t ExternalFileSize(const string& path) { RandomAccessFile file; Env::Default()->NewRandomAccessFile(path, &file); return ReadFromStartToEndAndGetSize(&file); } // Extracting this as separate function as I refer to it below // This is still outside TensorFlow ReadFromStartToEndAndGetSize(std::unique_ptr<RandomAccessFile> file) { size_t size = 0; // ignore casts for size_t to uint64 and from Status to bool while (file->Read(size, amount_to_read, output, result)) { // assume no partial reads for now as it doesn't matter here size += amount_to_read; } return size; }
-
We add a
GetSize
method in the plugin API and inRandomAccessFile
class. To implementExternalFileSize(path)
we now callNewRandomAccessFile
andGetSize
on the returned object. However,NewRandomAccessFile
dispatches on the scheme, so both plugins need to handle*.start
,*.none
and*.end
files inside theGetSize
equivalent, resulting in code duplication as well as behavior that is not common across plugins. This is what I was mentioning above, in the second paragraph.// This code is outside TensorFlow size_t ExternalFileSize(const string& path) { RandomAccessFile file; Env::Default()->NewRandomAccessFile(path, &file); return file->GetSize(); } // This code is in plugin (both A and B) size_t GetFileSize(const TF_RandomAccessFile *file) { const char *filename = file->filename; // SizeAtStart, SizeAtEnd, ReadFirstBlock, ReadLastBlock, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize // are all functions implemented in each plugin, code duplication if (SizeAtStart(filename)) { char* metadata = ReadFirstBlock(file); return ReadSizeFromMetadata(metadata); } else if (SizeAtEnd(filename)) { char* metadata = ReadLastBlock(file); return ReadSizeFromMetadata(metadata); } else { // This calls the plugin function, not the external one! return ReadFromStartToEndAndGetSize(file); } }
-
We allow
Read
to receive a negative argument for the offset. In this case,ExternalFileSize(path)
first looks at the path extension. If it isnone
it callsRead
and increments a size variable until end of file, returning this size. If it isend
, it callsRead
with the proper negative offset, reads until EOF and returns the size from the metadata. If it isstart
, it reads from start, only the metadata and returns the size field. SinceRead
now is on core tensorflow, each plugin only needs to implement it with support for negative offsets. If negative offsets are not supported (i.e., streaming filesystems) thenRead
with negative offset fails and code falls back to the same case asnone
extension above.// This code is outside TensorFlow size_t ExternalFileSize(const string& path) { RandomAccessFile file; Env::Default()->NewRandomAccessFile(path, &file); // SizeAtStart, SizeAtEnd, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize // implemented only once in the external code if (SizeAtStart(filename)) { file->Read(0, amount_to_read, output, metadata); return ReadSizeFromMetadata(metadata); } else if (SizeAtEnd(filename)) { if (file->Read(-amount_to_read, amount_to_read, output, metadata)) { return ReadSizeFromMetadata(metadata); } else { // filesystem doesn't support negative offsets (filesystem B) // so, fallback to default return ReadFromStartToEndAndGetSize(file); } } else { return ReadFromStartToEndAndGetSize(file); } return file->GetSize(); } // This code is in plugin A size_t Read(const TF_RandomAccessFile *file, uint64 offset, ...) { if (offset < 0) { // position from end and read ... } else { // read directly ... } } // This code is in plugin B size_t Read(const TF_RandomAccessFile *file, uint64 offset, ...) { if (offset < 0) { // error as seek is not supported } else { // read directly ... } }
-
We change
Read
to only receive the number of bytes to read and add aSeek
method to specify where the read should happen from.Seek
will have an argument for the relative position (negative offset, positive offset) and another enum argument to specify if the position change should be from current position, start of file or end of file.// This code is outside TensorFlow size_t ExternalFileSize(const string& path) { RandomAccessFile file; Env::Default()->NewRandomAccessFile(path, &file); // SizeAtStart, SizeAtEnd, ReadSizeFromMetadata, ReadFromStartToEndAndGetSize // implemented only once in the external code if (SizeAtStart(filename)) { file->Read(amount_to_read, output, metadata); return ReadSizeFromMetadata(metadata); } else if (SizeAtEnd(filename) && file->Seek(-amount_to_read, FILE_END)) { file->Read(amount_to_read, output, metadata); return ReadSizeFromMetadata(metadata); } else { return ReadFromStartToEndAndGetSize(file); } return file->GetSize(); }
This is similar to 3. above, but has a clearer API.
I would go with 4 or 3 instead of 2 as you propose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihaimaruseac Thanks. I think 3 or 4 makes sense 👍 . For 3, maybe the offset will be int64
instead of uint64
(to allow negative numbers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has to be int64
, indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option (4) would prevent concurrent reads from the RandomAccessFile (and require additional locking ...) - while this functionality is currently underused I expect this to change.
I would really like to see a GetSize() (2) method for RandomAccessFile - currently GetFileSize() is used as a substituted either before or after NewRandomAccessFile().
GetSize() could ensure on most file systems that both operations actually refer to the same file object in the presence of rename/delete file operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that leaves option 3 to read the size after decompression.
Indeed, GetFileSize()
returns the size of the file on disk. Probably we can also implement GetSize()
which does the same thing, returning the size of the file at the moment it was read (and filesystems can decide to not implement it in which case we'll default to calling GetFileSize()
from the filesystem object itself)
Thank you @alextp for all the review. I incorporated all of your changes, either by inline editing, extracting to appendix or deferring to an open questions section (including the way I'm leaning to) at the end of the document. |
### Optional additional changes to TensorFlow C/C++ core | ||
|
||
Something we might elect to do is to force all future filesystem implementations | ||
to use the plugin design suggested above. To do that, we will have to convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 let's definitely aim to have only one way of specifying filesystems in the future
Review notes:
|
or the core TensorFlow. As such, here are a few requirements that we think are | ||
needed: | ||
|
||
1. Once a plugin is registered, the function pointer to plugin provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording: either "to a plugin provided implementation" or "to plugin-provided implementations"?
|
||
1. Furthermore, it is ideal to have compilation failures when trying to compile | ||
plugins against an old API interface which will surely fail the version | ||
metadata tests. For example, if a method's signature changes this will break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, if you want ABI compatibility, you cannot change function signatures, right? You can only ever add new functions. This is what we see with a lot of ABI compat systems that create foo(), foo2(), foo3() etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly the case.
integration tests but will also be caught during compilation of plugin. | ||
However, something that might not be caught at compile time is when an | ||
optional method becomes required or a new method is added. We can maintain | ||
_source compatibility_ by always increasing version numbers when a method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different solution is having newer implementations always be optional in the sense that if foo3() is not implemented, you'll get an easy to detect error that you can propagate to the caller.
|
||
## Existing implementation | ||
|
||
The implementation of a filesystem support in TensorFlow is hairy, due to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wording: s/a //
1. Although this design is mostly about C++ implementation, we should ensure | ||
Python code and other language bindings will be able to continue working with | ||
the modular design. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about debugging, and specifically performance debugging? Given that input pipelines are often a core part of the ML bottlenecks, how can we ensure that folks will understand how time is spent within a plugin?
class PluginFilesystemIOTest : public ::testing::Test { | ||
protected: | ||
void SetUp() override { | ||
filesystem_ = new TF_Filesystem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C++ code, use a unique_ptr instead?
} TF_FilesystemOps; | ||
``` | ||
|
||
We also provide some convenience calls that plugins can use to fill in these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What part of the code is "we" here? A common library folks will depend on, or is this going to be calling back into the main TF code? If so, the complexity of creating a .lib file for Windows to make this work is going to play into the decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"we" here is more like "we, the authors of the plugin interface".
I'll have to reconsider the Windows issue, thanks for pointing it out
: filename_(filename), file_(file), ops_(ops) {} | ||
virtual ~ModularRandomAccessFile() { | ||
ops_->Cleanup(const_cast<TF_RandomAccessFile*>(file_)); | ||
delete file_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this take ownership of file? Generally that seems like an anti-pattern; if you have a reason, I'd call that out in the comments.
}; | ||
|
||
Status ModularRandomAccessFile::Read(uint64 offset, size_t n, StringPiece* result, char* scratch) const { | ||
TF_Status* plugin_status = TF_NewStatus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If TF_Status is just a struct, why not put it on the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it can also go on the stack. Will have to see how it is used elsewhere and also will have to learn modern C++ :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it cannot go on the stack as it needs to be private to plugins for ABI compatibility reasons.
1. **Should we pass plugin structs by value during registration?** Doing so | ||
makes ownership clear and prevents memory allocation bugs. Passing them by | ||
value results in a copy but we need a copy for security reasons anyway. We're | ||
leaning towards **Yes**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this about the structs that just wrap the void*? In that case, compilers will handle exactly like they would a void* (so no extra copy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about the structs containing the function pointers.
Thanks @r4nt for the extra review and suggestions. Will consider these too while implementing. |
Is there a place to track the implementation status of this RFC? |
https://github.com/tensorflow/tensorflow/search?q=modular+filesystem&type=Commits |
For now all the implementation will reside in POSIX support should be completed. I'm working on windows now then I will update this and send an email with instructions on how to test the experimental filesystems instead of the non-modular ones. At that time, we will start working on converting all other filesystems and then we will switch TF to use the modular ones (by moving the code to the proper locations) |
Feedback period will be open until 2019-05-27
Modular Filesystems C API
Summary
Define the API for the plugin-based implementation of filesystems under modular TensorFlow (as proposed in #77)
A major module of the modular TensorFlow involves all operations regarding files and the filesystem. Saving models, checkpointing, reading files as input to models (i.e. images, audio, etc.), and many other file manipulations are all tasks that TensorFlow has to support.
While users can use the underlying operating system’s API (such as calling fopen and friends on Posix environments, for example), it is definitely better to have a cross-platform, cross-environment API that can be used everywhere. TensorFlow provides such an API at the moment but it results in compiling code for all known filesystems in the final binary. We propose a new design where all filesystems are implemented as plugins and only the required ones are added at runtime. To do so, we will first need to design a C API for the filesystems and then integrate this with the rest of the modularization effort.