-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Reduce copies when reading files in pyio, match behavior of _io #129005
Comments
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readv`. `self.read()` was doing the closed and readable checks so move those into `readinto`
I took a tangent and looked at the code complexity of adding
Happy to work on either course, slight preference for Know moving to a full PR for os.readinto would need to add tests and news, just focusing on "adding os.readinto" vs. "using os.readv". cc: @tomasr8 , @vstinner , @gpshead (reviewers where os.read vs. _Py_read / readinto has come up). |
Can you please open a separated issue for |
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer`` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
`os.read` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto`. `os.readinto` uses `PyObject_GetBuffer` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
`os.read()` allocated and filled a buffer by calling `read(2)`, than that data was copied into the user provied buffer. Read directly into the caller's buffer instead by using `os.readinto()`. `os.readinto()` uses `PyObject_GetBuffer()` to make sure the passed in buffer is writeable and bytes-like, drop the manual check.
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.3 seconds
Slicing buf and appending chunk would always result in a copy. Commonly in a readall there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
This aligns the memory usage between _pyio and _io. Both use the same amount of memory.
The full set of changes on my Linux machine debug build, reduces |
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto, and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.
Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
This aligns the memory usage between _pyio and _io. Both now use the same amount of memory when reading a file.
Suggesting two new features to help implement this more effectively + make it easier to write zero-copy + zero extra allocation I/O loops:
|
Internally that latter (2) is infeasible. PyBytesObject ends with the immutable array of bytes. PyByteArrayObject uses indirection via pointers. A CPython |
Is there any reason to optimize for |
@gpshead the theory for 2 (zero-copy re: This is underlying work as I try to get to be able to:
Currently patches I've written touching BufferedIO run into issues that the C and Python don't quite match but people want them to, bringing them into line currently both helps me understand the code better, bring them in line with reviewer expectations, and demonstrate I/O in Python can be as efficient as C (both in memory and CPU). Ideal/dream would be with interpreter improvements (Ex. JIT and free threading), the performance delta gets small enough can have one python |
Ah, yes, if bytearray internally allocates a phantom bytes object for its backing memory then I could see how that could be made real and returned in refcount=1 bytes(bytearray) scenarios. A little extra memory overhead but given that bytearray overhead shouldn't matter much as the best uses are large anyways... nice. |
Hello, I am coming across this PR as I also noticed the horrible performance of Basically, the readall() function is stupid. It's copying data 3 times by appending to a buffer. EDIT: forgot to say, this is the function
|
the Using C |
Quick test with the code above, the test Oddly enough. I've added a print to debug how many bytes are read on the call
EDIT: raising to 10 GB for debugging. the partial reads are relatively consistent around the 2GB mark.
EDIT EDIT: that 10 GB run is going from ~25 seconds on main to ~18 seconds with the code above. |
|
…9702) Co-authored-by: Victor Stinner <vstinner@python.org>
Both now use a pre-allocated buffer of length `bufsize`, fill it using a readinto(), and have matching "expand buffer" logic. On my machine this takes: `./python -m test -M8g -uall test_largefile -m test_large_read -v` from ~3.7 seconds to ~3.4 seconds.
Slicing buf and appending chunk would always result in a copy. Commonly in a readall() there is no already read data in buf, and the amount of data read may be large, so the copy is expensive.
…hon#129454)" (python#129500) This reverts commit e1c4ba9.
…9458)" (python#129572) This reverts commit f927204.
python#129702) Co-authored-by: Victor Stinner <vstinner@python.org>
Utilize `bytearray.resize()` and `os.readinto()` to reduce copies and match behavior of `_io.FileIO.readall()`. There is still an extra copy which means twice the memory required compared to FileIO because there isn't a zero-copy path from `bytearray` -> `bytes` currently. On my system reading a 2 GB file: `./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read -v` Goes from ~2.7 seconds -> ~2.2 seconds Co-authored-by: Victor Stinner <vstinner@python.org>
python#129702) Co-authored-by: Victor Stinner <vstinner@python.org>
1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty bytes singleton. 2. `ob_alloc` information is now redundant, added assertion to validate that, would it make sense to deprecate? 3. There's a lot of `bytearray` code very similar to `bytes` code, more could likely be just proxied to the `bytes` now. Here just focusing on the swap as that enables optimizations. 4. If `bytearray` is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented. This enables adding `bytearray._detach()` which I plan to do in a separate PR. ```bash ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` On my machine (AMD 64 bit Linux, Optimized build): `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM Perf checking no major swings in an optimized build: `./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv` before: ~1.4s after: ~1.5s Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty bytes singleton. 2. `ob_alloc` information is now redundant, added assertion to validate that, would it make sense to deprecate? 3. There's a lot of `bytearray` code very similar to `bytes` code, more could likely be just proxied to the `bytes` now. Here just focusing on the swap as that enables optimizations. 4. If `bytearray` is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented. This enables adding `bytearray._detach()` which I plan to do in a separate PR. ```bash ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` On my machine (AMD 64 bit Linux, Optimized build): `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM Perf checking no major swings in an optimized build: `./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv` before: ~1.4s after: ~1.5s Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
This removes the memory overhead of `_pyio.FileIO.readall` relative to `_io.FileIO.readall`, and significantly improves performance: ```bash # _io.FileIO.readall of a large file ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read # _pyio.FileIO.readall of a large file ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM
1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty bytes singleton. 2. `ob_alloc` information is now redundant, added assertion to validate that, would it make sense to deprecate? 3. There's a lot of `bytearray` code very similar to `bytes` code, more could likely be just proxied to the `bytes` now. Here just focusing on the swap as that enables optimizations. 4. If `bytearray` is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented. This enables adding `bytearray._detach()` which I plan to do in a separate PR. ```bash ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` On my machine (AMD 64 bit Linux, Optimized build): `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM Perf checking no major swings in an optimized build: `./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv` before: ~1.4s after: ~1.5s Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty bytes singleton. 2. `ob_alloc` information is now redundant, added assertion to validate that, would it make sense to deprecate? 3. There's a lot of `bytearray` code very similar to `bytes` code, more could likely be just proxied to the `bytes` now. Here just focusing on the swap as that enables optimizations. 4. If `bytearray` is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented. This enables adding `bytearray._detach()` which I plan to do in a separate PR. ```bash ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` On my machine (AMD 64 bit Linux, Optimized build): `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM Perf checking no major swings in an optimized build: `./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv` before: ~1.4s after: ~1.5s Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
1. add `tp_new` to guarantee `ob_bytes_head` is always set, often to the empty bytes singleton. 2. `ob_alloc` information is now redundant, added assertion to validate that, would it make sense to deprecate? 3. There's a lot of `bytearray` code very similar to `bytes` code, more could likely be just proxied to the `bytes` now. Here just focusing on the swap as that enables optimizations. 4. If `bytearray` is passed a single-reference bytes it could potentially take "ownership" of it without copying the bytes, for now not implemented. This enables adding `bytearray._detach()` which I plan to do in a separate PR. ```bash # `_io` large file read test ./python -m test -M8g -uall test_largefile -m test.test_largefile.CLargeFileTest.test_large_read # `_pyio` large file read test ./python -m test -M8g -uall test_largefile -m test.test_largefile.PyLargeFileTest.test_large_read ``` On my machine (AMD 64 bit Linux, Optimized build): `_io` takes: ~0.791s and uses ~2GB of RAM `_pyio` current: ~1.073s and uses ~4GB of RAM `_pyio` w/ bytearray._detach: ~0.887s and uses ~2GB of RAM Perf checking no major swings in an optimized build: `./python -E -bb -Wd -m test -uall -M32G test_bytes test_capi.test_bytearray -vvv` before: ~1.4s after: ~1.5s Previous discussion: https://discuss.python.org/t/add-zero-copy-conversion-of-bytearray-to-bytes-by-providing-bytes/79164
Feature or enhancement
Proposal:
Currently
_pyio
uses ~2x as much memory to read all data from a file compared to _io. This is because it makes more than one copy of the data.Details from test_fileio run
Plan:
os.readv()
os.readinto()
to do readinto like C_Py_read
used by_io
does.os.read()
can't take a buffer to use. This aligns behavior between_io.FileIO.readall
and_pyio.FileIO.readall
.os.readv
works well today and takes a caller allocated buffer rather than needing to add a newos
API.readv(2)
mirrors the behavior and errors ofread(2)
, so this should keep the same end behavior._pyio.BufferedIO
to not force a copy of the buffer for readall when its internal buffer is empty. Currently it always slices its internal buffer then adds the result of_pyio.FileIO.readall
to it.For iterating, I'm using a small tracemalloc script to find where copies are:
Loose Ends
os.readv
seems to be well supported but is currently guarded by a configure check. I'd like to just make pyio requirereadv
, but can do conditional code if needed. If makingreadv
non-optional generally is feasible, happy to work on that.os.readv
is not supported on WASI, so need to add conditional code.Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
_pyio.FileIO.readall()
#129496The text was updated successfully, but these errors were encountered: