Skip to content

8358533: Improve performance of java.io.Reader.readAllLines #25863

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bplb
Copy link
Member

@bplb bplb commented Jun 18, 2025

Replaces the implementation readAllCharsAsString().lines().toList() with reading into a temporary char array which is then processed to detect line terminators and copy non-terminating characters into strings which are added to the list.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8358533: Improve performance of java.io.Reader.readAllLines (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25863/head:pull/25863
$ git checkout pull/25863

Update a local copy of the PR:
$ git checkout pull/25863
$ git pull https://git.openjdk.org/jdk.git pull/25863/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25863

View PR using the GUI difftool:
$ git pr show -t 25863

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25863.diff

Using Webrev

Link to Webrev Comment

@bplb
Copy link
Member Author

bplb commented Jun 18, 2025

The throughput of the implementation as measured by the included benchmark appears to hover around 13% greater than that of the existing method. The updated method should also have a smaller memory footprint for streams of non-trivial length as it does not first create a single intermediate String containing all lines in the stream. Instead it uses a char array of size 8192 and a StringBuilder whose maximum length will be the length of the longest line in the input.

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 18, 2025

👋 Welcome back bpb! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jun 18, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jun 18, 2025
@openjdk
Copy link

openjdk bot commented Jun 18, 2025

@bplb The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the core-libs core-libs-dev@openjdk.org label Jun 18, 2025
@mlbridge
Copy link

mlbridge bot commented Jun 18, 2025

Webrevs

int pos = 0;
List<String> lines = new ArrayList<String>();

StringBuilder sb = new StringBuilder(82);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this pre-allocation? If the whole content is smaller than 8192 in size, this allocation would be redundant because we are going through the string constructor path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this pre-allocation?

What would you suggest? Start with a smaller allocation and increase it if needed? There is no possibility of knowing the length of the stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this PR explicitly targets performance and as the aim of this method is to keep all content in-memory anyways, I wonder if it would be acceptable and even faster to pre-allocate new StringBuilder(TRANSFER_BUFFER_SIZE)? In the end, this allocation is just temporary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to call new StringBuilder(0) as it is possible this is completely unused because we always hit the eol && sb.length() == 0 path below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is motivated by performance, but there will be many inputs that are less than the transfer buffer size and those will not use the StringBuilder, so creating it before it is needed could be avoided.
When a partial line is left in the transfer buffer, copy it to the beginning of the buffer and read more characters for the remaining size of the buffer. It would save some copying into and out of the SB.
You might still need a fallback for really long lines (> transfer buffer size), but that might be more easily handled by reallocating the transfer buffer to make it larger.

Copy link

@xuemingshen-oracle xuemingshen-oracle Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resizing/newCapacity is always expensive and tricky, string builder included. so maybe we should decide if 'long lines' (> transfer buffer size) is the goal of this pr. if not, it might be reasonable/make sense (???) to simply go with "string" + the built-in string concatenation -> we don't care the scenario that most of the 'lines' > buffer.size. i do agree we probably want to avoid paying the cost of copying in & out of the sb, but tweaking the transfer buffer resizing might also be tricky and potentially out of the scope as well. yes, it's always a trade off.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to call new StringBuilder(0)

So changed in 8ccfd54.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a partial line is left in the transfer buffer, copy it to the beginning of the buffer and read more characters for the remaining size of the buffer.

The "transfer buffer" is the character array cb, right?

You might still need a fallback for really long lines (> transfer buffer size), but that might be more easily handled by reallocating the transfer buffer to make it larger.

The size of the array cb is not intended to be commensurate with line length, but rather to allow reading multiple lines thereby cutting down the nunber of reads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but some line might be longer than the transfer buffer size, so some mechanism is needed to assemble a longer line. Whether its a separate buffer (like SB) or just a bigger transfer buffer. One possibility is to copy the remaining fragment of a line to a new transfer buffer (maybe twice the size).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possibility is to copy the remaining fragment of a line to a new transfer buffer (maybe twice the size).

It's probably worth testing this.

@wenshao
Copy link
Contributor

wenshao commented Jun 18, 2025

If we want better performance, we should go a step further and overload the readAllLines method in the Reader implementation class.

For example, in the most commonly used InputStreamReader, overload readAllLines through StreamDecoder and make special optimizations for UTF8/ISO_8859_1 encoding.

In StringReader, special overload methods can also be used for optimization.

@bplb
Copy link
Member Author

bplb commented Jun 18, 2025

If we want better performance, we should go a step further and overload the readAllLines method in the Reader implementation class.

Perhaps, but not in this request. A separate issue should be filed and addressed subsequently.

int pos = 0;
List<String> lines = new ArrayList<String>();

StringBuilder sb = new StringBuilder(82);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this PR explicitly targets performance and as the aim of this method is to keep all content in-memory anyways, I wonder if it would be acceptable and even faster to pre-allocate new StringBuilder(TRANSFER_BUFFER_SIZE)? In the end, this allocation is just temporary.

@bplb
Copy link
Member Author

bplb commented Jun 24, 2025

The commit d5abfa4 does not address most comments provided to date. The algorithm was wrong and I preferred to correct it first.

@@ -457,7 +449,52 @@ private String readAllCharsAsString() throws IOException {
* @since 25
*/
public List<String> readAllLines() throws IOException {
return readAllCharsAsString().lines().toList();
List<String> lines = new ArrayList<>();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: BufferedReader will of course override this method and return the list of lines using its readLine method to collect all lines. Why not simply implement Reader.readAllLines() by return new BufferedReader(this).readAllLines() (it introduces an ugly dependency from Reader to BufferedReader, but spares this implementation which looks very similar to BufferedReader logic).

@bplb
Copy link
Member Author

bplb commented Jun 24, 2025

The throughput of the implementation [...].

The performance comments made here still apply to the most recent commit.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core-libs core-libs-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

9 participants