Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Experiment to replace the large array of Excel function definitions with a FlyWeight Pattern collection of Value Objects #2714

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

MarkBaker
Copy link
Member

@MarkBaker MarkBaker commented Mar 26, 2022

This is:

- [ ] a bugfix
- [ ] a new feature
- [X] refactoring

Checklist:

Why this change is needed?

Experiment to replace the large "unstructured" array of Excel function signature definitions with a structured collection of Value Objects to see what affect it has on memory usage and speed.

This isn't a perfect implementation yet: but it's enough to take a look at the memory usage and speed of the new solution.

Hopefully, this approach will also make Excel function definitions/implementations easier to maintain in future, and potentially allow new benefits such as defining the Excel version where they became available, and also if Microsoft have deprecated them.

@MarkBaker MarkBaker force-pushed the Experiment-Maintaining-Excel-Function-List-as-an-ArrayObject branch 4 times, most recently from 3dff948 to e6bbb7a Compare March 27, 2022 11:45
@MarkBaker
Copy link
Member Author

@oleibman - Thought you might like to take a look at this, replacing the large array of Excel function signatures in Calculation with a collection of Value Objects that's lazy-loaded as functions are referenced. Entries in the collection; and properties in the VOs can be accessed either as array elements or as object properties.

It's still not 100% yet, there's still some work I need to do on it; but it does seem to show both memory and speed improvements on the test suite (despite the one test that loads every single VO in the collection)... at least there's lower memory usage for PHP 7.4 and PHP 8.0; but all versions runs in the CI tests seem faster. How that will translate into real-world end-user performance, I don't know; but it should still be better.

…n signature definitions with a structured collection of Value Objects to see what affect it has on memory usage and speed.

This isn't a perfect implementation yet: but it's enough to take a look at the memory usage and speed.
@MarkBaker MarkBaker force-pushed the Experiment-Maintaining-Excel-Function-List-as-an-ArrayObject branch from e6bbb7a to 3878e32 Compare March 27, 2022 19:02
@oleibman
Copy link
Collaborator

@MarkBaker, I will certainly take a look. However, I will be offline for personal reasons for about a week, so will probably not be able to get back to you before late next week.

@oleibman
Copy link
Collaborator

oleibman commented Apr 7, 2022

@MarkBaker, as usual I am learning so much from reviewing this. It will take a while for me to put it all together. For now, I do have one question. You have moved the initialization of the large array phpSpreadsheetFunctions out of Calculation/Calculation and into Calculation/Engine/ExcelFunctions. But, even in the latter, you still have a large array, simpler than the original but still large. Might it be possible to just initialize that to an empty array (or possibly just the false entries if any) and populate it entirely on demand?

@MarkBaker
Copy link
Member Author

Might it be possible to just initialize that to an empty array (or possibly just the false entries if any) and populate it entirely on demand?

Also possible: The new array is a lot smaller than the original array; but it does still use a not insignificant amount of memory.... it took me a while to get this working "as is", but a version of the ExcelFunctions that tests for in_array(), and if not then tests if there is a signature definition using class_exists(), loading it into the array if there is... I had thought about it at the time, but ran out of weekend to experiment, and haven't had the time to revisit it since.
The other variant there is that class_exists() is faster when called without loading the class; but as we'd need to load and instantiate it anyway if class_exists() returned a true, I'm not sure that would make any noticeable difference.

I'll have a go at that on a separate branch next week, so that we can compare memory/speed results for both variations.

Given that only a small number of Excel functions are typically used in any given spreadsheet; but that those few functions are often used repeatedly, it did seem a good target for memory optimisation; though I was a bit concerned about the performance cost of the additional "load on demand"... that pattern of real world demand should see both memory and performance benefits, even if our test suite doesn't fully reflect that. Though I was pleased that there didn't seem any real performance overhead with the change, and the memory saving is visible in the unit test runs, even though there are tests there that would load every function definition.

@MarkBaker
Copy link
Member Author

MarkBaker commented Apr 7, 2022

Just as a warning: the biggest issue that I had with this was the large number of additional classes slowing PHPStorm's re-indexing, and making it slower overall... I doubt that would be a problem once I switch to my new laptop, with more memory and a faster processor, but it is a maintenance overhead.
That might be a result of 500 classes in a single directory; and I did try breaking them into subdirectories; but that created extra run-time overhead, and seemed to have no benefit in PHPStorm at all; so I left them all in the one folder.

@MarkBaker
Copy link
Member Author

MarkBaker commented Apr 7, 2022

And if I do start from an initial empty array, I'll also have to play around a bit with the logic that builds the list of implemented functions when we do a release; because that uses the array keys as its base data; but I'm sure I can work out an alternative reading through the Engine/Functions folder to handle that. Slower release isn't real problem if we're improving run-time performance.

- TEXTBEFORE – Returns text that’s before delimiting characters
- TEXTAFTER – Returns text that’s after delimiting character
- TEXTSPLIT – Splits text into rows or columns using delimiters
- VSTACK – Stacks arrays vertically
- HSTACK – Stacks arrays horizontally
- TOROW – Returns the array as one row
- TOCOL – Returns the array as one column
- WRAPROWS – Wraps a row array into a 2D array
- WRAPCOLS – Wraps a column array into a 2D array
- TAKE – Returns rows or columns from array start or end
- DROP – Drops rows or columns from array start or end
- CHOOSEROWS – Returns the specified rows from an array
- CHOOSECOLS – Returns the specified columns from an array
- EXPAND – Expands an array to the specified dimensions
# for free to join this conversation on GitHub. Already have an account? # to comment
Development

Successfully merging this pull request may close these issues.

2 participants