Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

Merged
merged 20 commits into from
Jul 7, 2020

Conversation

ramongonze
Copy link

As suggested in #243 (comment) , it is a simple implementation of an int 2D matrix.

It allows to create matrices with at most (2^31-1)^2 cells.

It keeps the original implementation of an array to be used when there are less than 2^31-1 cells.
If the number of cells is higher than 2^31-1, it creates an int 2D matrix.

Basically, it changes all methods to make calculations and return values according to DataMatrix nature (multidimensional or not).

prasser and others added 4 commits March 5, 2018 22:12
…ad of throwing an IllegalArgumentException, an multidimensional array is created. Now the upper bound of cells for a DataMatrix is (2^31)^2.
….multiplyExact in DataMatrix constructor. Now the exception is handled.
@ramongonze
Copy link
Author

ramongonze commented Feb 5, 2020

In the last 2 commits I have just bring back the try-catch block to handle with ArithmeticException: integer overflow.

I have also ran tests using a dataset with 56 million rows and 120 columns, and they worked perfectly with multidimensional matrices.

@prasser
Copy link
Collaborator

prasser commented Feb 6, 2020

Hi ramongonze,

thank you for your interest in ARX and this great contribution!

This is a huge change at the core of the software. We will therefore need some time to ensure that it works, doesn't break anything and can be merged. A few initial questions:

(1) Have you tested this, by executing the various functionalities of the software and made sure that it doesn't run into exceptions in other places? There might well be other parts of the code that need to be changed in order to handle such large datasets.
(2) Have you considered turning DataMatrix into an interface (or something similar) and to put the new variant into a separate class? This would probably be a bit challenging due to issues with serialization (DataMatrix is serializable) but it could be wortwhile to try it anyways, as this would be much more minimally invasive.

Thanks again!

…aMatrix class to new classes: SingleArrayMatrix and MultidimensionalArrayMatrix; Created a new abstract class 'Matrix' to keep all information for a single array or a multidimensional one; Runned JUnit tests using single array and multidimensional array, and both have passed in all tests.
@ramongonze
Copy link
Author

(2) Have you considered turning DataMatrix into an interface (or something similar) and to put the new variant into a separate class? This would probably be a bit challenging due to issues with serialization (DataMatrix is serializable) but it could be wortwhile to try it anyways, as this would be much more minimally invasive.

I changed the array variable from DataMatrix to an Matrix object.

I created 3 new classes in org.deidentifier.arx.framework.data:

  • Matrix (abstract)
  • SingleArrayMatrix (extends Matrix)
  • MultidimensionalArrayMatrix (extends Matrix)

Matrix class has all methods from DataMatrix which had dependency on array variable.
Both SingleArrayMatrix and MultidimensionalMatrix implement the same set of methods, but SingleArrayMatrix uses an int[] array and MultidimensionalArrayMatrix uses an int[][] array.

Why didn't I turned DataMatrix into an abstract class? Because I avoided to change classes from other packages, so the changes kept only in DataMatrix class. That's the reason I created the Matrix class.

Summarizing, we have:

  • DataMatrix changed;
  • 3 new classes at org.deidentifer.arx.framework.data: Matrix, SingleArrayMatrix and MultidimensionalArray.

(1) Have you tested this, by executing the various functionalities of the software and made sure that it doesn't run into exceptions in other places? There might well be other parts of the code that need to be changed in order to handle such large datasets.

I have ran all JUnit tests using both SingleArrayMatrix and MultidimensionalArrayMatrix operations, and all tests were ok (excepted those ones which depends on datasets that are not available in this repo).

What other parts of the code should I change to handle large datasets?
Or could you tell me what functionalities you think would need changes? I could create some tests for them.

@prasser
Copy link
Collaborator

prasser commented Feb 24, 2020

Thanks a lot! Another test that you should always perform is to load and save and load again the example projects with the GUI. Does it work? Thanks, Fabian

@ramongonze
Copy link
Author

Before testing GUI and saving/load projects, I would like to fix the problem of issue #302, because it might influence the test. Just an update.

@prasser
Copy link
Collaborator

prasser commented Jun 21, 2020

We will update all GUI-related dependencies as part of fixing issue #315. This will likely fix this as well.

@ramongonze
Copy link
Author

ramongonze commented Jul 1, 2020

I have done a test to check the new matrix implementation (int[][]) using the GUI:

Created a dataset with 2,200,000 rows and 1000 columns (the matrix has more than 2^31-1 cells).
All the cells have the integer value 1, except the header which has values from 0 to 999.

Test Steps

  1. Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
  2. File > New project > Ok
  3. File > Import Data > CSV, Next > Next > Finish
  4. Set attributes 0,1 as Identifying
  5. Set attributes 2,3,4 as QID
  6. Set attributes 5,6 as Sensitive
  7. Add k-anonymity privacy model with k=2
  8. File > Save project > Save
  9. Close arx
  10. Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
  11. File > Open project
  12. Set attribute 7 as QID
  13. File > Save project > Save
  14. Close arx
  15. Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
  16. File > Open project

It runned without errors and problems.
OBS: The flag -Xmx60G is used to set JVM memory limit to 60GB.

@prasser is there a specific test at src/test/org/deidentifier/arx/test that I should reproduce using the GUI?

@prasser
Copy link
Collaborator

prasser commented Jul 2, 2020

Sounds great, thanks!

Please perform the following additional test:
(1) Synchronize your feature branch and your master branch with ARX's master.
(2) Create a project (including data) with the current master branch of ARX (or the synchronized master in your fork, if you haven't committed changes to master) and safe the project.
(3) Switch to the branch including your changes and try to load the project.

Does this work?

@ramongonze
Copy link
Author

Does this work?

@prasser
I've done this test and it worked. I used the same dataset of the previous test, but using only 1000 rows and 1000 columns.

Test Steps

  1. Open ARX from branch ‘master’ on arx-deidentifier/arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
  2. File > New project > Ok
  3. File > Import Data > CSV, Next > Next > Finish
  4. Set attributes 0,1 as Identifying
  5. Set attributes 2,3,4 as QID
  6. Set attributes 5,6 as Sensitive
  7. Add k-anonymity privacy model with k=2
  8. File > Save project > Save
  9. Close arx
  10. Open arx from branch ‘master’ on ramongonze/arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
  11. File > Open project

When opening the arx (11th step), all the configuration set before were ok.

OBS:

  • ARX branch 'master' on arx-deidentifier was on commit 2808c70.
  • ARX branch 'master' on ramongonze/arx was on commit c049aee.

@prasser
Copy link
Collaborator

prasser commented Jul 6, 2020

Ok, great. I created a branch "morecells" and changed the base branch of this PR, so that I can take a look. Thanks!

@prasser prasser changed the base branch from master to morecells July 6, 2020 17:49
@prasser prasser merged commit ddb6d27 into arx-deidentifier:morecells Jul 7, 2020
@prasser
Copy link
Collaborator

prasser commented Jul 7, 2020

I will now check the branch and report any issues here.

@prasser
Copy link
Collaborator

prasser commented Jul 7, 2020

@ramongonze Unfortunately, your branch doesn't work. Your experiments worked, because you did not generate an anonymized output dataset within your project. Try the following steps:

(1) With ARX's master: open the example project file or generate a new project in which an anonymization has been performed resulting in an output dataset.
(2) Store this project with ARX's master.
(3) Open this project with your branch -> No output data will be loaded, leading to several problems.

The reason is that DataMatrix is a serializable class in ARX and you have changed it, so that already serialized instances are not loaded fully. I see at least two options:

(1) You just extend DataMatrix with the new functionality, but don't remove any of the existing fields. You will then need to implement some logic to decide whether or not an old or new instance of the class has been loaded.
(2) You implement a custom deserializer that correctly deserializes old instances and maps the data to the new structures.

If you have any questions, please don't hesitate to ask.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants