Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

ramongonze · 2020-02-01T02:56:17Z

As suggested in #243 (comment) , it is a simple implementation of an int 2D matrix.

It allows to create matrices with at most (2^31-1)^2 cells.

It keeps the original implementation of an array to be used when there are less than 2^31-1 cells.
If the number of cells is higher than 2^31-1, it creates an int 2D matrix.

Basically, it changes all methods to make calculations and return values according to DataMatrix nature (multidimensional or not).

…ad of throwing an IllegalArgumentException, an multidimensional array is created. Now the upper bound of cells for a DataMatrix is (2^31)^2.

….multiplyExact in DataMatrix constructor. Now the exception is handled.

ramongonze · 2020-02-05T22:45:01Z

In the last 2 commits I have just bring back the try-catch block to handle with ArithmeticException: integer overflow.

I have also ran tests using a dataset with 56 million rows and 120 columns, and they worked perfectly with multidimensional matrices.

prasser · 2020-02-06T11:23:39Z

Hi ramongonze,

thank you for your interest in ARX and this great contribution!

This is a huge change at the core of the software. We will therefore need some time to ensure that it works, doesn't break anything and can be merged. A few initial questions:

(1) Have you tested this, by executing the various functionalities of the software and made sure that it doesn't run into exceptions in other places? There might well be other parts of the code that need to be changed in order to handle such large datasets.
(2) Have you considered turning DataMatrix into an interface (or something similar) and to put the new variant into a separate class? This would probably be a bit challenging due to issues with serialization (DataMatrix is serializable) but it could be wortwhile to try it anyways, as this would be much more minimally invasive.

Thanks again!

…aMatrix class to new classes: SingleArrayMatrix and MultidimensionalArrayMatrix; Created a new abstract class 'Matrix' to keep all information for a single array or a multidimensional one; Runned JUnit tests using single array and multidimensional array, and both have passed in all tests.

ramongonze · 2020-02-18T18:11:13Z

(2) Have you considered turning DataMatrix into an interface (or something similar) and to put the new variant into a separate class? This would probably be a bit challenging due to issues with serialization (DataMatrix is serializable) but it could be wortwhile to try it anyways, as this would be much more minimally invasive.

I changed the array variable from DataMatrix to an Matrix object.

I created 3 new classes in org.deidentifier.arx.framework.data:

Matrix (abstract)
SingleArrayMatrix (extends Matrix)
MultidimensionalArrayMatrix (extends Matrix)

Matrix class has all methods from DataMatrix which had dependency on array variable.
Both SingleArrayMatrix and MultidimensionalMatrix implement the same set of methods, but SingleArrayMatrix uses an int[] array and MultidimensionalArrayMatrix uses an int[][] array.

Why didn't I turned DataMatrix into an abstract class? Because I avoided to change classes from other packages, so the changes kept only in DataMatrix class. That's the reason I created the Matrix class.

Summarizing, we have:

DataMatrix changed;
3 new classes at org.deidentifer.arx.framework.data: Matrix, SingleArrayMatrix and MultidimensionalArray.

(1) Have you tested this, by executing the various functionalities of the software and made sure that it doesn't run into exceptions in other places? There might well be other parts of the code that need to be changed in order to handle such large datasets.

I have ran all JUnit tests using both SingleArrayMatrix and MultidimensionalArrayMatrix operations, and all tests were ok (excepted those ones which depends on datasets that are not available in this repo).

What other parts of the code should I change to handle large datasets?
Or could you tell me what functionalities you think would need changes? I could create some tests for them.

prasser · 2020-02-24T21:17:44Z

Thanks a lot! Another test that you should always perform is to load and save and load again the example projects with the GUI. Does it work? Thanks, Fabian

ramongonze · 2020-03-17T17:53:32Z

Before testing GUI and saving/load projects, I would like to fix the problem of issue #302, because it might influence the test. Just an update.

Conflicts: build.xml

prasser · 2020-06-21T19:47:33Z

We will update all GUI-related dependencies as part of fixing issue #315. This will likely fix this as well.

ramongonze · 2020-07-01T18:04:06Z

I have done a test to check the new matrix implementation (int[][]) using the GUI:

Created a dataset with 2,200,000 rows and 1000 columns (the matrix has more than 2^31-1 cells).
All the cells have the integer value 1, except the header which has values from 0 to 999.

Test Steps

Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
File > New project > Ok
File > Import Data > CSV, Next > Next > Finish
Set attributes 0,1 as Identifying
Set attributes 2,3,4 as QID
Set attributes 5,6 as Sensitive
Add k-anonymity privacy model with k=2
File > Save project > Save
Close arx
Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
File > Open project
Set attribute 7 as QID
File > Save project > Save
Close arx
Open arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
File > Open project

It runned without errors and problems.
OBS: The flag -Xmx60G is used to set JVM memory limit to 60GB.

@prasser is there a specific test at src/test/org/deidentifier/arx/test that I should reproduce using the GUI?

prasser · 2020-07-02T21:33:12Z

Sounds great, thanks!

Please perform the following additional test:
(1) Synchronize your feature branch and your master branch with ARX's master.
(2) Create a project (including data) with the current master branch of ARX (or the synchronized master in your fork, if you haven't committed changes to master) and safe the project.
(3) Switch to the branch including your changes and try to load the project.

Does this work?

…entifier/arx'

ramongonze · 2020-07-06T17:16:34Z

Does this work?

@prasser
I've done this test and it worked. I used the same dataset of the previous test, but using only 1000 rows and 1000 columns.

Test Steps

Open ARX from branch ‘master’ on arx-deidentifier/arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
File > New project > Ok
File > Import Data > CSV, Next > Next > Finish
Set attributes 0,1 as Identifying
Set attributes 2,3,4 as QID
Set attributes 5,6 as Sensitive
Add k-anonymity privacy model with k=2
File > Save project > Save
Close arx
Open arx from branch ‘master’ on ramongonze/arx (java -jar arx-3.9.0-gtk-64.jar -Xmx60G)
File > Open project

When opening the arx (11th step), all the configuration set before were ok.

OBS:

ARX branch 'master' on arx-deidentifier was on commit 2808c70.
ARX branch 'master' on ramongonze/arx was on commit c049aee.

prasser · 2020-07-06T17:21:03Z

Ok, great. I created a branch "morecells" and changed the base branch of this PR, so that I can take a look. Thanks!

prasser · 2020-07-07T09:14:25Z

I will now check the branch and report any issues here.

prasser · 2020-07-07T09:27:29Z

@ramongonze Unfortunately, your branch doesn't work. Your experiments worked, because you did not generate an anonymized output dataset within your project. Try the following steps:

(1) With ARX's master: open the example project file or generate a new project in which an anonymization has been performed resulting in an output dataset.
(2) Store this project with ARX's master.
(3) Open this project with your branch -> No output data will be loaded, leading to several problems.

The reason is that DataMatrix is a serializable class in ARX and you have changed it, so that already serialized instances are not loaded fully. I see at least two options:

(1) You just extend DataMatrix with the new functionality, but don't remove any of the existing fields. You will then need to implement some logic to decide whether or not an old or new instance of the class has been loaded.
(2) You implement a custom deserializer that correctly deserializes old instances and maps the data to the new structures.

If you have any questions, please don't hesitate to ask.

prasser and others added 4 commits March 5, 2018 22:12

Update SWT to 4.7.2

c3ad83e

Modified class DataMatrix to support more than 2^31-1 elements. Inste…

c97eb06

…ad of throwing an IllegalArgumentException, an multidimensional array is created. Now the upper bound of cells for a DataMatrix is (2^31)^2.

Fixed 'ArithmeticException: integer overflow' caused by function Math…

c650446

….multiplyExact in DataMatrix constructor. Now the exception is handled.

Solved a mistake in if clause for this.isMultidimensional.

d3d6d87

ramongonze force-pushed the master branch from f09c065 to d3d6d87 Compare February 5, 2020 22:17

prasser added 8 commits June 8, 2020 16:17

Merge branch 'master' into swt-4-7-2

0fe5c61

Conflicts: build.xml

Merge remote-tracking branch 'origin/master' into swt-4-7-2

b9ee42e

Update build processes

2bcda27

Update GUI dependencies to fix problems on MacOS

3b38631

Replace jar with version compiled with older compiler

3568c76

Fix various update issues on MacOS

30e8414

Fix update issues on MacOS

7c0fbe0

Fix test and improve exception message

a89fc04

prasser and others added 6 commits June 23, 2020 18:16

Fix bug with pattern offset on MacOS

28ff1e0

Fixes for various table update issues on MacOS

d35bd4c

Formatting

076899d

Merge branch 'master' of https://github.com/arx-deidentifier/arx

f15e423

Merge with branch swt-4-7-2.

ab5bc68

Merge upstream and origin

e65792f

Merge remote-tracking branch 'master' of 'https://github.com/arx-deid…

c049aee

…entifier/arx'

prasser changed the base branch from master to morecells July 6, 2020 17:49

prasser merged commit ddb6d27 into arx-deidentifier:morecells Jul 7, 2020

prasser assigned prasser and ramongonze Jul 7, 2020

prasser added the needs-confirmation label Jul 7, 2020

prasser added bug enhancement and removed needs-confirmation labels Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

ramongonze commented Feb 1, 2020

ramongonze commented Feb 5, 2020 •

edited

Loading

prasser commented Feb 6, 2020

ramongonze commented Feb 18, 2020

prasser commented Feb 24, 2020

ramongonze commented Mar 17, 2020

prasser commented Jun 21, 2020

ramongonze commented Jul 1, 2020 •

edited

Loading

prasser commented Jul 2, 2020

ramongonze commented Jul 6, 2020

prasser commented Jul 6, 2020 •

edited

Loading

prasser commented Jul 7, 2020

prasser commented Jul 7, 2020

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

Conversation

ramongonze commented Feb 1, 2020

ramongonze commented Feb 5, 2020 • edited Loading

prasser commented Feb 6, 2020

ramongonze commented Feb 18, 2020

prasser commented Feb 24, 2020

ramongonze commented Mar 17, 2020

prasser commented Jun 21, 2020

ramongonze commented Jul 1, 2020 • edited Loading

prasser commented Jul 2, 2020

ramongonze commented Jul 6, 2020

prasser commented Jul 6, 2020 • edited Loading

prasser commented Jul 7, 2020

prasser commented Jul 7, 2020

ramongonze commented Feb 5, 2020 •

edited

Loading

ramongonze commented Jul 1, 2020 •

edited

Loading

prasser commented Jul 6, 2020 •

edited

Loading