java.lang.NegativeArraySizeException when checking re-identification risk for large file #243

om-sharma · 2019-01-07T14:57:42Z

Hi,

We are using ARX Risk analysis and our data files are very large. One of the file is having 5million rows and 800 columns.
When Data.getHandle() call is made, it fails because it creates an integer array of size "rows*columns" which goes out of integer range.

We can may be use long array, Is that fair to assume and I can submit a pull request?

Thanks

The text was updated successfully, but these errors were encountered:

prasser · 2019-01-07T15:33:57Z

Thanks for reporting this issue! Can you please post the complete stacktrace? Thanks, Fabian

om-sharma · 2019-01-07T15:37:28Z

Thanks for quick reply. Below is the stacktrace I have when it failed.

Exception in thread "main" java.lang.NegativeArraySizeException
	at org.deidentifier.arx.framework.data.DataMatrix.<init>(Unknown Source)
	at org.deidentifier.arx.DataHandleInput.<init>(Unknown Source)
	at org.deidentifier.arx.Data.getHandle(Unknown Source)
	at app.anon.risk.RiskAssessor.assessRisk(RiskAssessor.java:24)
	at app.anon.PrivacyFilter.checkReIdentificationRisk(PrivacyFilter.java:45)
	at app.anon.PrivacyFilter.run(PrivacyFilter.java:32)
	at app.anon.Application.main(Application.java:21)

prasser · 2019-01-07T16:56:59Z

I have applied a partial fix to master for this issue by enhancing the error message. The dataset that you are trying to load is too large for the current implementation of matrices in ARX.

You have two options:
(1) Reduce the number of columns (do you really need to estimate risks over 800 columns? it's unlikely that you will get meaningful risk estimates with data of such high dimensionality anyways...)

(2) Enhance org.deidentifier.arx.framework.data.DataMatrix to switch to a data structure that can hold more than 2^31-1 entries when the number of cells becomes too large. It will not be sufficient to switch to long arrays (they can also not hold more than 2^31-1 entries). You could use multi-dimensional arrays (e.g. a special implementation of an ArrayList that is backed up by a list of arrays, see http://fastutil.di.unimi.it/docs/it/unimi/dsi/fastutil/BigArrays.html) or switch to off-heap memory (https://github.com/xerial/larray is a well-known implementation, however, LArray seems to not be serializable which is required by ARX).

Best
Fabian

om-sharma · 2019-01-08T16:37:13Z

Thanks Fabian for suggestions.

How many maximum columns I can use for some meaningful estimates?

For now I will choose option 1 and split the file and run the re-id risk on the subset.

Thanks

prasser · 2019-01-08T16:40:21Z

You can calculate this yourself - the following condition must hold: rows * columns <= 2^31-1.

So, if you have 5M rows, this gives you: max_columns = floor( (2^31-1) / 5M) = 429

prasser · 2019-01-08T16:41:18Z

I will close this issue now. Thanks again for reporting the exception. Should you want to extend ARX as described above in the future, we can reopen the issue. Thanks for using ARX! Fabian

prasser added a commit that referenced this issue Jan 7, 2019

Partial fix for #243 (enhanced error message)

0044a59

prasser closed this as completed Jan 8, 2019

ramongonze mentioned this issue Feb 1, 2020

Enables org.deidentifier.arx.framework.data.DataMatrix to support (2^31-1)^2 cells #299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.NegativeArraySizeException when checking re-identification risk for large file #243

java.lang.NegativeArraySizeException when checking re-identification risk for large file #243

om-sharma commented Jan 7, 2019 •

edited

Loading

prasser commented Jan 7, 2019

om-sharma commented Jan 7, 2019 •

edited

Loading

prasser commented Jan 7, 2019

om-sharma commented Jan 8, 2019

prasser commented Jan 8, 2019

prasser commented Jan 8, 2019

java.lang.NegativeArraySizeException when checking re-identification risk for large file #243

java.lang.NegativeArraySizeException when checking re-identification risk for large file #243

Comments

om-sharma commented Jan 7, 2019 • edited Loading

prasser commented Jan 7, 2019

om-sharma commented Jan 7, 2019 • edited Loading

prasser commented Jan 7, 2019

om-sharma commented Jan 8, 2019

prasser commented Jan 8, 2019

prasser commented Jan 8, 2019

om-sharma commented Jan 7, 2019 •

edited

Loading

om-sharma commented Jan 7, 2019 •

edited

Loading