Workaround when creating a FlatDataList with nulled values. #23

jluis2k10 · 2017-08-27T19:46:22Z

Sometimes when creating a FlatDataList null values are added to the List, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.

…ist, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.

datumbox

As you noted, the Dataframe.getXColumn() method does not return only numeric/double values. I don't think this workaround should be placed here.

jluis2k10 · 2017-08-27T21:02:28Z

Noted... any tips on how should I try to "fix" this? Not implying that is broken, but in some cases when i try to create and train a Text Classifier it fails due to this.

For example, when using a StandardScaler for the NumericalScaler in the TrainingParameters of the Classifier. If i try to train (fit() method) the Classifier with what i suppose is a well constructed Dataframe, it fails at that point, because the method Dataframe.getXColumn() returns a FlatDataList with null values in it. Also, if i simply try to ignore those values and therefore I dont even add them to the FlatDataList, some methods like Descriptives.variance() fail due to the size of the FlatDataList being 1 or less (and as a side note, my understanding about the variance tells me this isn't a good idea either way).

Edit: maybe in TypeInference.toDouble(Object v)? like this:

    if (v == null) {
        return null;
    }

for this:

    if (v == null) {
        return new Double(0.0);
    }

datumbox · 2017-08-27T21:32:04Z

Hey @jluis2k10, thanks for reporting this. :)

How did you end up having null values in the Dataframe? Do you encode with null the missing values or it's the result of some other transformation done on the original data?

IF this needs patching, one potential solution is to change the Descriptives class to ignore null values. This should be a simple fix but there are a few gotchas. For example let's say you want to estimate the average of the following list: [1,2,null,6]. If you interpret nulls as missing values, then the average is (1+2+6)/3=3 not (1+2+6)/4. Moreover please note that this is a policy change so we need to sure it is necessary.

Please note that I'm not questioning whether the Descriptives class throws an exception if the list contains nulls. That is obvious. My question is how/why you ended up having nulls in your Dataframe in the first place. Perhaps the bug lives somewhere else. Could you open a ticket and provide a snippet that reproduces the problem? That could help us investigate the problem and discuss next steps. :)

jluis2k10 · 2017-08-27T21:50:57Z

Yes i'm aware of the inconvenience of simply ignoring the nulls, that's why i tried to convert them to 0.0.

And i don't have nulls in the Dataframe, this happens when in Dataframe.getXColumn() the parameter column is not present in one of the entries of the Dataframe itself. For example if it tries to getXColumn("some text to check") and only one of the entries in the Dataframe contains that piece of text, then i get as many null entries in the FlatDataList as the Dataframe size minus 1.

I don't have a neat and concise piece of code tho show, but I will try to write some asap. And i'm sorry, but what about my edit on my previous comment? Is it not a good idea to just return 0.0 if nullwhen converting to double?

datumbox · 2017-08-27T22:29:59Z

This is not about convinience/inconvinience but rather about ensuring that the math/algorithms will not break. We can't convert nulls to zeros. The semantics would not be correct, it's a hug change that affects the entire framework and could potentially mute or cause other problems.

Some classes of the framework are supposed to throw an exception if you have nulls or incorrect values stored in the Dataframe. In Machine Learning, data need to be preprocessed before throwing them into algorithms. For example if indeed your column is non empty in only to 1 record, trying to perform standardization/scaling with an algorithm that estimates the variance (which requires more than 1 observation) shoud fail.

Try writing a concise/simple piece of code that reproduces the problem and opening a ticket so that we can investigate further. :)

datumbox self-requested a review August 27, 2017 20:33

datumbox requested changes Aug 27, 2017

View reviewed changes

jluis2k10 closed this Aug 27, 2017

jluis2k10 reopened this Aug 27, 2017

datumbox closed this Aug 27, 2017

jluis2k10 mentioned this pull request Aug 28, 2017

FlatDataList with null values gets an exception when trying to calculate the variance #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround when creating a FlatDataList with nulled values. #23

Workaround when creating a FlatDataList with nulled values. #23

jluis2k10 commented Aug 27, 2017

datumbox left a comment

jluis2k10 commented Aug 27, 2017 •

edited

Loading

datumbox commented Aug 27, 2017

jluis2k10 commented Aug 27, 2017

datumbox commented Aug 27, 2017

Workaround when creating a FlatDataList with nulled values. #23

Workaround when creating a FlatDataList with nulled values. #23

Conversation

jluis2k10 commented Aug 27, 2017

datumbox left a comment

Choose a reason for hiding this comment

jluis2k10 commented Aug 27, 2017 • edited Loading

datumbox commented Aug 27, 2017

jluis2k10 commented Aug 27, 2017

datumbox commented Aug 27, 2017

jluis2k10 commented Aug 27, 2017 •

edited

Loading