Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Workaround when creating a FlatDataList with nulled values. #23

Closed
wants to merge 1 commit into from

Conversation

jluis2k10
Copy link

Sometimes when creating a FlatDataList null values are added to the List, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.

…ist, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.
@datumbox datumbox self-requested a review August 27, 2017 20:33
Copy link
Owner

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted, the Dataframe.getXColumn() method does not return only numeric/double values. I don't think this workaround should be placed here.

@jluis2k10
Copy link
Author

jluis2k10 commented Aug 27, 2017

Noted... any tips on how should I try to "fix" this? Not implying that is broken, but in some cases when i try to create and train a Text Classifier it fails due to this.

For example, when using a StandardScaler for the NumericalScaler in the TrainingParameters of the Classifier. If i try to train (fit() method) the Classifier with what i suppose is a well constructed Dataframe, it fails at that point, because the method Dataframe.getXColumn() returns a FlatDataList with null values in it. Also, if i simply try to ignore those values and therefore I dont even add them to the FlatDataList, some methods like Descriptives.variance() fail due to the size of the FlatDataList being 1 or less (and as a side note, my understanding about the variance tells me this isn't a good idea either way).

Edit: maybe in TypeInference.toDouble(Object v)? like this:

    if (v == null) {
        return null;
    }

for this:

    if (v == null) {
        return new Double(0.0);
    }

@jluis2k10 jluis2k10 closed this Aug 27, 2017
@jluis2k10 jluis2k10 reopened this Aug 27, 2017
@datumbox
Copy link
Owner

Hey @jluis2k10, thanks for reporting this. :)

How did you end up having null values in the Dataframe? Do you encode with null the missing values or it's the result of some other transformation done on the original data?

IF this needs patching, one potential solution is to change the Descriptives class to ignore null values. This should be a simple fix but there are a few gotchas. For example let's say you want to estimate the average of the following list: [1,2,null,6]. If you interpret nulls as missing values, then the average is (1+2+6)/3=3 not (1+2+6)/4. Moreover please note that this is a policy change so we need to sure it is necessary.

Please note that I'm not questioning whether the Descriptives class throws an exception if the list contains nulls. That is obvious. My question is how/why you ended up having nulls in your Dataframe in the first place. Perhaps the bug lives somewhere else. Could you open a ticket and provide a snippet that reproduces the problem? That could help us investigate the problem and discuss next steps. :)

@jluis2k10
Copy link
Author

Yes i'm aware of the inconvenience of simply ignoring the nulls, that's why i tried to convert them to 0.0.

And i don't have nulls in the Dataframe, this happens when in Dataframe.getXColumn() the parameter column is not present in one of the entries of the Dataframe itself. For example if it tries to getXColumn("some text to check") and only one of the entries in the Dataframe contains that piece of text, then i get as many null entries in the FlatDataList as the Dataframe size minus 1.

I don't have a neat and concise piece of code tho show, but I will try to write some asap. And i'm sorry, but what about my edit on my previous comment? Is it not a good idea to just return 0.0 if nullwhen converting to double?

@datumbox
Copy link
Owner

This is not about convinience/inconvinience but rather about ensuring that the math/algorithms will not break. We can't convert nulls to zeros. The semantics would not be correct, it's a hug change that affects the entire framework and could potentially mute or cause other problems.

Some classes of the framework are supposed to throw an exception if you have nulls or incorrect values stored in the Dataframe. In Machine Learning, data need to be preprocessed before throwing them into algorithms. For example if indeed your column is non empty in only to 1 record, trying to perform standardization/scaling with an algorithm that estimates the variance (which requires more than 1 observation) shoud fail.

Try writing a concise/simple piece of code that reproduces the problem and opening a ticket so that we can investigate further. :)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants