-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Input data with different lengths / filled with NAs #58
Comments
Seems like a relatively straightforward update might be to add |
Thanks for the observation and sorry for the late answer! I have just reached this task in my backlog :) This case hasn't come up before. I don't see any reason to not allow NaNs, so we can just set |
Cool, thanks a lot! |
I have fixed a problem and additionally released the new version, this fix included. Let me know if there is a problem! |
Hi, it seems the issue is still present in Ranked batch-mode sampling. Reprex (mostly from Ranked batch-mode sampling documentation)import numpy as np
import xgboost as xgb
from functools import partial
from modAL.batch import uncertainty_batch_sampling
from modAL.models import ActiveLearner
iris = load_iris()
X_raw = iris['data']
y_raw = iris['target']
# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=3)
X_train = X_raw[training_indices]
y_train = y_raw[training_indices]
# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)
# Setting an column's entry as np.nan
X_pool[0][0] = np.nan
# Pre-set our batch sampling to retrieve 3 samples at a time.
BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)
# Specify our active learning model.
learner = ActiveLearner(
estimator=xgb.XGBClassifier(),
X_training=X_train,
y_training=y_train,
query_strategy=preset_batch,
force_all_finite=False
)
query_index, query_instance = learner.query(X_pool) Error messageClick to expand!ValueError Traceback (most recent call last)
in
40 )
41
---> 42 query_index, query_instance = learner.query(X_pool)
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/models/base.py in query(self, *query_args, **query_kwargs)
201 labelled upon query synthesis.
--> 203 query_result = self.query_strategy(self, *query_args, **query_kwargs)
204 return query_result
205
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in uncertainty_batch_sampling(classifier, X, n_instances, metric, n_jobs, **uncertainty_measure_kwargs)
208 uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs)
209 query_indices = ranked_batch(classifier, unlabeled=X, uncertainty_scores=uncertainty,
--> 210 n_instances=n_instances, metric=metric, n_jobs=n_jobs)
211 return query_indices, X[query_indices]
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in ranked_batch(classifier, unlabeled, uncertainty_scores, n_instances, metric, n_jobs)
161 instance_index, instance, mask = select_instance(X_training=labeled, X_pool=unlabeled,
162 X_uncertainty=uncertainty_scores, mask=mask,
--> 163 metric=metric, n_jobs=n_jobs)
164
165 # Add our instance we've considered for labeling to our labeled set. Although we don't
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in select_instance(X_training, X_pool, X_uncertainty, mask, metric, n_jobs)
97 _, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
98 X_training.reshape(n_labeled_records, -1),
---> 99 metric=metric)
100 else:
101 distance_scores = pairwise_distances(X_pool_masked.reshape(n_unlabeled, -1),
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_argmin_min(X, Y, axis, metric, metric_kwargs)
573 sklearn.metrics.pairwise_distances_argmin
--> 575 X, Y = check_pairwise_arrays(X, Y)
576
577 if metric_kwargs is None:
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
139 X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
140 copy=copy, force_all_finite=force_all_finite,
--> 141 estimator=estimator)
142 Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype,
143 copy=copy, force_all_finite=force_all_finite,
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). |
Hi! This seems like a scikit-learn issue :( The function Similarly, if you set Do you have any suggestions how to solve this? At the moment, I don't see a proper solution, but this doesn't mean that there isn't one. (I don't want to internally remove NaNs and pass them to the external functions, because this would remain hidden from the user, possibly causing unintended consequences.) |
Hi!, as you correctly mentioned this should only work for models that can handle missing values such as novel boosting methods (i.e. xgboost). Alternatively, |
That is a good idea! I am going to take a shot this. I don't promise to do this ASAP since I am extremely busy with other work, but I'll try to do it this month. |
I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths.
tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls
without setting
force_all_finite = 'allow-nan'
.Is there a reason for not allowing NAs, or did this use case just not come up before?
Thanks a lot!
The text was updated successfully, but these errors were encountered: