-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
FEA: engine accepts dpnp.ndarray and dpt.usm_ndarray objects as input data. #62
Conversation
…o pass dpnp.ndarray and dpt.usm_ndarray objects as input data.
sklearn_numba_dpex/common/kernels.py
Outdated
|
||
@lru_cache | ||
def make_sum_reduction_2d_axis1_kernel( | ||
size0, size1, work_group_size, device, dtype, fused_unary_func=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to this kernel are for enabling the addition of this new argument fused_unary_func
which enable fusing a unary ops on the elements of the input before summing. It's used in this PR to compute variance for scaling tolerance.
@@ -1,14 +1,29 @@ | |||
import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With those changes (that consists in almost rewriting the whole file but that was expected and the sequence of previous refacto was building toward that) the behavior of sklearn regarding input validation is almost identically implemented.
The old behavior is removed (and in particular: printing warning when avoidable copies are implicitly triggered)
@@ -364,7 +366,7 @@ def _get_score_with_centers(centers): | |||
[ | |||
-1827.22702, | |||
-1027.674243, | |||
-865.257397, | |||
-865.257501, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference in those values comes from numerical instability after including the X_mean removal step.
), override_attr_context( | ||
sklearn_validation, | ||
get_namespace=_get_namespace, | ||
_asarray_with_order=_asarray_with_order, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using monkey patching here enables an efficient workflow as a whole. I wonder if rather than having those 2 functions limited to array api specs, we should also consider pluginyfying it.
Out of WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @fcharras. Here is a first pass.
try: | ||
attrs_before = dict() | ||
for attr_name, attr_value in attrs.items(): | ||
# raise AttributeError if obj does not have the attribute attr_name | ||
attrs_before[attr_name] = getattr(obj, attr_name) | ||
setattr(obj, attr_name, attr_value) | ||
|
||
yield | ||
|
||
finally: | ||
for attr_name, attr_value in attrs_before.items(): | ||
setattr(obj, attr_name, attr_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this?
try: | |
attrs_before = dict() | |
for attr_name, attr_value in attrs.items(): | |
# raise AttributeError if obj does not have the attribute attr_name | |
attrs_before[attr_name] = getattr(obj, attr_name) | |
setattr(obj, attr_name, attr_value) | |
yield | |
finally: | |
for attr_name, attr_value in attrs_before.items(): | |
setattr(obj, attr_name, attr_value) | |
attrs_before = dict() | |
for attr_name, attr_value in attrs.items(): | |
attribute = getattr(obj, attr_name, None) | |
if attribute is not None: | |
# Only replace the value of the `attr_name` does exist. | |
attrs_before[attr_name] = attribute | |
setattr(obj, attr_name, attr_value) | |
yield | |
for attr_name, attr_value in attrs_before.items(): | |
setattr(obj, attr_name, attr_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try/finally
is not only about the AttributeError
that was documented, but about any error that can be raised after yield
and before exiting the context. I thought initially that before exiting the code after yield
is always executed, in the same way it's true for __exit__
when expliciting context managers, but in case of such errors it seems not to be true, and caused failing tests (in the sklearn pipeline). Adding the try/finally
fixes that.
Regarding only replacing the value if attr_name
does exist, i'd prefer not, raising an AttributeError
is a safeguard against typos, or unseen changes in input objects. The context manager is only intended to be used for replacing attributes that exist and I like it to fail otherwise.
engine_kmeans_plusplus_centers = engine.init_centroids(X_prepared) | ||
engine_kmeans_plusplus_centers = dpt.asnumpy(engine_kmeans_plusplus_centers.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to _t
-suffix one of them?
centers, indices = engine._kmeans_plusplus(X_prepared) | ||
centers = dpt.asnumpy(centers.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, do we need to _t
-suffix one of them?
@@ -172,12 +171,16 @@ def make_sum_reduction_2d_axis1_kernel(size0, size1, work_group_size, device, dt | |||
minus_one_idx = np.int64(-1) | |||
two_as_a_long = np.int64(2) | |||
|
|||
is_1d = size1 is None | |||
if fused_unary_func is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document fused_unary_func
, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's documented in the public function that follows. For readability it would be better I think to swap the definitions of those, but it would have increased the diff and masked the true diff.
sklearn_numba_dpex/kmeans/drivers.py
Outdated
if (X_mean == 0).astype(int).sum() == len(X_mean): | ||
X_mean = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document why None
is conventionally used in this case?
sklearn_numba_dpex/kmeans/drivers.py
Outdated
n_features * n_samples, | ||
None, | ||
max_work_group_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_features * n_samples, | |
None, | |
max_work_group_size, | |
size0=n_features * n_samples, | |
size1=None, | |
work_group_size=max_work_group_size, |
sklearn_numba_dpex/kmeans/drivers.py
Outdated
def _minus(x, y): | ||
return x - y | ||
|
||
|
||
def _plus(x, y): | ||
return x + y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you group and document those helpers function for ops
?
sklearn_numba_dpex/common/kernels.py
Outdated
# NB: inplace. # Optimized for C-contiguous array and for | ||
# size1 >> preferred_work_group_size_multiple | ||
@dpex.kernel | ||
def broadcast_ops(left_operand_array, right_operand_vector): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth indicating that the left operand is modified in place and that the right one isn't modified at all.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Pushed your suggestions, also amended some changes to enable compute follow data. (the preferred device for compute is the device that stores de data if data is already on-device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM up to a few suggestions and questions.
Thank you, @fcharras!
sklearn_numba_dpex/kmeans/engine.py
Outdated
# future instances. It is only used for testing purposes, using | ||
# `sklearn_numba_dpex.testing.config.override_attr_context` context, for instance | ||
# in the benchmark script. | ||
# For normal usage, the compute will follow the __compute_follow_data__ principle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "__compute_follow_data__"
an emphasis, here?
Do you have a reference for this concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# NB: numba_dpex kernels only currently supports working with C memory layout | ||
# (see https://github.com/IntelPython/numba-dpex/issues/767) but our KMeans | ||
# implementation is hypothetized to be more efficient with the F-memory layout. | ||
# As a workaround the kernels work with the transpose of X, X_t, where X_t | ||
# is created with a C layout, which results in equivalent memory access | ||
# patterns than with a F layout for X. | ||
# TODO: when numba_dpex supports inputs with F-layout: | ||
# - use X rather than X_t and adapt the codebase (better for readability and | ||
# more consistent with sklearn notations) | ||
# - test the performances with both layouts and use the best performing layout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
sklearn_numba_dpex/common/kernels.py
Outdated
) | ||
# subsequent kernel calls only sum the data. | ||
nofunc_kernel = _make_partial_sum_reduction_2d_axis1_kernel( | ||
n_rows, work_group_size, None, dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_rows, work_group_size, None, dtype | |
n_rows, work_group_size, fused_unary_func=None, dtype=dtype |
sklearn_numba_dpex/kmeans/drivers.py
Outdated
|
||
if (X_mean == 0).astype(int).sum() == len(X_mean): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (X_mean == 0).astype(int).sum() == len(X_mean): | |
X_mean_is_zeroed = (X_mean == 0).astype(int).sum() == len(X_mean) | |
if X_mean_is_zeroed: |
@@ -401,6 +404,80 @@ def _relocate_empty_clusters( | |||
) | |||
|
|||
|
|||
def prepare_data_for_lloyd(X_t, init, tol, copy_x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document prepare_data_for_lloyd
indicating that this is centering X
for numerical stability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️ mostly reported sklearn docstring
# NB: sampling without replacement must be executed sequentially so | ||
# it's better done on CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
sklearn_numba_dpex/kmeans/engine.py
Outdated
# Poor man's fancy indexing | ||
# TODO: write a kernel ? or replace with better equivalent when available ? | ||
centers_t = dpt.concat( | ||
[dpt.expand_dims(X[center_idx], axes=1) for center_idx in centers_idx], | ||
axis=1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open an issue on dpctl
? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually take
has been added recently to the Array APi, dpctl issue
sklearn_numba_dpex/kmeans/engine.py
Outdated
use_uniform_weights = (sample_weight == sample_weight[0]).astype( | ||
int | ||
).sum() == len(sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_uniform_weights = (sample_weight == sample_weight[0]).astype( | |
int | |
).sum() == len(sample_weight) | |
use_uniform_weights = ( | |
(sample_weight == sample_weight[0]).astype(int).sum() | |
== len(sample_weight) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
black
doesn't like this, but it does accept
use_uniform_weights = (
(sample_weight == sample_weight[0]).astype(int).sum()
) == len(sample_weight)
i'll go for it
def get_labels(self, X, sample_weight): | ||
labels, _ = self._get_labels_inertia(X, with_inertia=False) | ||
# TODO: sample_weight actually not used for get_labels. Fix in sklearn ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an issue already open for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in scikit-learn/scikit-learn#25066
sklearn_numba_dpex/kmeans/engine.py
Outdated
sample_weight = dpt.ones(n_samples, dtype=dtype, device=device) | ||
elif isinstance(sample_weight, numbers.Number): | ||
sample_weight = dpt.full(n_samples, 1, dtype=dtype, device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the reasons not to have only one branch here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mistake! the latter line should read:
sample_weight = dpt.ones(n_samples, dtype=dtype, device=device) | |
elif isinstance(sample_weight, numbers.Number): | |
sample_weight = dpt.full(n_samples, 1, dtype=dtype, device=device) | |
sample_weight = dpt.ones(n_samples, dtype=dtype, device=device) | |
elif isinstance(sample_weight, numbers.Number): | |
sample_weight = dpt.full(n_samples, sample_weight, dtype=dtype, device=device) |
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Thanks for the review @jjerphan I've pushed your suggestions and answered other comments. I'll go ahead and merge when the pipeline is green. |
… into accept_dpnp_dpt_inputs
I think I'm happy with the result, the code is a little messy but not more than sklearn's and I think the PR basically succeeds in mimicking almost all aspects of sklearn's UX regarding what inputs are accepted and how it's casted under the hood, and also performance regarding the cautiousness with memory copies, while additionally creating support for
dpnp.ndarray
anddpt.usm_ndarray
inputs, and recycling the most possible code that already exists in sklearn.The PR still needs some polishing, and a few tests with
dpnp
/dpt
tensors.I think one (minor) difference is that, while sklearn try to convert numpy arrays with
object
dtype tofloat64
, our engine will error out in this case becausedpt.asarray
refuses object dtype as input. Those kind of edge cases are complicated to unifiy because different array libraries can have different choices and I think it's fine we let it fail in this case.