Pandas support & support for applying transformations configured in sklearn.pipeline #105

BoyanH · 2020-09-24T16:41:44Z

Most notable changes

query strategies now only return the indices of the selected instances, the query method then includes the instances themselves
- old interface is still supported, but its usage results in a deprecation warning
added on_transformed parameter to learners; when True and the estimator uses sklearn.pipeline, the transformations configured in that pipeline are applied before calculating metrics on the data set
- Committees also support this functionality, but as they have no X_training (could be different for each of their learners), the training data can yet not be transformed

Note

@cosmic-cortex , after playing around with your code, I must say you have created a great library! I am open to discussion to get this functionality merged, but please don't feel any pressure to do so if you are not satisfied with the implementation. I just needed to resolve #104 for my project and my fork is now sufficient for my needs.

Note2

Not sure where this functionality should be addressed in the docs.

… option for transforming data in learner

…nctionality and pandas support; small fixes

…e learner

…on pipeline to prevent weird handling of last transformation pipe, which is usually expected to be an estimator

codecov-io · 2020-10-08T20:23:43Z

Codecov Report

Merging #105 into dev will decrease coverage by 0.51%.
The diff coverage is 92.68%.

@@            Coverage Diff             @@
##              dev     #105      +/-   ##
==========================================
- Coverage   97.20%   96.68%   -0.52%     
==========================================
  Files          31       31              
  Lines        1644     1780     +136     
==========================================
+ Hits         1598     1721     +123     
- Misses         46       59      +13

Impacted Files	Coverage Δ
modAL/utils/data.py	`81.69% <81.53%> (-12.06%)`	⬇️
modAL/expected_error.py	`94.28% <87.50%> (-0.16%)`	⬇️
modAL/models/base.py	`92.42% <90.00%> (-1.58%)`	⬇️
modAL/multilabel.py	`98.50% <92.85%> (-0.15%)`	⬇️
modAL/acquisition.py	`100.00% <100.00%> (ø)`
modAL/batch.py	`97.87% <100.00%> (+0.04%)`	⬆️
modAL/disagreement.py	`100.00% <100.00%> (ø)`
modAL/models/learners.py	`94.94% <100.00%> (+2.02%)`	⬆️
modAL/uncertainty.py	`100.00% <100.00%> (ø)`
modAL/utils/combination.py	`100.00% <100.00%> (ø)`
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff7a52f...143067c. Read the comment docs.

cosmic-cortex · 2020-10-09T07:09:03Z

Hi! So sorry for being very late with the review, I am very busy with other projects :(

I am starting the review today and will hopefully finish it soon.

nawabhussain · 2020-10-14T11:39:03Z

@cosmic-cortex Any update on the PR?

cosmic-cortex · 2020-10-14T12:20:03Z

Didn't finish the review yet, but I am not sure about the query functions returning only the indices and not the instances. The behavior itself would be fine, but this is a code-breaking change. I'll need to give it a bit more consideration.

BoyanH · 2020-10-15T00:55:46Z

The query() method of learners still returns the instances and old query strategies returning both indices and instances are still supported. This does somehow simplify the code not having to include X[query_idx] every time. I did my best not to break any projects depending on the package.

However, I don't feel familiar enough with the codebase to know whether my approach is the most optimal. Please suggest any enhacements / alternative approaches and I could work on these.

cosmic-cortex

I have finally reviewed the PR, great work! Pandas support is great, and I am very happy that it is solved finally.
I only have two small issues, can you check them?

I am going to take care of the documentation (there are only a few modifications we need), we can merge the PR during the weekend and I'll release a new version for modAL in PyPI!

cosmic-cortex · 2020-10-15T07:57:07Z

modAL/utils/data.py

+def retrieve_rows(X: modALinput,
+                  I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]:
+    """
+    Returns the rows I from the data set X
+    """
+    if isinstance(X, pd.DataFrame):
+        return X.iloc[I]
+
+    return X[I]
+
+def drop_rows(X: modALinput,
+              I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]:
+    if isinstance(X, pd.DataFrame):
+        return X.drop(I, axis=0)
+
+    return np.delete(X, I, axis=0)
+
+def enumerate_data(X: modALinput):
+    if isinstance(X, pd.DataFrame):
+        return X.iterrows()
+
+    return enumerate(X)


These functions don't work with sparse matrices from scipy.sparse. Can you add support for these?

cosmic-cortex · 2020-10-15T08:12:19Z

modAL/expected_error.py

            # estimate the expected error
            for y_idx, y in enumerate(possible_labels):
-                X_new = data_vstack((learner.X_training, np.expand_dims(x, axis=0)))
+                X_new = data_vstack((learner.X_training, [x]))


It seems that [x] doesn't work instead of

np.expand_dims(x, axis=0)

since [x] just returns a list.

BoyanH · 2020-10-16T14:24:23Z

Thanks for the review, I'll fix the issues today.

… issues from code review

BoyanH · 2020-10-16T19:16:23Z

@cosmic-cortex I refactored the methods in utils.data to consistently check all supported types in the same sequence (sp.csr_matrix, pd.DataFrame, np.ndarray, list). It turned out expected_error_reduction didn't fully support sparse matrices before, as len(X) is not defined for these. We need to make sure there are no similar errors in the other strategies, but that can perhaps come as a further merge request. I am also not totally happy with the functionality of utils.data.enumerate_data, as its output is data type dependent (check docstring).

I also wondered why test cases exhaustively test over similar inputs, e.g.

for n_pool, n_features, n_classes in product(range(5, 10), range(1, 5), range(2, 5)):
    ...

On my machine running the whole test suite takes over 2 minutes. The time would perhaps be better invested in testing various input data types, active learner parameters, etc..

cosmic-cortex · 2020-11-01T09:28:25Z

I have reviewed your modifications, thanks! You have done an amazing job, I am really glad about these two features!

It took me a while to make some time for the review, sorry for that. Once I have merged it in, I'll release the next version of modAL and upload it to PyPI.

nawabhussain · 2020-11-04T10:39:51Z

@BoyanH @cosmic-cortex Are there any examples or documentation on how the new feature can be used? It would be really helpful if either of these could be provided.

BoyanH · 2020-11-04T15:42:26Z

@nawabhussain pandas.DataFrame is now supported as dataset input, as long as the provided classifier supports the format. Query strategies which compute metrics on the data, however, cannot work directly on data frames, since they can contain unsupported data types. For example, how would one compute the distance between two sentences? For that,
the on_transformed flag was added to learners, so that such strategies first transform the data with the same pipeline the
estimator uses and then compute metrics on that data.

Here is an example of heterogenous dataset, containing both textual and numeric features. We create an sklearn pipeline, which transforms the textual features to word frequency vectors and normalizes the numerical features.

import modAL
from modAL.batch import uncertainty_batch_sampling
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

n_samples = 10
n_features = 5
query_strategy = uncertainty_batch_sampling

# Create a random dataset of numerical features
X_pool = np.random.rand(n_samples, n_features)

# Store dataset as pandas dataframe and add a feature column with textual data
X_pool = pd.DataFrame(X_pool)
X_pool['text'] = pd.Series(['This is a sentence.' for _ in range(10)])

y_pool = np.random.randint(0, 2, size=(n_samples,))
train_idx = np.random.choice(range(n_samples), size=2, replace=False)


learner = modAL.models.learners.ActiveLearner(
    estimator=make_pipeline(
        ColumnTransformer(transformers=[
            # Texts are transformed into word frequency vectors
            ('text_transform', CountVectorizer(), 'text'),

            # Numerical data is normalized
            ('numerical_transform', Normalizer(), [c for c in X_pool.columns if c != 'text'])
        ]),
        RandomForestClassifier(n_estimators=3)
    ),
    query_strategy=query_strategy,
    X_training=X_pool.iloc[train_idx],
    y_training=y_pool[train_idx],

    # IMPORTANT! This tells modAL transformations are to be applied before computing metrics on data
    # The sklearn transformations are going to be automatically extracted from the pipeline.
    on_transformed=True
)
query_idx, query_inst = learner.query(X_pool)
learner.teach(X_pool.iloc[query_idx], y_pool[query_idx])

TL;DR

For many query strategies, pandas input just works now. If it doesn't in your case, you might need to create an sklearn.Pipeline which transforms your data to numeric features and use on_transformed=True.

nawabhussain · 2020-11-04T21:01:24Z

@BoyanH Thank you very much for the quick reply. I noticed something while I was experimenting with the new feature. I confirmed the same behaviour with the sample code that you provided. I am not sure whether this is a bug or not.
Try the code below.

import modAL
from modAL.batch import uncertainty_batch_sampling
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
n_samples = 10
n_features = 5
query_strategy = uncertainty_batch_sampling
X_pool = np.random.rand(n_samples, n_features)
X_pool = pd.DataFrame()
X_pool["text"] = pd.Series(sentences)
y_pool = np.random.randint(0, 2, size=(n_samples,))
train_idx = np.random.choice(range(n_samples), size=2, replace=False)
learner = modAL.models.learners.ActiveLearner(
estimator=make_pipeline(
ColumnTransformer(transformers=[
("text_transform", TfidfVectorizer(ngram_range=(1, 3)), "text"),
("numerical_transform", Normalizer(), [c for c in X_pool.columns if c != "text"])
]),
RandomForestClassifier(n_estimators=3)
),
query_strategy=query_strategy,
X_training=X_pool.iloc[train_idx],
y_training=y_pool[train_idx],
on_transformed=True
)
query_idx, query_inst = learner.query(X_pool)
learner.teach(X_pool.iloc[query_idx[:5]], y_pool[query_idx[:5]])
learner.teach(X_pool.iloc[query_idx[5:]], y_pool[query_idx[5:]])

The second call to the function teach will give an error:

ValueError: the dimensions of the new training data and label mustagree with the training data and labels provided so far

When you try the same code with max_features specified for TfidfVectorizer, it works.

BoyanH · 2020-11-04T23:14:37Z

@nawabhussain Can you provide your exact code? The one in your last response wasn't complete (missing arguments in X_pool = pd.DataFrame(), missing imports, etc.). After fixing these issues, I don't get any errors. Also, it might be better to open another issue at this point, since this one wasn't intended for any special use-case and is already closed.

BoyanH · 2020-11-04T23:28:39Z

UPDATE: To speed things up when handling large amounts of data, I store the transformed training data on each teach() query. This sounded like a good idea at the time, but now I realize when the learned transformations change over time (e.g. another feature column is added for a new word by the TfidfVectorizer), the newly transformed examples might not be of the same shape as the previous ones. Furthermore, one would generally want the newly learned representations to be used for the instance selection.

This is a mistake I did and will fix it soon. However, you could use the fit() method instead of teach(), providing it learner.X_training, learner.y_training , and your new data to overcome the issue.

nawabhussain · 2020-11-05T07:38:44Z

@BoyanH Do you still need the code to reproduce the error?

BoyanH · 2020-11-05T08:21:50Z

No thank you, I found my error.

Boyan Hristov added 3 commits September 24, 2020 18:12

resolves modAL-python#20, modAL-python#104 - added pandas support and…

8e0cb25

… option for transforming data in learner

modAL-python#104 - added on_transformed support to BaseCommittee

1ad79fe

modAL-python#20, modAL-python#104 - added tests for on_transformed fu…

171e2e9

…nctionality and pandas support; small fixes

BoyanH marked this pull request as ready for review September 28, 2020 14:09

Boyan Hristov added 2 commits September 29, 2020 15:39

modAL-python#104 - FIXED: now saving transformed data in fit() of bas…

942d045

…e learner

modAL-python#104 - added an empty pipe at the end of the transformati…

68f8878

…on pipeline to prevent weird handling of last transformation pipe, which is usually expected to be an estimator

cosmic-cortex requested changes Oct 15, 2020

View reviewed changes

modAL-python#20 - fixed scipy.sparse support in expected_error, fixed…

143067c

… issues from code review

BoyanH requested a review from cosmic-cortex October 20, 2020 14:48

cosmic-cortex approved these changes Nov 1, 2020

View reviewed changes

cosmic-cortex merged commit a2b7c83 into modAL-python:dev Nov 1, 2020

BoyanH mentioned this pull request Nov 4, 2020

Stored transformed data needs to be updated when model is refitted #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas support & support for applying transformations configured in sklearn.pipeline #105

Pandas support & support for applying transformations configured in sklearn.pipeline #105

BoyanH commented Sep 24, 2020 •

edited

Loading

codecov-io commented Oct 8, 2020 •

edited

Loading

cosmic-cortex commented Oct 9, 2020

nawabhussain commented Oct 14, 2020

cosmic-cortex commented Oct 14, 2020

BoyanH commented Oct 15, 2020

cosmic-cortex left a comment

cosmic-cortex Oct 15, 2020

cosmic-cortex Oct 15, 2020

BoyanH commented Oct 16, 2020

BoyanH commented Oct 16, 2020

cosmic-cortex commented Nov 1, 2020

nawabhussain commented Nov 4, 2020

BoyanH commented Nov 4, 2020

nawabhussain commented Nov 4, 2020 •

edited

Loading

BoyanH commented Nov 4, 2020

BoyanH commented Nov 4, 2020

nawabhussain commented Nov 5, 2020

BoyanH commented Nov 5, 2020

Pandas support & support for applying transformations configured in sklearn.pipeline #105

Pandas support & support for applying transformations configured in sklearn.pipeline #105

Conversation

BoyanH commented Sep 24, 2020 • edited Loading

Most notable changes

Note

Note2

codecov-io commented Oct 8, 2020 • edited Loading

Codecov Report

cosmic-cortex commented Oct 9, 2020

nawabhussain commented Oct 14, 2020

cosmic-cortex commented Oct 14, 2020

BoyanH commented Oct 15, 2020

cosmic-cortex left a comment

Choose a reason for hiding this comment

cosmic-cortex Oct 15, 2020

Choose a reason for hiding this comment

cosmic-cortex Oct 15, 2020

Choose a reason for hiding this comment

BoyanH commented Oct 16, 2020

BoyanH commented Oct 16, 2020

cosmic-cortex commented Nov 1, 2020

nawabhussain commented Nov 4, 2020

BoyanH commented Nov 4, 2020

TL;DR

nawabhussain commented Nov 4, 2020 • edited Loading

BoyanH commented Nov 4, 2020

BoyanH commented Nov 4, 2020

nawabhussain commented Nov 5, 2020

BoyanH commented Nov 5, 2020

BoyanH commented Sep 24, 2020 •

edited

Loading

codecov-io commented Oct 8, 2020 •

edited

Loading

nawabhussain commented Nov 4, 2020 •

edited

Loading