Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Pandas support & support for applying transformations configured in sklearn.pipeline #105

Merged
merged 6 commits into from
Nov 1, 2020

Conversation

BoyanH
Copy link

@BoyanH BoyanH commented Sep 24, 2020

Most notable changes

  • query strategies now only return the indices of the selected instances, the query method then includes the instances themselves
    • old interface is still supported, but its usage results in a deprecation warning
  • added on_transformed parameter to learners; when True and the estimator uses sklearn.pipeline, the transformations configured in that pipeline are applied before calculating metrics on the data set
    • Committees also support this functionality, but as they have no X_training (could be different for each of their learners), the training data can yet not be transformed

Note

@cosmic-cortex , after playing around with your code, I must say you have created a great library! I am open to discussion to get this functionality merged, but please don't feel any pressure to do so if you are not satisfied with the implementation. I just needed to resolve #104 for my project and my fork is now sufficient for my needs.

Note2

Not sure where this functionality should be addressed in the docs.

@BoyanH BoyanH marked this pull request as ready for review September 28, 2020 14:09
Boyan Hristov added 2 commits September 29, 2020 15:39
…on pipeline to prevent weird handling of last transformation pipe, which is usually expected to be an estimator
@codecov-io
Copy link

codecov-io commented Oct 8, 2020

Codecov Report

Merging #105 into dev will decrease coverage by 0.51%.
The diff coverage is 92.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #105      +/-   ##
==========================================
- Coverage   97.20%   96.68%   -0.52%     
==========================================
  Files          31       31              
  Lines        1644     1780     +136     
==========================================
+ Hits         1598     1721     +123     
- Misses         46       59      +13     
Impacted Files Coverage Δ
modAL/utils/data.py 81.69% <81.53%> (-12.06%) ⬇️
modAL/expected_error.py 94.28% <87.50%> (-0.16%) ⬇️
modAL/models/base.py 92.42% <90.00%> (-1.58%) ⬇️
modAL/multilabel.py 98.50% <92.85%> (-0.15%) ⬇️
modAL/acquisition.py 100.00% <100.00%> (ø)
modAL/batch.py 97.87% <100.00%> (+0.04%) ⬆️
modAL/disagreement.py 100.00% <100.00%> (ø)
modAL/models/learners.py 94.94% <100.00%> (+2.02%) ⬆️
modAL/uncertainty.py 100.00% <100.00%> (ø)
modAL/utils/combination.py 100.00% <100.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ff7a52f...143067c. Read the comment docs.

@cosmic-cortex
Copy link
Member

Hi! So sorry for being very late with the review, I am very busy with other projects :(

I am starting the review today and will hopefully finish it soon.

@nawabhussain
Copy link

@cosmic-cortex Any update on the PR?

@cosmic-cortex
Copy link
Member

Didn't finish the review yet, but I am not sure about the query functions returning only the indices and not the instances. The behavior itself would be fine, but this is a code-breaking change. I'll need to give it a bit more consideration.

@BoyanH
Copy link
Author

BoyanH commented Oct 15, 2020

The query() method of learners still returns the instances and old query strategies returning both indices and instances are still supported. This does somehow simplify the code not having to include X[query_idx] every time. I did my best not to break any projects depending on the package.

However, I don't feel familiar enough with the codebase to know whether my approach is the most optimal. Please suggest any enhacements / alternative approaches and I could work on these.

Copy link
Member

@cosmic-cortex cosmic-cortex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have finally reviewed the PR, great work! Pandas support is great, and I am very happy that it is solved finally.
I only have two small issues, can you check them?

I am going to take care of the documentation (there are only a few modifications we need), we can merge the PR during the weekend and I'll release a new version for modAL in PyPI!

Comment on lines 57 to 78
def retrieve_rows(X: modALinput,
I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]:
"""
Returns the rows I from the data set X
"""
if isinstance(X, pd.DataFrame):
return X.iloc[I]

return X[I]

def drop_rows(X: modALinput,
I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]:
if isinstance(X, pd.DataFrame):
return X.drop(I, axis=0)

return np.delete(X, I, axis=0)

def enumerate_data(X: modALinput):
if isinstance(X, pd.DataFrame):
return X.iterrows()

return enumerate(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions don't work with sparse matrices from scipy.sparse. Can you add support for these?

# estimate the expected error
for y_idx, y in enumerate(possible_labels):
X_new = data_vstack((learner.X_training, np.expand_dims(x, axis=0)))
X_new = data_vstack((learner.X_training, [x]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that [x] doesn't work instead of

np.expand_dims(x, axis=0)

since [x] just returns a list.

@BoyanH
Copy link
Author

BoyanH commented Oct 16, 2020

Thanks for the review, I'll fix the issues today.

@BoyanH
Copy link
Author

BoyanH commented Oct 16, 2020

@cosmic-cortex I refactored the methods in utils.data to consistently check all supported types in the same sequence (sp.csr_matrix, pd.DataFrame, np.ndarray, list). It turned out expected_error_reduction didn't fully support sparse matrices before, as len(X) is not defined for these. We need to make sure there are no similar errors in the other strategies, but that can perhaps come as a further merge request. I am also not totally happy with the functionality of utils.data.enumerate_data, as its output is data type dependent (check docstring).

I also wondered why test cases exhaustively test over similar inputs, e.g.

for n_pool, n_features, n_classes in product(range(5, 10), range(1, 5), range(2, 5)):
    ...

On my machine running the whole test suite takes over 2 minutes. The time would perhaps be better invested in testing various input data types, active learner parameters, etc..

@BoyanH BoyanH requested a review from cosmic-cortex October 20, 2020 14:48
@cosmic-cortex
Copy link
Member

I have reviewed your modifications, thanks! You have done an amazing job, I am really glad about these two features!

It took me a while to make some time for the review, sorry for that. Once I have merged it in, I'll release the next version of modAL and upload it to PyPI.

@cosmic-cortex cosmic-cortex merged commit a2b7c83 into modAL-python:dev Nov 1, 2020
@nawabhussain
Copy link

@BoyanH @cosmic-cortex Are there any examples or documentation on how the new feature can be used? It would be really helpful if either of these could be provided.

@BoyanH
Copy link
Author

BoyanH commented Nov 4, 2020

@nawabhussain pandas.DataFrame is now supported as dataset input, as long as the provided classifier supports the format. Query strategies which compute metrics on the data, however, cannot work directly on data frames, since they can contain unsupported data types. For example, how would one compute the distance between two sentences? For that,
the on_transformed flag was added to learners, so that such strategies first transform the data with the same pipeline the
estimator uses and then compute metrics on that data.

Here is an example of heterogenous dataset, containing both textual and numeric features. We create an sklearn pipeline, which transforms the textual features to word frequency vectors and normalizes the numerical features.

import modAL
from modAL.batch import uncertainty_batch_sampling
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

n_samples = 10
n_features = 5
query_strategy = uncertainty_batch_sampling

# Create a random dataset of numerical features
X_pool = np.random.rand(n_samples, n_features)

# Store dataset as pandas dataframe and add a feature column with textual data
X_pool = pd.DataFrame(X_pool)
X_pool['text'] = pd.Series(['This is a sentence.' for _ in range(10)])

y_pool = np.random.randint(0, 2, size=(n_samples,))
train_idx = np.random.choice(range(n_samples), size=2, replace=False)


learner = modAL.models.learners.ActiveLearner(
    estimator=make_pipeline(
        ColumnTransformer(transformers=[
            # Texts are transformed into word frequency vectors
            ('text_transform', CountVectorizer(), 'text'),

            # Numerical data is normalized
            ('numerical_transform', Normalizer(), [c for c in X_pool.columns if c != 'text'])
        ]),
        RandomForestClassifier(n_estimators=3)
    ),
    query_strategy=query_strategy,
    X_training=X_pool.iloc[train_idx],
    y_training=y_pool[train_idx],

    # IMPORTANT! This tells modAL transformations are to be applied before computing metrics on data
    # The sklearn transformations are going to be automatically extracted from the pipeline.
    on_transformed=True
)
query_idx, query_inst = learner.query(X_pool)
learner.teach(X_pool.iloc[query_idx], y_pool[query_idx])

TL;DR

For many query strategies, pandas input just works now. If it doesn't in your case, you might need to create an sklearn.Pipeline which transforms your data to numeric features and use on_transformed=True.

@nawabhussain
Copy link

nawabhussain commented Nov 4, 2020

@BoyanH Thank you very much for the quick reply. I noticed something while I was experimenting with the new feature. I confirmed the same behaviour with the sample code that you provided. I am not sure whether this is a bug or not.
Try the code below.

import modAL
from modAL.batch import uncertainty_batch_sampling
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
n_samples = 10
n_features = 5
query_strategy = uncertainty_batch_sampling
X_pool = np.random.rand(n_samples, n_features)
X_pool = pd.DataFrame()
X_pool["text"] = pd.Series(sentences)
y_pool = np.random.randint(0, 2, size=(n_samples,))
train_idx = np.random.choice(range(n_samples), size=2, replace=False)
learner = modAL.models.learners.ActiveLearner(
estimator=make_pipeline(
ColumnTransformer(transformers=[
("text_transform", TfidfVectorizer(ngram_range=(1, 3)), "text"),
("numerical_transform", Normalizer(), [c for c in X_pool.columns if c != "text"])
]),
RandomForestClassifier(n_estimators=3)
),
query_strategy=query_strategy,
X_training=X_pool.iloc[train_idx],
y_training=y_pool[train_idx],
on_transformed=True
)
query_idx, query_inst = learner.query(X_pool)
learner.teach(X_pool.iloc[query_idx[:5]], y_pool[query_idx[:5]])
learner.teach(X_pool.iloc[query_idx[5:]], y_pool[query_idx[5:]])

The second call to the function teach will give an error:

ValueError: the dimensions of the new training data and label mustagree with the training data and labels provided so far

When you try the same code with max_features specified for TfidfVectorizer, it works.

@BoyanH
Copy link
Author

BoyanH commented Nov 4, 2020

@nawabhussain Can you provide your exact code? The one in your last response wasn't complete (missing arguments in X_pool = pd.DataFrame(), missing imports, etc.). After fixing these issues, I don't get any errors. Also, it might be better to open another issue at this point, since this one wasn't intended for any special use-case and is already closed.

@BoyanH
Copy link
Author

BoyanH commented Nov 4, 2020

UPDATE: To speed things up when handling large amounts of data, I store the transformed training data on each teach() query. This sounded like a good idea at the time, but now I realize when the learned transformations change over time (e.g. another feature column is added for a new word by the TfidfVectorizer), the newly transformed examples might not be of the same shape as the previous ones. Furthermore, one would generally want the newly learned representations to be used for the instance selection.

This is a mistake I did and will fix it soon. However, you could use the fit() method instead of teach(), providing it learner.X_training, learner.y_training , and your new data to overcome the issue.

@nawabhussain
Copy link

@BoyanH Do you still need the code to reproduce the error?

@BoyanH
Copy link
Author

BoyanH commented Nov 5, 2020

No thank you, I found my error.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
4 participants