-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Pandas support & support for applying transformations configured in sklearn.pipeline #105
Conversation
… option for transforming data in learner
…nctionality and pandas support; small fixes
…on pipeline to prevent weird handling of last transformation pipe, which is usually expected to be an estimator
Codecov Report
@@ Coverage Diff @@
## dev #105 +/- ##
==========================================
- Coverage 97.20% 96.68% -0.52%
==========================================
Files 31 31
Lines 1644 1780 +136
==========================================
+ Hits 1598 1721 +123
- Misses 46 59 +13
Continue to review full report at Codecov.
|
Hi! So sorry for being very late with the review, I am very busy with other projects :( I am starting the review today and will hopefully finish it soon. |
@cosmic-cortex Any update on the PR? |
Didn't finish the review yet, but I am not sure about the query functions returning only the indices and not the instances. The behavior itself would be fine, but this is a code-breaking change. I'll need to give it a bit more consideration. |
The However, I don't feel familiar enough with the codebase to know whether my approach is the most optimal. Please suggest any enhacements / alternative approaches and I could work on these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have finally reviewed the PR, great work! Pandas support is great, and I am very happy that it is solved finally.
I only have two small issues, can you check them?
I am going to take care of the documentation (there are only a few modifications we need), we can merge the PR during the weekend and I'll release a new version for modAL in PyPI!
modAL/utils/data.py
Outdated
def retrieve_rows(X: modALinput, | ||
I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]: | ||
""" | ||
Returns the rows I from the data set X | ||
""" | ||
if isinstance(X, pd.DataFrame): | ||
return X.iloc[I] | ||
|
||
return X[I] | ||
|
||
def drop_rows(X: modALinput, | ||
I: Union[int, List[int], np.ndarray]) -> Union[sp.csc_matrix, np.ndarray, pd.DataFrame]: | ||
if isinstance(X, pd.DataFrame): | ||
return X.drop(I, axis=0) | ||
|
||
return np.delete(X, I, axis=0) | ||
|
||
def enumerate_data(X: modALinput): | ||
if isinstance(X, pd.DataFrame): | ||
return X.iterrows() | ||
|
||
return enumerate(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions don't work with sparse matrices from scipy.sparse
. Can you add support for these?
modAL/expected_error.py
Outdated
# estimate the expected error | ||
for y_idx, y in enumerate(possible_labels): | ||
X_new = data_vstack((learner.X_training, np.expand_dims(x, axis=0))) | ||
X_new = data_vstack((learner.X_training, [x])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that [x]
doesn't work instead of
np.expand_dims(x, axis=0)
since [x]
just returns a list.
Thanks for the review, I'll fix the issues today. |
… issues from code review
@cosmic-cortex I refactored the methods in I also wondered why test cases exhaustively test over similar inputs, e.g. for n_pool, n_features, n_classes in product(range(5, 10), range(1, 5), range(2, 5)):
... On my machine running the whole test suite takes over 2 minutes. The time would perhaps be better invested in testing various input data types, active learner parameters, etc.. |
I have reviewed your modifications, thanks! You have done an amazing job, I am really glad about these two features! It took me a while to make some time for the review, sorry for that. Once I have merged it in, I'll release the next version of modAL and upload it to PyPI. |
@BoyanH @cosmic-cortex Are there any examples or documentation on how the new feature can be used? It would be really helpful if either of these could be provided. |
@nawabhussain pandas.DataFrame is now supported as dataset input, as long as the provided classifier supports the format. Query strategies which compute metrics on the data, however, cannot work directly on data frames, since they can contain unsupported data types. For example, how would one compute the distance between two sentences? For that, Here is an example of heterogenous dataset, containing both textual and numeric features. We create an sklearn pipeline, which transforms the textual features to word frequency vectors and normalizes the numerical features. import modAL
from modAL.batch import uncertainty_batch_sampling
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
n_samples = 10
n_features = 5
query_strategy = uncertainty_batch_sampling
# Create a random dataset of numerical features
X_pool = np.random.rand(n_samples, n_features)
# Store dataset as pandas dataframe and add a feature column with textual data
X_pool = pd.DataFrame(X_pool)
X_pool['text'] = pd.Series(['This is a sentence.' for _ in range(10)])
y_pool = np.random.randint(0, 2, size=(n_samples,))
train_idx = np.random.choice(range(n_samples), size=2, replace=False)
learner = modAL.models.learners.ActiveLearner(
estimator=make_pipeline(
ColumnTransformer(transformers=[
# Texts are transformed into word frequency vectors
('text_transform', CountVectorizer(), 'text'),
# Numerical data is normalized
('numerical_transform', Normalizer(), [c for c in X_pool.columns if c != 'text'])
]),
RandomForestClassifier(n_estimators=3)
),
query_strategy=query_strategy,
X_training=X_pool.iloc[train_idx],
y_training=y_pool[train_idx],
# IMPORTANT! This tells modAL transformations are to be applied before computing metrics on data
# The sklearn transformations are going to be automatically extracted from the pipeline.
on_transformed=True
)
query_idx, query_inst = learner.query(X_pool)
learner.teach(X_pool.iloc[query_idx], y_pool[query_idx]) TL;DRFor many query strategies, pandas input just works now. If it doesn't in your case, you might need to create an |
@BoyanH Thank you very much for the quick reply. I noticed something while I was experimenting with the new feature. I confirmed the same behaviour with the sample code that you provided. I am not sure whether this is a bug or not.
The second call to the function teach will give an error:
When you try the same code with max_features specified for TfidfVectorizer, it works. |
@nawabhussain Can you provide your exact code? The one in your last response wasn't complete (missing arguments in |
UPDATE: To speed things up when handling large amounts of data, I store the transformed training data on each teach() query. This sounded like a good idea at the time, but now I realize when the learned transformations change over time (e.g. another feature column is added for a new word by the TfidfVectorizer), the newly transformed examples might not be of the same shape as the previous ones. Furthermore, one would generally want the newly learned representations to be used for the instance selection. This is a mistake I did and will fix it soon. However, you could use the |
@BoyanH Do you still need the code to reproduce the error? |
No thank you, I found my error. |
Most notable changes
query
method then includes the instances themselveson_transformed
parameter to learners; when True and the estimator uses sklearn.pipeline, the transformations configured in that pipeline are applied before calculating metrics on the data setNote
@cosmic-cortex , after playing around with your code, I must say you have created a great library! I am open to discussion to get this functionality merged, but please don't feel any pressure to do so if you are not satisfied with the implementation. I just needed to resolve #104 for my project and my fork is now sufficient for my needs.
Note2
Not sure where this functionality should be addressed in the docs.