Batch transform sparse matrix on Scikit-learn model #2359
Replies: 1 comment
-
Reference: 0414058987 Hi @ivankeller, Thank you for the clear and concise description!
Unfortunately it looks like there isn't support for 'application/x-recordio-protobuf' within the predefined scikit-learn container. There are a few issues that reference this as well.
I'll reach out to the corresponding team to give guidance.
That is the recommended split_type for 'application/x-recordio-protobuf'. https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-batch Unfortunately, it seems that container doesn't support that input yet :(.
I recommend providing an input_fn in your script If the default_input_fn doesn't fit your use case.
I am not a scitkit-learn expert, so I am not too sure. I'll reach out to the corresponding team for guidance. Based on the details provided, it looks like this isn't an issue with the Python SDK or the batch transform service itself. |
Beta Was this translation helpful? Give feedback.
-
Reference: 0414058987
I reproduce here a question I submited on stackoverflow (https://stackoverflow.com/questions/58410583/batch-transform-sparse-matrix-with-aws-sagemaker-python-sdk):
I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).
I struggle because of:
the size of the data
the format of the data
I did several experiments to check what was going on:
1. predict locally on sample sparse matrix data
It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.
2. Batch Transform on a sample csv data
It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:
where:
model_fn
to deserialize the model artifact:batch_data
is the s3 path for the csv file.3. Batch Transform of a sample dense numpy dataset.
It works
I prepared a sample of the data and saved it to s3 in Numpy
.npy
format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).The only difference with the previous experiment (2) is the argument
content_type='application/x-npy'
intransformer.transform(...)
.This solution does not scale and we would like to pass a Scipy sparse matrix:
4. Batch Transform of a big sparse matrix.
Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:
I used
write_spmatrix_to_sparse_tensor
to write the data to protobuf format on s3. The function I used is:Then the code used for launching the batch transform job is:
I get the following error:
Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)
content_type='application/x-recordio-protobuf'
is not allowed, what should I use?split_type='RecordIO'
the proper setting in this context?input_fn
function in my script to deserialize the data?Beta Was this translation helpful? Give feedback.
All reactions