Batch transform sparse matrix on Scikit-learn model #2359

ivankeller · 2019-10-16T10:35:34Z

ivankeller
Oct 16, 2019

Reference: 0414058987

I reproduce here a question I submited on stackoverflow (https://stackoverflow.com/questions/58410583/batch-transform-sparse-matrix-with-aws-sagemaker-python-sdk):

I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).

I struggle because of:

the size of the data
the format of the data

I did several experiments to check what was going on:

1. predict locally on sample sparse matrix data

It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.

2. Batch Transform on a sample csv data

It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:

sklearn_model = SKLearnModel(
    model_data=model_artifact_location_on_s3,
    entry_point='my_script.py',
    role=role,
    sagemaker_session=sagemaker_session)

transformer = sklearn_model.transformer(
   instance_count=1, 
   instance_type='ml.m4.xlarge', 
   max_payload=100)

transformer.transform(
   data=batch_data, 
   content_type='text/csv',
   split_type=None)   

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

where:

'my_script.py' implements a simple model_fn to deserialize the model artifact:

def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

batch_data is the s3 path for the csv file.

3. Batch Transform of a sample dense numpy dataset.

It works
I prepared a sample of the data and saved it to s3 in Numpy .npy format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).
The only difference with the previous experiment (2) is the argument content_type='application/x-npy' in transformer.transform(...).

This solution does not scale and we would like to pass a Scipy sparse matrix:

4. Batch Transform of a big sparse matrix.

Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:

I used write_spmatrix_to_sparse_tensor to write the data to protobuf format on s3. The function I used is:

def write_protobuf(X_sparse, bucket, prefix, obj):
    """Write sparse matrix to protobuf format at location bucket/prefix/obj."""
    buf = io.BytesIO()
    write_spmatrix_to_sparse_tensor(file=buf, array=X_sparse, labels=None)
    buf.seek(0)
    key = '{}/{}'.format(prefix, obj)
    boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket, key)

Then the code used for launching the batch transform job is:

sklearn_model = SKLearnModel(
    model_data=model_artifact_location_on_s3,
    entry_point='my_script.py',
    role=role,
    sagemaker_session=sagemaker_session)

transformer = sklearn_model.transformer(
   instance_count=1, 
   instance_type='ml.m4.xlarge', 
   max_payload=100)

transformer.transform(
   data=batch_data, 
   content_type='application/x-recordio-protobuf',
   split_type='RecordIO')   

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

I get the following error:

sagemaker_containers._errors.ClientError: Content type application/x-recordio-protobuf is not supported by this framework.

Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)

If content_type='application/x-recordio-protobuf' is not allowed, what should I use?
Is split_type='RecordIO' the proper setting in this context?
Should I provide an input_fn function in my script to deserialize the data?
Is there another better approach to tackle this problem?

ChoiByungWook · 2019-10-16T20:32:39Z

ChoiByungWook
Oct 16, 2019

Reference: 0414058987

Hi @ivankeller,

Thank you for the clear and concise description!

If content_type='application/x-recordio-protobuf' is not allowed, what should I use?

Unfortunately it looks like there isn't support for 'application/x-recordio-protobuf' within the predefined scikit-learn container.

There are a few issues that reference this as well.

sklearn preprocessor output in 'application/x-recordio-protobuf' fails with either import error or version conflict #767
add support for recordio and sparse matrices sagemaker-scikit-learn-container#10

I'll reach out to the corresponding team to give guidance.

Is split_type='RecordIO' the proper setting in this context?

That is the recommended split_type for 'application/x-recordio-protobuf'. https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-batch

Unfortunately, it seems that container doesn't support that input yet :(.

Should I provide an input_fn function in my script to deserialize the data?

I recommend providing an input_fn in your script If the default_input_fn doesn't fit your use case.

Is there another better approach to tackle this problem?

I am not a scitkit-learn expert, so I am not too sure. I'll reach out to the corresponding team for guidance.

Based on the details provided, it looks like this isn't an issue with the Python SDK or the batch transform service itself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch transform sparse matrix on Scikit-learn model #2359

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Batch transform sparse matrix on Scikit-learn model #2359

ivankeller Oct 16, 2019

1. predict locally on sample sparse matrix data

2. Batch Transform on a sample csv data

3. Batch Transform of a sample dense numpy dataset.

4. Batch Transform of a big sparse matrix.

Replies: 1 comment

ChoiByungWook Oct 16, 2019

ivankeller
Oct 16, 2019

ChoiByungWook
Oct 16, 2019