Skip to content
This repository has been archived by the owner on Dec 4, 2019. It is now read-only.

Long Time to Collect Results of Distributed Spark-Sklearn Training #114

Open
wjohnson opened this issue Sep 18, 2019 · 1 comment
Open

Comments

@wjohnson
Copy link

I'm running 15 combinations of a Logistic Regression model with spark-sklearn and I'll see that all tasks have completed but there is a huge amount of time to collect all of the results. I'm guessing it's the number of my coefficients that I'm bringing back to the driver. But I've noticed it several times when I'm working with wide datasets or deep random forests. Is it just expected due to network traffic?

Data set size: 31,358 rows, 10000 columns

param_grid = [
  dict(
  penalty=['l2'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['newton-cg', 'lbfgs', 'sag']
  ),
  dict(
  penalty=['l1', 'elasticnet'], 
  C = [1.0, 0.5, 0.1], 
  solver = ['saga',]
  )
]
grid = GridSearchCV(sc, estimator=LogisticRegression(max_iter=500), param_grid=param_grid, n_jobs=-1, cv=5)
grid_result = grid.fit(X_train, y_train)

Environment:

  • Azure Databricks ML 5.5 Runtime
  • 9 Worker nodes with 56GB and 8 cores each
@srowen
Copy link
Collaborator

srowen commented Sep 18, 2019

It could be; how big is the data set / model?
Is it definitely waiting on collecting the data, not fitting?

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants