You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.
I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
> python scripts/emr_submit.py
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788
Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
My sense is that a good next step on this effort is to collect a few of these results together and do a larger comparison across projects. Having SparkSQL in that comparison seems pretty critical. Thanks for your work here!
Hey there, thanks for putting this together.
As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.
I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:
Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:
With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.
I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.
With some more time, I think it would be interesting to try:
spark.sql.files.maxPartitionBytes
in order to reduce scheduling / task overhead.Anyways, let me know what you think, or if you've got other suggestions for improving this.
The text was updated successfully, but these errors were encountered: