SparkSQL on Amazon Elastic MapReduce (EMR) #5

SamWheating · 2024-02-25T18:59:25Z

Hey there, thanks for putting this together.

As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.

I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:

> python scripts/emr_submit.py 
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788

Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:

32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698

With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.

I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.

With some more time, I think it would be interesting to try:

pre-loading the dataset to cluster-local HDFS (to reduce the time spent downloading data)
re-running with an increased value of spark.sql.files.maxPartitionBytes in order to reduce scheduling / task overhead.

Anyways, let me know what you think, or if you've got other suggestions for improving this.

The text was updated successfully, but these errors were encountered:

mrocklin · 2024-03-01T20:25:58Z

Very cool. Thanks @SamWheating for sharing!

My sense is that a good next step on this effort is to collect a few of these results together and do a larger comparison across projects. Having SparkSQL in that comparison seems pretty critical. Thanks for your work here!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparkSQL on Amazon Elastic MapReduce (EMR) #5

SparkSQL on Amazon Elastic MapReduce (EMR) #5

SamWheating commented Feb 25, 2024 •

edited

Loading

mrocklin commented Mar 1, 2024

SparkSQL on Amazon Elastic MapReduce (EMR) #5

SparkSQL on Amazon Elastic MapReduce (EMR) #5

Comments

SamWheating commented Feb 25, 2024 • edited Loading

mrocklin commented Mar 1, 2024

SamWheating commented Feb 25, 2024 •

edited

Loading