Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SparkSQL on Amazon Elastic MapReduce (EMR) #5

Open
SamWheating opened this issue Feb 25, 2024 · 1 comment
Open

SparkSQL on Amazon Elastic MapReduce (EMR) #5

SamWheating opened this issue Feb 25, 2024 · 1 comment

Comments

@SamWheating
Copy link

SamWheating commented Feb 25, 2024

Hey there, thanks for putting this together.

As another data point, I've written a SparkSQL implementation of this challenge - see https://github.com/SamWheating/1trc. I've included all of the steps for building and running the job locally as well as submitting it to EMR.

I've included a script for submitting the job to EMR (running Spark 3.4.1), and was able to verify the results like so:

> python scripts/emr_submit.py 
(wait until the job completes)
> aws s3 cp s3://samwheating-1trc/output/part-00000-3cbfb625-d602-4663-8cbe-2007d8f5458a-c000.csv - | head -n 10
Abha,83.4,-43.8,18.000157677827257
Abidjan,87.8,-34.4,25.99981470248089
Abéché,92.1,-33.6,29.400132310152458
Accra,86.7,-34.1,26.400022720245452
Addis Ababa,79.6,-58.0,16.000132105603647
Adelaide,76.8,-45.1,17.29980559679163
Aden,93.2,-33.1,29.100014447157122
Ahvaz,86.6,-37.5,25.3998722446478
Albuquerque,77.3,-54.7,13.99982215316457
Alexandra,72.6,-47.6,11.000355765170788

Overall, the results are pretty comparable to the dask results in your blog post - running on 32 m6i.xlarge instances this job completed in 32 minutes (incl. provisioning) for a total cost of ~$2.27 on spot instances:

32 machines * 32 minutes * ($0.085 spot rate + 0.048 EMR premium)/machine-hr * 1hr/60min = $2.2698

With larger/more machines this would probably be proportionally faster as this job is almost entirely parallelizable.

I haven't really spent much time optimizing / profiling this job, but figured this was an interesting starting point.

With some more time, I think it would be interesting to try:

  • pre-loading the dataset to cluster-local HDFS (to reduce the time spent downloading data)
  • re-running with an increased value of spark.sql.files.maxPartitionBytes in order to reduce scheduling / task overhead.

Anyways, let me know what you think, or if you've got other suggestions for improving this.

@mrocklin
Copy link
Member

mrocklin commented Mar 1, 2024

Very cool. Thanks @SamWheating for sharing!

My sense is that a good next step on this effort is to collect a few of these results together and do a larger comparison across projects. Having SparkSQL in that comparison seems pretty critical. Thanks for your work here!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants