Sparkling is a Clojure API for Apache Spark.
(do
(require '[sparkling.conf :as conf])
(require '[sparkling.core :as spark])
(spark/with-context sc (-> (conf/spark-conf) ; this creates a spark context from the given context
(conf/app-name "sparkling-test")
(conf/master "local"))
(let [lines-rdd (spark/into-rdd sc ["This is a firest line" ;; here we provide data from a clojure collection.
"Testing spark" ;; You could also read from a text file, or avro file.
"and sparkling" ;; You could even approach a JDBC datasource
"Happy hacking!"])]
(spark/collect ;; get every element from the filtered RDD
(spark/filter ;; filter elements in the given RDD (lines-rdd)
#(.contains % "spark") ;; a pure clojure function as filter predicate
lines-rdd)))))
Check out our site for information about Gorillalabs Sparkling and a getting started guide.
Just clone our getting-started repo and get going right now.
But note: There's one thing you need to be aware of: Certain namespaces need to be AOT-compiled, e.g. because the classes are referenced in the startup process by name. I'm doing this in my project.clj using the :aot
directive like this
:aot [#".*" sparkling.serialization sparkling.destructuring]
Sparkling is available from Clojars. To use with Leiningen, add
- added @/deref support for broadcasts Making it easier to work with broadcasts by using Clojure mechanisms. This is especially true for unit tests, as you could test without actual broadcasts, but with anything deref-able.
- added RDD autonaming from fn metadata, eases navigation in SparkUI
- added lookup functionality. Make sure the key to your Tuples is Serializable (Java serialization), as it will be serialized as part of your task definition, not only as part of your data. These are handled differently in Spark.
(thanks to Jase Bell)
- feature: added accumulators (Thanks to Oleg Smirnov for that)
- change: overhaul of Kryo Registration: Deprecated defregistrator macro, added Registrator type (see sparkling.serialization), with basic support of required types. This introduced a breaking change (sorry!): You need to aot-compile (or require) sparkling.serialization to run stuff in the REPL.
- feature: added support for your own avro readers, making it possible to read types/records instead of maps. Major improvement on memory consumption.
- No more spilling of unwanted stuff in your application. You only need to refer to sparkling to get a proper environment with Spark 1.2.1. In order to deploy to a cluster with Spark pre-installed, you need to set Spark dependency to provided in your project, though.
- Use sparkling.core instead of sparkling.api for parameter orders similar to Clojure. Easier currying using partial.
- Made it possible to use Keywords as Functions by serializing IFn instead of AFunction.
- Tested with Spark 1.1.0 and Spark 1.2.1.
- It's about twice as fast by getting rid of a reflection call (thanks to David Jacot for his take on this).
- Get rid of mapping/remapping inside the api functions, which
- bloated the execution plan (mine shrinked to a third) and
- (more importantly) allowed me to keep partitioner information.
- adding more -values functions (e.g. map-values), againt to keep partitioner information.
- Additional Sources for RDDs:
- JdbcRDD: Reading Data from your JDBC source.
- Hadoop-Avro-Reader: Reading AVRO Files from HDFS
Feel free to fork the Sparkling repository, improve stuff and open up a pull request against our "develop" branch. However, we'll only add features with tests, so make sure everything is green ;)
Thanks to The Climate Corporation and their open source clj-spark project, and to Yieldbot for yieldbot/flambo which served as the starting point for this project.
Copyright (C) 2014-2015 Dr. Christian Betz, and the Gorillalabs team.
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.