Skip to content

Getting Started

shiyuhang0 edited this page May 7, 2022 · 16 revisions

What is TiSpark

TiSpark is a third-party jar package for Spark that provides the ability to read/write TiKV

How to choose TiSpark version

The latest version of TiSpark is 2.5.0. You can get TiSpark jar from maven central

TiSpark version TiDB、TiKV、PD version Spark version Scala version
2.4.x-scala_2.11 5.x,4.x 2.3.x,2.4.x 2.11
2.4.x-scala_2.12 5.x,4.x 2.4.x 2.12
2.5.x 5.x,4.x 3.0.x,3.1.x 2.12

Use TiSpark >= 2.5

Take the use of spark-shell for example

Start spark-shell

TO use Tispark in Spark shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
spark.sql.catalog.tidb_catalog  org.apache.spark.sql.catalyst.catalog.TiCatalog
spark.sql.catalog.tidb_catalog.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("use tidb_catalog")
spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Delete with TiSpark

You can use Spark SQL to delete from TiKV (Tispark master support)

spark.sql("use tidb_catalog")
spark.sql("delete from ${database}.${table} where xxx").show

See here for more details.

Use TiSpark 2.4.x

Take the use of spark-shell for example

Start spark-shell

TO use Tispark in Spark shell

  1. Add the following configuration in spark-defaults.conf
spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
spark.sql.catalog.tidb_catalog  org.apache.spark.sql.catalyst.catalog.TiCatalog
spark.sql.catalog.tidb_catalog.pd.addresses  ${your_pd_adress}
  1. Start spark-shell with the --jars option
spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Clone this wiki locally