Skip to content

Getting Started

shiyuhang0 edited this page May 7, 2022 · 16 revisions

What is TiSpark

TiSpark is a third-party jar package for Spark that provides the ability to read/write TiKV

How to choose TiSpark version

The latest version of TiSpark is 2.5.0

TiSpark version TiDB、TiKV、PD version Spark version Scala version
2.4.x-scala_2.11 5.x,4.x 2.3.x,2.4.x 2.11
2.4.x-scala_2.12 5.x,4.x 2.4.x 2.12
2.5.x 5.x,4.x 3.0.x,3.1.x 2.12

Use TiSpark >= 2.5

Take the use of spark-shell for example

Setup

Add the following configuration in spark-defaults.conf

spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}
spark.sql.catalog.tidb_catalog  org.apache.spark.sql.catalyst.catalog.TiCatalog
spark.sql.catalog.tidb_catalog.pd.addresses  ${your_pd_adress}

Start spark-shell

Use Tispark in Spark shell with the --jars option:

spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("use tidb_catalog")
spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Delete with TiSpark

You can use Spark SQL to delete from TiKV (Tispark master support)

spark.sql("use tidb_catalog")
spark.sql("delete from ${database}.${table} where xxx").show

See here for more details.

Use TiSpark 2.4.x

Take the use of spark-shell for example

Setup

Add the following configuration in spark-defaults.conf

spark.sql.extensions  org.apache.spark.sql.TiExtensions
spark.tispark.pd.addresses  ${your_pd_adress}

Start spark-shell

Use Tispark in Spark shell with the --jars option:

spark-shell --jars tispark-assembly-{version}.jar

Read with TiSpark

You can use Spark SQL to read from TiKV

spark.sql("select count(*) from ${database}.${table}").show

Write with TiSpark

You can use Spark DataSource API to write to TiKV and guarantees ACID(INSERT statement is not supported yet)

val tidbOptions: Map[String, String] = Map(
  "tidb.addr" -> "127.0.0.1",
  "tidb.password" -> "",
  "tidb.port" -> "4000",
  "tidb.user" -> "root",
  "spark.tispark.pd.addresses" -> "127.0.0.1:2379"
)

val customerDF = spark.sql("select * from customer limit 100000")

customerDF.write
.format("tidb")
.option("database", "tpch_test")
.option("table", "cust_test_select")
.options(tidbOptions)
.mode("append")
.save()

See here for more details.

Clone this wiki locally