ClassCastException in Spark when retrieving arrays from ClickHouse via clickhouse-java driver #1754

maxim-lixakov · 2024-07-31T12:58:44Z

Describe the bug

In the current implementation of Spark when working with arrays of data returned from ClickHouse through the clickhouse-java driver, there is a type conversion error. When the driver returns an array of primitive types (e.g., bytes or integers), Spark attempts to cast them to an array of objects, which is not possible in Java and Scala, as arrays of primitives are not subtypes of arrays of objects. This leads to a ClassCastException.

Steps to reproduce

CREATE TABLE statements for tables involved:

CREATE TABLE example_table (
  id UInt32,
  data Array(Int32)
) ENGINE = MergeTree()
ORDER BY id;

INSERT INTO example_table VALUES (1, [1, 2, 3]), (2, [4, 5, 6]);

At first we implement custom Clickhouse dialect that handles Array(T) because in native Spark, Array(T) is unsupported type

import scala.util.matching.Regex
import org.apache.spark.sql.jdbc.{JdbcDialect, JdbcType}
import org.apache.spark.sql.execution.datasources.jdbc.{JdbcUtils}
import org.apache.spark.sql.types._
import org.slf4j.LoggerFactory
import java.sql.Types

private object ClickhouseDialectExtension extends JdbcDialect {

  private val logger = LoggerFactory.getLogger(getClass)

  private val arrayTypePattern: Regex = "^Array\\((.*)\\)$".r

  override def canHandle(url: String): Boolean = {
    url.startsWith("jdbc:clickhouse")
  }

  override def getCatalystType(
      sqlType: Int,
      typeName: String,
      size: Int,
      md: MetadataBuilder): Option[DataType] = {
    sqlType match {
      case Types.ARRAY =>
        arrayTypePattern.findFirstMatchIn(typeName) match {
          case Some(m) =>
            val elementType = m.group(1)
            JdbcUtils.getCommonJDBCType(elementType).map(dt => ArrayType(dt))
          case None => None
        }
      case _ => None
    }
  }

  override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
    case ArrayType(et, _) =>
      logger.debug("Custom mapping applied: Array[T_1] for ArrayType(T_0)")
      getJDBCType(et)
        .orElse(JdbcUtils.getCommonJDBCType(et))
        .map(jdbcType => JdbcType(s"Array(${jdbcType.databaseTypeDefinition})", Types.ARRAY))
    case _ => None
  }
}

val spark = SparkSession
  .builder()
  .master("local[*]")
  .appName("Spark Clickhouse Dialect Test Session")
  .config("spark.jars", jarPaths)  // include the JAR file containing the compiled custom dialect
  .getOrCreate()

// Register custom Clickhouse dialect
JdbcDialects.registerDialect(ClickhouseDialectExtension)

val df = spark.read
    .format("jdbc")
    .option("url", s"jdbc:clickhouse://$jdbcHostname:${jdbcPort}/$database")
    .option("dbtable", "example_table")
    .load()

df.collect() // raises error:  java.lang.ClassCastException: [B cannot be cast to [Ljava.lang.Object;

Expected behaviour

The driver should return an array of objects instead of an array of primitives to avoid ClassCastException in Spark.

Code example

Example code snippet in Spark that demonstrates the issue:

case ArrayType(et, _) =>
  val elementConversion = et match {
    case TimestampType =>
      (array: Object) =>
        array.asInstanceOf[Array[java.sql.Timestamp]].map { timestamp =>
          nullSafeConvert(timestamp, DateTimeUtils.fromJavaTimestamp)
        }

    case StringType =>
      (array: Object) =>
        array.asInstanceOf[Array[java.lang.Object]]
          .map(obj => if (obj == null) null else UTF8String.fromString(obj.toString))

    case DateType =>
      (array: Object) =>
        array.asInstanceOf[Array[java.sql.Date]].map { date =>
          nullSafeConvert(date, DateTimeUtils.fromJavaDate)
        }

    case dt: DecimalType =>
      (array: Object) =>
        array.asInstanceOf[Array[java.math.BigDecimal]].map { decimal =>
          nullSafeConvert[java.math.BigDecimal](
            decimal, d => Decimal(d, dt.precision, dt.scale))
        }

    case LongType if metadata.contains("binarylong") =>
      throw QueryExecutionErrors.unsupportedArrayElementTypeBasedOnBinaryError(dt)

    case ArrayType(_, _) =>
      throw QueryExecutionErrors.nestedArraysUnsupportedError()

    case _ => (array: Object) => array.asInstanceOf[Array[Any]]
  }

Error log

java.lang.ClassCastException: [B cannot be cast to [Ljava.lang.Object;

Configuration

Environment

Client version: com.clickhouse.clickhouse-jdbc.0.6.0-patch5
Language version: Java 1.8.0, Scala 2.12
OS: macOS 13.4.1 (22F82)

ClickHouse server

ClickHouse Server version: docker image clickhouse-server:latest-alpine
ClickHouse Server non-default settings, if any: None

The text was updated successfully, but these errors were encountered:

chernser · 2024-08-01T16:37:00Z

@maxim-lixakov, thank you for reporting the issue. It will look into it.
Btw, ClickHouse has own Spark connector. Please see https://clickhouse.com/docs/en/integrations/apache-spark

Thanks!

dolfinus · 2024-10-02T13:34:59Z

This issue prevents our custom Clickhouse dialect for Spark from reading Arrays from Clickhouse:
https://github.com/MobileTeleSystems/spark-dialect-extension/blob/b4ba3724bf036394457985588ea891c05c90e5b1/src/test/scala/io/github/mtsongithub/doetl/sparkdialectextensions/clickhouse/ClickhouseDialectTest.scala#L478-L500

maxim-lixakov added the bug label Jul 31, 2024

chernser added the usability what affects usability of the client label Aug 1, 2024

chernser added this to the Priority Backlog milestone Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClassCastException in Spark when retrieving arrays from ClickHouse via clickhouse-java driver #1754

ClassCastException in Spark when retrieving arrays from ClickHouse via clickhouse-java driver #1754

maxim-lixakov commented Jul 31, 2024 •

edited

Loading

chernser commented Aug 1, 2024

dolfinus commented Oct 2, 2024

ClassCastException in Spark when retrieving arrays from ClickHouse via clickhouse-java driver #1754

ClassCastException in Spark when retrieving arrays from ClickHouse via clickhouse-java driver #1754

Comments

maxim-lixakov commented Jul 31, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behaviour

Code example

Error log

Configuration

Environment

ClickHouse server

chernser commented Aug 1, 2024

dolfinus commented Oct 2, 2024

maxim-lixakov commented Jul 31, 2024 •

edited

Loading