-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Audit][FEA][SPARK-36831] Support ANSI interval types from CSV source #4146
Comments
@revans2 Please help review the solution: Interval types can be found in https://spark.apache.org/docs/latest/sql-ref-datatypes.html Currently plugin do not support write for csv, so let's talk about reading interval type from CSV. Spark read interval code is in: There are legacy form and normal form switched by SQLConf.LEGACY_FROM_DAYTIME_STRING. legacy form SQLConf.LEGACY_FROM_DAYTIME_STRING See: https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3042 parseDayTimeLegacy see: normal form
The invalid value will be null when reading csv. proposed solution for the normal form Use Cudf ColumnView.extractRe to extract the day, hour, ... , second fields by specifing the groups in regexp, and then calculate the micros. Gpu code is like:
Cpu code is like
row count: 10,000,000 |
I really dislike regular expression use in casts, but it is a good first step. It would be nice to file a follow on issue to write a custom kernel to do this for us. Also I assume you know that your code to do the conversion is leaking a lot of column views. I assume you did that just for readability of the code. Second have you tested this with CSV? The patch that added in support for writing/reading CSV https://issues.apache.org/jira/browse/SPARK-36831 did not add in anything that calls |
Filed an issue: rapidsai/cudf#10356 CSV is text file, the day-time interval is stored in string form, e.g:
Spark used a similar method to parse interval string to day-time interval: IntervalUtils.castStringToDTInterval I know the leaking in the example code, thanks the kindly reminder. |
Spark Accelerator already supported reading |
The cuDF issue is closed but without a fix. |
Make sure the plugin can read ANSI interval types from CSV source
The text was updated successfully, but these errors were encountered: