feat: Add `datafusion-spark` crate #15168

shehabgamin · 2025-03-11T21:24:15Z

Which issue does this PR close?

Closes [DISCUSSION] Add separate crate to cover spark builtin functions #5600

Rationale for this change

See discussion in #5600

TL;DR Many projects want Spark-compatible expressions for use with DataFusion. There are some in Comet and there are some in the Sail project.

What changes are included in this PR?

Adding Spark crate.

Are these changes tested?

Are there any user-facing changes?

shehabgamin · 2025-03-13T02:26:13Z

datafusion/sqllogictest/test_files/spark/README.md

+  specific language governing permissions and limitations
+  under the License.
+-->
+


Testing instructions here

shehabgamin · 2025-03-13T02:26:31Z

datafusion/spark/README.md

+# datafusion-spark: Spark-compatible Expressions
+
+This crate provides Apache Spark-compatible expressions for use with DataFusion.
+


Testing instructions here

shehabgamin · 2025-03-13T02:26:54Z

datafusion/sqllogictest/test_files/spark/math/expm1.slt

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+


Example test here

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

How do you think we should handle these different cases with this test approach?

A related question - There will be some functions that we can support with 100% compatibility and some that we cannot. It would be good to think about how we express that.

Sorry for all the questions, but I am really excited about this 😄 ... what would be involved in being able to run these same sqllogic test files in Spark (either in CI or manually locally) to confirm same/similar results

These are great questions and really good things to discuss. I'm about to go into a meeting but I have a bunch of thoughts that I'll share afterwards in a couple of hours.

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

How do you think we should handle these different cases with this test approach?

In the Sail code base, auxiliary information is passed into new() and stored within the struct. For example:
https://github.com/lakehq/sail/blob/be54362ce8bd79bfb3e55214c5a1b80c2c3d2492/crates/sail-plan/src/extension/function/datetime/timestamp_now.rs#L9-L32

#[derive(Debug)] pub struct TimestampNow { signature: Signature, timezone: Arc<str>, time_unit: TimeUnit, } impl TimestampNow { pub fn new(timezone: Arc<str>, time_unit: TimeUnit) -> Self { Self { signature: Signature::nullary(Volatility::Stable), timezone, time_unit, } } pub fn timezone(&self) -> &str { &self.timezone } pub fn time_unit(&self) -> &TimeUnit { &self.time_unit } }

And then in our PhysicalExtensionCodec we can do the following:
https://github.com/lakehq/sail/blob/be54362ce8bd79bfb3e55214c5a1b80c2c3d2492/crates/sail-execution/src/codec.rs#L946-L953

if let Some(func) = node.inner().as_any().downcast_ref::<TimestampNow>() { let timezone = func.timezone().to_string(); let time_unit: gen_datafusion_common::TimeUnit = func.time_unit().into(); let time_unit = time_unit.as_str_name().to_string(); UdfKind::TimestampNow(gen::TimestampNowUdf { timezone, time_unit, })

If we decide to not use sqllogictest (per #15168 (comment)) then we will have no problem testing UDFs with auxiliary information. There are already tests in DataFusion core for this type of pattern as well:

datafusion/datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

Lines 1265 to 1304 in 8061485

async fn test_parameterized_scalar_udf() -> Result<()> {

let batch = RecordBatch::try_from_iter([(

"text",

Arc::new(StringArray::from(vec!["foo", "bar", "foobar", "barfoo"])) as ArrayRef,

)])?;

let ctx = SessionContext::new();

ctx.register_batch("t", batch)?;

let t = ctx.table("t").await?;

let foo_udf = ScalarUDF::from(MyRegexUdf::new("fo{2}"));

let bar_udf = ScalarUDF::from(MyRegexUdf::new("[Bb]ar"));

let plan = LogicalPlanBuilder::from(t.into_optimized_plan()?)

.filter(

foo_udf

.call(vec![col("text")])

.and(bar_udf.call(vec![col("text")])),

)?

.filter(col("text").is_not_null())?

.build()?;

assert_eq!(

format!("{plan}"),

"Filter: t.text IS NOT NULL\n Filter: regex_udf(t.text) AND regex_udf(t.text)\n TableScan: t projection=[text]"

);

let actual = DataFrame::new(ctx.state(), plan).collect().await?;

let expected = [

"+--------+",

"| text |",

"+--------+",

"| foobar |",

"| barfoo |",

"+--------+",

];

assert_batches_eq!(expected, &actual);

ctx.deregister_table("t")?;

Ok(())

}

A related question - There will be some functions that we can support with 100% compatibility and some that we cannot. It would be good to think about how we express that.

Throw errors when possible, and provide documentation. Depending on how shaky the compatibility for the function is, we may want to avoid implementing it altogether.

... what would be involved in being able to run these same sqllogic test files in Spark (either in CI or manually locally) to confirm same/similar results

One option:

A Python script to automatically generate Spark SQL function test cases and their results using PySpark.

A README for developers explaining how to run the script and commit the test cases.

An optional CI workflow to verify the correctness of the test cases' ground truth on demand.

Separate topic... Do you have ideas about fuzzy testing and its suitability in DataFusion?

Here is my suggestion.

Write a Python script that generates interesting test data (probably in Parquet format) with edge cases using PySpark

Create files containing SQL queries that operate on these test files

Write a Python script to run those queries via PySpark and write results out to file

Write a Rust script to run those queries using datafusion-spark and write results out to file

Write a script that can compare the Spark and datafusion-spark output and report on any differences

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

I believe @Omega359 has a proposal here of how to thread the config options into the arguments of the functions (for the same reason)

feat: Add ConfigOptions to ScalarFunctionArgs #13527

shehabgamin · 2025-03-13T02:27:01Z

datafusion/sqllogictest/test_files/spark/string/ascii.slt

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+


Example test here

shehabgamin · 2025-03-13T02:33:20Z

datafusion/spark/src/function/string/ascii.rs

+        };
+    }
+
+    #[test]


Example invoke test here. Direct invocation tests should only be used to verify that the function is correctly implemented. Further tests are required in the sqllogictest crate (examples for ascii can be found in this PR).

My personal preference is to test them all from .slt rather than have any rust based tests unless there is something that can not be tested from .slt

For the different string types, we could perhaps cover the different string types using the same pattern as normal string tests -- see https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/README.md

However, I don't think this is required

shehabgamin · 2025-03-13T02:34:11Z

datafusion/spark/src/function/math/expm1.rs

+        };
+    }
+
+    #[test]


Example invoke test here. Direct invocation tests should only be used to verify that the function is correctly implemented. Further tests are required in the sqllogictest crate (examples for expm1 can be found in this PR).

Because sqllogictests are so much faster to write and update, I suggest we point people towards using sqllogictests to test the functions unless there is somehting that can not be tested using .slt files

@andygrove and I found some correctness issues with sqllogictests. Specifically, we found issues with testing the correctness of floating point results.

The idea was to do something like this: #15168 (comment)

@alamb What are your thoughts on this? Should we perhaps use sqllogictest as long as we're not testing float point results and as long as the function being tested is not configurable?

I think the underlying sqlogictest library has the notion of "engines"

The one we use on main is here:
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/src/engines/datafusion_engine and

part of that is how to normalize the results into the strings printed in .slt tests

datafusion/datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs

Lines 196 to 249 in 19dd46d

pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {

if !col.is_valid(row) {

// represent any null value with the string "NULL"

Ok(NULL_STR.to_string())

} else {

match col.data_type() {

DataType::Null => Ok(NULL_STR.to_string()),

DataType::Boolean => {

Ok(bool_to_str(get_row_value!(array::BooleanArray, col, row)))

}

DataType::Float16 => {

Ok(f16_to_str(get_row_value!(array::Float16Array, col, row)))

}

DataType::Float32 => {

Ok(f32_to_str(get_row_value!(array::Float32Array, col, row)))

}

DataType::Float64 => {

Ok(f64_to_str(get_row_value!(array::Float64Array, col, row)))

}

DataType::Decimal128(_, scale) => {

let value = get_row_value!(array::Decimal128Array, col, row);

Ok(decimal_128_to_str(value, *scale))

}

DataType::Decimal256(_, scale) => {

let value = get_row_value!(array::Decimal256Array, col, row);

Ok(decimal_256_to_str(value, *scale))

}

DataType::LargeUtf8 => Ok(varchar_to_str(get_row_value!(

array::LargeStringArray,

col,

row

))),

DataType::Utf8 => {

Ok(varchar_to_str(get_row_value!(array::StringArray, col, row)))

}

DataType::Utf8View => Ok(varchar_to_str(get_row_value!(

array::StringViewArray,

col,

row

))),

DataType::Dictionary(_, _) => {

let dict = col.as_any_dictionary();

let key = dict.normalized_keys()[row];

Ok(cell_to_string(dict.values(), key)?)

}

_ => {

let f =

ArrayFormatter::try_new(col.as_ref(), &DEFAULT_CLI_FORMAT_OPTIONS);

Ok(f.unwrap().value(row).to_string())

}

}

.map_err(DFSqlLogicTestError::Arrow)

}

}

If there is some different way to convert floating point values for spark maybe we could make a spark functions specific driver

The idea was to do something like this: #15168 (comment)

Ideally we could use one of the many existing tools in datafusion rather than write new scripts

Another potential possiblity might be to use insta.rs perhaps (which was added to the repo recently) which automates result update 🤔

@alamb Thanks for the pointers, this seems reasonable to me.

@andygrove Does this sound good to you as well?

If we're all aligned, I think we've gathered enough input for me to push up some more code 🤠

FYI I think @andygrove might be out for a week so he may not respond quickly

@alamb Thanks for the pointers, this seems reasonable to me.

@andygrove Does this sound good to you as well?

If we're all aligned, I think we've gathered enough input for me to push up some more code 🤠

@andygrove @alamb Just checking in here

Sorry for the late response here. I had a vacation and have been busy with Comet priorities since getting back. I would like to help with the review here. I do wonder if we could start with a smaller scope PR to get the initial crate in place.

I would also like to contribute some Spark-compatible shuffle implementation from Comet so that we can re-use it in Ballista.

shehabgamin · 2025-03-13T02:48:07Z

datafusion/spark/src/function/string/ascii.rs

+    }
+
+    fn name(&self) -> &str {
+        "spark_ascii"


Prefix with spark_ because sqllogictest evaluates both implementations of ascii.

shehabgamin · 2025-03-13T02:49:35Z

datafusion/spark/src/function/math/expm1.rs

+    }
+
+    fn name(&self) -> &str {
+        "spark_expm1"


Prefix with spark_ because sqllogictest may evaluate more than one implementation of expm1.

What I would recommend is

keep the original expm1 name (that seems the most useful to people who are trying to get spark compatible behavior)

Change to use a function to register spark compatible functions (see above)

Change our sqlloigictest driver so it registers spark functions for any test that starts with spark_*.slt (similiar to pg_...)

That way most sqllogictest stuff stays the same, and we can write spark tests in spark/spark_math.slt, spark/spark_string.slt etc type tests

Here is the code that customizes the context for the individual test files

datafusion/datafusion/sqllogictest/src/test_context.rs

Lines 77 to 81 in e4bf951

match file_name {

"information_schema_table_types.slt" => {

info!("Registering local temporary table");

register_temp_table(test_ctx.session_ctx()).await;

}

andygrove · 2025-03-13T23:58:12Z

datafusion/sqllogictest/test_files/spark/math/expm1.slt

+SELECT spark_expm1(1::INT);
+----
+1.718281828459


I tried running this in Spark 3.5.3 and did not get the same answer.

scala> spark.sql("select expm1(1)").show() +-----------------+ | EXPM1(1)| +-----------------+ |1.718281828459045| +-----------------+

@andygrove I believe sqllogic is truncating the answer.

I initially had this test, which tested for the value 1.7182818284590453 (slightly more precise than your result) but removed it because cargo test (amd64) was giving the value 1.718281828459045 (https://github.com/apache/datafusion/actions/runs/13825914914/job/38680868216) while the rest of the cargo tests on different architectures were passing.

Does sqllogictest have a way to test floating point results within some tolerance?

Doesn't look like it:

Remember, the purpose of sqllogictest is to validate the logic behind the evaluation of SQL statements, not the ability to handle extreme values. So keep content in a reasonable range: small integers, short strings, and floating point numbers that use only the most significant bits of an a 32-bit IEEE float.

https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki

We don't have to stick with sqllogictest.

We can create test helpers similar to:
https://github.com/lakehq/datafusion/blob/d78877a55c5e835a07a7ebf23a7bd515faf7d827/datafusion/optimizer/src/analyzer/type_coercion.rs#L2137-L2208

The above link is from an old PR that didn't end up getting merged in, but the general idea seems useful here.

lol

query R SELECT spark_expm1(1::INT); ---- 1.718281828459 query T SELECT spark_expm1(1::INT)::STRING; ---- 1.7182818284590453

andygrove · 2025-03-14T00:08:46Z

datafusion/sqllogictest/test_files/spark/math/expm1.slt

+1.718281828459
+
+query R
+SELECT spark_expm1(a) FROM (VALUES (0::INT), (1::INT)) AS t(a);


I'd suggest adding tests for a wider range of values and edge cases, such as negative numbers, large positive and negative numbers, NaN, null, and so on.

Good call, will do!

andygrove · 2025-03-14T00:11:40Z

datafusion/sqllogictest/test_files/spark/string/ascii.slt

+50
+
+query I
+SELECT spark_ascii(a) FROM (VALUES ('Spark'), ('PySpark'), ('Pandas API')) AS t(a);


Could you add tests for edge cases?

Some ideas from ChatGPT (the results are from me actually running them in Spark):

scala> spark.sql("select ascii('😀')").show() +---------+ |ascii(😀)| +---------+ | 128512| +---------+ scala> spark.sql("select ascii('\n')").show() +---------+ |ascii(\n)| +---------+ | 10| +---------+ scala> spark.sql("select ascii('\t')").show() +---------+ |ascii(\t)| +---------+ | 9| +---------+

Good call, will do!

datafusion/core/src/execution/session_state_defaults.rs

shehabgamin · 2025-04-26T09:08:35Z

datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs

@@ -193,7 +192,7 @@ macro_rules! get_row_value {
 ///
 /// Floating numbers are rounded to have a consistent representation with the Postgres runner.
 ///
-pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
+pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result<String> {


@alamb While digging into your suggestion (#15168 (comment)), I realized that we don't need to write an entire engine for Spark. All we care about is the logic in cell_to_string. For now, I haven’t created a Spark-specific spark_cell_to_string, since the issues we originally encountered with sqllogictest were related only to Float64 precision. We can always create a Spark-specificspark_cell_to_string later if we find that other changes are needed.

nice -- this makes sense to me -- I agree what you have here looks good

alamb

Thank you @shehabgamin -- I think this looks great to me. I left a few comments but nothing that would block merging this PR and I think we can do them as follow on items.

Things that I think we should do next:

Add an example somewhere (perhaps in the examples_directory) showing how to configure and use the spark functions in a SessionContext. I can help with this
Automatically generate documentation for these functions, the way we do for other functions -- https://datafusion.apache.org/user-guide/sql/scalar_functions.html
Test integrating this code into comet (with a draft PR or something) to make sure it works.

After this PR s merged, I suggest we implement one or two more small functions to give some example patterns to follow, and then I think we'll be ready to write a bunch of tickets to port all the functions

alamb · 2025-04-29T00:21:05Z

Cargo.lock

+ "datafusion-functions-aggregate",
+ "datafusion-functions-aggregate-common",
+ "datafusion-functions-nested",
+ "datafusion-functions-table",
+ "datafusion-functions-window",
+ "datafusion-functions-window-common",


I don't think these dependencies are used

Suggested change

"datafusion-functions-aggregate",

"datafusion-functions-aggregate-common",

"datafusion-functions-nested",

"datafusion-functions-table",

"datafusion-functions-window",

"datafusion-functions-window-common",

Nice catch, done!

alamb · 2025-04-29T00:23:05Z

datafusion/spark/src/lib.rs

+use std::sync::Arc;
+
+/// Fluent-style API for creating `Expr`s
+#[allow(unused)]


why does it need to be "allow unused"? I don't think this should be necessary for pub APIs

Linter yells at me otherwise

alamb · 2025-04-29T00:23:41Z

datafusion/spark/src/lib.rs

+/// Fluent-style API for creating `Expr`s
+#[allow(unused)]
+pub mod expr_fn {
+    pub use super::function::aggregate::expr_fn::*;


is this list of modules ones that spark offers? I am not familiar with spark so I don't know off the top of my head

Yes, exactly!

alamb · 2025-04-29T00:24:26Z

NOTICE.txt

@@ -1,5 +1,5 @@
 Apache DataFusion
-Copyright 2019-2024 The Apache Software Foundation
+Copyright 2019-2025 The Apache Software Foundation


alamb · 2025-04-29T00:25:36Z

datafusion/spark/src/function/math/expm1.rs

+use std::sync::Arc;
+
+#[user_doc(
+    doc_section(label = "Spark Math Functions"),


Removed per your suggestion here:
#15168 (comment)

alamb · 2025-04-29T00:32:10Z

datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs

@@ -193,7 +192,7 @@ macro_rules! get_row_value {
 ///
 /// Floating numbers are rounded to have a consistent representation with the Postgres runner.
 ///
-pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
+pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result<String> {


nice -- this makes sense to me -- I agree what you have here looks good

alamb · 2025-04-29T00:33:51Z

datafusion/sqllogictest/src/engines/conversion.rs

+    }
+}
+
+pub(crate) fn spark_f64_to_str(value: f64) -> String {


this looks like a copy/paste of f64_to_str -- maybe we could just thread the spark flag down and avoid some duplication. Not necesary just a suggestion

f64_to_str is used in more than 1 place, so I figured it made sense to create a new function.

datafusion/spark/src/lib.rs

alamb · 2025-04-29T00:38:28Z

datafusion/spark/src/function/string/ascii.rs

+        };
+    }
+
+    #[test]


My personal preference is to test them all from .slt rather than have any rust based tests unless there is something that can not be tested from .slt

For the different string types, we could perhaps cover the different string types using the same pattern as normal string tests -- see https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/README.md

However, I don't think this is required

datafusion/spark/src/function/string/ascii.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…dd-spark-crate

alamb · 2025-04-30T01:31:30Z

This looks great to me -- I plan to merge it tomorrow and start collecting next steps in a new EPIC ticket unless someone beats me to it

alamb · 2025-05-01T23:58:42Z

I have filed an epic to track filling out the datafusion-spark crate:

[EPIC] Complete datafusion-spark Spark Compatible Functions #15914

I will file some subtickets for follow on work as well (e.g. what is in #15168 (review))

alamb · 2025-05-01T23:59:16Z

Onward!

xudong963 · 2025-05-02T00:32:08Z

Fyi, the main CI has failed since the PR

alamb · 2025-05-02T11:16:06Z

Fyi, the main CI has failed since the PR

@blaginin has fixed it -- it appears to have been a logical conflict

feat: Add datafusion-spark crate

57018b7

github-actions bot added the functions Changes to functions implementation label Mar 11, 2025

shehabgamin mentioned this pull request Mar 11, 2025

[DISCUSSION] Add separate crate to cover spark builtin functions #5600

Closed

shehabgamin added 4 commits March 12, 2025 17:21

spark crate setup

762ef03

clean up 2 example functions

6e1aa06

cleanup crate

eba0a51

Spark crate setup

9ca4a52

shehabgamin marked this pull request as ready for review March 13, 2025 02:15

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 13, 2025

fix lint issue

074fc70

shehabgamin commented Mar 13, 2025

View reviewed changes

cargo cleanup

4e28310

shehabgamin commented Mar 13, 2025

View reviewed changes

fix collision in sqllogic

7f448f3

shehabgamin commented Mar 13, 2025

View reviewed changes

remove redundant test

c29a9e4

alamb mentioned this pull request Mar 13, 2025

Weekly Plan (Andrew Lamb) March 10, 2025 #15121

Closed

11 tasks

andygrove reviewed Mar 13, 2025

View reviewed changes

andygrove reviewed Mar 14, 2025

View reviewed changes

datafusion/core/src/execution/session_state_defaults.rs Outdated Show resolved Hide resolved

shehabgamin added 3 commits March 13, 2025 22:32

fix merge conflict

b0cdaf3

test float precision when casting to string

b2e9135

reorder

4fedbf4

shehabgamin added 3 commits April 22, 2025 21:00

save

4bac6ce

add spark crate

2d7b44e

remove spark from core

ac2b90d

github-actions bot removed the core Core DataFusion crate label Apr 25, 2025

shehabgamin added 6 commits April 24, 2025 22:05

add comment to import tests

6070e0e

Merge branch 'main' of github.com:lakehq/datafusion into add-spark-crate

4235fcc

Fix: reset submodule to main pointer and clean state

4608b2a

Save

dd48fa0

fix registration

51b882f

modify float64 precision for spark

0514ba6

shehabgamin requested a review from alamb April 26, 2025 08:58

shehabgamin commented Apr 26, 2025

View reviewed changes

This was referenced Apr 28, 2025

Weekly Plan: Andrew Lamb 2025-04-28 #15880

Closed

Release DataFusion 48.0.0 (June 2025) #15771

Open

alamb approved these changes Apr 29, 2025

View reviewed changes

shehabgamin and others added 4 commits April 29, 2025 05:34

Update datafusion/spark/src/lib.rs

2601c26

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

clean up code

e7f6d5f

Merge branch 'add-spark-crate' of github.com:lakehq/datafusion into a…

269edcd

…dd-spark-crate

code cleanup

2b02e3f

alamb mentioned this pull request May 1, 2025

[EPIC] Complete datafusion-spark Spark Compatible Functions #15914

Open

7 tasks

alamb merged commit 6bda479 into apache:main May 1, 2025
29 checks passed

This was referenced May 2, 2025

[datafusion-spark] Example of using Spark compatible function library #15915

Open

[datafusion-spark] Test integrating datafusion-spark code into comet apache/datafusion-comet#1704

Open

blaginin mentioned this pull request May 2, 2025

Fix CI in main #15917

Merged

linhr deleted the add-spark-crate branch May 14, 2025 03:10

		# datafusion-spark: Spark-compatible Expressions

		This crate provides Apache Spark-compatible expressions for use with DataFusion.

	async fn test_parameterized_scalar_udf() -> Result<()> {
	let batch = RecordBatch::try_from_iter([(
	"text",
	Arc::new(StringArray::from(vec!["foo", "bar", "foobar", "barfoo"])) as ArrayRef,
	)])?;

	let ctx = SessionContext::new();
	ctx.register_batch("t", batch)?;
	let t = ctx.table("t").await?;
	let foo_udf = ScalarUDF::from(MyRegexUdf::new("fo{2}"));
	let bar_udf = ScalarUDF::from(MyRegexUdf::new("[Bb]ar"));

	let plan = LogicalPlanBuilder::from(t.into_optimized_plan()?)
	.filter(
	foo_udf
	.call(vec![col("text")])
	.and(bar_udf.call(vec![col("text")])),
	)?
	.filter(col("text").is_not_null())?
	.build()?;

	assert_eq!(
	format!("{plan}"),
	"Filter: t.text IS NOT NULL\n Filter: regex_udf(t.text) AND regex_udf(t.text)\n TableScan: t projection=[text]"
	);

	let actual = DataFrame::new(ctx.state(), plan).collect().await?;
	let expected = [
	"+--------+",
	"\| text \|",
	"+--------+",
	"\| foobar \|",
	"\| barfoo \|",
	"+--------+",
	];
	assert_batches_eq!(expected, &actual);

	ctx.deregister_table("t")?;
	Ok(())
	}

+                      };
+                  }
+                  #[test]

	pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
	if !col.is_valid(row) {
	// represent any null value with the string "NULL"
	Ok(NULL_STR.to_string())
	} else {
	match col.data_type() {
	DataType::Null => Ok(NULL_STR.to_string()),
	DataType::Boolean => {
	Ok(bool_to_str(get_row_value!(array::BooleanArray, col, row)))
	}
	DataType::Float16 => {
	Ok(f16_to_str(get_row_value!(array::Float16Array, col, row)))
	}
	DataType::Float32 => {
	Ok(f32_to_str(get_row_value!(array::Float32Array, col, row)))
	}
	DataType::Float64 => {
	Ok(f64_to_str(get_row_value!(array::Float64Array, col, row)))
	}
	DataType::Decimal128(_, scale) => {
	let value = get_row_value!(array::Decimal128Array, col, row);
	Ok(decimal_128_to_str(value, *scale))
	}
	DataType::Decimal256(_, scale) => {
	let value = get_row_value!(array::Decimal256Array, col, row);
	Ok(decimal_256_to_str(value, *scale))
	}
	DataType::LargeUtf8 => Ok(varchar_to_str(get_row_value!(
	array::LargeStringArray,
	col,
	row
	))),
	DataType::Utf8 => {
	Ok(varchar_to_str(get_row_value!(array::StringArray, col, row)))
	}
	DataType::Utf8View => Ok(varchar_to_str(get_row_value!(
	array::StringViewArray,
	col,
	row
	))),
	DataType::Dictionary(_, _) => {
	let dict = col.as_any_dictionary();
	let key = dict.normalized_keys()[row];
	Ok(cell_to_string(dict.values(), key)?)
	}
	_ => {
	let f =
	ArrayFormatter::try_new(col.as_ref(), &DEFAULT_CLI_FORMAT_OPTIONS);
	Ok(f.unwrap().value(row).to_string())
	}
	}
	.map_err(DFSqlLogicTestError::Arrow)
	}
	}

	match file_name {
	"information_schema_table_types.slt" => {
	info!("Registering local temporary table");
	register_temp_table(test_ctx.session_ctx()).await;
	}

feat: Add datafusion-spark crate #15168

feat: Add datafusion-spark crate #15168

Conversation

shehabgamin commented Mar 11, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Mar 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shehabgamin Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shehabgamin Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Apr 30, 2025

alamb commented May 1, 2025

alamb commented May 1, 2025

xudong963 commented May 2, 2025

alamb commented May 2, 2025

feat: Add `datafusion-spark` crate #15168

feat: Add `datafusion-spark` crate #15168

shehabgamin commented Mar 11, 2025 •

edited by alamb

Loading

alamb Mar 16, 2025 •

edited

Loading

shehabgamin Mar 13, 2025 •

edited

Loading

shehabgamin Mar 13, 2025 •

edited

Loading

andygrove Mar 14, 2025 •

edited

Loading