Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEA] Support last in windowing context for Integer type. #5061

Closed
viadea opened this issue Mar 25, 2022 · 7 comments
Closed

[FEA] Support last in windowing context for Integer type. #5061

viadea opened this issue Mar 25, 2022 · 7 comments
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@viadea
Copy link
Collaborator

viadea commented Mar 25, 2022

I wish we can support function last in windowing context for IntegerType/DoubleType.

For example:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000, "2019-01-01",List("Java","Scala")),
    Row(Row("Bob ","Middle","Green"),"2","M",2000, "2019-01-02",List("Java","Python")),
    Row(Row("Cathy ","","Green"),"3","F",3000, "2019-01-03",List())
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType)
  .add("birthdayStr",StringType)
  .add("language",ArrayType(StringType))
             )

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.withColumn("birthday", to_date(col("birthdayStr"))).write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
df2.printSchema

sql("""SELECT gender, last(salary,true) OVER (PARTITION BY gender ORDER BY salary) FROM df2""").collect
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 25, 2022
@revans2
Copy link
Collaborator

revans2 commented Mar 25, 2022

This is essentially the same as #4005

@sameerz
Copy link
Collaborator

sameerz commented Mar 26, 2022

cc: @mythrocks

@sameerz sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Mar 29, 2022
@res-life res-life self-assigned this Jul 7, 2022
@mythrocks mythrocks removed their assignment Jul 7, 2022
@mythrocks
Copy link
Collaborator

I have taken myself off the assignee list, leaving this issue in the capable hands of @res-life.

@mythrocks
Copy link
Collaborator

@res-life, here is a reference implementation for supporting FIRST() and LAST() in spark-rapids. I haven't included support for NTH_VALUE(), but it should be along these lines.

@res-life
Copy link
Collaborator

Note:
Cudf does not support range queries on floats or doubles yet.
Still depends on cuDF: #6000 (comment)

@res-life res-life changed the title [FEA] Support last in windowing context for IntegerType/DoubleType. [FEA] Support last in windowing context for Intege type. Aug 2, 2022
@res-life
Copy link
Collaborator

res-life commented Aug 2, 2022

Split this issue into 2, one for Integer, one for Double #6193

@res-life
Copy link
Collaborator

res-life commented Aug 2, 2022

#6000 already supported Integer type.

@res-life res-life closed this as completed Aug 2, 2022
@res-life res-life changed the title [FEA] Support last in windowing context for Intege type. [FEA] Support last in windowing context for Integer type. Aug 2, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants