You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe. TPC-DS is a popular and well-known database benchmark suite. According to its specification(Section 5, there's no official online link for this, user can only download the TPC-DS tool to see their doc) for "Data Maintenance" benchmark, it requires some SQL using INSERT, DELETE.
Example is as below:
In Specification,
5.3.8 Method 2: Sales and Returns Fact Table Delete
Delete rows from R with corresponding rows in S
where d_date between Date1 and Date2
Delete rows from S
where d_date between Date1 and Date2
...
...
...
5.3.11.10 DF_CS:
S=catalog_sales
R=catalog_returns
Date1 as generated by dsdgen
Date2 as generated by dsdgen
We are able to get a SQL string using DELETE:
delete from catalog_sales
where cs_sold_date_sk in
(select d_date_sk
from date_dim
where d_date between DATE1 and DATE2);
Spark will throw an error
`pyspark.sql.utils.AnalysisException: Delete by condition with subquery is not supported: Some(cs_sold_date_sk#777 IN (list#858 []))`
To workaround, I use the following code:
dates=spark.sql(""" select d_date_sk from date_dim where d_date between '1999-09-18' and '1999-09-19'; """).collect()
date_lst= [x['d_date_sk'] forxindates]
spark.sql(f""" delete from catalog_sales where cs_sold_date_sk in ({str(date_lst)[1:-1]}); """)
And this will throw the final exception:
`pyspark.sql.utils.AnalysisException: Table does not support deletes: parquet file /.../.../catalog_sales`
Describe the solution you'd like
After Iceberg is supported, parquet data source can be updated(insert, delete).
Describe alternatives you've considered
I don't know if Delta Lake is also an option.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
TPC-DS is a popular and well-known database benchmark suite. According to its specification(Section 5, there's no official online link for this, user can only download the TPC-DS tool to see their doc) for "Data Maintenance" benchmark, it requires some SQL using INSERT, DELETE.
Example is as below:
In Specification,
We are able to get a SQL string using DELETE:
Spark will throw an error
To workaround, I use the following code:
And this will throw the final exception:
Describe the solution you'd like
After Iceberg is supported, parquet data source can be updated(insert, delete).
Describe alternatives you've considered
I don't know if Delta Lake is also an option.
The text was updated successfully, but these errors were encountered: