[FEA] Support Iceberg for data INSERT, DELETE operations #5510

wjxiz1992 · 2022-05-17T03:26:38Z

Is your feature request related to a problem? Please describe.
TPC-DS is a popular and well-known database benchmark suite. According to its specification(Section 5, there's no official online link for this, user can only download the TPC-DS tool to see their doc) for "Data Maintenance" benchmark, it requires some SQL using INSERT, DELETE.

Example is as below:
In Specification,

5.3.8  Method 2: Sales and Returns Fact Table Delete
Delete rows from R with corresponding rows in S 
 where d_date between Date1 and Date2
Delete rows from S 
 where d_date between Date1 and Date2
...
...
...
5.3.11.10 DF_CS:
S=catalog_sales
R=catalog_returns
Date1 as generated by dsdgen
Date2 as generated by dsdgen

We are able to get a SQL string using DELETE:

delete from catalog_sales 
           where cs_sold_date_sk in 
                      (select d_date_sk 
                                 from date_dim 
                                 where d_date between DATE1 and DATE2);

Spark will throw an error

`pyspark.sql.utils.AnalysisException: Delete by condition with subquery is not supported: Some(cs_sold_date_sk#777 IN (list#858 []))`

To workaround, I use the following code:

dates = spark.sql("""
          select d_date_sk from date_dim 
            where d_date between '1999-09-18' and '1999-09-19';
          """).collect()

date_lst = [x['d_date_sk'] for x in dates]

spark.sql(f"""
          delete from catalog_sales 
              where cs_sold_date_sk in 
                  ({str(date_lst)[1:-1]});
          """)

And this will throw the final exception:

`pyspark.sql.utils.AnalysisException: Table does not support deletes: parquet file /.../.../catalog_sales`

Describe the solution you'd like
After Iceberg is supported, parquet data source can be updated(insert, delete).

Describe alternatives you've considered
I don't know if Delta Lake is also an option.

The text was updated successfully, but these errors were encountered:

wjxiz1992 added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 17, 2022

sameerz removed the ? - Needs Triage Need team to review and classify label May 24, 2022

jlowe mentioned this issue Jul 1, 2022

GPU accelerate Apache Iceberg reads #5941

Merged

jlowe closed this as completed in #5941 Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

wjxiz1992 commented May 17, 2022 •

edited

Loading

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

Comments

wjxiz1992 commented May 17, 2022 • edited Loading

wjxiz1992 commented May 17, 2022 •

edited

Loading