Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

Closed
wjxiz1992 opened this issue May 17, 2022 · 0 comments · Fixed by #5941
Closed

[FEA] Support Iceberg for data INSERT, DELETE operations #5510

wjxiz1992 opened this issue May 17, 2022 · 0 comments · Fixed by #5941
Labels
feature request New feature or request

Comments

@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented May 17, 2022

Is your feature request related to a problem? Please describe.
TPC-DS is a popular and well-known database benchmark suite. According to its specification(Section 5, there's no official online link for this, user can only download the TPC-DS tool to see their doc) for "Data Maintenance" benchmark, it requires some SQL using INSERT, DELETE.

Example is as below:
In Specification,

5.3.8  Method 2: Sales and Returns Fact Table Delete
Delete rows from R with corresponding rows in S 
 where d_date between Date1 and Date2
Delete rows from S 
 where d_date between Date1 and Date2
...
...
...
5.3.11.10 DF_CS:
S=catalog_sales
R=catalog_returns
Date1 as generated by dsdgen
Date2 as generated by dsdgen

We are able to get a SQL string using DELETE:

delete from catalog_sales 
           where cs_sold_date_sk in 
                      (select d_date_sk 
                                 from date_dim 
                                 where d_date between DATE1 and DATE2);

Spark will throw an error

`pyspark.sql.utils.AnalysisException: Delete by condition with subquery is not supported: Some(cs_sold_date_sk#777 IN (list#858 []))`

To workaround, I use the following code:

dates = spark.sql("""
          select d_date_sk from date_dim 
            where d_date between '1999-09-18' and '1999-09-19';
          """).collect()

date_lst = [x['d_date_sk'] for x in dates]

spark.sql(f"""
          delete from catalog_sales 
              where cs_sold_date_sk in 
                  ({str(date_lst)[1:-1]});
          """)

And this will throw the final exception:

`pyspark.sql.utils.AnalysisException: Table does not support deletes: parquet file /.../.../catalog_sales`

Describe the solution you'd like
After Iceberg is supported, parquet data source can be updated(insert, delete).

Describe alternatives you've considered
I don't know if Delta Lake is also an option.

@wjxiz1992 wjxiz1992 added feature request New feature or request ? - Needs Triage Need team to review and classify labels May 17, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 24, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants