-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve documentation for format OPTIONS
clause
#15708
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
…e for both reading and writing
Run extended tests |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @marvelshan -- this is a nice improvement
I left a few comments -- let me know what you think. It would be great to put the NULL
option into the csv options and fix the examples so they work in this PR
|
||
# Format Options | ||
|
||
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence. | |
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence. |
Format-related options can be specified in the following ways: | ||
|
||
- Session-level config defaults | ||
- `CREATE EXTERNAL TABLE` options | ||
- `COPY` option tuples | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to explicit specify the order of precedence here. Something like
Format-related options can be specified in the following ways: | |
- Session-level config defaults | |
- `CREATE EXTERNAL TABLE` options | |
- `COPY` option tuples | |
Format-related options can be specified in three ways, in decreasing order of precedence: | |
- `CREATE EXTERNAL TABLE` syntax | |
- `COPY` option tuples | |
- Session-level config defaults | |
|
||
| Option | Description | Default Value | | ||
| ---------- | ------------------------------------------------------------- | ---------------- | | ||
| NULL_VALUE | Sets the string which should be used to indicate null values. | arrow-rs default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a CSV specific option (not a generic option)
For example
> create external table my_table(a int) stored as JSON location '/tmp/foo' options('NULL_VALUE' 'NULL');
Invalid or Unsupported Configuration: Config value "null_value" not found on JsonOptions
Thank you for the suggestions! |
OPTIONS
clause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.
I think we should include all of the options in the documentation as a reference. While I agree there are a lot of options, I think it would be best for the documentation to reflects that (complex) reality
| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes | | ||
| CREATED_BY | No | Sets the "created by" property in the parquet file | | ||
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. | | ||
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to have lost some of these options in the new doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on this @marvelshan -- I think it is looking really nice now
OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true'); | ||
``` | ||
|
||
## Parquet Format Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this looks great
|
||
The following options are available when executing a `COPY` query. | ||
|
||
| Option Key | Description | Default Value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This largely replicates what is already in this page: https://datafusion.apache.org/user-guide/configs.html
I think the idea is that the format_options page will describe options specific to formats -- I will push a small commit to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your review! I truly appreciate you taking the time to make the final adjustments. I’ve learned a lot of valuable details from this process. If there’s anything else that needs further revision, please don’t hesitate to let me know—I’d be more than happy to make the necessary changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for helping push this along @marvelshan -- it takes non trivial effort to make these docs
* doc:Add documentation for OPTIONS clause syntax * doc:rename write_options.md to format_options.md and clarify its scope for both reading and writing * doc: change dml.md, cuz still have wrong write_options filename * doc: update doctest reference to renamed format_options.md * docs: update and correct format options documentation * doc: add more information of options content * remove execution settings, move note about insert * wordsmith example --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
CREATE EXTERNAL TABLE ... OPTIONS
#10451Rationale for this change
This PR adds documentation for the
OPTIONS
clause, including generic options and format-specific options, to ensure users have clear guidance on available settings.What changes are included in this PR?
write_options.md
toformat_options.md
to reflect its scope for both reading and writing.OPTIONS
clause.Are these changes tested?
Do not require automated tests
Are there any user-facing changes?
Yes, the documentation now includes detailed examples and descriptions of the
OPTIONS
clause forCREATE EXTERNAL TABLE
andCOPY
queries.