Skip to content

Improve documentation for format OPTIONS clause #15708

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 9 commits into from
Apr 19, 2025

Conversation

marvelshan
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds documentation for the OPTIONS clause, including generic options and format-specific options, to ensure users have clear guidance on available settings.

What changes are included in this PR?

  • Renamed write_options.md to format_options.md to reflect its scope for both reading and writing.
  • Added examples for each format (JSON, CSV, Parquet) in the OPTIONS clause.
  • Documented generic options and format-specific options in a structured format.
  • Updated heading levels to ensure all sections appear in the table of contents.

Are these changes tested?

Do not require automated tests

Are there any user-facing changes?

Yes, the documentation now includes detailed examples and descriptions of the OPTIONS clause for CREATE EXTERNAL TABLE and COPY queries.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 14, 2025
@github-actions github-actions bot added the core Core DataFusion crate label Apr 14, 2025
@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Run extended tests

@alamb
Copy link
Contributor

alamb commented Apr 14, 2025

Run extended tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @marvelshan -- this is a nice improvement

I left a few comments -- let me know what you think. It would be great to put the NULL option into the csv options and fix the examples so they work in this PR


# Format Options

DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` query. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. Options can also in some cases be specified in multiple ways with a set order of precedence.
DataFusion supports customizing how data is read from or written to disk as a result of a `COPY`, `INSERT INTO`, or `CREATE EXTERNAL TABLE` statements. There are a few special options, file format (e.g., CSV or Parquet) specific options, and Parquet column-specific options. In some cases, Options can be specified in multiple ways with a set order of precedence.

Comment on lines 26 to 31
Format-related options can be specified in the following ways:

- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to explicit specify the order of precedence here. Something like

Suggested change
Format-related options can be specified in the following ways:
- Session-level config defaults
- `CREATE EXTERNAL TABLE` options
- `COPY` option tuples
Format-related options can be specified in three ways, in decreasing order of precedence:
- `CREATE EXTERNAL TABLE` syntax
- `COPY` option tuples
- Session-level config defaults


| Option | Description | Default Value |
| ---------- | ------------------------------------------------------------- | ---------------- |
| NULL_VALUE | Sets the string which should be used to indicate null values. | arrow-rs default |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a CSV specific option (not a generic option)

For example

> create external table my_table(a int) stored as JSON location '/tmp/foo' options('NULL_VALUE' 'NULL');
Invalid or Unsupported Configuration: Config value "null_value" not found on JsonOptions

@marvelshan
Copy link
Contributor Author

Thank you for the suggestions!
I realized there were several places I hadn’t tested thoroughly. After some testing, I’ve made updates and pushed a new version.
I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.

@alamb alamb changed the title doc/document options clause Improve documentation for format OPTIONS clause Apr 16, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wanted to check with you: since there are quite a few available options on file config.rs, should we just list a few key ones in the documentation, or aim to include all of them? I’m wondering if listing everything might make the docs overly complex.

I think we should include all of the options in the documentation as a reference. While I agree there are a lot of options, I think it would be best for the documentation to reflects that (complex) reality

| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes |
| CREATED_BY | No | Sets the "created by" property in the parquet file |
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. |
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have lost some of these options in the new doc

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this @marvelshan -- I think it is looking really nice now

OPTIONS('DELIMITER' '|', 'HAS_HEADER' 'true', 'NEWLINES_IN_VALUES' 'true');
```

## Parquet Format Options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this looks great


The following options are available when executing a `COPY` query.

| Option Key | Description | Default Value |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This largely replicates what is already in this page: https://datafusion.apache.org/user-guide/configs.html

I think the idea is that the format_options page will describe options specific to formats -- I will push a small commit to fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your review! I truly appreciate you taking the time to make the final adjustments. I’ve learned a lot of valuable details from this process. If there’s anything else that needs further revision, please don’t hesitate to let me know—I’d be more than happy to make the necessary changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for helping push this along @marvelshan -- it takes non trivial effort to make these docs

@alamb alamb merged commit f07fb10 into apache:main Apr 19, 2025
29 checks passed
nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* doc:Add documentation for OPTIONS clause syntax

* doc:rename write_options.md to format_options.md and clarify its scope for both reading and writing

* doc: change dml.md, cuz still have wrong write_options filename

* doc: update doctest reference to renamed format_options.md

* docs: update and correct format options documentation

* doc: add more information of options content

* remove execution settings, move note about insert

* wordsmith example

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants