-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
ENH adding metadata argument to DataFrame.to_parquet #20521
Comments
cc @cpcloud What's the purpose here? Would this be in addition to or in place of the usual |
The user given dictionary updates current key value file metadata. If user gives pandas key then it overwrites pandas_metadata but warning.warn is issued. Purpose: User metadata is very needed when:
For me it is a very important feature and one of the main reasons I want to switch to parquet. |
That all sounds reasonable. |
Slight cosmetic suggestion - code a bit more Pythonic |
And added whatsnew and rebased to current master. |
Note for readers: the PR was closed but mentions a work-around that can be used for now if you need this: #20534 (comment) |
I have been thinking about this and am wondering what the general thoughts are to use DataFrame.attrs and Series.attrs for reading and writing metadata to/from parquet? For example, here is how the metadata would be written: pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
pdf.to_parquet("file.parquet") Then, when loading in the data: pdf = pandas.read_parquet("file.parquet")
pdf.attrs
pdf.a.attrs
Is this something that would need to be done in pandas or pyarrow/fastparquet? EDIT: Added issue to pyarrow here |
Here is a hack to get the attrs to work with pyarrow: def _write_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
column_attrs = {}
for col in pdf.columns:
attrs = pdf[col].attrs
if not attrs or not isinstance(col, str):
continue
column_attrs[col] = attrs
pandas_metadata.update(
attrs=pdf.attrs,
column_attrs=column_attrs,
)
schema_metadata[b"pandas"] = json.dumps(pandas_metadata)
return table.replace_schema_metadata(schema_metadata)
def _read_attrs(table, pdf):
schema_metadata = table.schema.metadata or {}
pandas_metadata = json.loads(schema_metadata.get(b"pandas", "{}"))
pdf.attrs = pandas_metadata.get("attrs", {})
col_attrs = pandas_metadata.get("column_attrs", {})
for col in pdf.columns:
pdf[col].attrs = col_attrs.get(col, {})
def to_parquet(pdf, filename):
# write parquet file with attributes
table = pyarrow.Table.from_pandas(pdf)
table = _write_attrs(table, pdf)
pyarrow.parquet.write_table(table, filename)
def read_parquet(filename):
# read parquet file with attributes
table = pyarrow.parquet.read_pandas(filename)
pdf = table.to_pandas()
_read_attrs(table, pdf)
return pdf Example: Writing: pdf = pandas.DataFrame({"a": [1]})
pdf.attrs = {"name": "my custom dataset"}
pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
to_parquet(pdf, "a.parquet") Reading: pdf = read_parquet("a.parquet")
pdf.attrs
pdf.a.attrs
|
I have a PR that seems to do the trick: #41545 |
Ideally, I think this would actually be done in pyarrow/fastparquet, as it is in those libraries that the "pandas" metadata item gets constructed currently |
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
Use a workaround until this ENH is implemented: pandas-dev/pandas#20521
so... can we have simple something to work with df.attrs ? The goal is to replace multiple pseudo-csv formats which add #-prefixed comments in the beginning of a file with something systematic. I believe everyone would agree that's 1) a common usecase 2) supportable by parquet 3) should work without hassle for reader (I'm ok with hassle for writer) |
Yes, and a contribution to add this functionality is welcome, I think. And a PR to add generic parquet file-level metadata with a |
Edit don't need this ⬇️ since
# write
df.to_parquet(path)
meta = {'foo':'bar'}
fastparquet.update_file_custom_metadata(path, meta)
# read
pf = fastparquet.ParquetFile(path)
df_ = pf.to_pandas()
meta_ = pf.key_value_metadata Note |
This is done and in |
Code Sample, a copy-pastable example if possible
Please comsider merging
master...JacekPliszka:master
Problem description
Currently pandas can not add custom metadata to parquet file.
This patch add metadata argument to DataFrame.to_parquet that allows for that.
Warning is issued when pandas key is present in the dictionary passed.
The text was updated successfully, but these errors were encountered: