-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG]%%pretty with chinese character error #767
Comments
I experience the same issue. |
I experience the same issue too, when i want to display chinese character,it will return error: thanks to any help,it really confuses me TAT |
This issue should be fixed by this PR, which I just released as part of the 0.20.4 release. I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade. |
thanks for your reply! i found it is correct in livyserver,but returned error in notebook display maybe you can use this dataframe to reappear the problem: df = spark.createDataFrame([("a","你好"),("b","你好")],("key","value")) df.show(5) and it is my versions |
Thanks for the code snippet @baixinzxl. I will investigate once I have bandwidth in the coming weeks. Contributions are welcome if you want to dive into the code! |
thanks, take your time~ |
sorry for disturbing but i wonder if there is any findings about the problem? |
Hey @baixinzxl I haven't forgotten about this. I've been stretched for time and have tried to tackle this twice without success. The relevant code is in this file if you want to take a stab at it! |
I am using sparkmagic: 0.21.0 Same issue when show dataframe with Japanese column |
@dosu how would you update https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/dataframe_parser.py to handle characters with varying widths like Chinese and Japanese? |
Hey @devstein! 👋 I'm here to help with any bugs, questions, or contributions you have for the repository. I'm a bot designed to assist you while you wait for a human maintainer. Let me know how I can assist you! To update Here is how you can modify the
from wcwidth import wcswidth
def extractors(header_top, header_content):
"""Creates functions to pull column values out of Spark DF rows.
Based on the top of a Dataframe header, identifies start and end index for
each column value.
012345678901
+---+------+
| id|animal|
+---+------+
| 1| cat|
| 2| mouse|
| 3| horse|
+---+------+
For example, the `id` column is row[0:4] and `animal` is row[4:11]
:param header_top The header border top comprise of `+` and `-` marking off
demarcating different columns.
eg `+---+------+`
:param header_content The string following the header_top, containing the
actual column names
eg `| id|animal|`
:return A dict of column: function that can be applied to string-row
representation of a a Dataframe, eg `| 1| cat|`
In our example:
{'id': lambda row: row[0:4], 'animal': lambda row[4:11]}
"""
header_pluses = list(re.finditer(r"\+", header_top))
column_widths = [wcswidth(header_content[start.end():end.start()]) for start, end in zip(header_pluses, header_pluses[1:])]
def _extract(l, r, row, offset=0):
return row[offset + l : offset + r].strip()
def _extractor_iter():
start = 0
for width in column_widths:
end = start + width
yield partial(_extract, start, end)
start = end
return OrderedDict((x(header_content), x) for x in _extractor_iter())
class DataframeHtmlParser:
"""Parses a Spark Dataframe and presents it as a HTML table."""
header_top_r = re.compile(header_top_pattern)
header_content_r = re.compile(header_content_pattern)
def __init__(self, cell, start=0, end=None):
"""Creates a Dataframe parser for a single dataframe.
:param cell The evaluated output of a cell.
Cell can contain more than one dataframe, but a single
DataframeHtmlParser can only parse table headers/rows for a
a single dataframe in the substring cell[start:end]
"""
self.cell_contents = cell
end = end or len(self.cell_contents)
header_spans = DataframeHtmlParser.header_top_r.finditer(
self.cell_contents, start, end
)
parts = {
"header_top": next(header_spans).span(),
"header_content": DataframeHtmlParser.header_content_r.search(
self.cell_contents, start, end
).span(),
"header_bottom": next(header_spans).span(),
"footer": next(header_spans).span(),
}
self.header_content_span = parts["header_content"]
header_content = self._cell_span(self.header_content_span)
self.expected_width = wcswidth(header_content.strip())
header_top = self._cell_span(parts["header_top"])
self.extractors = extractors(header_top.strip(), header_content.strip())
# The content is between the header-bottom and the footer
self.content_span = (parts["header_bottom"][1], parts["footer"][0])
def _cell_span(self, span):
s, e = span
return self.cell_contents[s:e]
def _rowspan_iter(self):
"""Extract each row from the contents of a Dataframe."""
row_delimiters = re.compile(r"\n").finditer(
self.cell_contents, self.content_span[0], self.content_span[1]
)
start = self.content_span[0]
for row_delimiter in row_delimiters:
end, next_start = row_delimiter.span()[0], row_delimiter.span()[1]
yield (start, end)
start = next_start
def row_iter(self, transform=None):
"""Extract and transform each row from a Dataframe.
Defaults to converting a row to a dict {colName: value}
"""
_transform = transform or (
lambda r: {col: x(r) for col, x in self.extractors.items()}
)
for rowspan in self._rowspan_iter():
row = self._cell_span(rowspan).strip()
if wcswidth(row) != self.expected_width:
raise ValueError(
"""Expected DF rows to be uniform width (%d)
but found %s (%d)"""
% (self.expected_width, row, wcswidth(row))
)
yield _transform(row)
def to_table(self):
"""Converts the contents of a notebook cell to a HTML table."""
header_content = self._cell_span(self.header_content_span)
table_header_html = self._to_tr(header_content.strip(), is_header=True)
table_row_iter = self.row_iter(transform=self._to_tr)
table_body = "".join([r for r in table_row_iter])
return "<table>%s%s</table>" % (table_header_html, table_body)
def _to_tr(self, row, is_header=False):
"""Converts a spark dataframe row to a HTML row."""
tag = "th" if is_header else "td"
row_content = [x(row) for x in self.extractors.values()]
row_html = "".join(
[
'<%s><pre style="word-break: unset; background-color: unset;">%s</pre></%s>'
% (tag, rc, tag)
for rc in row_content
]
)
return "<tr>%s</tr>" % row_html These changes ensure that the parser correctly handles characters with varying widths, such as Chinese and Japanese characters, by using the
|
Describe the bug
when use %%pretty function, if there is chinese character in the table, the table cannot show properly and return this error.
An internal error was encountered.
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
Error:
Expected DF rows to be uniform width (581)
To Reproduce
%%pretty
df.show()
Versions:
The text was updated successfully, but these errors were encountered: