Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: 处理PDF中出现 \0 字符报 Null characters are not allowed #1640

Merged
merged 1 commit into from
Nov 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions apps/common/handle/impl/pdf_split_handle.py
Original file line number Diff line number Diff line change
@@ -104,6 +104,9 @@ def handle_pdf_content(file, pdf_document):

content += page_content

# Null characters are not allowed.
content = content.replace('\0', '')

elapsed_time = time.time() - start_time
max_kb.debug(
f"File: {file.name}, Page: {page_num + 1}, Time : {elapsed_time: .3f}s, content-length: {len(page_content)}")
@@ -156,6 +159,10 @@ def handle_toc(doc, limit):
text = text[:idx]

chapter_text += text # 提取文本

# Null characters are not allowed.
chapter_text = chapter_text.replace('\0', '')

# 限制章节内容长度
if 0 < limit < len(chapter_text):
split_text = PdfSplitHandle.split_text(chapter_text, limit)
@@ -228,6 +235,9 @@ def handle_links(doc, pattern_list, with_filter, limit):
text = text[:idx]
chapter_text += text

# Null characters are not allowed.
chapter_text = chapter_text.replace('\0', '')

# 限制章节内容长度
if 0 < limit < len(chapter_text):
split_text = PdfSplitHandle.split_text(chapter_text, limit)
Loading