Md parser #238

Ceceliachenen · 2024-09-29T06:28:45Z

No description provided.

src/pai_rag/integrations/readers/pai_pdf_reader.py

github-actions · 2024-09-29T10:04:00Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
7150	4046	57%	40%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
src/pai_rag/integrations/index/pai/multimodal/multimodal_retriever.py	56%	🟢
src/pai_rag/integrations/nodeparsers/base.py	74%	🟢
src/pai_rag/integrations/nodeparsers/pai/pai_node_parser.py	86%	🟢
src/pai_rag/integrations/readers/pai/pai_data_reader.py	61%	🟢
src/pai_rag/integrations/readers/pai_image_reader.py	54%	🟢
src/pai_rag/integrations/readers/pai_pdf_reader.py	82%	🟢
TOTAL	69%	🟢

updated for commit: 09dd7fe by action🐍

src/pai_rag/integrations/nodeparsers/base.py

moria97 · 2024-09-30T00:46:12Z

src/pai_rag/integrations/nodeparsers/base.py

-                    start = pos + 1 - self.chunk_overlap_size
+                start = pos + 1
+
+    def _build_nodes_from_split(


这里的split 建议加几个测试用例，比如我们的测试pdf，或者mardown，来验证一下拆分的chunk数量和内容符合预期，可以用pai-rec的markdown和其他的pdf

已加pai_document的测试

src/pai_rag/integrations/readers/pai/pai_data_reader.py

moria97 · 2024-09-30T00:48:53Z

src/pai_rag/integrations/readers/pai_pdf_reader.py

+                headers={
+                    "x-oss-object-acl": "public-read"
+                },  # set public read to make image accessible
+                path_prefix=f"pairag/pdf_images/{pdf_name.strip()}/",


这里的pdf_name取值是什么

会有空格什么的吗？不知道会不会出问题

空格会被替换掉

moria97 · 2024-09-30T11:02:12Z

src/pai_rag/integrations/nodeparsers/base.py

+            if re.match(ALT_REGEX_PATTERN, alt_text):
+                image_urls_positions.append(img_info)
+
+                raw_section_without_image = raw_section_without_image.replace(


这里的replace会比较重，最好还是根据匹配到的img进行拼接，取img前面的内容+img后面的内容，等于replace了这个img，这样不会有bug

* Replace PaiEas LLM with LLI-integration and upgrade python to 3.11 (#148) * Replace PaiEas LLM with LLI-integration and upgrade python version to 3.11 * Replace MyFCDashScope with OpenAILike class * Fix pyproject dependency * bug fix (#149) * Support postgresql load user dict (#150) * make format * Allow not install extension pg_jieba * table name data_default * Convert raptor processor to TransformComponent (#151) * udpate raptor using transform * modify raptor with transform * modify raptor and dataloader --------- Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * Add clip model (#130) * Update * Add clip model * Fix oss cache * Fix cache * Pdf reader upload image * Add multimodal * Update config * Use two embedding * Add text_image node * Add tests * Fix tests * fix multi_modal_vector --------- Co-authored-by: 燃夏 <chenanyu.cay@alibaba-inc.com> * Fix docker base image (#152) * change insert to be sync (#153) * Personal/ranxia/fix image readme (#155) * fix multi_modal and readme * fix multi_modal and readme * fix multi_modal and readme * fix multi_modal image (#156) * Support Agentic RAG with intent and functioncalling (#154) * Add intent detection module * Remove LlmQuery class * Support API * Refactor agent module and format toml * Refactor module tool * Refactor query api * Add demo and UI * remove * Fix reviews * Add test for intent and api * Add web search (#161) * Add web search * Fix lint * Fix bug * Update timeout * Fix bug * Fix jieba bug (#163) * Support PAI-EAS MultiModal LLM (#168) * Support minicpm * Fix issue * Bugfix: PaiEas LLM endpoint & max_tokens (#171) * Fix dashscope interface (#172) * Fix dashscope llm * Fix bug * Fix test bug (#174) * add minerU (#160) * add minerU * add minerU * add minerU * Fix nodes id and simi_topK * remove image url from text * remove image url from text * remove image url from text * Support FAQ query w/o image (#162) * Support FAQ query w/o image * Using LLM when query w/o images * Personal/ranxia/mineru enhancement (#164) * remove repeat nodes * show multiple pictures in media * show multiple pictures in media * Install miner with poetry (#165) * fix retriever * Support OSS Data Loader (#166) * Support oss data loader * Skip file which has been uploaded * Support oss prefix via api * 1. change image size (#167) 2. limit image number 3. fix retriever answer ui format * adjust image score (#169) * merge feature * merge feature * merge feature * merge feature * Fix bug (#173) * Support chunk text-overflow display (#170) * Fix bugs * Support text-overflow * Support text-overflow * Support load MinerU config file automatically (#175) * Support load MinerU config file automatically * Modify * Direct writing the config rather than copying * Fix multi_modal build docker (#176) * fix load_config (#177) * change multimodal prompt (#178) * Test Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix test bug (#174) (#179) Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * Fix Dockerfile (#180) * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix docker env (#181) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env (#183) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Bugfix * Bugfix for EAS (#184) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Bugfix * Bugfix * Fix detectron link (#182) * Update detectron dependency (#185) * Update dependency * udpate poetry lock * fix multimodal_config and prompt (#186) * fix MinerU readme (#189) * Add timeout and more logs (#188) * Personal/ranxia/fix miner u readme (#190) * fix MinerU readme * fix MinerU readme * Personal/ranxia/fix miner u readme (#191) * fix MinerU readme * fix MinerU readme * fix MinerU config * fix MinerU bug (#192) * Personal/ranxia/fix test and review bug (#193) * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug --------- Co-authored-by: 筱文 <zxw320697@alibaba-inc.com> Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * fix multimodal readme and config (#195) * nl2sql refactoring (#194) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display --------- Co-authored-by: 陆逊 <luxun.fy@alibaba-inc.com> * Personal/xi/nl2sql UI (#196) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display * data_analysis ui update --------- Co-authored-by: 陆逊 <luxun.fy@alibaba-inc.com> * Personal/ranxia/change max new tokens (#199) * set multimodal llm max_new_tokens * set multimodal llm max_new_tokens * Add trace (#197) * Add trace * Fix bug * Push to hangzhou region by default * 修复tables和descriptions默认配置bug (#198) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display * data_analysis ui update * fix table & description & query_output bugs * fix inconsistency between frontend and backend data structures --------- Co-authored-by: 陆逊 <luxun.fy@alibaba-inc.com> * Fix nginx routing (#200) * Fix nginx routing (#202) * Fix nginx routing * Fix nginx config * add data_analysis doc (#201) Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * Resolve conflict * Fix session_id bug (#204) * Fix session_id bug (#205) (#206) * Replace PaiEas LLM with LLI-integration and upgrade python to 3.11 (#148) * Replace PaiEas LLM with LLI-integration and upgrade python version to 3.11 * Replace MyFCDashScope with OpenAILike class * Fix pyproject dependency * bug fix (#149) * Support postgresql load user dict (#150) * make format * Allow not install extension pg_jieba * table name data_default * Convert raptor processor to TransformComponent (#151) * udpate raptor using transform * modify raptor with transform * modify raptor and dataloader --------- * Add clip model (#130) * Update * Add clip model * Fix oss cache * Fix cache * Pdf reader upload image * Add multimodal * Update config * Use two embedding * Add text_image node * Add tests * Fix tests * fix multi_modal_vector --------- * Fix docker base image (#152) * change insert to be sync (#153) * Personal/ranxia/fix image readme (#155) * fix multi_modal and readme * fix multi_modal and readme * fix multi_modal and readme * fix multi_modal image (#156) * Support Agentic RAG with intent and functioncalling (#154) * Add intent detection module * Remove LlmQuery class * Support API * Refactor agent module and format toml * Refactor module tool * Refactor query api * Add demo and UI * remove * Fix reviews * Add test for intent and api * Add web search (#161) * Add web search * Fix lint * Fix bug * Update timeout * Fix bug * Fix jieba bug (#163) * Support PAI-EAS MultiModal LLM (#168) * Support minicpm * Fix issue * Bugfix: PaiEas LLM endpoint & max_tokens (#171) * Fix dashscope interface (#172) * Fix dashscope llm * Fix bug * Fix test bug (#174) * add minerU (#160) * add minerU * add minerU * add minerU * Fix nodes id and simi_topK * remove image url from text * remove image url from text * remove image url from text * Support FAQ query w/o image (#162) * Support FAQ query w/o image * Using LLM when query w/o images * Personal/ranxia/mineru enhancement (#164) * remove repeat nodes * show multiple pictures in media * show multiple pictures in media * Install miner with poetry (#165) * fix retriever * Support OSS Data Loader (#166) * Support oss data loader * Skip file which has been uploaded * Support oss prefix via api * 1. change image size (#167) 2. limit image number 3. fix retriever answer ui format * adjust image score (#169) * merge feature * merge feature * merge feature * merge feature * Fix bug (#173) * Support chunk text-overflow display (#170) * Fix bugs * Support text-overflow * Support text-overflow * Support load MinerU config file automatically (#175) * Support load MinerU config file automatically * Modify * Direct writing the config rather than copying * Fix multi_modal build docker (#176) * fix load_config (#177) * change multimodal prompt (#178) * Test Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix test bug (#174) (#179) * Fix Dockerfile (#180) * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix Dockerfile * Fix docker env (#181) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env (#183) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Bugfix * Bugfix for EAS (#184) * Fix Dockerfile * Fix bugs * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Fix docker env * Bugfix * Bugfix * Fix detectron link (#182) * Update detectron dependency (#185) * Update dependency * udpate poetry lock * fix multimodal_config and prompt (#186) * fix MinerU readme (#189) * Add timeout and more logs (#188) * Personal/ranxia/fix miner u readme (#190) * fix MinerU readme * fix MinerU readme * Personal/ranxia/fix miner u readme (#191) * fix MinerU readme * fix MinerU readme * fix MinerU config * fix MinerU bug (#192) * Personal/ranxia/fix test and review bug (#193) * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug * fix MinerU bug --------- * fix multimodal readme and config (#195) * nl2sql refactoring (#194) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display --------- * Personal/xi/nl2sql UI (#196) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display * data_analysis ui update --------- * Personal/ranxia/change max new tokens (#199) * set multimodal llm max_new_tokens * set multimodal llm max_new_tokens * Add trace (#197) * Add trace * Fix bug * Push to hangzhou region by default * 修复tables和descriptions默认配置bug (#198) * change insert to be sync * add nl2sql * nl2sql setting * nl2sql setting * fix test bug * fix bugs * data analysis retriever and synthesizer * fix tests bugs * add data_analysis ui * update poetry.lock * remove unnecessary comment * add fault tolerance if no file provided * add minor fault tolerance * add upload_datasheet * nl2sql refactor and add db ui * restore retriever & synthesizer * update poetry.lock * Fix list merge * bug fix * add default display * data_analysis ui update * fix table & description & query_output bugs * fix inconsistency between frontend and backend data structures --------- * Fix nginx routing (#200) * Fix nginx routing (#202) * Fix nginx routing * Fix nginx config * add data_analysis doc (#201) * Resolve conflict * Fix session_id bug (#204) --------- Co-authored-by: wwxxzz <zxw320697@alibaba-inc.com> Co-authored-by: aero-xi <chuyu.cx@alibaba-inc.com> Co-authored-by: zt2645802240 <47960912+zt2645802240@users.noreply.github.com> Co-authored-by: 燃夏 <chenanyu.cay@alibaba-inc.com> * add multi headings (#207) * add multi headings * add multi headings * add multi headings * add multi headings * Support MLLM & OSS Configuration on WebUI (#208) * Add UI for mllm & oss configuration * Support OSS config * Support Oss cfg * Support Oss cfg * fix default trace handler (#209) * Fixbug: UI error if oss cfg is none (#210) * Fixbug: when oss cfg is none * Fixbug * db分析增加reference (#212) * add db ref * add reference * fix no headings case (#211) * fix no headings case * fix no headings case * Fix elastic search threading unsafe bug (#215) * Fix bug * Remote duplicate properties * Add eval tab (#214) * Support ImageVectorStore for other DBs (#213) * Support image store for hologres and milvus * Support ES * Support Opensearch * Support ImageStore for OS/PG/HOLO/Milvus * Add schema config for os * Fix spell bug * Fix mllm in cfg * Fix cfg * Add pai settings cfg * Add pai settings cfg * Fix bug * Fix index bug * Fix poetry toml and image name --------- Co-authored-by: 陆逊 <luxun.fy@alibaba-inc.com> * Release hotfix image (#218) * 增加data analysis prompt透出 (#216) * add db ref * add reference * add nl2sql prompt * use one button * Update create table with cache * add fault tolerance to custom prompt --------- Co-authored-by: 陆逊 <luxun.fy@alibaba-inc.com> * Fix release yaml * refine nl2sql tiny (#219) * Fix llm display bug (#220) * Update ui&reference (#221) * update ui button and reference * update refernece display & table match logic * delete button * Fix postprocessor bug: callback_manager (#223) * Fix error for downloading oss model_info file (#227) * Remove Eval Tab (#228) * Remove Eval Tab * Fix duckduckgo-search version * Personal/ranxia/query transform (#224) * query_transform * query_transform * query_transform & fix load_data * query_transform & fix load_data --------- Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * update data_analysis doc (#230) * fix query_transform (#231) * Refactor for multimodal (#232) * Refactor * Fix mm embedding * Fix node id bug * Fix retriever * Add faiss debug * Fix reranker * Fix tests * Fix oss upload * Update * Fix milvus weights * Fix opensearch multithreading * Fix UI * Add multimodal prompt template (#233) * Add multimodal template * Fix load data (#235) * Support unique node_id for postgresql (#236) * Support more advanced embed models (#237) * Support more advanced embed models * Support more advanced embed models * Add model introduction link in huggingface * Md parser (#238) * pdf_reader & md_parser * pdf_reader & md_parser * pdf_reader & md_parser * pdf_reader & md_parser * pdf_reader & md_parser * pdf_reader & md_parser * pdf_reader & md_parser * Fix chinese character bug in streaming json & fix pickle load bug for bm25 index (#239) * Fix bug * Fix linter * Fix dup file error (#241) * Fix dup file error * Fix max_score bug * Fix lint * Remove empty flag * fix image display format (#240) * fix image display format * fix md parser and image display format * fix image node hash * fix upload dir * Add llm module (#242) * Add llm module * Fix bug * Fix test * Fix test failures (#244) * Fix test failures * Fix llm connection error * Fix asyncio error * Fix eventloop error * Improve api call stability * fix llm settings (#243) Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * Refactor evaluation: remove module & support qca generator (#245) * Modfy Eval Pipeline * Add rag QCA dataset generator * Add predicted qca dataset generator * Remove evaluetion module * Fix --------- Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * fix image retriever (#246) * fix image retriever * fix image retriever --------- Co-authored-by: Yue Fei <luxun.fy@alibaba-inc.com> * Remove model_name, use model. (#248) * Refactor & add index management (#249) * Refactor * Delete modules * Fix tests * Address comments * Fix trace old config name * Fix config * Update * Update cnclip * Add index readme (#251) * Add readme * Update image size * Fix update index (#252) * Fix update index * Fix threshold bug * docx_reader (#250) * docx_reader * docx_reader * docx_reader * docx_reader * docx_reader * Personal/xi/nl2sql op1 (#254) * add syn prompt * add data_sample * update parse by lstrip * update test * Support multi evaluators and experiments pipeline (#247) * Add evaluator and metrics * Add evaluator and metrics * Add eval experiment pipeline * Modify entry file * Modify entry file * Modify result file * Fix * Refactor evaluation * Fix int value (#256) * Personal/ranxia/html reader (#255) * html_reader * html_reader * html_reader * html_reader * fix docx reader (#257) * fix docx reader * fix docx reader * fix docx reader * Support auto evaluation for multi-modal (#258) * Support MM: text and image eval * Support MultiModal Eval * Update agent module (#259) * Update agent * Update application * Address comment * Resolve trace app name from environment variable (#253) * Add trace namespace * Fix app name * Use arms python trace * remove setup tracing * Update arms startup comamnd * Add instrument * Update main * Update aliyun-bootstrap * Fix agent/trace bugs (#260) * Fix bugs * Fix test bug * Fix test * Add agent doc (#261) * Fix bugs * Fix test bug * Fix test * Update agent doc * Update doc * address comment * Fix tools in pyproject.toml (#262) * Fix agent bugs (#264) * Fix bugs * Fix test bug * Fix test * Update agent doc * Update doc * address comment * Fix agent * Address comment * Remove duplicate logger from PaiPDFReader (#263) * Remove duplicate logger from PaiPDFReader * Add ModelScopeDownloader * Use logger from loguru not logging * Remove ak sk info * Add predicted_node_score for eval data (#265) * Add predicted_node_score for eval data * Update description * Personal/ranxia/pptx reader (#266) * pai ppt reader * pai ppt reader & fix oss cache * fix poetry * pptx reader * Update docker.yml (#267) * Update docker.yml * Update llm api key (#268) * fix excel reader (#271) * Fix frontend bug (#270) * Fix ui bug * Fix arms package link error * Fix bug * Fix exclude key * Remove row_number * Fix template error * Update trace instrument command (#272) * Fix ui bug * Fix arms package link error * Fix bug * Fix exclude key * Remove row_number * Fix template error * Fix dockerfile * Clean docker build cache (#273) * Clear docker build cache * Update settings * Fix v1 api bug (#274) * fix da display * Fix lint --------- Co-authored-by: wwxxzz <zxw320697@alibaba-inc.com> Co-authored-by: aero-xi <chuyu.cx@alibaba-inc.com> Co-authored-by: zt2645802240 <47960912+zt2645802240@users.noreply.github.com> Co-authored-by: 燃夏 <chenanyu.cay@alibaba-inc.com>

Ceceliachenen added 3 commits September 29, 2024 14:23

pdf_reader & md_parser

9ca743d

pdf_reader & md_parser

8f065d8

pdf_reader & md_parser

44db327