-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Enable automatic URL linking #19110
base: master
Are you sure you want to change the base?
Enable automatic URL linking #19110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this perform, especially in documents that contain a lot of text?
Also, we probably want a new option/preference to be able to disable this functionality.
web/pdf_page_view.js
Outdated
} | ||
|
||
#processLinks() { | ||
return this.pdfPage.getTextContent().then(content => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We absolutely cannot fetch the textContent twice for each rendered page, since that'll be really inefficient in general.
Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see
pdf.js/web/text_layer_builder.js
Line 99 in 079eb24
this.highlighter?.setTextMapping(textDivs, textContentItemsStr); pdf.js/web/text_highlighter.js
Lines 47 to 59 in 079eb24
/** * Store two arrays that will map DOM nodes to text they should contain. * The arrays should be of equal length and the array element at each index * should correspond to the other. e.g. * `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0"; * * @param {Array<Node>} divs * @param {Array<string>} texts */ setTextMapping(divs, texts) { this.textDivs = divs; this.textContentItemsStr = texts; }
web/pdf_page_view.js
Outdated
#processLinks() { | ||
return this.pdfPage.getTextContent().then(content => { | ||
const [text, diffs] = normalizedTextContent(content); | ||
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the regular expression should probably be created just once (and then cached) to avoid re-creating it for every page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might no longer be relevant since it's now a static field on the class.
web/pdf_page_view.js
Outdated
annotationType: 2, | ||
annotationFlags: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should use actual constants, rather than hard-coded numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed annotationFlags
and the only remaining hardcoded number now is annotationType
. Ideally I should still not be assigning any of these manually and using a constructor.
a7d0024
to
f3737b5
Compare
cad781f
to
93e5417
Compare
I was a bit unsure if I understood exactly what you were suggesting but how does this commit look? It "fetches" the textContents from the previous render of the textLayer and makes the processing step sync. |
@Snuffleupagus nvm my last comment, I figured it out after looking at |
Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.
517c74e
to
7849bc0
Compare
if ( | ||
annotation.subtype === "Link" && | ||
annotation.url === link.url && | ||
Util.intersect(annotation.rect, link.rect) !== null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to avoid some corner case bug, maybe a better way to do would be to compute the ratio area(intersection) / area(annotation.rect) and if it's greater than a threshold then consider that links are very likely the same.
Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.
References:
Please note that this is a WIP PR for soliciting your feedback while I work on polishing things and hopefully optimizing further.