Commit graph

327 commits

Author SHA1 Message Date
Jonas Jenwald
d24a61c648 Allow /XYZ destinations without zoom parameter (issue 18408)
According to the PDF specification these destinations should have a zoom parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870

Hence we try to work-around bad PDF generators by making the zoom parameter optional when validating explicit destinations in both the worker and the viewer.
2024-07-18 13:29:32 +02:00
Jonas Jenwald
403d023617 Allow e.g. /FitH destinations without additional parameter (bug 1907000)
According to the PDF specification these destinations should have a coordinate parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870

Hence we try to work-around bad PDF generators by making the coordinate parameter optional when validating explicit destinations in both the worker and the viewer.
2024-07-11 10:36:44 +02:00
Jonas Jenwald
5ee61690f3
Merge pull request #18390 from alexcat3/fix-issue-18099
Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099)
2024-07-06 18:57:07 +02:00
alexcat3
1c364422a6 Handle toUnicode cmaps that omit leading zeros in hex encoded UTF-16 (issue 18099)
Add unit test to check compatability with such cmaps

In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF-16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied
I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
2024-07-06 11:29:21 -04:00
Tim van der Meij
2a44203d96
Fix the "caches image resources at the document/page level as expected (issue 11878)" unit test
This unit test fails occasionally (albeit much less than before thanks
to PR #17663), so we change the parsing time check's divisor to prevent
it from happening again. If the last page's rendering time is less than
or equal to 50% of the first page's rendering time that should be enough
proof that no worker thread re-parsing occurred while also providing a
wide enough range to avoid intermittents.

Note that the assertion is now equal to the one we already have in the
"caches image resources at the document/page level, with main-thread
copying of complex images (issue 11518)" unit test which seems to work
reliably so far.
2024-07-06 16:30:07 +02:00
Jonas Jenwald
06334c97ef Improve the loadingParams functionality in the API
- Move the definition of the `loadingParams` Object, to simplify the code.

 - Add a unit-test, since none existed and the viewer depends on this functionality.
2024-05-24 09:26:40 +02:00
Jonas Jenwald
c5f92437f7 Avoid re-parsing global images that failed decoding (issue 18042, PR 17428 follow-up)
For images that failed to decode once we want to avoid a pointless round-trip to the main-thread, which could otherwise happen for globally cached images.
2024-05-14 13:58:36 +02:00
Jonas Jenwald
6d523c316c [api-minor] Include the document /Lang attribute in the textContent-data
- These changes will allow a simpler way of implementing PR 17770.

 - The /Lang attribute is fetched lazily, with the first `getTextContent` invocation. Given the existing worker-thread caching, this will thus only need to be done *once* per PDF document (and most PDFs don't included this data).

 - This makes the /Lang attribute *directly available* in the `textLayer`, which has the following advantages:
    - We don't need to block, and thus delay, overall viewer initialization on fetching it (nor pass it around throughout the viewer).

    - Third-party users of the `textLayer` will automatically benefit from this, once we start actually using the /Lang attribute in PR 17770.
      *Please note:* This also, importantly, means that the `text` reference-tests will then cover this code (which wouldn't otherwise have been the case).
2024-05-14 12:44:41 +02:00
Jonas Jenwald
2b69fb76ac [api-minor] Improve the FileSpec implementation
- Check that the `filename` is actually a string, before parsing it further.
 - Use proper "shadowing" in the `filename` getter.
 - Add a bit more validation of the data in `pickPlatformItem`.
 - Last, but not least, return both the original `filename` and the (path stripped) variant needed in the display-layer and viewer.
2024-05-01 18:02:05 +02:00
Jonas Jenwald
bf4e36d1b5 [api-minor] Expose the /Desc-attribute of file attachments in the viewer (issue 18030)
In the viewer this will be displayed in the `title` of the hyperlink, which is probably the best we can do here given how the viewer is implemented.
2024-05-01 09:02:11 +02:00
Calixte Denizet
45fa867577 Allow to insert several annotations under the same parent in the structure tree
While testing stamp insertion with the added pdf, I noticed that the tags using a MCID
weren't considered when trying to attach an annotation to it.
2024-04-24 16:23:05 +02:00
Jonas Jenwald
91898e5923 Extend the globally cached image main-thread copying to "complex" images as well (PR 17428 follow-up)
In PR 17428 this functionality was limited to "larger" images, to not affect performance negatively. However it turns out that it's also beneficial to consider more "complex" images, regardless of their size, that contain /SMask or /Mask data; see issue 11518.
2024-04-20 11:10:09 +02:00
Tim van der Meij
2e5282928f
Merge pull request #17854 from Snuffleupagus/rm-PromiseCapability
[api-minor] Replace the `PromiseCapability` with  `Promise.withResolvers()`
2024-04-02 15:21:43 +02:00
Jonas Jenwald
e4d0e84802 [api-minor] Replace the PromiseCapability with Promise.withResolvers()
This replaces our custom `PromiseCapability`-class with the new native `Promise.withResolvers()` functionality, which does *almost* the same thing[1]; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers

The only difference is that `PromiseCapability` also had a `settled`-getter, which was however not widely used and the call-sites can either be removed or re-factored to avoid it. In particular:
 - In `src/display/api.js` we can tweak the `PDFObjects`-class to use a "special" initial data-value and just compare against that, in order to replace the `settled`-state.
 - In `web/app.js` we change the only case to manually track the `settled`-state, which should hopefully be OK given how this is being used.
 - In `web/pdf_outline_viewer.js` we can remove the `settled`-checks, since the code should work just fine without it. The only thing that could potentially happen is that we try to `resolve` a Promise multiple times, which is however *not* a problem since the value of a Promise cannot be changed once fulfilled or rejected.
 - In `web/pdf_viewer.js` we can remove the `settled`-checks, since the code should work fine without them:
     - For the `_onePageRenderedCapability` case the `settled`-check is used in a `EventBus`-listener which is *removed* on its first (valid) invocation.
     - For the `_pagesCapability` case the `settled`-check is used in a print-related helper that works just fine with "only" the other checks.
 - In `test/unit/api_spec.js` we can change the few relevant cases to manually track the `settled`-state, since this is both simple and *test-only* code.

---
[1] In browsers/environments that lack native support, note [the compatibility data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers#browser_compatibility), it'll be polyfilled via the `core-js` library (but only in `legacy` builds).
2024-04-01 11:42:37 +02:00
Calixte Denizet
136c1faa7f Display outlines even if one has no title
Fixes #17856.
2024-03-29 21:30:24 +01:00
Jonas Jenwald
0d039937f9 Add better support for /Launch actions with /FileSpec dictionaries (issue 17846) 2024-03-26 20:15:48 +01:00
Jonas Jenwald
eded037d06 [api-minor] Use the Fetch API, when supported, to load PDF documents in Node.js environments
Given that modern Node.js versions now implement support for a fair number of "browser" APIs, we can utilize the standard Fetch API to load PDF documents that are specified via http/https URLs.

Please find compatibility information at:
 - https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API#browser_compatibility
 - https://nodejs.org/dist/latest-v18.x/docs/api/globals.html#fetch
 - https://developer.mozilla.org/en-US/docs/Web/API/Response#browser_compatibility
 - https://nodejs.org/dist/latest-v18.x/docs/api/globals.html#response
2024-02-21 22:38:42 +01:00
Jonas Jenwald
37e98e39f6 Skip any whitespace after the first object in linearized PDFs (issue 17665)
This way the code is now consistent with the non-linearized branch in the `PDFDocument.startXRef` getter.
2024-02-12 22:05:36 +01:00
Jonas Jenwald
19ef3e367b Tweak the issue 11878 unit-test parsing time check (PR 17428 follow-up)
This unit-test has been failing occasionally in Chrome and Node.js, hence we tweak the parsing time check to reduce the likelihood of that happening.
2024-02-12 12:31:55 +01:00
Calixte Denizet
7f2428a77e Reduce memory use and improve perfs when computing the bounding box of a bezier curve (bug 1875547)
It isn't really a fix for the mentioned bug but it slightly improve things.
In reducing the memory use, the time spent in the GC is reduced either.
The algorithm to compute the bounding box is the same as before but it has just
been rewritten to be more efficient.
2024-01-24 23:41:14 +01:00
Jonas Jenwald
f9a384d711 Enable the arrow-body-style ESLint rule
This manually ignores some cases where the resulting auto-formatting would not, as far as I'm concerned, constitute a readability improvement or where we'd just end up with more overall indentation.

Please see https://eslint.org/docs/latest/rules/arrow-body-style
2024-01-21 16:20:55 +01:00
Jonas Jenwald
9dfe9c552c Use shorter arrow functions where possible
For arrow functions that are both simple and short, we can avoid using explicit `return` to shorten them even further without hurting readability.

For the `gulp mozcentral` build-target this reduces the overall size of the output by just under 1 kilo-byte (which isn't a lot but still can't hurt).
2024-01-21 10:13:12 +01:00
Calixte Denizet
405f573d70 Take into account empty lines when extracting text content from the appearance
Fixes #17492.
2024-01-14 20:23:29 +01:00
Jonas Jenwald
9f02cc36d4 Attempt to further reduce re-parsing for globally cached images (PR 11912, 16108 follow-up)
In PR 11912 we started caching images that occur on multiple pages globally, which improved performance a lot in many PDF documents.
However, one slightly annoying limitation of the implementation is the need to re-parse the image once the global-caching threshold has been reached. Previously this was difficult to avoid, since large image-resources will cause cleanup to run on the main-thread after rendering has finished. In PR 16108 we started delaying this cleanup a little bit, to improve performance if a user e.g. zooms and/or rotates the document immediately after rendering completes.

Taking those two PRs together, we now have a situation where it's much more likely that the main-thread has "globally used" images cached at the page-level. Hence we can instead attempt to *copy* a locally cached image into the global object-cache on the main-thread and thus reduce unnecessary re-parsing of large/complex global images, which significantly reduces the rendering time in many cases.

For the PDF document in issue 11878, the rendering time of *the second page* changes as follows (on my computer):
 - With the `master`-branch it takes >600 ms to render.
 - With this patch that goes down to ~50 ms, which is one order of magnitude faster.

(Note that all other pages are, as expected, completely unaffected by these changes.)

This new main-thread copying is limited to "large" global images, since:
 - Re-parsing of small images, on the worker-thread, is usually fast enough to not be an issue.
 - With the delayed cleanup after rendering, it's still not guaranteed that an image is available in a page-level cache on the main-thread.
 - This forces the worker-thread to wait for the main-thread, which is a pattern that you always want to avoid unless absolutely necessary.
2023-12-21 21:26:21 +01:00
Jonas Jenwald
674052d3fc Re-factor the blob-URL caching in DownloadManager.openOrDownloadData
Cache blob-URLs on the actual data, rather than DOM elements, to reduce potential duplicates (note the updated unit-test).
2023-10-17 10:18:34 +02:00
Jonas Jenwald
927e50f5d4 [api-major] Output JavaScript modules in the builds (issue 10317)
At this point in time all browsers, and also Node.js, support standard `import`/`export` statements and we can now finally consider outputting modern JavaScript modules in the builds.[1]

In order for this to work we can *only* use proper `import`/`export` statements throughout the main code-base, and (as expected) our Node.js support made this much more complicated since both the official builds and the GitHub Actions-based tests must keep working.[2]
One remaining issue is that the `pdf.scripting.js` file cannot be built as a JavaScript module, since doing so breaks PDF scripting.

Note that my initial goal was to try and split these changes into a couple of commits, however that unfortunately didn't really work since it turned out to be difficult for smaller patches to work correctly and pass (all) tests that way.[3]
This is a classic case of every change requiring a couple of other changes, with each of those changes requiring further changes in turn and the size/scope quickly increasing as a result.

One possible "issue" with these changes is that we'll now only output JavaScript modules in the builds, which could perhaps be a problem with older tools. However it unfortunately seems far too complicated/time-consuming for us to attempt to support both the old and modern module formats, hence the alternative would be to do "nothing" here and just keep our "old" builds.[4]

---
[1] The final blocker was module support in workers in Firefox, which was implemented in Firefox 114; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/import#browser_compatibility

[2] It's probably possible to further improve/simplify especially the Node.js-specific code, but it does appear to work as-is.

[3] Having partially "broken" patches, that fail tests, as part of the commit history is *really not* a good idea in general.

[4] Outputting JavaScript modules was first requested almost five years ago, see issue 10317, and nowadays there *should* be much better support for JavaScript modules in various tools.
2023-10-07 09:31:08 +02:00
Jonas Jenwald
bf9c33e60f Add support for "GoToE" actions with destinations (issue 17056)
This shouldn't be very common in practice, since "GoToE" actions themselves seem quite uncommon; see PR 15537.
2023-10-04 11:14:23 +02:00
Calixte Denizet
f2196f7803 StructParents entry isn't required on pages with no tagged contents (bug 1855641) 2023-09-28 14:23:10 +02:00
Calixte Denizet
3ee5268a23 [Editor] Don't try to add data to the struct tree when there is no accessibilityData (bug 1855157) 2023-09-26 11:02:14 +02:00
Calixte Denizet
a8573d4e1b [Editor] Add the ability to create/update the structure tree when saving a pdf containing newly added annotations (bug 1845087)
When there is no tree, the tags for the new annotions are just put under the root element.
When there is a tree, we insert the new tags at the right place in using the value
of structTreeParentId (added in PR #16916).
2023-09-16 18:34:58 +02:00
Tim van der Meij
66507ccae8
Enable unit test "creates pdf doc from non-existent URL"
The unit test is re-enabled because it no longer seems to fail after 10
runs on Linux where this used to fail often. Code inspection also shows
that the code is correct and should raise the previous exception
(anymore). Finally, a lot has changed since this test was disabled such
as new Jasmine versions, new Linux bot OS version and new browser
versions.
2023-09-10 15:47:04 +02:00
Calixte Denizet
a8a50c567a Construct the correct field name and strip out classes when searching
The classes were stripped out during when creating the field name but
it led to a wrong name.
Since class components in a path are irrelevant, they're just ignored
when searching for a node in the datasets.
2023-09-07 15:56:47 +02:00
Calixte Denizet
ee3ac35e05 Revert fix for bug 1838855 (bug 1849876)
The issue described in the mentioned bug is reall because
Acrobat is rendering the XFA instead of the Acroform.
The original patch just tried to workaround the issue but it
induces some regressions.
2023-08-23 12:34:41 -04:00
Tim van der Meij
5828ac0ee3
Merge pull request #16834 from Snuffleupagus/globalWorkerPort-parallel-test
Add a unit-test for the "correct" way of using the global `workerPort` in parallel (PR 16830 follow-up)
2023-08-19 13:38:16 +02:00
Jonas Jenwald
29b2050ac2 Improve the "write a new annotation, save the pdf and check that the text content is correct" unit-test (PR 16559 follow-up)
Currently this unit-test will pass just fine if compression is disabled, e.g. by commenting out the relevant code in the `src/core/writer.js` file.
While we don't have a simple way of *directly* checking that the Annotation text-content is compressed, we can however use the resulting file-size as a fairly good proxy. (Note that if compression is disabled the file-size is more than doubled.)
2023-08-15 15:12:17 +02:00
Jonas Jenwald
2422492ee3 Add a unit-test for the "correct" way of using the global workerPort in parallel (PR 16830 follow-up)
Please note that for performance reasons it's not really advised to use the same worker-thread *in parallel* for parsing multiple PDF documents, since they will then unnecessarily compete for resources.
However, given that it's still possible to do that e.g. when using the global `workerPort` it probably won't hurt to add a unit-test for this particular situation.
2023-08-15 12:45:54 +02:00
Jonas Jenwald
66437917db Avoid using the global workerPort when destruction has started, but not yet finished (issue 16777)
Given that the `PDFDocumentLoadingTask.destroy()`-method is documented as being asynchronous, you thus need to await its completion before attempting to load a new PDF document when using the global `workerPort`.
If you don't await destruction as intended then a new `getDocument`-call can remain pending indefinitely, without any kind of indication of the problem, as shown in the issue.

In order to improve the current situation, without unnecessarily complicating the API-implementation, we'll now throw during the `getDocument`-call if the global `workerPort` is in the process of being destroyed.
This part of the code-base has apparently never been covered by any tests, hence the patch adds unit-tests for both the *correct* usage (awaiting destruction) as well as the specific case outlined in the issue.
2023-08-12 21:21:50 +02:00
Jonas Jenwald
389a26c115 Fallback to check all pages when getting the pageIndex of FieldObjects
Given that the FieldObjects are parsed in parallel, in combination with the existing caching in the `getPage`-method and `annotations`-getter, adding additional caches for this fallback code-path doesn't seem entirely necessary.
2023-08-10 17:10:04 +02:00
Jonas Jenwald
64e8557fb5 [api-minor] Deprecate the PDFDocumentProxy.getJavaScript method
This method is very old, however with the exception of the auto-print hack (when scripting is disabled) in the viewer it's never actually been used.

Most likely the idea with `PDFDocumentProxy.getJavaScript` was that it'd be useful if scripting support was added, however it turned out that it was a bit too simplistic and instead a number of new methods were added for the scripting use-cases.
2023-08-01 09:02:05 +02:00
Jonas Jenwald
3a886e7264 Move the isNodeJS-helper into the src/shared/util.js file
With the changes in the previous patch the `isNodeJS`-helper no longer needs to live in its own file, which helps get rid of a closure in the *built* files.
2023-07-17 16:42:25 +02:00
Jonas Jenwald
f84657d837 Address formatting changes from Prettier version 3 2023-07-15 10:44:39 +02:00
Jonas Jenwald
39113baa33 Move the transfers computation into the AnnotationStorage class
Rather than having to *manually* determine the potential `transfers` at various spots in the API, we can let the `AnnotationStorage.serializable` getter include this.
To further simplify things, we can also let the `serializable` getter compute and include the `hash`-string as well.
2023-06-29 19:51:57 +02:00
Calixte Denizet
599b9498f2 [Editor] Add support for printing/saving newly added Stamp annotations
In order to minimize the size the of a saved pdf, we generate only one
image and use a reference in each annotation using it.
When printing, it's slightly different since we have to render each page
independantly but we use the same image within a page.
2023-06-26 15:47:05 +02:00
Calixte Denizet
5c0054d58d Guess that a checkbox belongs to a group in using its T value (bug 1838855) 2023-06-16 18:45:09 +02:00
Jonas Jenwald
2cb113b545 Improve handling of /Filter-entries in writeStream
Fix handling of /Filter-entries, since the current implementation could potentially corrupt the data if there's multiple filters present.
Please note that filters are applied *sequentially* during decoding, starting from the first one in the Array, hence the first Array-entry needs to be /FlateDecode in order for things to actually work correctly.

To prevent a future bug, if we want to save more "complex" data such as images, also ensure that we include any existing /DecodeParms-entries when updating the /Filter-entry.
2023-06-16 10:27:23 +02:00
Calixte Denizet
85b38fc247 Add a test to check that the compression is ok when saving an annotation 2023-06-16 10:05:42 +02:00
Calixte Denizet
71479fdd21 [Editor] Avoid to have duplicated entries in the Annot array when saving an existing and modified annotation 2023-06-15 22:02:10 +02:00
Jonas Jenwald
877884029d
Merge pull request #16551 from Snuffleupagus/page-destroyed-complete
Ensure that `cleanup` during rendering is actually ignored, to prevent a blank canvas
2023-06-15 12:26:57 +02:00
Jonas Jenwald
0650be4641
Merge pull request #16550 from Snuffleupagus/rm-RenderingCancelledException-type
[api-minor] Remove the `type` from `RenderingCancelledException` (PR 16226 follow-up)
2023-06-15 12:26:27 +02:00
Jonas Jenwald
a591c3de84 Ensure that cleanup during rendering is actually ignored, to prevent a blank canvas
The existing unit-test doesn't work as intended, since the page never actually renders. Note how `cleanup` is *not* allowed to run when parsing and/or rendering is ongoing, however an (old) incorrect condition could prevent rendering from ever starting.

This is very old code, which has been slightly re-factored a couple of times (many years ago), however this doesn't appear to affect e.g. the default viewer since the incorrect behaviour seem highly dependent on "unlucky" timing.
Note also how at the start of the `PDFPageProxy.prototype.render`-method we purposely cancel any pending `cleanup`-call, to prevent unnecessary re-parsing for multiple sequential `render`-calls.

Finally, avoid running `cleanup` when document/page destruction has already started since it's pointless in that case.
2023-06-15 11:39:26 +02:00