Browsers have an accessibility option that allows user to enforce
a minimum font size for all text rendered in the page, regardless
of what the font-size CSS property says. For example, it can be
found in Firefox under `font.minimum-size.x-western`.
When rendering the <span>s in the text layer, this causes the
text layer to not be aligned anymore with the underlying canvas.
While normally accessibility features should not be worked around,
in this case it is *not* improving accessibility:
- the text is transparent, so making it bigger doesn't make it more
readable
- the selection UX for users with that accessibility option enabled
is worse than for other users (it's basically unusable).
While there is tecnically no way to ignore that minimum font size,
this commit does it by multiplying all the `font-size`s in the text
layer by minFontSize, and then scaling all the `<span>`s down by
1/minFontSize.
After the re-factoring in PR 18104 there's now a *theoretical* risk that a pending `TextLayer` is never removed, which we can avoid by not registering it until `render` is invoked.
Note that this doesn't affect the viewer or tests, but if a third-party user calls `new TextLayer(...)` without a following call of either the `render`- or `cancel`-method we'd block global clean-up without this patch.
Fixes issue #16843.
In certain cases, the text layer was misaligned
due to a difference between the `lang` attribute
of the viewer and the canvas. This commit addresses
the problem by adding the `lang` attribute to the canvas.
The issue was caused because PDF.js uses serif/sans-serif
fonts to generate the text layer and relies on system fonts.
The difference in the `lang` attribute led to different fonts
being picked, causing the misalignment.
This is very old code, and predates e.g. the introduction of JavaScript classes, which creates unnecessarily unwieldy code in the viewer.
By introducing a new `TextLayer` class in the API, similar to how e.g. the `AnnotationLayer` looks, we're able to keep most parameters on the class-instance itself. This removes the need to manually track them in the viewer, and simplifies the call-sites.
This also removes the `numTextDivs` parameter from the "textlayerrendered" event, since that's only added to support default-viewer functionality that no longer exists.
Finally we try, as far as possible, to polyfill the old `renderTextLayer` and `updateTextLayer` functions since they are exposed in the library API.
For *simple* invocations of `renderTextLayer` the behaviour should thus be the same, with only a warning printed in the console.
*Please note:* This doesn't really affect the viewer, but may affect the library API if multiple PDF documents are opened in parallel.
Since we clean-up "global" textLayer-data when destroying a PDF document, this means that other active PDFs could potentially break by invoking `cleanupTextLayer` unconditionally. Note that textLayer rendering is an asynchronous task, and we thus need to ensure those are all finished before running clean-up.
I broke this accidentally in PR 18089, sorry about that!
Note that since `#processItems` is private we can no longer just "replace" the method as was done in PR 18052.
- Change all possible semi-private methods into properly private ones. Note that this code is old enough to predate standard classes.
- Move the `appendText` helper function into `TextLayerRenderTask`, as a private method, to avoid having to manually pass in the scope.
- Simplify `#layoutText` by directly passing in all necessary data. This is possible after the changes PR 18052.
- These changes will allow a simpler way of implementing PR 17770.
- The /Lang attribute is fetched lazily, with the first `getTextContent` invocation. Given the existing worker-thread caching, this will thus only need to be done *once* per PDF document (and most PDFs don't included this data).
- This makes the /Lang attribute *directly available* in the `textLayer`, which has the following advantages:
- We don't need to block, and thus delay, overall viewer initialization on fetching it (nor pass it around throughout the viewer).
- Third-party users of the `textLayer` will automatically benefit from this, once we start actually using the /Lang attribute in PR 17770.
*Please note:* This also, importantly, means that the `text` reference-tests will then cover this code (which wouldn't otherwise have been the case).
This limit is currently completely non-functional, since the check happens *after* the entire textLayer has been parsed and appended to the DOM. It seems that this has been *accidentally* broken ever since the introduction of `ReadableStream` support.
The reason that this hasn't caused noticeable textLayer-related performance issues in practice is probably because we nowadays manage to coalesce the textLayer into fewer overall DOM elements, whereas years ago many PDF documents ended up with one DOM element *per* glyph.
By moving this check, and thus restoring the functionality, we're also able to remove the `render` helper function and simplify the code.
The only reason that this code still accepts `TextContent` is for backward-compatibility purposes, so we can simplify the implementation by always using a `ReadableStream` internally.
This replaces our custom `PromiseCapability`-class with the new native `Promise.withResolvers()` functionality, which does *almost* the same thing[1]; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers
The only difference is that `PromiseCapability` also had a `settled`-getter, which was however not widely used and the call-sites can either be removed or re-factored to avoid it. In particular:
- In `src/display/api.js` we can tweak the `PDFObjects`-class to use a "special" initial data-value and just compare against that, in order to replace the `settled`-state.
- In `web/app.js` we change the only case to manually track the `settled`-state, which should hopefully be OK given how this is being used.
- In `web/pdf_outline_viewer.js` we can remove the `settled`-checks, since the code should work just fine without it. The only thing that could potentially happen is that we try to `resolve` a Promise multiple times, which is however *not* a problem since the value of a Promise cannot be changed once fulfilled or rejected.
- In `web/pdf_viewer.js` we can remove the `settled`-checks, since the code should work fine without them:
- For the `_onePageRenderedCapability` case the `settled`-check is used in a `EventBus`-listener which is *removed* on its first (valid) invocation.
- For the `_pagesCapability` case the `settled`-check is used in a print-related helper that works just fine with "only" the other checks.
- In `test/unit/api_spec.js` we can change the few relevant cases to manually track the `settled`-state, since this is both simple and *test-only* code.
---
[1] In browsers/environments that lack native support, note [the compatibility data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers#browser_compatibility), it'll be polyfilled via the `core-js` library (but only in `legacy` builds).
The system locale (used in OffscreenCanvas) can be different from the one guessed by Fluent,
consequently, in order to avoid any mismatch, we just use an attached canvas element.
The original issue can easily be reproduced locally in adding a lang="ja" in viewer.html
(or with an other language for Japanese users).
When pdfBug is true, the substitution font is used in the text layer in order
to be able to know what is the font really used thanks to the devtools.
And to be sure that fonts are loaded, the font cache isn't cleaned up when
the debugger is active.
This is something that I completely overlooked in PR 16162, which in some cases cause the default viewer to incorrectly print warnings.
This can be reproduced with the PAGE scrolling-mode, and/or the PresentationMode, and this patch simply work-around it by checking the visibility as well (since the warning is a best-effort solution anyway).
Unfortunately I don't believe that we can simply add a default `--scale-factor` CSS-variable to the `container`-element, since that might not be entirely appropriate/correct in all cases.[1]
However, we can at least print a console-error to hopefully make this situation more apparent to users. (This is purposely not using the `warn` helper-function, since those messages can be disabled.)
---
[1] One example is in our reference-tests, where we don't need to add it to the `container`-element itself.
Currently some `getCtx` calls will have `isOffscreenCanvasSupported === undefined` set, meaning that `OffscreenCanvas` isn't being used as intended, since no `TextLayerRenderTask._isOffscreenCanvasSupported` property exists.
*Please note:* This patch is written using the GitHub UI, since I'm currently without a dev machine, so hopefully it works correctly.
While reviewing recent patches, I couldn't help but noticing that we now have a lot of call-sites that manually access the `PageViewport.viewBox`-property.
Rather than repeating that verbatim all over the code-base, this patch adds a lazily computed and cached getter for this data instead.
Rather than handling these parameters separately, which is a left-over from back when streaming of textContent was originally added, we can simply pass either data directly to the `TextLayer` and let it handle things accordingly.
Also, improves a few JSDoc comments and `typedef`-imports.
The idea is just to resuse what we got on the first draw.
Now, we only update the scaleX of the different spans and the other values
are dependant of --scale-factor.
Move some properties in the CSS in order to avoid any updates in JS.
The deprecation is included in the current release, i.e. version `3.1.81`, and given the edge-case nature of this option I really don't think that we need to keep it deprecated for multiple releases.
This has never really been used anywhere within the PDF.js library[1], and when streaming of textContent was introduced this parameter was effectively made redundant.
Note that when streaming of textContent is used, all text-layout has already happened by the time that this `timeout`-functionality is actually invoked (thus making it pointless).
While the `timeout`-functionality may still "work" when the textContent is provided upfront, although it's never been used/tested, streaming will generally perform better (in e.g. a viewer setting).
*Please note:* While unrelated here, also removes a now unused property that I forgot in PR 15259.
---
[1] At least not since the code was moved into its current file, which happened in PR 6619 and landed seven years ago.
There are obviously cases where using `concat` makes perfect sense, since that method doesn't change any of the existing Arrays; see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/concat
However, in a few cases throughout the code-base that's not an issue and using `concat` only leads to unnecessary intermediate allocations. With modern JavaScript we can thus replace those with a combination of `push` and spread-syntax, which wasn't originally possible when the code was written.
While `TextLayerRenderTask` apparently makes sense in TypeScript environments, given that it's being returned by the `renderTextLayer`-function in the API, we really don't want to extend the *public* API by simply exporting the class directly in `src/pdf.js` since it should never be called/initialized manually.
Hence we follow the same pattern as in PR 14013, and add some very basic unit-tests to ensure that `renderTextLayer` always returns a `TextLayerRenderTask`-instance as expected.
In PR #14717, the type was changed from a HTMLElement to a DocumentFragment.
This broke TypeScript projects that use a HTMLElement container.
To remedy this, we extend the type of container to also include HTMLElement.
Given that the textLayer-code has been using a `DocumentFragment` ever since PR 3356 (back in 2013), simply updating the type of the `container` property should be fine.
This patch also tries to, ever so slightly, improve the grammar of a couple of other properties in the typedef.
- PR #13257 fixed a lot of issues but not all and this patch aims to fix almost all remaining issues.
- the idea in this new patch is to compare position of new glyph with the last position where a glyph has been drawn;
- no space are "drawn": it just moves the cursor but they aren't added in the chunk;
- so this way a space followed by a cursor move can be treated as only one space: it helps to merge all spaces into one.
- to make difference between real spaces and tracking ones, we used a factor of the space width (from the font)
- it was a pretty good idea in general but it fails with some fonts where space was too big:
- in Poppler, they're using a factor of the font size: this is an excellent idea (<= 0.1 * fontSize implies tracking space).
While these changes will obviously not have a significant effect on overall memory usage, it cannot hurt as far as I'm concerned. This patch makes the following changes:
- Clear out `_textDivProperties` once rendering is done, since those properties are only necessary to keep alive when *enhanced* text-selection is being used.
- Reduce the size of the `_textDivProperties`-entries by default, since a majority of the properties are only relevant when *enhanced* text-selection is being used.
While fixing issue 13794, I noticed that cancelling the `ReadableStream` returned by the `PDFPageProxy.streamTextContent`-method could lead to "Uncaught promise" messages in the console.[1]
Generally speaking, we don't really care about errors when *cancelling* a `ReadableStream` and it thus seems reasonable to simply suppress any output in those cases.
---
[1] Although, after that issue was fixed you'd now need to set the API-option `stopAtErrors = true` to actually trigger this.
With modern JavaScript modules, where you explicitly list the properties that should be exported, it's no longer necessary to wrap all of the code in a closure.[1]
This patch also tries to clean-up/improve a couple of the existing JSDoc-comments.
---
[1] This reduces the size, even of the *built* `pdf.js` file, since there's now a lot less unnecessary whitespace.
- Improve chunking in order to fix some bugs where the spaces aren't here:
* track the last position where a glyph has been drawn;
* when a new glyph (first glyph in a chunk) is added then compare its position with the last saved one and add a space or break:
- there are multiple ways to move the glyphs and to avoid to have to deal with all the different possibilities it's a way easier to just compare positions;
- and so there is now one function (i.e. "compareWithLastPosition") where all the job is done.
- Add some breaks in order to get lines;
- Remove the multiple whites spaces:
* some spaces were filled with several whites spaces and so it makes harder to find some sequences of words using the search tool;
* other pdf readers replace spaces by one white space.
Update src/core/evaluator.js
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Co-authored-by: Jonas Jenwald <jonas.jenwald@gmail.com>
Using `for...of` is a modern and generally much nicer pattern, since it gets rid of unnecessary callback-functions. (In a couple of spots, a "regular" `for` loop had to be used.)