From 16b102b31332e408537938762fe89e8fdf5da2d4 Mon Sep 17 00:00:00 2001 From: Jan Lukas Gernert Date: Sat, 29 Apr 2023 18:20:28 +0200 Subject: [PATCH] replace multiple
s with single

--- .../tests/ftr/phoronix/expected.html | 35 ++-- .../tests/readability/blogger/expected.html | 43 +--- .../readability/hukumusume/expected.html | 195 +++++++----------- .../tests/readability/tumblr/expected.html | 2 +- .../tests/readability/v8-blog/expected.html | 6 +- .../tests/readability/yahoo-4/expected.html | 58 +----- article_scraper/src/full_text_parser/mod.rs | 4 + article_scraper/src/util.rs | 157 ++++++++++++++ 8 files changed, 264 insertions(+), 236 deletions(-) diff --git a/article_scraper/resources/tests/ftr/phoronix/expected.html b/article_scraper/resources/tests/ftr/phoronix/expected.html index df5432c..83afd8e 100644 --- a/article_scraper/resources/tests/ftr/phoronix/expected.html +++ b/article_scraper/resources/tests/ftr/phoronix/expected.html @@ -1,27 +1,16 @@

GNOME
It's been one month already since the debut of GNOME 44 and out today is the first point release. -
-
GNOME 44.1 brings many fixes to this updated Linux desktop, including many crash fixes and addressing newly uncovered memory leaks. Some of the GNOME 44.1 highlights include: -
-
- Many fixes to GNOME Shell, including crash fixes, memory leak fixes, and other items addressed. -
-
- GNOME's Mutter has also seen numerous fixes, including improved screencast support, fixing support for resizing windows via the keyboard, enabling modifiers by default for non-native backends, and various other fixes. -
-
- The GNOME Settings Daemon will now connect to light sensors asynchronously. -
-
- Crash fixes for GNOME Software and the Nautilus file manager. -
-
- Nautilus now allows extraction of .tar.zst and .zstd archives. -
-
- GNOME Control Center's display area now allows configuring all monitors and applying those settings at once. -
-
- GNOME Calls will no longer crash on empty/null call ID. -
-
- GNOME Web (Epiphany) has seen some crash fixes. -
-
- GNOME Boxes for virtualization has a fix to always enable the boot menu option and fixing 3D acceleration not sticking at startup. -
-
- GNOME Calendar has stability and performance improvements to its search. -

Fedora 38 with GNOME 44

+

GNOME 44.1 brings many fixes to this updated Linux desktop, including many crash fixes and addressing newly uncovered memory leaks. Some of the GNOME 44.1 highlights include: +

- Many fixes to GNOME Shell, including crash fixes, memory leak fixes, and other items addressed. +

- GNOME's Mutter has also seen numerous fixes, including improved screencast support, fixing support for resizing windows via the keyboard, enabling modifiers by default for non-native backends, and various other fixes. +

- The GNOME Settings Daemon will now connect to light sensors asynchronously. +

- Crash fixes for GNOME Software and the Nautilus file manager. +

- Nautilus now allows extraction of .tar.zst and .zstd archives. +

- GNOME Control Center's display area now allows configuring all monitors and applying those settings at once. +

- GNOME Calls will no longer crash on empty/null call ID. +

- GNOME Web (Epiphany) has seen some crash fixes. +

- GNOME Boxes for virtualization has a fix to always enable the boot menu option and fixing 3D acceleration not sticking at startup. +

- GNOME Calendar has stability and performance improvements to its search. +

Fedora 38 with GNOME 44


More details on the GNOME 44.1 changes via the release announcement.
\ No newline at end of file diff --git a/article_scraper/resources/tests/readability/blogger/expected.html b/article_scraper/resources/tests/readability/blogger/expected.html index 0149e1e..c3c26f3 100644 --- a/article_scraper/resources/tests/readability/blogger/expected.html +++ b/article_scraper/resources/tests/readability/blogger/expected.html @@ -1,10 +1,7 @@

I've written a couple of posts in the past few months but they were all for the blog at work so I figured I'm long overdue for one on Silicon Exposed.

So what's a GreenPak?

-

Silego Technology is a fabless semiconductor company located in the SF Bay area, which makes (among other things) a line of programmable logic devices known as GreenPak. Their 5th generation parts were just announced, but I started this project before that happened so I'm still targeting the 4th generation.
-
GreenPak devices are kind of like itty bitty PSoCs - they have a mixed signal fabric with an ADC, DACs, comparators, voltage references, plus a digital LUT/FF fabric and some typical digital MCU peripherals like counters and oscillators (but no CPU).
-
It's actually an interesting architecture - FPGAs (including some devices marketed as CPLDs) are a 2D array of LUTs connected via wires to adjacent cells, and true (product term) CPLDs are a star topology of AND-OR arrays connected by a crossbar. GreenPak, on the other hand, is a star topology of LUTs, flipflops, and analog/digital hard IP connected to a crossbar.
-
Without further ado, here's a block diagram showing all the cool stuff you get in the SLG46620V:

+

Silego Technology is a fabless semiconductor company located in the SF Bay area, which makes (among other things) a line of programmable logic devices known as GreenPak. Their 5th generation parts were just announced, but I started this project before that happened so I'm still targeting the 4th generation.

GreenPak devices are kind of like itty bitty PSoCs - they have a mixed signal fabric with an ADC, DACs, comparators, voltage references, plus a digital LUT/FF fabric and some typical digital MCU peripherals like counters and oscillators (but no CPU).

It's actually an interesting architecture - FPGAs (including some devices marketed as CPLDs) are a 2D array of LUTs connected via wires to adjacent cells, and true (product term) CPLDs are a star topology of AND-OR arrays connected by a crossbar. GreenPak, on the other hand, is a star topology of LUTs, flipflops, and analog/digital hard IP connected to a crossbar.

Without further ado, here's a block diagram showing all the cool stuff you get in the SLG46620V:

@@ -16,11 +13,7 @@

- They're also tiny (the SLG46620V is a 20-pin 0.4mm pitch STQFN measuring 2x3 mm, and the lower gate count SLG46140V is a mere 1.6x2 mm) and probably the cheapest programmable logic device on the market - $0.50 in low volume and less than $0.40 in larger quantities.
-
The Vdd range of GreenPak4 is huge, more like what you'd expect from an MCU than an FPGA! It can run on anything from 1.8 to 5V, although performance is only specified at 1.8, 3.3, and 5V nominal voltages. There's also a dual-rail version that trades one of the GPIO pins for a second power supply pin, allowing you to interface to logic at two different voltage levels.
-
To support low-cost/space-constrained applications, they even have the configuration memory on die. It's one-time programmable and needs external Vpp to program (presumably Silego didn't want to waste die area on charge pumps that would only be used once) but has a SRAM programming mode for prototyping.
-
The best part is that the development software (GreenPak Designer) is free of charge and provided for all major operating systems including Linux! Unfortunately, the only supported design entry method is schematic entry and there's no way to write your design in a HDL.
-
While schematics may be fine for quick tinkering on really simple designs, they quickly get unwieldy. The nightmare of a circuit shown below is just a bunch of counters hooked up to LEDs that blink at various rates.

+ They're also tiny (the SLG46620V is a 20-pin 0.4mm pitch STQFN measuring 2x3 mm, and the lower gate count SLG46140V is a mere 1.6x2 mm) and probably the cheapest programmable logic device on the market - $0.50 in low volume and less than $0.40 in larger quantities.

The Vdd range of GreenPak4 is huge, more like what you'd expect from an MCU than an FPGA! It can run on anything from 1.8 to 5V, although performance is only specified at 1.8, 3.3, and 5V nominal voltages. There's also a dual-rail version that trades one of the GPIO pins for a second power supply pin, allowing you to interface to logic at two different voltage levels.

To support low-cost/space-constrained applications, they even have the configuration memory on die. It's one-time programmable and needs external Vpp to program (presumably Silego didn't want to waste die area on charge pumps that would only be used once) but has a SRAM programming mode for prototyping.

The best part is that the development software (GreenPak Designer) is free of charge and provided for all major operating systems including Linux! Unfortunately, the only supported design entry method is schematic entry and there's no way to write your design in a HDL.

While schematics may be fine for quick tinkering on really simple designs, they quickly get unwieldy. The nightmare of a circuit shown below is just a bunch of counters hooked up to LEDs that blink at various rates.

@@ -32,14 +25,9 @@

- As if this wasn't enough of a problem, the largest GreenPak4 device (the SLG46620V) is split into two halves with limited routing between them, and the GUI doesn't help the user manage this complexity at all - you have to draw your schematic in two halves and add "cross connections" between them.
-
The icing on the cake is that schematics are a pain to diff and collaborate on. Although GreenPak schematics are XML based, which is a touch better than binary, who wants to read a giant XML diff and try to figure out what's going on in the circuit?
-
This isn't going to be a post on the quirks of Silego's software, though - that would be boring. As it turns out, there's one more exciting feature of these chips that I didn't mention earlier: the configuration bitstream is 100% documented in the device datasheet! This is unheard of in the programmable logic world. As Nick of Arachnid Labs says, the chip is "just dying for someone to write a VHDL or Verilog compiler for it". As you can probably guess by from the title of this post, I've been busy doing exactly that.

+ As if this wasn't enough of a problem, the largest GreenPak4 device (the SLG46620V) is split into two halves with limited routing between them, and the GUI doesn't help the user manage this complexity at all - you have to draw your schematic in two halves and add "cross connections" between them.

The icing on the cake is that schematics are a pain to diff and collaborate on. Although GreenPak schematics are XML based, which is a touch better than binary, who wants to read a giant XML diff and try to figure out what's going on in the circuit?

This isn't going to be a post on the quirks of Silego's software, though - that would be boring. As it turns out, there's one more exciting feature of these chips that I didn't mention earlier: the configuration bitstream is 100% documented in the device datasheet! This is unheard of in the programmable logic world. As Nick of Arachnid Labs says, the chip is "just dying for someone to write a VHDL or Verilog compiler for it". As you can probably guess by from the title of this post, I've been busy doing exactly that.

Great! How does it work?

-

Rather than wasting time writing a synthesizer, I decided to write a GreenPak technology library for Clifford Wolf's excellent open source synthesis tool, Yosys, and then make a place-and-route tool to turn that into a final netlist. The post-PAR netlist can then be loaded into GreenPak Designer in order to program the device.
-
The first step of the process is to run the "synth_greenpak4" Yosys flow on the Verilog source. This runs a generic RTL synthesis pass, then some coarse-grained extraction passes to infer shift register and counter cells from behavioral logic, and finally maps the remaining logic to LUT/FF cells and outputs a JSON-formatted netlist.
-
Once the design has been synthesized, my tool (named, surprisingly, gp4par) is then launched on the netlist. It begins by parsing the JSON and constructing a directed graph of cell objects in memory. A second graph, containing all of the primitives in the device and the legal connections between them, is then created based on the device specified on the command line. (As of now only the SLG46620V is supported; the SLG46621V can be added fairly easily but the SLG46140V has a slightly different microarchitecture which will require a bit more work to support.)
-
After the graphs are generated, each node in the netlist graph is assigned a numeric label identifying the type of cell and each node in the device graph is assigned a list of legal labels: for example, an I/O buffer site is legal for an input buffer, output buffer, or bidirectional buffer.

+

Rather than wasting time writing a synthesizer, I decided to write a GreenPak technology library for Clifford Wolf's excellent open source synthesis tool, Yosys, and then make a place-and-route tool to turn that into a final netlist. The post-PAR netlist can then be loaded into GreenPak Designer in order to program the device.

The first step of the process is to run the "synth_greenpak4" Yosys flow on the Verilog source. This runs a generic RTL synthesis pass, then some coarse-grained extraction passes to infer shift register and counter cells from behavioral logic, and finally maps the remaining logic to LUT/FF cells and outputs a JSON-formatted netlist.

Once the design has been synthesized, my tool (named, surprisingly, gp4par) is then launched on the netlist. It begins by parsing the JSON and constructing a directed graph of cell objects in memory. A second graph, containing all of the primitives in the device and the legal connections between them, is then created based on the device specified on the command line. (As of now only the SLG46620V is supported; the SLG46621V can be added fairly easily but the SLG46140V has a slightly different microarchitecture which will require a bit more work to support.)

After the graphs are generated, each node in the netlist graph is assigned a numeric label identifying the type of cell and each node in the device graph is assigned a list of legal labels: for example, an I/O buffer site is legal for an input buffer, output buffer, or bidirectional buffer.

diff --git a/article_scraper/resources/tests/readability/tumblr/expected.html b/article_scraper/resources/tests/readability/tumblr/expected.html index c3598cf..cedfe20 100644 --- a/article_scraper/resources/tests/readability/tumblr/expected.html +++ b/article_scraper/resources/tests/readability/tumblr/expected.html @@ -1,4 +1,4 @@

Minecraft 1.8 - The Bountiful Update

-

+ Added Granite, Andesite, and Diorite stone blocks, with smooth versions
+ Added Slime Block
+ Added Iron Trapdoor
+ Added Prismarine and Sea Lantern blocks
+ Added the Ocean Monument
+ Added Red Sandstone
+ Added Banners
+ Added Armor Stands
+ Added Coarse Dirt (dirt where grass won’t grow)
+ Added Guardian mobs, with item drops
+ Added Endermite mob
+ Added Rabbits, with item drops
+ Added Mutton and Cooked Mutton
+ Villagers will harvest crops and plant new ones
+ Mossy Cobblestone and Mossy Stone Bricks are now craftable
+ Chiseled Stone Bricks are now craftable
+ Doors and fences now come in all wood type variants
+ Sponge block has regained its water-absorbing ability and becomes wet
+ Added a spectator game mode (game mode 3)
+ Added one new achievement
+ Added “Customized” world type
+ Added hidden “Debug Mode” world type
+ Worlds can now have a world barrier
+ Added @e target selector for Command Blocks
+ Added /blockdata command
+ Added /clone command
+ Added /execute command
+ Added /fill command
+ Added /particle command
+ Added /testforblocks command
+ Added /title command
+ Added /trigger command
+ Added /worldborder command
+ Added /stats command
+ Containers can be locked in custom maps by using the “Lock” data tag
+ Added logAdminCommands, showDeathMessages, reducedDebugInfo, sendCommandFeedback, and randomTickSpeed game rules
+ Added three new statistics
+ Player skins can now have double layers across the whole model, and left/right arms/legs can be edited independently
+ Added a new player model with smaller arms, and a new player skin called Alex?
+ Added options for configuring what pieces of the skin that are visible
+ Blocks can now have custom visual variations in the resource packs
+ Minecraft Realms now has an activity chart, so you can see who has been online
+ Minecraft Realms now lets you upload your maps
* Difficulty setting is saved per world, and can be locked if wanted
* Enchanting has been redone, now costs lapis lazuli in addition to enchantment levels
* Villager trading has been rebalanced
* Anvil repairing has been rebalanced
* Considerable faster client-side performance
* Max render distance has been increased to 32 chunks (512 blocks)
* Adventure mode now prevents you from destroying blocks, unless your items have the CanDestroy data tag
* Resource packs can now also define the shape of blocks and items, and not just their textures
* Scoreboards have been given a lot of new features
* Tweaked the F3 debug screen
* Block ID numbers (such as 1 for stone), are being replaced by ID names (such as minecraft:stone)
* Server list has been improved
* A few minor changes to village and temple generation
* Mob heads for players now show both skin layers
* Buttons can now be placed on the ceiling
* Lots and lots of other changes
* LOTS AND LOTS of other changes
- Removed Herobrine

+

+ Added Granite, Andesite, and Diorite stone blocks, with smooth versions
+ Added Slime Block
+ Added Iron Trapdoor
+ Added Prismarine and Sea Lantern blocks
+ Added the Ocean Monument
+ Added Red Sandstone
+ Added Banners
+ Added Armor Stands
+ Added Coarse Dirt (dirt where grass won’t grow)
+ Added Guardian mobs, with item drops
+ Added Endermite mob
+ Added Rabbits, with item drops
+ Added Mutton and Cooked Mutton
+ Villagers will harvest crops and plant new ones
+ Mossy Cobblestone and Mossy Stone Bricks are now craftable
+ Chiseled Stone Bricks are now craftable
+ Doors and fences now come in all wood type variants
+ Sponge block has regained its water-absorbing ability and becomes wet
+ Added a spectator game mode (game mode 3)
+ Added one new achievement
+ Added “Customized” world type
+ Added hidden “Debug Mode” world type
+ Worlds can now have a world barrier
+ Added @e target selector for Command Blocks
+ Added /blockdata command
+ Added /clone command
+ Added /execute command
+ Added /fill command
+ Added /particle command
+ Added /testforblocks command
+ Added /title command
+ Added /trigger command
+ Added /worldborder command
+ Added /stats command
+ Containers can be locked in custom maps by using the “Lock” data tag
+ Added logAdminCommands, showDeathMessages, reducedDebugInfo, sendCommandFeedback, and randomTickSpeed game rules
+ Added three new statistics
+ Player skins can now have double layers across the whole model, and left/right arms/legs can be edited independently
+ Added a new player model with smaller arms, and a new player skin called Alex?
+ Added options for configuring what pieces of the skin that are visible
+ Blocks can now have custom visual variations in the resource packs
+ Minecraft Realms now has an activity chart, so you can see who has been online
+ Minecraft Realms now lets you upload your maps
* Difficulty setting is saved per world, and can be locked if wanted
* Enchanting has been redone, now costs lapis lazuli in addition to enchantment levels
* Villager trading has been rebalanced
* Anvil repairing has been rebalanced
* Considerable faster client-side performance
* Max render distance has been increased to 32 chunks (512 blocks)
* Adventure mode now prevents you from destroying blocks, unless your items have the CanDestroy data tag
* Resource packs can now also define the shape of blocks and items, and not just their textures
* Scoreboards have been given a lot of new features
* Tweaked the F3 debug screen
* Block ID numbers (such as 1 for stone), are being replaced by ID names (such as minecraft:stone)
* Server list has been improved
* A few minor changes to village and temple generation
* Mob heads for players now show both skin layers
* Buttons can now be placed on the ceiling
* Lots and lots of other changes
* LOTS AND LOTS of other changes
- Removed Herobrine

\ No newline at end of file diff --git a/article_scraper/resources/tests/readability/v8-blog/expected.html b/article_scraper/resources/tests/readability/v8-blog/expected.html index ecbb8c8..99cf99b 100644 --- a/article_scraper/resources/tests/readability/v8-blog/expected.html +++ b/article_scraper/resources/tests/readability/v8-blog/expected.html @@ -8,7 +8,7 @@

First, let's see what you can do with this new feature! Similar to this post let's start with a "hello world" type program that exports a single function that adds two numbers:

-
// add.c
#include <emscripten.h>

EMSCRIPTEN_KEEPALIVE
int add(int x, int y) {
return x + y;
}
+
// add.c
#include <emscripten.h>

EMSCRIPTEN_KEEPALIVE
int add(int x, int y) {
return x + y;
}

We'd normally build this with something like emcc -O3 add.c -o add.js which would emit add.js and add.wasm. Instead, let's ask emcc to only emit Wasm:

@@ -34,7 +34,7 @@

One nice thing about a standalone Wasm file like this is that you can write custom JavaScript to load and run it, which can be very minimal depending on your use case. For example, we can do this in Node.js:

-
// load-add.js
const binary = require('fs').readFileSync('add.wasm');

WebAssembly.instantiate(binary).then(({ instance }) => {
console.log(instance.exports.add(40, 2));
});
+
// load-add.js
const binary = require('fs').readFileSync('add.wasm');

WebAssembly.instantiate(binary).then(({ instance }) => {
console.log(instance.exports.add(40, 2));
});

Just 4 lines! Running that prints 42 as expected. Note that while this example is very simplistic, there are cases where you simply don't need much JavaScript, and may be able to do better than Emscripten's default JavaScript runtime (which supports a bunch of environments and options). A real-world example of that is in zeux's meshoptimizer - just 57 lines, including memory management, growth, etc.!

@@ -44,7 +44,7 @@

Another nice thing about standalone Wasm files is that you can run them in Wasm runtimes like wasmer, wasmtime, or WAVM. For example, consider this hello world:

-
// hello.cpp
#include <stdio.h>

int main() {
printf("hello, world!\n");
return 0;
}
+
// hello.cpp
#include <stdio.h>

int main() {
printf("hello, world!\n");
return 0;
}

We can build and run that in any of those runtimes:

diff --git a/article_scraper/resources/tests/readability/yahoo-4/expected.html b/article_scraper/resources/tests/readability/yahoo-4/expected.html index 7e3a3fc..e26ced8 100644 --- a/article_scraper/resources/tests/readability/yahoo-4/expected.html +++ b/article_scraper/resources/tests/readability/yahoo-4/expected.html @@ -1,51 +1,7 @@ -
- - - -
-

- トレンドマイクロは3月9日、Wi-Fi利用時の通信を暗号化し保護するスマホ・タブレット向けのセキュリティアプリ「フリーWi-Fiプロテクション」(iOS/Android)の発売を開始すると発表した。1年版ライセンスは2900円(税込)で、2年版ライセンスは5000円(税込)。
-
 フリーWi-Fiプロテクションは、App Storeおよび、Google Playにて販売され、既に提供しているスマホ・タブレット向け総合セキュリティ対策アプリ「ウイルスバスター モバイル」と併用することで、不正アプリや危険なウェブサイトからの保護に加え、通信の盗み見を防ぐことができる。
-
 2020年の東京オリンピック・パラリンピックの開催などを見据え、フリーWi-Fi(公衆無線LAN)の設置が促進され、フリーWi-Fiの利用者も増加している。 -
-
 一方で、脆弱な設定のフリーWi-Fiや攻撃者が設置した偽のフリーWi-Fiへの接続などによる情報漏えい、通信の盗み見などのセキュリティリスクが危惧されているという。 -
-
 正規事業者が提供する安全性の高いフリーWi-Fiのほかにも、通信を暗号化していない安全性の低いフリーWi-Fi、さらにはサイバー犯罪者が設置したフリーWi-Fiなどさまざまなものが混在している。また、利用者は、接続する前にひとつひとつ安全性を確認するのは難しい状況だとしている。 -
-
 トレンドマイクロがスマートフォン保持者でフリーWi-Fiの利用経験がある人に実施した調査では、回答者の約85%が安全なフリーWi-Fiと危険なフリーWi-Fiは「見分けられない」と回答。さらに、約65%がフリーWi-Fiの利用に不安を感じていると回答している。 -
-
 こうした環境の変化やユーザの状況を鑑み、フリーWi-Fiプロテクションの提供を開始する。同アプリをインストールすることで利用者は、万が一安全性の低いフリーWi-Fiのアクセスポイントに接続してしまった場合でも、その通信を暗号化でき、通信の盗み見やそれによる情報漏えいのリスクを低減できるようになる。 -
-
 具体的には、フリーWi-Fi利用時に、スマートフォンがフリーWi-Fiプロテクションインフラに接続することにより、フリーWi-Fiのアクセスポイントを介した通信がVPN(Virtual Private Network)で暗号化される。これにより利用者は、第三者から通信を傍受されることやデータの情報漏えいを防ぐことが可能。さらに、かんたん自動接続の機能により、通信を暗号化していない安全性が低いフリーWi-Fi接続時や利用者が指定したWi-Fiへ接続する際に、自動的に通信を暗号化し、利用者の通信を保護する。
-
 また、フリーWi-Fiプロテクションインフラと、莫大なセキュリティ情報のビッグデータを保有するクラウド型セキュリティ技術基盤「Trend Micro Smart Protection Network」(SPN)が連携することで、フリーWi-Fiプロテクションインフラを経由してインターネットを利用する際に、利用者がフィッシング詐欺サイトや偽サイトなどへの不正サイトへアクセスすることをブロックできるという。

- - - - - - -
-
-

最終更新:3/9(木) 18:45

-

- CNET Japan -

-
- - - - -
\ No newline at end of file +

+ トレンドマイクロは3月9日、Wi-Fi利用時の通信を暗号化し保護するスマホ・タブレット向けのセキュリティアプリ「フリーWi-Fiプロテクション」(iOS/Android)の発売を開始すると発表した。1年版ライセンスは2900円(税込)で、2年版ライセンスは5000円(税込)。

 フリーWi-Fiプロテクションは、App Storeおよび、Google Playにて販売され、既に提供しているスマホ・タブレット向け総合セキュリティ対策アプリ「ウイルスバスター モバイル」と併用することで、不正アプリや危険なウェブサイトからの保護に加え、通信の盗み見を防ぐことができる。

 2020年の東京オリンピック・パラリンピックの開催などを見据え、フリーWi-Fi(公衆無線LAN)の設置が促進され、フリーWi-Fiの利用者も増加している。 +

 一方で、脆弱な設定のフリーWi-Fiや攻撃者が設置した偽のフリーWi-Fiへの接続などによる情報漏えい、通信の盗み見などのセキュリティリスクが危惧されているという。 +

 正規事業者が提供する安全性の高いフリーWi-Fiのほかにも、通信を暗号化していない安全性の低いフリーWi-Fi、さらにはサイバー犯罪者が設置したフリーWi-Fiなどさまざまなものが混在している。また、利用者は、接続する前にひとつひとつ安全性を確認するのは難しい状況だとしている。 +

 トレンドマイクロがスマートフォン保持者でフリーWi-Fiの利用経験がある人に実施した調査では、回答者の約85%が安全なフリーWi-Fiと危険なフリーWi-Fiは「見分けられない」と回答。さらに、約65%がフリーWi-Fiの利用に不安を感じていると回答している。 +

 こうした環境の変化やユーザの状況を鑑み、フリーWi-Fiプロテクションの提供を開始する。同アプリをインストールすることで利用者は、万が一安全性の低いフリーWi-Fiのアクセスポイントに接続してしまった場合でも、その通信を暗号化でき、通信の盗み見やそれによる情報漏えいのリスクを低減できるようになる。 +

 具体的には、フリーWi-Fi利用時に、スマートフォンがフリーWi-Fiプロテクションインフラに接続することにより、フリーWi-Fiのアクセスポイントを介した通信がVPN(Virtual Private Network)で暗号化される。これにより利用者は、第三者から通信を傍受されることやデータの情報漏えいを防ぐことが可能。さらに、かんたん自動接続の機能により、通信を暗号化していない安全性が低いフリーWi-Fi接続時や利用者が指定したWi-Fiへ接続する際に、自動的に通信を暗号化し、利用者の通信を保護する。

 また、フリーWi-Fiプロテクションインフラと、莫大なセキュリティ情報のビッグデータを保有するクラウド型セキュリティ技術基盤「Trend Micro Smart Protection Network」(SPN)が連携することで、フリーWi-Fiプロテクションインフラを経由してインターネットを利用する際に、利用者がフィッシング詐欺サイトや偽サイトなどへの不正サイトへアクセスすることをブロックできるという。

\ No newline at end of file diff --git a/article_scraper/src/full_text_parser/mod.rs b/article_scraper/src/full_text_parser/mod.rs index 46f3c5e..048d906 100644 --- a/article_scraper/src/full_text_parser/mod.rs +++ b/article_scraper/src/full_text_parser/mod.rs @@ -879,6 +879,10 @@ impl FullTextParser { _ = Util::strip_node(context, "//link"); _ = Util::strip_node(context, "//aside"); + if let Some(root) = document.get_root_element() { + Util::replace_brs(&root, document); + } + Self::fix_urls(context, url, document); } diff --git a/article_scraper/src/util.rs b/article_scraper/src/util.rs index fb8cd6e..76f162a 100644 --- a/article_scraper/src/util.rs +++ b/article_scraper/src/util.rs @@ -921,4 +921,161 @@ impl Util { }); std::fs::write(filename, html).unwrap(); } + + // Replaces 2 or more successive
elements with a single

. + // Whitespace between
elements are ignored. + // For example: + //

foo
bar


abc
+ // will become: + //
foo
bar

abc

+ pub fn replace_brs(node: &Node, document: &Document) { + let br_nodes = Self::get_elements_by_tag_name(node, "br"); + + for br_node in br_nodes { + let mut next = br_node.get_next_sibling(); + + // Whether 2 or more
elements have been found and replaced with a + //

block. + let mut replaced = false; + + // If we find a
chain, remove the
s until we hit another node + // or non-whitespace. This leaves behind the first
in the chain + // (which will be replaced with a

later). + while let Some(mut n) = next { + let is_text_whitespace = n + .get_type() + .map(|t| t == NodeType::TextNode) + .unwrap_or(false) + && n.get_content().trim().is_empty(); + let is_br_node = n.get_name().to_uppercase() == "BR"; + let next_is_br_node = n + .get_next_sibling() + .map(|n| n.get_name().to_uppercase() == "BR") + .unwrap_or(false); + + if !is_text_whitespace && !is_br_node { + break; + } + + next = n.get_next_sibling(); + + if is_br_node || (is_text_whitespace && next_is_br_node) { + replaced = true; + n.unlink(); + } + } + + if !replaced { + continue; + } + + // If we removed a
chain, replace the remaining
with a

. Add + // all sibling nodes as children of the

until we hit another
+ // chain. + let mut parent = match br_node.get_parent() { + Some(parent) => parent, + None => continue, + }; + let mut p = Node::new("p", None, document).unwrap(); + _ = parent.replace_child_node(p.clone(), br_node).unwrap(); + + next = p.get_next_sibling(); + + while let Some(mut next_node) = next { + // If we've hit another

, we're done adding children to this

. + if next_node.get_name().to_uppercase() == "BR" { + if let Some(next_elem) = next_node.get_next_element_sibling() { + if next_elem.get_name().to_uppercase() == "BR" { + break; + } + } + } + + if !Self::is_phrasing_content(&next_node) { + break; + } + + // Otherwise, make this node a child of the new

. + let sibling = next_node.get_next_sibling(); + next_node.unlink(); + _ = p.add_child(&mut next_node); + + next = sibling; + } + + if p.get_child_elements().is_empty() && p.get_content().trim().is_empty() { + p.unlink(); + continue; + } + + while let Some(mut last_child) = p.get_last_child() { + let is_text_node = last_child + .get_type() + .map(|t| t == NodeType::TextNode) + .unwrap_or(false); + let is_empty = last_child.get_content().trim().is_empty(); + + if is_text_node && is_empty { + last_child.unlink(); + } else { + break; + } + } + + if let Some(mut parent) = p.get_parent() { + if parent.get_name().to_uppercase() == "P" { + _ = parent.set_name("DIV"); + } + } + } + } +} + +#[cfg(test)] +mod tests { + use libxml::parser::Parser; + + use super::Util; + + fn replace_brs(source: &str, expected: &str) { + libxml::tree::node::set_node_rc_guard(10); + + let parser = Parser::default_html(); + let document = parser.parse_string(source).unwrap(); + let root = document.get_root_element().unwrap(); + let body = root.get_first_child().unwrap(); + let div = body.get_first_child().unwrap(); + + Util::replace_brs(&root, &document); + + let result = document.node_to_string(&div); + + assert_eq!(expected, result); + } + + #[test] + fn replace_brs_1() { + replace_brs( + "

foo
bar


abc
", + "
foo
bar

abc

", + ) + } + + #[test] + fn replace_brs_2() { + let source = r#" +
+

+ It might have been curiosity or it might have been the nagging sensation that chewed at his brain for the three weeks that he researched the subject of the conversation. All For One was a cryptid. Mystical in more ways than one, he was only a rumour on a network that was two-hundred years old. There were whispers of a shadowy figure who once ruled Japan, intermingled with a string of conspiracies and fragmented events. +

+

+ Izuku had even braved the dark web, poking and prodding at some of the seedier elements of the world wide web. The internet had rumours, but the dark web had stories.
+

+

+ An implied yakuza wrote about his grandfather who lost a fire manipulation Quirk and his sanity without any reason. His grandfather had been institutionalised, crying and repeating “he took it, he took it” until his dying days. No one could console him. +

+
+ "#; + replace_brs(source, source.trim()) + } }
@@ -71,8 +59,7 @@
  • Re-compute the score for the design. If it's better, accept this change and start the next iteration.
  • If the score is worse, accept it with a random probability which decreases as the iteration number goes up. If the change is not accepted, restore the previous placement.
  • - After optimization, the design is checked for routability. If any edges in the netlist graph don't correspond to edges in the device graph, the user probably asked for something impossible (for example, trying to hook a flipflop's output to a comparator's reference voltage input) so fail with an error.
    -
    The design is then routed. This is quite simple due to the crossbar structure of the device. For each edge in the netlist:

      + After optimization, the design is checked for routability. If any edges in the netlist graph don't correspond to edges in the device graph, the user probably asked for something impossible (for example, trying to hook a flipflop's output to a comparator's reference voltage input) so fail with an error.

      The design is then routed. This is quite simple due to the crossbar structure of the device. For each edge in the netlist:

      1. If dedicated (non-fabric) routing is used for this path, configure the destination's input mux appropriately and stop.
      2. If the source and destination are in the same half of the device, configure the destination's input mux appropriately and stop.
      3. A cross-connection must be used. Check if we already used one to bring the source signal to the other half of the device. If found, configure the destination to route from that cross-connection and stop.
      4. @@ -84,26 +71,14 @@
      5. If an I/O buffer is connected to analog hard IP, fail with an error if it's not configured in analog mode.
      6. Some signals (such as comparator inputs and oscillator power-down controls) are generated by a shared mux and fed to many loads. If different loads require conflicting settings for the shared mux, fail with an error.
      7. - If DRC passes with no errors, configure all of the individual cells in the netlist based on the HDL parameters. Fail with an error if an invalid configuration was requested.
        -
        Finally, generate the bitstream from all of the per-cell configuration and write it to a file.

        + If DRC passes with no errors, configure all of the individual cells in the netlist based on the HDL parameters. Fail with an error if an invalid configuration was requested.

        Finally, generate the bitstream from all of the per-cell configuration and write it to a file.

        Great, let's get started!

        - If you don't already have one, you'll need to buy a GreenPak4 development kit. The kit includes samples of the SLG46620V (among other devices) and a programmer/emulation board. While you're waiting for it to arrive, install GreenPak Designer.
        -
        Download and install Yosys. Although Clifford is pretty good at merging my pull requests, only my fork on Github is guaranteed to have the most up-to-date support for GreenPak devices so don't be surprised if you can't use a bleeding-edge feature with mainline Yosys.
        -
        Download and install gp4par. You can get it from the Github repository.
        -
        Write your HDL, compile with Yosys, P&R with gp4par, and import the bitstream into GreenPak Designer to program the target device. The most current gp4par manual is included in LaTeX source form in the source tree and is automatically built as part of the compile process. If you're just browsing, there's a relatively recent PDF version on my web server.
        -
        If you'd like to see the Verilog that produced the nightmare of a schematic I showed above, here it is.
        -
        Be advised that this project is still very much a work in progress and there are still a number of SLG46620V features I don't support (see the manual for exact details).

        + If you don't already have one, you'll need to buy a GreenPak4 development kit. The kit includes samples of the SLG46620V (among other devices) and a programmer/emulation board. While you're waiting for it to arrive, install GreenPak Designer.

        Download and install Yosys. Although Clifford is pretty good at merging my pull requests, only my fork on Github is guaranteed to have the most up-to-date support for GreenPak devices so don't be surprised if you can't use a bleeding-edge feature with mainline Yosys.

        Download and install gp4par. You can get it from the Github repository.

        Write your HDL, compile with Yosys, P&R with gp4par, and import the bitstream into GreenPak Designer to program the target device. The most current gp4par manual is included in LaTeX source form in the source tree and is automatically built as part of the compile process. If you're just browsing, there's a relatively recent PDF version on my web server.

        If you'd like to see the Verilog that produced the nightmare of a schematic I showed above, here it is.

        Be advised that this project is still very much a work in progress and there are still a number of SLG46620V features I don't support (see the manual for exact details).

        I love it / it segfaulted / there's a problem in the manual!

        Hop in our IRC channel (##openfpga on Freenode) and let me know. Feedback is great, pull requests are even better,

        You're competing with Silego's IDE. Have they found out and sued you yet?

        - Nope. They're fully aware of what I'm doing and are rolling out the red carpet for me. They love the idea of a HDL flow as an alternative to schematic entry and are pretty amazed at how fast it's coming together.
        -
        After I reported a few bugs in their datasheets they decided to skip the middleman and give me direct access to the engineer who writes their documentation so that I can get faster responses. The last time I found a problem (two different parts of the datasheet contradicted each other) an updated datasheet was in my inbox and on their website by the next day. I only wish Xilinx gave me that kind of treatment!
        -
        They've even offered me free hardware to help me add support for their latest product family, although I plan to get GreenPak4 support to a more stable state before taking them up on the offer.

        + Nope. They're fully aware of what I'm doing and are rolling out the red carpet for me. They love the idea of a HDL flow as an alternative to schematic entry and are pretty amazed at how fast it's coming together.

        After I reported a few bugs in their datasheets they decided to skip the middleman and give me direct access to the engineer who writes their documentation so that I can get faster responses. The last time I found a problem (two different parts of the datasheet contradicted each other) an updated datasheet was in my inbox and on their website by the next day. I only wish Xilinx gave me that kind of treatment!

        They've even offered me free hardware to help me add support for their latest product family, although I plan to get GreenPak4 support to a more stable state before taking them up on the offer.

        So what's next?

        -

        Better testing, for starters. I have to verify functionality by hand with a DMM and oscilloscope, which is time consuming.
        -
        My contact at Silego says they're going to be giving me documentation on the SRAM emulation interface soon, so I'm going to make a hardware-in-loop test platform that connects to my desktop and the Silego ZIF socket, and lets me load new bitstreams via a scriptable interface. It'll have FPGA-based digital I/O as well as an ADC and DAC on every device pin, plus an adjustable voltage regulator for power, so I can feed in arbitrary mixed-signal test waveforms and write PC-based unit tests to verify correct behavior.
        -
        Other than that, I want to finish support for the SLG46620V in the next month or two. The SLG46621V will be an easy addition since only one pin and the relevant configuration bits have changed from the 46620 (I suspect they're the same die, just bonded out differently).
        -
        Once that's done I'll have to do some more extensive work to add the SLG46140V since the architecture is a bit different (a lot of the combinatorial logic is merged into multi-function blocks). Luckily, the 46140 has a lot in common architecturally with the GreenPak5 family, so once that's done GreenPak5 will probably be a lot easier to add support for.
        -
        My thanks go out to Clifford Wolf, whitequark, the IRC users in ##openfpga, and everyone at Silego I've worked with to help make this possible. I hope that one day this project will become mature enough that Silego will ship it as an officially supported extension to GreenPak Designer, making history by becoming the first modern programmable logic vendor to ship a fully open source synthesis and P&R suite. +

        Better testing, for starters. I have to verify functionality by hand with a DMM and oscilloscope, which is time consuming.

        My contact at Silego says they're going to be giving me documentation on the SRAM emulation interface soon, so I'm going to make a hardware-in-loop test platform that connects to my desktop and the Silego ZIF socket, and lets me load new bitstreams via a scriptable interface. It'll have FPGA-based digital I/O as well as an ADC and DAC on every device pin, plus an adjustable voltage regulator for power, so I can feed in arbitrary mixed-signal test waveforms and write PC-based unit tests to verify correct behavior.

        Other than that, I want to finish support for the SLG46620V in the next month or two. The SLG46621V will be an easy addition since only one pin and the relevant configuration bits have changed from the 46620 (I suspect they're the same die, just bonded out differently).

        Once that's done I'll have to do some more extensive work to add the SLG46140V since the architecture is a bit different (a lot of the combinatorial logic is merged into multi-function blocks). Luckily, the 46140 has a lot in common architecturally with the GreenPak5 family, so once that's done GreenPak5 will probably be a lot easier to add support for.

        My thanks go out to Clifford Wolf, whitequark, the IRC users in ##openfpga, and everyone at Silego I've worked with to help make this possible. I hope that one day this project will become mature enough that Silego will ship it as an officially supported extension to GreenPak Designer, making history by becoming the first modern programmable logic vendor to ship a fully open source synthesis and P&R suite.

        \ No newline at end of file diff --git a/article_scraper/resources/tests/readability/hukumusume/expected.html b/article_scraper/resources/tests/readability/hukumusume/expected.html index 90da1ee..5ecf93d 100644 --- a/article_scraper/resources/tests/readability/hukumusume/expected.html +++ b/article_scraper/resources/tests/readability/hukumusume/expected.html @@ -33,21 +33,11 @@

        福娘童話集 > きょうのイソップ童話 > 1月のイソップ童話 > 欲張りなイヌ

        -

        - 元旦のイソップ童話
        -
        -
        -
        - よくばりなイヌ
        -
        -
        -
        - 欲張りなイヌ
        -
        -
        -
        - ひらがな ←→ 日本語・英語 ←→ English -

        +
        +

        元旦のイソップ童話

        + よくばりなイヌ

        + 欲張りなイヌ

        + ひらがな ←→ 日本語・英語 ←→ English

        @@ -107,13 +97,7 @@ おしまい

        - 前のページへ戻る
        -
        -
        -
        - - -

        + 前のページへ戻る

        + + + + + + + + + + + + + + + + + + + + + + + + + きょうの日本昔話

        + ネコがネズミを追いかける訳

        + きょうの世界昔話

        + モンゴルの十二支話

        + きょうの日本民話

        + 仕事の取替えっこ

        + きょうのイソップ童話

        + 欲張りなイヌ

        + きょうの江戸小話

        + ぞうきんとお年玉

        - - - - - - - - - - - - - - - - - - - - - - - - + きょうの百物語

        + 百物語の幽霊

        @@ -136,112 +120,80 @@
        -      1月 1日の豆知識
        -
        -
        -
        - 366日への旅
        +      1月 1日の豆知識

        +

        + 366日への旅

        + きょうの記念日

        + 元旦

        + きょうの誕生花

        + 松(まつ)

        + きょうの誕生日・出来事

        + 1949年 Mr.マリック(マジシャン)

        + 恋の誕生日占い

        + 自分の考えをしっかりと持った女の子。

        + なぞなぞ小学校

        + ○(丸)を取ったらお母さんになってしまう男の人は?

        + あこがれの職業紹介

        + 歌手

        + 恋の魔法とおまじない 001

        + 両思いになれる おまじない

        +   1月 1日の童話・昔話

        + 福娘童話集

        - きょうの記念日
        -
        - 元旦 -
        - きょうの誕生花
        -
        - 松(まつ) -
        - きょうの誕生日・出来事
        -
        - 1949年 Mr.マリック(マジシャン) -
        - 恋の誕生日占い
        -
        - 自分の考えをしっかりと持った女の子。 -
        - なぞなぞ小学校
        -
        - ○(丸)を取ったらお母さんになってしまう男の人は? -
        - あこがれの職業紹介
        -
        - 歌手 -
        - 恋の魔法とおまじない 001
        -
        - 両思いになれる おまじない -
        -   1月 1日の童話・昔話
        -
        -
        -
        - 福娘童話集
        -
        - きょうの日本昔話
        -
        - ネコがネズミを追いかける訳 -
        - きょうの世界昔話
        -
        - モンゴルの十二支話 -
        - きょうの日本民話
        -
        - 仕事の取替えっこ -
        - きょうのイソップ童話
        -
        - 欲張りなイヌ -
        - きょうの江戸小話
        -
        - ぞうきんとお年玉 -
        - きょうの百物語
        -
        - 百物語の幽霊 -
        @@ -254,37 +206,32 @@
    - 366日への旅
    -
    - 毎日の記念日・誕生花 ・有名人の誕生日と性格判断
    + 366日への旅

    + 毎日の記念日・誕生花 ・有名人の誕生日と性格判断

    - 福娘童話集
    -
    - 世界と日本の童話と昔話
    + 福娘童話集

    + 世界と日本の童話と昔話

    - 女の子応援サイト -さくら-
    -
    - 誕生日占い、お仕事紹介、おまじない、など
    + 女の子応援サイト -さくら-

    + 誕生日占い、お仕事紹介、おまじない、など

    - 子どもの病気相談所
    -
    - 病気検索と対応方法、症状から検索するWEB問診
    + 子どもの病気相談所

    + 病気検索と対応方法、症状から検索するWEB問診

    - 世界60秒巡り
    -
    - 国旗国歌や世界遺産など、世界の国々の豆知識
    + 世界60秒巡り

    + 国旗国歌や世界遺産など、世界の国々の豆知識