Jan Lukas Gernert
|
fcec0d83ee
|
don't move content nodes to <article> root node
could fix potential crash?
|
2023-06-29 19:47:49 +02:00 |
|
Jan Lukas Gernert
|
fdb8d9a97e
|
small fixes
|
2023-06-27 19:21:26 +02:00 |
|
Jan Lukas Gernert
|
4fd41d98cc
|
add fn to parse thumbnail from html
|
2023-06-26 23:22:08 +02:00 |
|
Jan Lukas Gernert
|
e32015c1d0
|
add mercury leading image heuristics
|
2023-06-26 22:25:57 +02:00 |
|
Jan Lukas Gernert
|
e99a4b4f23
|
ignore test resources
|
2023-06-23 21:22:37 +02:00 |
|
Jan Lukas Gernert
|
a7983e873d
|
(cargo-release) version 2.0.0
|
2023-06-23 21:17:19 +02:00 |
|
Jan Lukas Gernert
|
a036d03510
|
use ftr-site-config fork with heise patch
|
2023-06-23 21:15:36 +02:00 |
|
Jan Lukas Gernert
|
a31956531a
|
fix download loop
|
2023-06-22 00:15:57 +02:00 |
|
Jan Lukas Gernert
|
582834cdf1
|
fixes
|
2023-06-21 23:48:09 +02:00 |
|
Jan Lukas Gernert
|
e0ccd7e0b3
|
split download & parsing
|
2023-06-21 23:04:21 +02:00 |
|
Jan Lukas Gernert
|
99c5f6220e
|
fix golem test
|
2023-06-21 23:04:08 +02:00 |
|
Jan Lukas Gernert
|
d8ceee1403
|
remove <h1/2> duplicating the title
|
2023-04-30 09:24:00 +02:00 |
|
Jan Lukas Gernert
|
eb4b3603f5
|
remove artifact
|
2023-04-29 18:21:21 +02:00 |
|
Jan Lukas Gernert
|
16b102b313
|
replace multiple <br>s with single <p>
|
2023-04-29 18:20:58 +02:00 |
|
Jan Lukas Gernert
|
c4f8bd2bc2
|
fix heise crash: simpler way of checking for ancestor
|
2023-04-28 15:56:29 +02:00 |
|
Jan Lukas Gernert
|
44d01ad1c6
|
Merge branch 'hardwareluxx' into 'master'
Hardwareluxx
See merge request news-flash/article_scraper!8
|
2023-04-28 05:57:37 +00:00 |
|
Jan Lukas Gernert
|
871b441776
|
parse image objects
|
2023-04-28 07:46:28 +02:00 |
|
Jan Lukas Gernert
|
572fada104
|
parse video objects
|
2023-04-27 19:03:07 +02:00 |
|
Jan Lukas Gernert
|
34a737c89c
|
overhaul non-readability tests
|
2023-04-27 07:40:28 +02:00 |
|
Jan Lukas Gernert
|
f737ab27fd
|
update readability test results
|
2023-04-26 21:04:35 +02:00 |
|
Jan Lukas Gernert
|
2a4f17d458
|
ignore image download test
|
2023-04-26 20:58:25 +02:00 |
|
Jan Lukas Gernert
|
62c0968619
|
remove empty nodes
|
2023-04-26 19:54:34 +02:00 |
|
Jan Lukas Gernert
|
5621a0ea54
|
fmt
|
2023-04-26 09:12:55 +02:00 |
|
Jan Lukas Gernert
|
fbb6585596
|
replace first occurence only
|
2023-04-26 09:09:06 +02:00 |
|
Jan Lukas Gernert
|
afbc384b38
|
update ftr config
|
2023-04-26 07:45:40 +02:00 |
|
Jan Lukas Gernert
|
dd958fe30f
|
fix encoding
|
2023-04-26 07:44:32 +02:00 |
|
Jan Lukas Gernert
|
bd413a795c
|
fmt
|
2023-04-25 19:12:15 +02:00 |
|
Jan Lukas Gernert
|
a0161e92d4
|
next page fixes
|
2023-04-25 18:57:24 +02:00 |
|
Jan Lukas Gernert
|
37d317ad86
|
simplify iterating over dir
|
2023-04-25 08:58:15 +02:00 |
|
Jan Lukas Gernert
|
309a60c5d0
|
update regex
|
2023-04-23 20:45:45 +02:00 |
|
Jan Lukas Gernert
|
c51f0fd731
|
cargo.toml metadata
|
2023-04-23 16:47:02 +02:00 |
|
Jan Lukas Gernert
|
1695e33f9e
|
fmt
|
2023-04-23 16:37:06 +02:00 |
|
Jan Lukas Gernert
|
57df2e6832
|
write some docs
|
2023-04-23 16:35:00 +02:00 |
|
Jan Lukas Gernert
|
bfb31dc188
|
fmt
|
2023-04-21 08:53:12 +02:00 |
|
Jan Lukas Gernert
|
baf2a8a15d
|
rename test
|
2023-04-21 08:47:25 +02:00 |
|
Jan Lukas Gernert
|
b4b5d802c9
|
only serialize root node
|
2023-04-21 08:46:10 +02:00 |
|
Jan Lukas Gernert
|
3f58a39fcf
|
dump node
|
2023-04-20 08:53:06 +02:00 |
|
Jan Lukas Gernert
|
cd3d3468a3
|
clean html
|
2023-04-20 08:41:10 +02:00 |
|
Jan Lukas Gernert
|
3096f28aae
|
empty clean html fn
|
2023-04-16 22:00:00 +02:00 |
|
Jan Lukas Gernert
|
f427b7c36f
|
cli: progress bar for image download
|
2023-04-16 21:31:11 +02:00 |
|
Jan Lukas Gernert
|
3dd7c7d57a
|
tmp: calc download size & print progress
|
2023-04-16 18:10:43 +02:00 |
|
Jan Lukas Gernert
|
ccc8223db0
|
cleanup & fixes
|
2023-04-14 17:50:39 +02:00 |
|
Jan Lukas Gernert
|
57f74c635b
|
fix clippy
|
2023-04-14 10:32:05 +02:00 |
|
Jan Lukas Gernert
|
3a465f2619
|
somehow made things much slower
|
2023-04-14 08:49:49 +02:00 |
|
Jan Lukas Gernert
|
4fd4dd39db
|
download images concurrently
|
2023-04-13 07:54:31 +02:00 |
|
Jan Lukas Gernert
|
35a14b0a5f
|
start improving image download
|
2023-04-12 08:27:22 +02:00 |
|
Jan Lukas Gernert
|
c198225012
|
eliminate additional head request
|
2023-04-11 07:49:01 +02:00 |
|
Jan Lukas Gernert
|
fa41633e11
|
cli to parse single page with ftr
|
2023-04-10 13:47:45 +02:00 |
|
Jan Lukas Gernert
|
d978059709
|
command to use readability extractor
|
2023-04-07 11:51:14 +02:00 |
|
Jan Lukas Gernert
|
063996d62f
|
readability cli
|
2023-04-06 08:53:19 +02:00 |
|