1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-07 16:15:32 +02:00
Commit graph

69 commits

Author SHA1 Message Date
Jan Lukas Gernert
acb7d1d000 port libxml workaround from hurl 2023-08-10 02:09:07 +02:00
Jan Lukas Gernert
6116ba38ae no need for head 2023-08-10 02:06:52 +02:00
Jan Lukas Gernert
8c7cdacd26 Revert "generate full html document"
This reverts commit 0133b20f06.
2023-08-10 02:06:08 +02:00
Jan Lukas Gernert
0133b20f06 generate full html document 2023-08-10 00:01:31 +02:00
Jan Lukas Gernert
1584649eb4 fix tests 2023-08-10 00:01:10 +02:00
Jan Lukas Gernert
2c76a89f9d add spiegel test 2023-08-09 23:57:25 +02:00
Jan Lukas Gernert
9aa6478e3c update heise test 2023-08-09 23:25:07 +02:00
Jan Lukas Gernert
b91014c685 clean html fragments 2023-08-03 10:40:44 +02:00
Leonardo Fedalto
3211b91bad Make Article public 2023-08-01 21:39:48 +02:00
Jan Lukas Gernert
7a4f5c500d 400 2023-08-01 19:35:22 +02:00
Jan Lukas Gernert
a7e8661a09 update tests & defined youtube iframe height 2023-08-01 18:37:55 +02:00
Jan Lukas Gernert
eb1bfdbca0 print url 2023-07-28 07:09:50 +02:00
Jan Lukas Gernert
40f065d9cd allow downloads without content type smaller than 5mb 2023-07-28 07:03:50 +02:00
Jan Lukas Gernert
db007f752c dont clean video tags 2023-07-27 23:18:17 +02:00
Jan Lukas Gernert
bf7a89fef7 don't fail because of lacking content length 2023-07-23 15:39:24 +02:00
Jan Lukas Gernert
345518253a even if img has src 2023-07-22 20:03:32 +02:00
Jan Lukas Gernert
42eb9daf65 remove lazy loading attributes 2023-07-22 19:57:38 +02:00
Jan Lukas Gernert
d562d41b81 download single image 2023-07-16 21:40:10 +02:00
Jan Lukas Gernert
be40383b1a impl from reqwest error 2023-07-16 15:17:01 +02:00
Jan Lukas Gernert
d62aa8c31a clippy fixes 2023-06-29 19:59:38 +02:00
Jan Lukas Gernert
fcec0d83ee don't move content nodes to <article> root node
could fix potential crash?
2023-06-29 19:47:49 +02:00
Jan Lukas Gernert
fdb8d9a97e small fixes 2023-06-27 19:21:26 +02:00
Jan Lukas Gernert
4fd41d98cc add fn to parse thumbnail from html 2023-06-26 23:22:08 +02:00
Jan Lukas Gernert
e32015c1d0 add mercury leading image heuristics 2023-06-26 22:25:57 +02:00
Jan Lukas Gernert
e99a4b4f23 ignore test resources 2023-06-23 21:22:37 +02:00
Jan Lukas Gernert
a7983e873d (cargo-release) version 2.0.0 2023-06-23 21:17:19 +02:00
Jan Lukas Gernert
a036d03510 use ftr-site-config fork with heise patch 2023-06-23 21:15:36 +02:00
Jan Lukas Gernert
a31956531a fix download loop 2023-06-22 00:15:57 +02:00
Jan Lukas Gernert
582834cdf1 fixes 2023-06-21 23:48:09 +02:00
Jan Lukas Gernert
e0ccd7e0b3 split download & parsing 2023-06-21 23:04:21 +02:00
Jan Lukas Gernert
99c5f6220e fix golem test 2023-06-21 23:04:08 +02:00
Jan Lukas Gernert
d8ceee1403 remove <h1/2> duplicating the title 2023-04-30 09:24:00 +02:00
Jan Lukas Gernert
16b102b313 replace multiple <br>s with single <p> 2023-04-29 18:20:58 +02:00
Jan Lukas Gernert
c4f8bd2bc2 fix heise crash: simpler way of checking for ancestor 2023-04-28 15:56:29 +02:00
Jan Lukas Gernert
871b441776 parse image objects 2023-04-28 07:46:28 +02:00
Jan Lukas Gernert
572fada104 parse video objects 2023-04-27 19:03:07 +02:00
Jan Lukas Gernert
34a737c89c overhaul non-readability tests 2023-04-27 07:40:28 +02:00
Jan Lukas Gernert
f737ab27fd update readability test results 2023-04-26 21:04:35 +02:00
Jan Lukas Gernert
2a4f17d458 ignore image download test 2023-04-26 20:58:25 +02:00
Jan Lukas Gernert
62c0968619 remove empty nodes 2023-04-26 19:54:34 +02:00
Jan Lukas Gernert
5621a0ea54 fmt 2023-04-26 09:12:55 +02:00
Jan Lukas Gernert
fbb6585596 replace first occurence only 2023-04-26 09:09:06 +02:00
Jan Lukas Gernert
afbc384b38 update ftr config 2023-04-26 07:45:40 +02:00
Jan Lukas Gernert
dd958fe30f fix encoding 2023-04-26 07:44:32 +02:00
Jan Lukas Gernert
bd413a795c fmt 2023-04-25 19:12:15 +02:00
Jan Lukas Gernert
a0161e92d4 next page fixes 2023-04-25 18:57:24 +02:00
Jan Lukas Gernert
37d317ad86 simplify iterating over dir 2023-04-25 08:58:15 +02:00
Jan Lukas Gernert
309a60c5d0 update regex 2023-04-23 20:45:45 +02:00
Jan Lukas Gernert
c51f0fd731 cargo.toml metadata 2023-04-23 16:47:02 +02:00
Jan Lukas Gernert
1695e33f9e fmt 2023-04-23 16:37:06 +02:00