1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-07 08:05:31 +02:00
Commit graph

261 commits

Author SHA1 Message Date
Jan Lukas Gernert
e99a4b4f23 ignore test resources 2023-06-23 21:22:37 +02:00
Jan Lukas Gernert
a7983e873d (cargo-release) version 2.0.0 2023-06-23 21:17:19 +02:00
Jan Lukas Gernert
a036d03510 use ftr-site-config fork with heise patch 2023-06-23 21:15:36 +02:00
Jan Lukas Gernert
a31956531a fix download loop 2023-06-22 00:15:57 +02:00
Jan Lukas Gernert
582834cdf1 fixes 2023-06-21 23:48:09 +02:00
Jan Lukas Gernert
e0ccd7e0b3 split download & parsing 2023-06-21 23:04:21 +02:00
Jan Lukas Gernert
99c5f6220e fix golem test 2023-06-21 23:04:08 +02:00
Jan Lukas Gernert
d8ceee1403 remove <h1/2> duplicating the title 2023-04-30 09:24:00 +02:00
Jan Lukas Gernert
eb4b3603f5 remove artifact 2023-04-29 18:21:21 +02:00
Jan Lukas Gernert
16b102b313 replace multiple <br>s with single <p> 2023-04-29 18:20:58 +02:00
Jan Lukas Gernert
c4f8bd2bc2 fix heise crash: simpler way of checking for ancestor 2023-04-28 15:56:29 +02:00
Jan Lukas Gernert
44d01ad1c6 Merge branch 'hardwareluxx' into 'master'
Hardwareluxx

See merge request news-flash/article_scraper!8
2023-04-28 05:57:37 +00:00
Jan Lukas Gernert
871b441776 parse image objects 2023-04-28 07:46:28 +02:00
Jan Lukas Gernert
572fada104 parse video objects 2023-04-27 19:03:07 +02:00
Jan Lukas Gernert
34a737c89c overhaul non-readability tests 2023-04-27 07:40:28 +02:00
Jan Lukas Gernert
f737ab27fd update readability test results 2023-04-26 21:04:35 +02:00
Jan Lukas Gernert
2a4f17d458 ignore image download test 2023-04-26 20:58:25 +02:00
Jan Lukas Gernert
62c0968619 remove empty nodes 2023-04-26 19:54:34 +02:00
Jan Lukas Gernert
5621a0ea54 fmt 2023-04-26 09:12:55 +02:00
Jan Lukas Gernert
fbb6585596 replace first occurence only 2023-04-26 09:09:06 +02:00
Jan Lukas Gernert
afbc384b38 update ftr config 2023-04-26 07:45:40 +02:00
Jan Lukas Gernert
dd958fe30f fix encoding 2023-04-26 07:44:32 +02:00
Jan Lukas Gernert
bd413a795c fmt 2023-04-25 19:12:15 +02:00
Jan Lukas Gernert
a0161e92d4 next page fixes 2023-04-25 18:57:24 +02:00
Jan Lukas Gernert
37d317ad86 simplify iterating over dir 2023-04-25 08:58:15 +02:00
Jan Lukas Gernert
309a60c5d0 update regex 2023-04-23 20:45:45 +02:00
Jan Lukas Gernert
c51f0fd731 cargo.toml metadata 2023-04-23 16:47:02 +02:00
Jan Lukas Gernert
1695e33f9e fmt 2023-04-23 16:37:06 +02:00
Jan Lukas Gernert
57df2e6832 write some docs 2023-04-23 16:35:00 +02:00
Jan Lukas Gernert
bfb31dc188 fmt 2023-04-21 08:53:12 +02:00
Jan Lukas Gernert
baf2a8a15d rename test 2023-04-21 08:47:25 +02:00
Jan Lukas Gernert
b4b5d802c9 only serialize root node 2023-04-21 08:46:10 +02:00
Jan Lukas Gernert
3f58a39fcf dump node 2023-04-20 08:53:06 +02:00
Jan Lukas Gernert
cd3d3468a3 clean html 2023-04-20 08:41:10 +02:00
Jan Lukas Gernert
3096f28aae empty clean html fn 2023-04-16 22:00:00 +02:00
Jan Lukas Gernert
f427b7c36f cli: progress bar for image download 2023-04-16 21:31:11 +02:00
Jan Lukas Gernert
3dd7c7d57a tmp: calc download size & print progress 2023-04-16 18:10:43 +02:00
Jan Lukas Gernert
ccc8223db0 cleanup & fixes 2023-04-14 17:50:39 +02:00
Jan Lukas Gernert
57f74c635b fix clippy 2023-04-14 10:32:05 +02:00
Jan Lukas Gernert
3a465f2619 somehow made things much slower 2023-04-14 08:49:49 +02:00
Jan Lukas Gernert
4fd4dd39db download images concurrently 2023-04-13 07:54:31 +02:00
Jan Lukas Gernert
35a14b0a5f start improving image download 2023-04-12 08:27:22 +02:00
Jan Lukas Gernert
c198225012 eliminate additional head request 2023-04-11 07:49:01 +02:00
Jan Lukas Gernert
fa41633e11 cli to parse single page with ftr 2023-04-10 13:47:45 +02:00
Jan Lukas Gernert
d978059709 command to use readability extractor 2023-04-07 11:51:14 +02:00
Jan Lukas Gernert
063996d62f readability cli 2023-04-06 08:53:19 +02:00
Jan Lukas Gernert
a2719c8c7e first few cli args 2023-04-05 08:43:00 +02:00
Jan Lukas Gernert
4a7349a5fa add cli crate 2023-04-04 08:42:04 +02:00
Jan Lukas Gernert
9832fa2c77 clippy fixes 2023-04-02 13:23:07 +02:00
Jan Lukas Gernert
acc2fe781a port final tests from readability for now 2023-04-02 13:22:16 +02:00