1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-08 08:30:00 +02:00
Commit graph

39 commits

Author SHA1 Message Date
Jan Lukas Gernert
2189f527d7 fix strip unlikely table-child & add 2 new tests 2023-03-26 11:54:13 +02:00
Jan Lukas Gernert
873e081c33 clean js-links & add new test 2023-03-26 11:31:59 +02:00
Jan Lukas Gernert
b541cd73f8 whitespace fixes 2023-03-24 08:02:08 +01:00
Jan Lukas Gernert
f7fa696921 fmt & clippy 2023-03-19 23:37:42 +01:00
Jan Lukas Gernert
280c516cbe make cleaning more obvious 2023-03-19 23:09:06 +01:00
Jan Lukas Gernert
11e08ae505 move conditional cleaning right after parsing & port attribute cleaning form readability 2023-03-19 22:43:26 +01:00
Jan Lukas Gernert
7737311a92 small fix 2023-03-19 13:31:10 +01:00
Jan Lukas Gernert
848291e4f3 small fixes 2023-03-12 23:13:28 +01:00
Jan Lukas Gernert
4ca4b73823 fmt 2023-03-12 19:36:34 +01:00
Jan Lukas Gernert
603b373e0d lots of fixes 2023-03-12 19:36:10 +01:00
Jan Lukas Gernert
779afd6245 fix cleaning of empty p/div-tags 2023-03-12 12:20:50 +01:00
Jan Lukas Gernert
1e71aa2bfb remove duplicate code 2023-03-10 22:17:53 +01:00
Jan Lukas Gernert
3ece2522bb add clean links test 2023-03-09 21:24:29 +01:00
Jan Lukas Gernert
c5c6b788c8 add citilab test & fix noscript unwrapping 2023-03-09 20:10:03 +01:00
Jan Lukas Gernert
f5b7ff198a fix post processing 2023-03-04 23:40:01 +01:00
Jan Lukas Gernert
7c9e527827 strip iframes but keep vidoes 2023-03-01 01:37:37 +01:00
Jan Lukas Gernert
3a92585f4d use url.join() instead of custom code 2023-03-01 00:42:03 +01:00
Jan Lukas Gernert
aea57d0cf3 fix has_single_tag_inside_element & update tests 2023-02-28 03:59:48 +01:00
Jan Lukas Gernert
31a8033844 fixes, more sanitation & 1 more failing test 2023-02-28 01:50:13 +01:00
Jan Lukas Gernert
56c08c501a fmt 2023-02-27 01:01:16 +01:00
Jan Lukas Gernert
df999cd9fc more cleanups & more tests 2023-02-27 01:00:56 +01:00
Jan Lukas Gernert
0834c4d72a fixes 2023-02-26 02:22:53 +01:00
Jan Lukas Gernert
63035ca028 fmt 2023-02-25 00:43:42 +01:00
Jan Lukas Gernert
e3246af28b refactor & more testing 2023-02-25 00:42:26 +01:00
Jan Lukas Gernert
7ae98904d4 unwrap noscript images 2023-02-23 01:53:42 +01:00
Jan Lukas Gernert
98c06e11f4 improve title extraction 2023-02-20 02:32:58 +01:00
Jan Lukas Gernert
cce912c354 first content extraction kinda working 2023-02-20 00:29:44 +01:00
Jan Lukas Gernert
71a8816747 somewhat complete readability algorithm 2023-02-17 14:16:01 +01:00
Jan Lukas Gernert
979358fd35 more 2023-01-01 21:35:46 +01:00
Jan Lukas Gernert
2750ad648d start implementing readability 2023-01-01 14:51:34 +01:00
Jan Lukas Gernert
c08f5afa5d move stuff around 2022-12-13 08:54:57 +01:00
Jan Lukas Gernert
90383545e0 extract & parse charsets other than utf8 2022-12-11 17:38:42 +01:00
Jan Lukas Gernert
88bb88a38f clippy 2022-12-11 16:23:02 +01:00
Jan Lukas Gernert
dc1bf2ef0c fmt 2022-12-11 16:19:49 +01:00
Jan Lukas Gernert
22e98fdab7 extract thumbnail url 2022-12-11 16:18:03 +01:00
Jan Lukas Gernert
0c8aba4f4a refactor: a bit less nested code 2022-12-01 10:14:47 +01:00
Jan Lukas Gernert
27be5a3204 port failure -> thiserror 2022-12-01 09:22:08 +01:00
Jan Lukas Gernert
d906f6b7fe readability stub 2022-10-08 23:10:26 +02:00
Jan Lukas Gernert
273ddd832c start refactor & fingerprints 2022-10-08 23:09:00 +02:00