1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-07 16:15:32 +02:00
Commit graph

149 commits

Author SHA1 Message Date
Jan Lukas Gernert
6a58e45c7a add cnet test 2023-03-10 07:05:10 +01:00
Jan Lukas Gernert
a915d8fe67 update some older tests 2023-03-10 06:36:21 +01:00
Jan Lukas Gernert
7b6d22ebc8 add cnet-svg-classes test 2023-03-10 06:33:24 +01:00
Jan Lukas Gernert
3ece2522bb add clean links test 2023-03-09 21:24:29 +01:00
Jan Lukas Gernert
c5c6b788c8 add citilab test & fix noscript unwrapping 2023-03-09 20:10:03 +01:00
Jan Lukas Gernert
69b7b1fdc2 fix clippy 2023-03-06 01:51:26 +01:00
Jan Lukas Gernert
612f022879 add buzzfeed test 2023-03-06 01:36:37 +01:00
Jan Lukas Gernert
881c2b90ac fix alternate candidates 2023-03-06 01:36:21 +01:00
Jan Lukas Gernert
45b4141049 add new test 2023-03-06 00:04:23 +01:00
Jan Lukas Gernert
7060e30911 fix conditional clean of nested tags 2023-03-06 00:03:59 +01:00
Jan Lukas Gernert
9c5ffda5de add breitbart test 2023-03-04 23:40:23 +01:00
Jan Lukas Gernert
f5b7ff198a fix post processing 2023-03-04 23:40:01 +01:00
Jan Lukas Gernert
2528aa3e18 fmt 2023-03-04 17:55:17 +01:00
Jan Lukas Gernert
e2b804d00a add blogger test 2023-03-04 17:41:22 +01:00
Jan Lukas Gernert
daa5543c4e fix turning div's into p's 2023-03-04 17:41:14 +01:00
Jan Lukas Gernert
d93f5c9677 fmt 2023-03-02 01:09:48 +01:00
Jan Lukas Gernert
6964724102 add bbc test 2023-03-02 01:09:44 +01:00
Jan Lukas Gernert
df41e690ae fix conditional cleaning class weight 2023-03-02 01:08:52 +01:00
Jan Lukas Gernert
02e043f6de fix negative regex 2023-03-02 01:08:28 +01:00
Jan Lukas Gernert
aaff97c184 cleanup 2023-03-01 01:55:26 +01:00
Jan Lukas Gernert
4031750956 tag cleaning test 2023-03-01 01:37:44 +01:00
Jan Lukas Gernert
7c9e527827 strip iframes but keep vidoes 2023-03-01 01:37:37 +01:00
Jan Lukas Gernert
cea23f1638 always use fakehost url for tests 2023-03-01 00:46:35 +01:00
Jan Lukas Gernert
80de6d177c url completion test 2023-03-01 00:42:44 +01:00
Jan Lukas Gernert
3a92585f4d use url.join() instead of custom code 2023-03-01 00:42:03 +01:00
Jan Lukas Gernert
13d147d270 fmt 2023-02-28 18:30:23 +01:00
Jan Lukas Gernert
451dd61547 add two new tests 2023-02-28 18:28:55 +01:00
Jan Lukas Gernert
a1c07d436f fix alternative top candidate calcs 2023-02-28 18:28:01 +01:00
Jan Lukas Gernert
f4ccd22837 fix node ancestor depth 2023-02-28 18:27:46 +01:00
Jan Lukas Gernert
58721efa35 fix positive/negative class weight regex 2023-02-28 18:27:36 +01:00
Jan Lukas Gernert
aea57d0cf3 fix has_single_tag_inside_element & update tests 2023-02-28 03:59:48 +01:00
Jan Lukas Gernert
31a8033844 fixes, more sanitation & 1 more failing test 2023-02-28 01:50:13 +01:00
Jan Lukas Gernert
56c08c501a fmt 2023-02-27 01:01:16 +01:00
Jan Lukas Gernert
df999cd9fc more cleanups & more tests 2023-02-27 01:00:56 +01:00
Jan Lukas Gernert
0834c4d72a fixes 2023-02-26 02:22:53 +01:00
Jan Lukas Gernert
d8e3a75b01 update configs 2023-02-25 01:40:07 +01:00
Jan Lukas Gernert
2460745547 cleanup 2023-02-25 00:44:18 +01:00
Jan Lukas Gernert
63035ca028 fmt 2023-02-25 00:43:42 +01:00
Jan Lukas Gernert
e3246af28b refactor & more testing 2023-02-25 00:42:26 +01:00
Jan Lukas Gernert
7ae98904d4 unwrap noscript images 2023-02-23 01:53:42 +01:00
Jan Lukas Gernert
98c06e11f4 improve title extraction 2023-02-20 02:32:58 +01:00
Jan Lukas Gernert
cce912c354 first content extraction kinda working 2023-02-20 00:29:44 +01:00
Jan Lukas Gernert
2c76a869e7 fmt 2023-02-17 14:35:35 +01:00
Jan Lukas Gernert
71a8816747 somewhat complete readability algorithm 2023-02-17 14:16:01 +01:00
Jan Lukas Gernert
979358fd35 more 2023-01-01 21:35:46 +01:00
Jan Lukas Gernert
2750ad648d start implementing readability 2023-01-01 14:51:34 +01:00
Jan Lukas Gernert
c08f5afa5d move stuff around 2022-12-13 08:54:57 +01:00
Jan Lukas Gernert
90383545e0 extract & parse charsets other than utf8 2022-12-11 17:38:42 +01:00
Jan Lukas Gernert
97b194c9e8 clippy regex escape 2022-12-11 16:31:01 +01:00
Jan Lukas Gernert
88bb88a38f clippy 2022-12-11 16:23:02 +01:00