1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-08 00:19:59 +02:00
Commit graph

28 commits

Author SHA1 Message Date
Jan Lukas Gernert
873e081c33 clean js-links & add new test 2023-03-26 11:31:59 +02:00
Jan Lukas Gernert
b541cd73f8 whitespace fixes 2023-03-24 08:02:08 +01:00
Jan Lukas Gernert
280c516cbe make cleaning more obvious 2023-03-19 23:09:06 +01:00
Jan Lukas Gernert
11e08ae505 move conditional cleaning right after parsing & port attribute cleaning form readability 2023-03-19 22:43:26 +01:00
Jan Lukas Gernert
3a56439ae8 fix scorint p tags twice 2023-03-19 13:31:27 +01:00
Jan Lukas Gernert
b5d8f43ef8 stabalize buzzfeed test 2023-03-12 23:13:52 +01:00
Jan Lukas Gernert
603b373e0d lots of fixes 2023-03-12 19:36:10 +01:00
Jan Lukas Gernert
11d9657bdd fix using parent if top candidate is only child 2023-03-12 14:20:19 +01:00
Jan Lukas Gernert
58a799b096 fix negative regex & fmt 2023-03-12 11:42:37 +01:00
Jan Lukas Gernert
a356ced646 fix potential infinite loop 2023-03-10 22:17:31 +01:00
Jan Lukas Gernert
69b7b1fdc2 fix clippy 2023-03-06 01:51:26 +01:00
Jan Lukas Gernert
881c2b90ac fix alternate candidates 2023-03-06 01:36:21 +01:00
Jan Lukas Gernert
2528aa3e18 fmt 2023-03-04 17:55:17 +01:00
Jan Lukas Gernert
daa5543c4e fix turning div's into p's 2023-03-04 17:41:14 +01:00
Jan Lukas Gernert
13d147d270 fmt 2023-02-28 18:30:23 +01:00
Jan Lukas Gernert
a1c07d436f fix alternative top candidate calcs 2023-02-28 18:28:01 +01:00
Jan Lukas Gernert
aea57d0cf3 fix has_single_tag_inside_element & update tests 2023-02-28 03:59:48 +01:00
Jan Lukas Gernert
31a8033844 fixes, more sanitation & 1 more failing test 2023-02-28 01:50:13 +01:00
Jan Lukas Gernert
0834c4d72a fixes 2023-02-26 02:22:53 +01:00
Jan Lukas Gernert
63035ca028 fmt 2023-02-25 00:43:42 +01:00
Jan Lukas Gernert
e3246af28b refactor & more testing 2023-02-25 00:42:26 +01:00
Jan Lukas Gernert
7ae98904d4 unwrap noscript images 2023-02-23 01:53:42 +01:00
Jan Lukas Gernert
98c06e11f4 improve title extraction 2023-02-20 02:32:58 +01:00
Jan Lukas Gernert
cce912c354 first content extraction kinda working 2023-02-20 00:29:44 +01:00
Jan Lukas Gernert
2c76a869e7 fmt 2023-02-17 14:35:35 +01:00
Jan Lukas Gernert
71a8816747 somewhat complete readability algorithm 2023-02-17 14:16:01 +01:00
Jan Lukas Gernert
979358fd35 more 2023-01-01 21:35:46 +01:00
Jan Lukas Gernert
2750ad648d start implementing readability 2023-01-01 14:51:34 +01:00