1
0
Fork 0
mirror of https://gitlab.com/news-flash/article_scraper.git synced 2025-07-07 08:05:31 +02:00

Commit graph

  • 9f349f8c6f need reqwest streams master article_scraper-v2.1.2 Jan Lukas Gernert 2025-05-04 18:00:59 +02:00
  • 498008f630 bump version Jan Lukas Gernert 2025-05-04 17:51:30 +02:00
  • ee53f58aeb Merge branch 'empty-body' into 'master' Jan Lukas Gernert 2025-05-04 15:50:59 +00:00
  • 7535c76e43 Merge branch 'empty-body' into 'master' Jan Lukas Gernert 2025-05-04 15:38:52 +00:00
  • 06990acbc0 fix libxml CI build empty-body Jan Lukas Gernert 2025-05-04 17:38:46 +02:00
  • f361392c04 check for empty http response and parsed documents without root element Jan Lukas Gernert 2025-05-04 17:34:33 +02:00
  • 9b374a28c7 update ftr-site-config Jan Lukas Gernert 2025-04-05 15:47:08 +02:00
  • b92500fca2 better error messages Jan Lukas Gernert 2025-04-05 15:45:41 +02:00
  • 0978335d3b [f] ignore url harvest error Jan Lukas Gernert 2025-03-28 17:18:03 +01:00
  • 9f56ed03b8 article_scraper: don't specify reqwest features Jan Lukas Gernert 2025-03-10 13:42:31 +01:00
  • 8cfcd6d9f3 clippy Jan Lukas Gernert 2025-01-17 03:05:55 +01:00
  • ca1cc47af1 update CI image Jan Lukas Gernert 2025-01-17 03:02:40 +01:00
  • 7c658a4ba8 resolver 2 article_scraper-v2.1.1 Jan Lukas Gernert 2025-01-17 02:58:41 +01:00
  • 89eb87fa85 update thiserror, ftr-site-config submodule and bump version Jan Lukas Gernert 2025-01-17 02:55:59 +01:00
  • 7fcb781c68 remove useless format! Jan Lukas Gernert 2024-11-02 11:34:47 +01:00
  • 11ee29feda thumbnail: check for attribute with name property as well (fixes #4) Jan Lukas Gernert 2024-11-02 11:30:29 +01:00
  • b3ce28632d update submodule Jan Lukas Gernert 2024-07-10 11:59:21 +02:00
  • 6932902b7b update CI image Jan Lukas Gernert 2024-07-06 23:43:23 +02:00
  • c16e11fdda init parser according to (https://gitlab.gnome.org/GNOME/libxml2/-/wikis/Thread-safety) Jan Lukas Gernert 2024-07-06 23:38:43 +02:00
  • f4e4e64b9e absolute default size for embedded youtube videos Jan Lukas Gernert 2024-06-10 22:27:10 +02:00
  • df8ebcbb35 treat iframes as valid emtry tags Jan Lukas Gernert 2024-06-10 22:06:48 +02:00
  • e01c8e9d34 negative score for thumbnails with emoji alt Jan Lukas Gernert 2024-06-10 20:40:19 +02:00
  • 06018d98d4 replace emoji images Jan Lukas Gernert 2024-06-08 23:18:00 +02:00
  • 11e9261bf2 fmt Jan Lukas Gernert 2024-06-08 01:03:00 +02:00
  • 3e5654e197 fix tests Jan Lukas Gernert 2024-06-08 01:02:52 +02:00
  • 65b26370a2 update ftr config article_scraper-v2.1.0 Jan Lukas Gernert 2024-03-24 22:11:49 +01:00
  • a80b8a8274 bump versions Jan Lukas Gernert 2024-03-24 22:01:34 +01:00
  • eee7ffee05 update ftr config Jan Lukas Gernert 2024-03-24 22:00:44 +01:00
  • e4140ff093 Merge branch 'reqwest-0.12' into 'master' Jan Lukas Gernert 2024-03-24 20:54:27 +00:00
  • 689a72e6cd reqwest 0.12 reqwest-0.12 Jan Lukas Gernert 2024-03-24 17:54:30 +01:00
  • 0dcebe8b49 fmt Jan Lukas Gernert 2024-02-13 19:36:58 +01:00
  • a1ee3b22f9 clippy Jan Lukas Gernert 2024-02-13 19:35:29 +01:00
  • b13673ce3b do some null checks before unlinking nodes Jan Lukas Gernert 2024-02-13 19:06:05 +01:00
  • ed8a83708b update deps & fix some flaky tests Jan Lukas Gernert 2024-02-13 17:00:45 +01:00
  • f9812b556c update ftr config Jan Lukas Gernert 2023-08-13 16:43:38 +02:00
  • acb7d1d000 port libxml workaround from hurl Jan Lukas Gernert 2023-08-10 02:09:07 +02:00
  • 6116ba38ae no need for head Jan Lukas Gernert 2023-08-10 02:06:52 +02:00
  • 8c7cdacd26 Revert "generate full html document" Jan Lukas Gernert 2023-08-10 02:06:08 +02:00
  • 0133b20f06 generate full html document Jan Lukas Gernert 2023-08-10 00:01:31 +02:00
  • 1584649eb4 fix tests Jan Lukas Gernert 2023-08-10 00:01:10 +02:00
  • 2c76a89f9d add spiegel test Jan Lukas Gernert 2023-08-09 23:57:25 +02:00
  • 9aa6478e3c update heise test Jan Lukas Gernert 2023-08-09 23:25:07 +02:00
  • b91014c685 clean html fragments Jan Lukas Gernert 2023-08-03 10:40:29 +02:00
  • 9c857a1481 Merge branch 'make-article-public' into 'master' Jan Lukas Gernert 2023-08-02 09:04:30 +00:00
  • 3211b91bad Make Article public Leonardo Fedalto 2023-08-01 21:39:48 +02:00
  • 7a4f5c500d 400 Jan Lukas Gernert 2023-08-01 19:35:22 +02:00
  • a7e8661a09 update tests & defined youtube iframe height Jan Lukas Gernert 2023-08-01 18:37:55 +02:00
  • eb1bfdbca0 print url Jan Lukas Gernert 2023-07-28 07:09:50 +02:00
  • 40f065d9cd allow downloads without content type smaller than 5mb Jan Lukas Gernert 2023-07-28 07:03:50 +02:00
  • db007f752c dont clean video tags Jan Lukas Gernert 2023-07-27 23:18:17 +02:00
  • bf7a89fef7 don't fail because of lacking content length Jan Lukas Gernert 2023-07-23 15:39:24 +02:00
  • 345518253a even if img has src Jan Lukas Gernert 2023-07-22 20:03:32 +02:00
  • 42eb9daf65 remove lazy loading attributes Jan Lukas Gernert 2023-07-22 19:57:38 +02:00
  • d562d41b81 download single image Jan Lukas Gernert 2023-07-16 21:40:10 +02:00
  • be40383b1a impl from reqwest error Jan Lukas Gernert 2023-07-16 15:17:01 +02:00
  • d62aa8c31a clippy fixes Jan Lukas Gernert 2023-06-29 19:59:38 +02:00
  • fcec0d83ee don't move content nodes to <article> root node Jan Lukas Gernert 2023-06-29 19:47:49 +02:00
  • fdb8d9a97e small fixes Jan Lukas Gernert 2023-06-27 19:21:26 +02:00
  • 4fd41d98cc add fn to parse thumbnail from html Jan Lukas Gernert 2023-06-26 23:22:08 +02:00
  • e32015c1d0 add mercury leading image heuristics Jan Lukas Gernert 2023-06-26 22:25:57 +02:00
  • e99a4b4f23 ignore test resources article_scraper-v2.0.0 Jan Lukas Gernert 2023-06-23 21:22:37 +02:00
  • a7983e873d (cargo-release) version 2.0.0 Jan Lukas Gernert 2023-06-23 21:17:19 +02:00
  • a036d03510 use ftr-site-config fork with heise patch Jan Lukas Gernert 2023-06-23 21:15:36 +02:00
  • a31956531a fix download loop send Jan Lukas Gernert 2023-06-22 00:15:57 +02:00
  • 582834cdf1 fixes Jan Lukas Gernert 2023-06-21 23:48:09 +02:00
  • e0ccd7e0b3 split download & parsing Jan Lukas Gernert 2023-06-21 23:02:31 +02:00
  • 99c5f6220e fix golem test Jan Lukas Gernert 2023-06-21 23:04:08 +02:00
  • d8ceee1403 remove <h1/2> duplicating the title Jan Lukas Gernert 2023-04-30 09:24:00 +02:00
  • eb4b3603f5 remove artifact Jan Lukas Gernert 2023-04-29 18:21:21 +02:00
  • 16b102b313 replace multiple <br>s with single <p> Jan Lukas Gernert 2023-04-29 18:20:28 +02:00
  • c4f8bd2bc2 fix heise crash: simpler way of checking for ancestor Jan Lukas Gernert 2023-04-28 15:56:29 +02:00
  • 44d01ad1c6 Merge branch 'hardwareluxx' into 'master' Jan Lukas Gernert 2023-04-28 05:57:37 +00:00
  • 871b441776 parse image objects hardwareluxx Jan Lukas Gernert 2023-04-28 07:46:28 +02:00
  • 572fada104 parse video objects Jan Lukas Gernert 2023-04-27 19:03:07 +02:00
  • 34a737c89c overhaul non-readability tests Jan Lukas Gernert 2023-04-27 07:40:28 +02:00
  • f737ab27fd update readability test results Jan Lukas Gernert 2023-04-26 21:04:35 +02:00
  • 2a4f17d458 ignore image download test Jan Lukas Gernert 2023-04-26 20:58:25 +02:00
  • 62c0968619 remove empty nodes Jan Lukas Gernert 2023-04-26 19:54:34 +02:00
  • 5621a0ea54 fmt Jan Lukas Gernert 2023-04-26 09:12:55 +02:00
  • fbb6585596 replace first occurence only Jan Lukas Gernert 2023-04-26 09:09:06 +02:00
  • afbc384b38 update ftr config Jan Lukas Gernert 2023-04-26 07:45:40 +02:00
  • dd958fe30f fix encoding Jan Lukas Gernert 2023-04-26 07:44:10 +02:00
  • bd413a795c fmt Jan Lukas Gernert 2023-04-25 19:12:15 +02:00
  • a0161e92d4 next page fixes Jan Lukas Gernert 2023-04-25 18:57:24 +02:00
  • 37d317ad86 simplify iterating over dir Jan Lukas Gernert 2023-04-25 08:58:15 +02:00
  • 309a60c5d0 update regex Jan Lukas Gernert 2023-04-23 20:45:45 +02:00
  • c51f0fd731 cargo.toml metadata Jan Lukas Gernert 2023-04-23 16:47:02 +02:00
  • 1695e33f9e fmt Jan Lukas Gernert 2023-04-23 16:37:06 +02:00
  • 57df2e6832 write some docs Jan Lukas Gernert 2023-04-23 16:35:00 +02:00
  • bfb31dc188 fmt Jan Lukas Gernert 2023-04-21 08:53:12 +02:00
  • baf2a8a15d rename test Jan Lukas Gernert 2023-04-21 08:47:25 +02:00
  • b4b5d802c9 only serialize root node Jan Lukas Gernert 2023-04-21 08:46:10 +02:00
  • 3f58a39fcf dump node Jan Lukas Gernert 2023-04-20 08:53:06 +02:00
  • cd3d3468a3 clean html Jan Lukas Gernert 2023-04-20 08:41:10 +02:00
  • 3096f28aae empty clean html fn Jan Lukas Gernert 2023-04-16 22:00:00 +02:00
  • f427b7c36f cli: progress bar for image download Jan Lukas Gernert 2023-04-16 21:31:11 +02:00
  • 3dd7c7d57a tmp: calc download size & print progress Jan Lukas Gernert 2023-04-16 18:10:43 +02:00
  • ccc8223db0 cleanup & fixes Jan Lukas Gernert 2023-04-14 17:50:39 +02:00
  • 57f74c635b fix clippy Jan Lukas Gernert 2023-04-14 10:32:05 +02:00
  • 3a465f2619 somehow made things much slower Jan Lukas Gernert 2023-04-14 08:49:49 +02:00