mirror of
https://gitlab.com/news-flash/article_scraper.git
synced 2025-07-07 16:15:32 +02:00
fixes, more sanitation & 1 more failing test
This commit is contained in:
parent
56c08c501a
commit
31a8033844
8 changed files with 1993 additions and 162 deletions
107
expected.html
107
expected.html
|
@ -1,107 +0,0 @@
|
||||||
<article><DIV id="readability-page-1" class="page"><div>
|
|
||||||
<p>
|
|
||||||
I don't use Facebook. I'm not technophobic — I'm a geek. I've been using email since the early 1990s, I have accounts on hundreds of services around the net, and I do software development and internet protocol design both for work and for fun. I believe that a globe-spanning communications network like the internet can be a positive social force, and I publish much of my own work on the open web.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
But Facebook and other massive web companies represent a strong push toward unaccountable centralized social control, which I think makes our society more unequal and more unjust. The Cambridge Analytica scandal is one instance of this long-running problem with what I call the "surveillance economy." I don't want to submit to these power structures, and I don’t want my presence on such platforms to serve as bait that lures other people into the digital panopticon.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
But while I've never "opted in" to Facebook or any of the other big social networks, Facebook still has a detailed profile that can be used to target me. I've never consented to having Facebook collect my data, which can be used to draw very detailed inferences about my life, my habits, and my relationships. As we aim to take Facebook to task for its breach of user trust, we need to think about what its capabilities imply for society overall. After all, if you do #deleteFacebook, you'll find yourself in my shoes: non-consenting, but still subject to Facebook’s globe-spanning surveillance and targeting network.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
There are at least two major categories of information available to Facebook about non-participants like me: information from other Facebook users, and information from sites on the open web.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Information from other Facebook users</strong></h3>
|
|
||||||
<p>
|
|
||||||
When you sign up for Facebook, it encourages you to upload your list of contacts so that the site can "find your friends." Facebook uses this contact information to learn about people, even if those people don't agree to participate. It also links people together based on who they know, even if the shared contact hasn't agreed to this use.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
For example, I received an email from Facebook that lists the people who have all invited me to join Facebook: my aunt, an old co-worker, a friend from elementary school, etc. This email includes names and email addresses — including my own name — and at least one <a href="https://en.wikipedia.org/wiki/Web_bug" target="_blank">web bug</a> designed to identify me to Facebook’s web servers when I open the email. Facebook records this group of people as my contacts, even though I've never agreed to this kind of data collection.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Similarly, I'm sure that I'm in some photographs that someone has uploaded to Facebook — and I'm probably tagged in some of them. I've never agreed to this, but Facebook could still be keeping track.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
So even if you decide you need to join Facebook, remember that you might be giving the company information about someone else who didn't agree to be part of its surveillance platform.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Information from sites on the open Web</strong></h3>
|
|
||||||
<p>
|
|
||||||
Nearly every website that you visit that has a "Like" button is actually encouraging your browser to tell Facebook about your browsing habits. Even if you don't click on the "Like" button, displaying it requires your browser to send a request to Facebook's servers for the "Like" button itself. That request includes <a href="https://en.wikipedia.org/wiki/HTTP_referer" target="_blank">information</a> mentioning the name of the page you are visiting and any Facebook-specific <a href="https://en.wikipedia.org/wiki/HTTP_cookie" target="_blank">cookies</a> your browser might have collected. (See <a href="https://www.facebook.com/help/186325668085084" target="_blank">Facebook's own description of this process</a>.) This is called a "third-party request."
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
This makes it possible for Facebook to create a detailed picture of your browsing history — even if you've never even visited Facebook directly, let alone signed up for a Facebook account.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Think about most of the web pages you've visited — how many of them <em>don't</em> have a "Like" button? If you administer a website and you include a "Like" button on every page, you're helping Facebook to build profiles of your visitors, even those who have opted out of the social network. Facebook’s <a href="https://developers.facebook.com/docs/plugins/" target="_blank">“Share” buttons</a> on other sites — along with <a href="https://www.facebook.com/business/learn/facebook-ads-pixel" target="_blank">other tools</a> — work a bit differently from the “Like” button, but do effectively the same thing.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The profiles that Facebook builds on non-users don't necessarily include so-called "personally identifiable information" (PII) like names or email addresses. But they do include fairly unique patterns. Using <a href="https://dev.chromium.org/for-testers/providing-network-details" target="_blank">Chromium's NetLog dumping</a>, I performed a simple five-minute browsing test last week that included visits to various sites — but not Facebook. In that test, the PII-free data that was sent to Facebook included information about which news articles I was reading, my dietary preferences, and my hobbies.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Given the precision of this kind of mapping and targeting, "PII" isn’t necessary to reveal my identity. How many vegans examine specifications for computer hardware from the ACLU's offices while reading about Cambridge Analytica? Anyway, if Facebook combined that information with the "web bug" from the email mentioned above — which <em>is</em> clearly linked to my name and e-mail address — no guesswork would be required.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
I'd be shocked if Facebook were not connecting those dots given the goals <a href="https://www.facebook.com/about/privacy/cookies" target="_blank">they claim for data collection</a>:
|
|
||||||
</p>
|
|
||||||
<blockquote><p>
|
|
||||||
We use the information we have to improve our advertising and measurement systems so we can show you relevant ads on and off our Services and measure the effectiveness and reach of ads and services.
|
|
||||||
</p></blockquote>
|
|
||||||
<p>
|
|
||||||
This is, in essence, exactly what Cambridge Analytica did.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Consent</strong></h3>
|
|
||||||
<p>
|
|
||||||
Facebook and other tech companies often deflect accusations against excessive data collection by arguing "consent" — that they harvest and use data with the consent of the users involved.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
But even if we accept that clicking through a "Terms of Service" that <a href="https://tosdr.org/" target="_blank">no one reads</a> can actually constitute true consent, even if we ignore the fact that these terms are overwhelmingly one-sided and non-negotiable, and even if we accept that it's meaningful for people to give consent when sharing data about other people who may have also opted in — what is the recourse for someone who has not opted into these systems at all?
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Are those of us who have explicitly avoided agreeing to the Facebook terms of service simply fair game for an industry-wide surveillance and targeting network?
|
|
||||||
</p>
|
|
||||||
<h3><strong>Privilege</strong></h3>
|
|
||||||
<p>
|
|
||||||
I don’t mean to critique people who have created a Facebook profile or suggest they deserve whatever they get.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
My ability to avoid Facebook comes from privilege — I have existing social contacts with whom I know how to stay in touch without using Facebook's network. My job does not require that I use Facebook. I can afford the time and expense to communicate with my electoral representatives and political allies via other channels.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Many people do not have these privileges and are compelled to "opt in" on Facebook's non-negotiable terms.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Many journalists, organizers, schools, politicians, and others who have good reasons to oppose Facebook's centralized social control feel compelled by Facebook's reach and scale to participate in their practices, even those we know to be harmful. That includes the ACLU.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Privacy should not be a luxury good, and while I'm happy to encourage people to opt out of these subtle and socially fraught arrangements, I do not argue that anyone who has signed up has somehow relinquished concerns about their privacy. We need to evaluate privacy concerns in their full social contexts. These are not problems that can be resolved on an individual level, because of the interpersonal nature of much of this data and the complexities of the tradeoffs involved.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Technical countermeasures</strong></h3>
|
|
||||||
<p>
|
|
||||||
While they may not solve the problem, there are some technical steps people can take to limit the scope of these surveillance practices. For example, some web browsers do not send "third-party cookies" by default, or <a href="https://wiki.mozilla.org/Thirdparty" target="_blank">they scope cookies</a> so that centralized surveillance doesn't get a single view of one user. The most privacy-preserving modern browser is <a href="https://www.torproject.org/" target="_blank">the Tor Browser</a>, which everyone should have installed and available, even if it's not the browser they choose to use every day. It limits the surveillance ability of systems that you have not signed up for to track you as you move around the web.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
You can also modify some browsers — for example, with plug-ins for <a href="https://requestpolicycontinued.github.io/" target="_blank">Firefox</a> and <a href="https://chrome.google.com/webstore/detail/umatrix/ogfcmafjalglgifnmanfmnieipoejdcf" target="_blank">Chrome</a> — so that they <a href="https://addons.mozilla.org/en-US/firefox/addon/umatrix/" target="_blank">do not send third-party</a><a href="https://requestpolicycontinued.github.io/" target="_blank">requests at all</a>. Firefox is also exploring even more <a href="https://addons.mozilla.org/en-US/firefox/addon/multi-account-containers/" target="_blank">privacy-preserving techniques</a><a href="https://addons.mozilla.org/en-US/firefox/addon/multi-account-containers/" target="_blank">.</a></p>
|
|
||||||
<p>
|
|
||||||
It can’t be denied, though, that these tools are harder to use than the web browsers most people are accustomed to, and they create barriers to some online activities. (For example, logging in to <a href="https://offcampushousing.uconn.edu/login" target="_blank">some sites</a> and accessing some <a href="https://filestore.community.support.microsoft.com/api/images/0253d8fb-b050-401a-834d-9d80a99c0b12" target="_blank">web applications</a> is impossible without third-party cookies.)
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Some website operators take their visitors' privacy more seriously than others, by reducing the amount of third-party requests. For example, it's possible to display "share on Facebook" or "Like" buttons without sending user requests to Facebook in the first place. The ACLU's own website does this because we believe that the right to read with privacy is a fundamental protection for civic discourse.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
If you are responsible for running a website, try browsing it with a third-party-blocking extension turned on. Think about how much information you're requiring your users to send to third parties as a condition for using your site. If you care about being a good steward of your visitors' data, you can re-design your website to reduce this kind of leakage.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Opting out?</strong></h3>
|
|
||||||
<p>
|
|
||||||
Some advertisers claim that you can "opt out" of their targeted advertising, and even offer <a href="http://optout.aboutads.info/" target="_blank">a centralized place meant to help you do so</a>. However, my experience with these tools isn't a positive one. They don't appear to work all of the time. (In a recent experiment I conducted, two advertisers’ opt-out mechanisms failed to take effect.) And while advertisers claim to allow the user to opt out of "interest-based ads," it's not clear that the opt-outs govern data collection itself, rather than just the use of the collected data for displaying ads. Moreover, opting out on their terms requires the use of third-party cookies, thereby enabling another mechanism that other advertisers can then exploit.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
It's also not clear how they function over time: How frequently do I need to take these steps? Do they expire? How often should I check back to make sure I’m still opted out? I'd much prefer an approach requiring me to opt <em>in</em> to surveillance and targeting.
|
|
||||||
</p>
|
|
||||||
<h3><strong>Fix the surveillance economy, not just Facebook</strong></h3>
|
|
||||||
<p>
|
|
||||||
These are just a few of the mechanisms that enable online tracking. Facebook is just one culprit in this online "surveillance economy," albeit a massive one — the company owns <a href="https://www.instagram.com/" target="_blank">Instagram</a>, <a href="https://atlassolutions.com/" target="_blank">Atlas</a>, <a href="https://www.whatsapp.com/" target="_blank">WhatsApp</a>, and dozens of other internet and technology companies and services. But it’s not the only player in this space. Google’s business model also relies on this kind of surveillance, and there are dozens of smaller players as well.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
As we work to address the fallout from the current storm around Facebook and Cambridge Analytica, we can't afford to lose sight of these larger mechanisms at play. Cambridge Analytica's failures and mistakes are inherent to Facebook's business model. We need to seriously challenge the social structures that encourage people to opt in to this kind of surveillance. At the same time, we also need to protect those of us who manage to opt out.
|
|
||||||
</p>
|
|
||||||
</div></DIV></article>
|
|
1
resources/tests/readability/aktualne/expected.html
Normal file
1
resources/tests/readability/aktualne/expected.html
Normal file
|
@ -0,0 +1 @@
|
||||||
|
FIXME
|
1661
resources/tests/readability/aktualne/source.html
Normal file
1661
resources/tests/readability/aktualne/source.html
Normal file
File diff suppressed because one or more lines are too long
|
@ -40,8 +40,11 @@ pub static TITLE_CUT_END: Lazy<Regex> =
|
||||||
pub static WORD_COUNT: Lazy<Regex> = Lazy::new(|| Regex::new(r#"\s+"#).expect("WORD_COUNT regex"));
|
pub static WORD_COUNT: Lazy<Regex> = Lazy::new(|| Regex::new(r#"\s+"#).expect("WORD_COUNT regex"));
|
||||||
pub static TITLE_CUT_FRONT: Lazy<Regex> =
|
pub static TITLE_CUT_FRONT: Lazy<Regex> =
|
||||||
Lazy::new(|| Regex::new(r#"/[^-|\\/>»]*[-|\\/>»](.*)/gi"#).expect("TITLE_CUT_FRONT regex"));
|
Lazy::new(|| Regex::new(r#"/[^-|\\/>»]*[-|\\/>»](.*)/gi"#).expect("TITLE_CUT_FRONT regex"));
|
||||||
|
pub static VIDEOS: Lazy<Regex> = Lazy::new(|| {
|
||||||
|
Regex::new(r#"///(www\.)?((dailymotion|youtube|youtube-nocookie|player\.vimeo|v\.qq)\.com|(archive|upload\.wikimedia)\.org|player\.twitch\.tv)/i"#).expect("VIDEOS regex")
|
||||||
|
});
|
||||||
pub const SCORE_ATTR: &str = "content_score";
|
pub const SCORE_ATTR: &str = "content_score";
|
||||||
|
pub const DATA_TABLE_ATTR: &str = "is_data_table";
|
||||||
pub const MINIMUM_TOPCANDIDATES: usize = 3;
|
pub const MINIMUM_TOPCANDIDATES: usize = 3;
|
||||||
pub const UNLIKELY_ROLES: &[&str] = &[
|
pub const UNLIKELY_ROLES: &[&str] = &[
|
||||||
"menu",
|
"menu",
|
||||||
|
|
|
@ -594,7 +594,6 @@ impl FullTextParser {
|
||||||
|
|
||||||
let _ = Self::fix_lazy_images(context, "lazyload", "data-src");
|
let _ = Self::fix_lazy_images(context, "lazyload", "data-src");
|
||||||
let _ = Self::fix_iframe_size(context, "youtube.com");
|
let _ = Self::fix_iframe_size(context, "youtube.com");
|
||||||
let _ = Self::remove_attribute(context, None, "style");
|
|
||||||
let _ = Self::remove_attribute(context, Some("a"), "onclick");
|
let _ = Self::remove_attribute(context, Some("a"), "onclick");
|
||||||
let _ = Self::remove_attribute(context, Some("img"), "srcset");
|
let _ = Self::remove_attribute(context, Some("img"), "srcset");
|
||||||
let _ = Self::remove_attribute(context, Some("img"), "sizes");
|
let _ = Self::remove_attribute(context, Some("img"), "sizes");
|
||||||
|
@ -610,6 +609,8 @@ impl FullTextParser {
|
||||||
|
|
||||||
// strip elements that contain style="display: none;"
|
// strip elements that contain style="display: none;"
|
||||||
let _ = Util::strip_node(context, "//*[contains(@style,'display:none')]");
|
let _ = Util::strip_node(context, "//*[contains(@style,'display:none')]");
|
||||||
|
let _ = Util::strip_node(context, "//*[contains(@style,'display: none')]");
|
||||||
|
let _ = Self::remove_attribute(context, None, "style");
|
||||||
|
|
||||||
// strip all comments
|
// strip all comments
|
||||||
let _ = Util::strip_node(context, "//input");
|
let _ = Util::strip_node(context, "//input");
|
||||||
|
@ -849,11 +850,6 @@ impl FullTextParser {
|
||||||
}
|
}
|
||||||
|
|
||||||
pub(crate) fn post_process_content(document: &Document) -> Result<(), FullTextParserError> {
|
pub(crate) fn post_process_content(document: &Document) -> Result<(), FullTextParserError> {
|
||||||
if let Some(mut root) = document.get_root_element() {
|
|
||||||
Self::clean_classes(&mut root)?;
|
|
||||||
Self::simplify_nested_elements(&mut root)?;
|
|
||||||
}
|
|
||||||
|
|
||||||
let context = Context::new(document).map_err(|()| {
|
let context = Context::new(document).map_err(|()| {
|
||||||
error!("Creating xpath context failed for article HTML");
|
error!("Creating xpath context failed for article HTML");
|
||||||
FullTextParserError::Xml
|
FullTextParserError::Xml
|
||||||
|
@ -884,6 +880,19 @@ impl FullTextParser {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
Util::mark_data_tables(&context)?;
|
||||||
|
|
||||||
|
if let Some(mut root) = document.get_root_element() {
|
||||||
|
Util::clean_conditionally(&mut root, "form")?;
|
||||||
|
Util::clean_conditionally(&mut root, "fieldset")?;
|
||||||
|
Util::clean_conditionally(&mut root, "table")?;
|
||||||
|
Util::clean_conditionally(&mut root, "ul")?;
|
||||||
|
Util::clean_conditionally(&mut root, "div")?;
|
||||||
|
|
||||||
|
Self::clean_classes(&mut root)?;
|
||||||
|
Self::simplify_nested_elements(&mut root)?;
|
||||||
|
}
|
||||||
|
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -904,11 +913,17 @@ impl FullTextParser {
|
||||||
})?;
|
})?;
|
||||||
}
|
}
|
||||||
|
|
||||||
node.remove_attribute("content_score").map_err(|e| {
|
node.remove_attribute(constants::SCORE_ATTR).map_err(|e| {
|
||||||
log::error!("{e}");
|
log::error!("{e}");
|
||||||
FullTextParserError::Xml
|
FullTextParserError::Xml
|
||||||
})?;
|
})?;
|
||||||
|
|
||||||
|
node.remove_attribute(constants::DATA_TABLE_ATTR)
|
||||||
|
.map_err(|e| {
|
||||||
|
log::error!("{e}");
|
||||||
|
FullTextParserError::Xml
|
||||||
|
})?;
|
||||||
|
|
||||||
node_iter = Util::next_node(&node, false);
|
node_iter = Util::next_node(&node, false);
|
||||||
}
|
}
|
||||||
Ok(())
|
Ok(())
|
||||||
|
|
|
@ -69,8 +69,18 @@ impl Readability {
|
||||||
if state.strip_unlikely {
|
if state.strip_unlikely {
|
||||||
if constants::UNLIELY_CANDIDATES.is_match(&match_string)
|
if constants::UNLIELY_CANDIDATES.is_match(&match_string)
|
||||||
&& !constants::OKAY_MAYBE_ITS_A_CANDIDATE.is_match(&match_string)
|
&& !constants::OKAY_MAYBE_ITS_A_CANDIDATE.is_match(&match_string)
|
||||||
&& !Util::has_ancestor_tag(node_ref, "table", None)
|
&& !Util::has_ancestor_tag(
|
||||||
&& !Util::has_ancestor_tag(node_ref, "code", None)
|
node_ref,
|
||||||
|
"table",
|
||||||
|
None,
|
||||||
|
None::<fn(&Node) -> bool>,
|
||||||
|
)
|
||||||
|
&& !Util::has_ancestor_tag(
|
||||||
|
node_ref,
|
||||||
|
"code",
|
||||||
|
None,
|
||||||
|
None::<fn(&Node) -> bool>,
|
||||||
|
)
|
||||||
&& tag_name != "BODY"
|
&& tag_name != "BODY"
|
||||||
&& tag_name != "A"
|
&& tag_name != "A"
|
||||||
{
|
{
|
||||||
|
@ -123,6 +133,10 @@ impl Readability {
|
||||||
log::error!("{error}");
|
log::error!("{error}");
|
||||||
FullTextParserError::Readability
|
FullTextParserError::Readability
|
||||||
})?;
|
})?;
|
||||||
|
node_ref.add_child(&mut new_node).map_err(|error| {
|
||||||
|
log::error!("{error}");
|
||||||
|
FullTextParserError::Readability
|
||||||
|
})?;
|
||||||
p.replace(new_node);
|
p.replace(new_node);
|
||||||
}
|
}
|
||||||
} else if let Some(p) = p.as_mut() {
|
} else if let Some(p) = p.as_mut() {
|
||||||
|
@ -638,40 +652,13 @@ impl Readability {
|
||||||
"H1" | "H2" | "H3" | "H4" | "H5" | "H6" | "TH" => -5,
|
"H1" | "H2" | "H3" | "H4" | "H5" | "H6" | "TH" => -5,
|
||||||
_ => 0,
|
_ => 0,
|
||||||
};
|
};
|
||||||
let score = score + Self::get_class_weight(node, state);
|
let class_weight = if state.weigh_classes {
|
||||||
|
Util::get_class_weight(node)
|
||||||
|
} else {
|
||||||
|
0
|
||||||
|
};
|
||||||
|
let score = score + class_weight;
|
||||||
Self::set_content_score(node, score as f64)?;
|
Self::set_content_score(node, score as f64)?;
|
||||||
Ok(())
|
Ok(())
|
||||||
}
|
}
|
||||||
|
|
||||||
fn get_class_weight(node: &Node, state: &State) -> i64 {
|
|
||||||
if !state.weigh_classes {
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
let mut weight = 0;
|
|
||||||
|
|
||||||
// Look for a special classname
|
|
||||||
if let Some(class_names) = node.get_property("class") {
|
|
||||||
if constants::NEGATIVE.is_match(&class_names) {
|
|
||||||
weight -= 25;
|
|
||||||
}
|
|
||||||
|
|
||||||
if constants::POSITIVE.is_match(&class_names) {
|
|
||||||
weight += 25;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Look for a special ID
|
|
||||||
if let Some(class_names) = node.get_property("id") {
|
|
||||||
if constants::NEGATIVE.is_match(&class_names) {
|
|
||||||
weight -= 25;
|
|
||||||
}
|
|
||||||
|
|
||||||
if constants::POSITIVE.is_match(&class_names) {
|
|
||||||
weight += 25;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
weight
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
|
@ -70,6 +70,11 @@ async fn aclu() {
|
||||||
run_test("aclu").await
|
run_test("aclu").await
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#[tokio::test]
|
||||||
|
async fn aktualne() {
|
||||||
|
run_test("aktualne").await
|
||||||
|
}
|
||||||
|
|
||||||
#[tokio::test]
|
#[tokio::test]
|
||||||
async fn webmd_1() {
|
async fn webmd_1() {
|
||||||
run_test("webmd-1").await
|
run_test("webmd-1").await
|
||||||
|
|
290
src/util.rs
290
src/util.rs
|
@ -228,10 +228,6 @@ impl Util {
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn is_probably_visible(node: &Node) -> bool {
|
pub fn is_probably_visible(node: &Node) -> bool {
|
||||||
let display_none = node
|
|
||||||
.get_attribute("display")
|
|
||||||
.map(|display| display == "none")
|
|
||||||
.unwrap_or(false);
|
|
||||||
let is_hidden = node.has_attribute("hidden");
|
let is_hidden = node.has_attribute("hidden");
|
||||||
let aria_hidden = node
|
let aria_hidden = node
|
||||||
.get_attribute("aria-hidden")
|
.get_attribute("aria-hidden")
|
||||||
|
@ -239,7 +235,7 @@ impl Util {
|
||||||
.unwrap_or(false);
|
.unwrap_or(false);
|
||||||
let has_fallback_image = node.get_class_names().contains("fallback-image");
|
let has_fallback_image = node.get_class_names().contains("fallback-image");
|
||||||
|
|
||||||
!display_none && !is_hidden && !aria_hidden || has_fallback_image
|
!is_hidden && !aria_hidden || has_fallback_image
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn is_whitespace(node: &Node) -> bool {
|
pub fn is_whitespace(node: &Node) -> bool {
|
||||||
|
@ -333,7 +329,15 @@ impl Util {
|
||||||
1.0 - distance_b
|
1.0 - distance_b
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn has_ancestor_tag(node: &Node, tag_name: &str, max_depth: Option<u64>) -> bool {
|
pub fn has_ancestor_tag<F>(
|
||||||
|
node: &Node,
|
||||||
|
tag_name: &str,
|
||||||
|
max_depth: Option<u64>,
|
||||||
|
filter: Option<F>,
|
||||||
|
) -> bool
|
||||||
|
where
|
||||||
|
F: Fn(&Node) -> bool,
|
||||||
|
{
|
||||||
let max_depth = max_depth.unwrap_or(3);
|
let max_depth = max_depth.unwrap_or(3);
|
||||||
let tag_name = tag_name.to_uppercase();
|
let tag_name = tag_name.to_uppercase();
|
||||||
let mut depth = 0;
|
let mut depth = 0;
|
||||||
|
@ -349,7 +353,12 @@ impl Util {
|
||||||
None => return false,
|
None => return false,
|
||||||
};
|
};
|
||||||
|
|
||||||
if tmp_node.get_name() == tag_name {
|
if tmp_node.get_name() == tag_name
|
||||||
|
&& filter
|
||||||
|
.as_ref()
|
||||||
|
.map(|filter| filter(&tmp_node))
|
||||||
|
.unwrap_or(true)
|
||||||
|
{
|
||||||
return true;
|
return true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -383,15 +392,15 @@ impl Util {
|
||||||
if let Some(node_type) = node.get_type() {
|
if let Some(node_type) = node.get_type() {
|
||||||
let len = node.get_child_nodes().len();
|
let len = node.get_child_nodes().len();
|
||||||
|
|
||||||
return node_type == NodeType::ElementNode
|
node_type == NodeType::ElementNode
|
||||||
&& node.get_content().trim().is_empty()
|
|
||||||
&& (len == 0
|
&& (len == 0
|
||||||
|| len
|
|| len
|
||||||
== Self::get_elements_by_tag_name(node, "br").len()
|
== Self::get_elements_by_tag_name(node, "br").len()
|
||||||
+ Self::get_elements_by_tag_name(node, "hr").len());
|
+ Self::get_elements_by_tag_name(node, "hr").len())
|
||||||
|
&& node.get_content().trim().is_empty()
|
||||||
|
} else {
|
||||||
|
false
|
||||||
}
|
}
|
||||||
|
|
||||||
false
|
|
||||||
}
|
}
|
||||||
|
|
||||||
pub fn get_elements_by_tag_name(node: &Node, tag: &str) -> Vec<Node> {
|
pub fn get_elements_by_tag_name(node: &Node, tag: &str) -> Vec<Node> {
|
||||||
|
@ -480,4 +489,261 @@ impl Util {
|
||||||
false
|
false
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Clean an element of all tags of type "tag" if they look fishy.
|
||||||
|
// "Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
|
||||||
|
pub fn clean_conditionally(root: &mut Node, tag: &str) -> Result<(), FullTextParserError> {
|
||||||
|
// Gather counts for other typical elements embedded within.
|
||||||
|
// Traverse backwards so we can remove nodes at the same time
|
||||||
|
// without effecting the traversal.
|
||||||
|
//
|
||||||
|
// TODO: Consider taking into account original contentScore here.
|
||||||
|
let nodes = Util::get_elements_by_tag_name(root, tag);
|
||||||
|
let nodes_to_remove = nodes
|
||||||
|
.into_iter()
|
||||||
|
.filter(|node| Self::should_remove(node, tag))
|
||||||
|
.collect::<Vec<_>>();
|
||||||
|
|
||||||
|
for mut node in nodes_to_remove {
|
||||||
|
node.unlink();
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
fn should_remove(node: &Node, tag: &str) -> bool {
|
||||||
|
// First check if this node IS data table, in which case don't remove it.
|
||||||
|
let mut is_list = tag == "ul" || tag == "ol";
|
||||||
|
if !is_list {
|
||||||
|
let mut list_length = 0.0;
|
||||||
|
let ul_nodes = Self::get_elements_by_tag_name(node, "ul");
|
||||||
|
let ol_nodes = Self::get_elements_by_tag_name(node, "ol");
|
||||||
|
for list_node in ul_nodes {
|
||||||
|
list_length += Util::get_inner_text(&list_node, false).len() as f64;
|
||||||
|
}
|
||||||
|
for list_node in ol_nodes {
|
||||||
|
list_length += Util::get_inner_text(&list_node, false).len() as f64;
|
||||||
|
}
|
||||||
|
is_list = (list_length / Util::get_inner_text(node, false).len() as f64) > 0.9;
|
||||||
|
}
|
||||||
|
|
||||||
|
if tag == "table" && Self::is_data_table(node) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Next check if we're inside a data table, in which case don't remove it as well.
|
||||||
|
if Self::has_ancestor_tag(node, "table", Some(u64::MAX), Some(Self::is_data_table)) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
if Self::has_ancestor_tag(node, "code", None, None::<fn(&Node) -> bool>) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
let weight = Self::get_class_weight(node);
|
||||||
|
if weight < 0 {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
if Self::get_char_count(node, ',') < 10 {
|
||||||
|
// If there are not very many commas, and the number of
|
||||||
|
// non-paragraph elements is more than paragraphs or other
|
||||||
|
// ominous signs, remove the element.
|
||||||
|
let p = Self::get_elements_by_tag_name(node, "p").len();
|
||||||
|
let img = Self::get_elements_by_tag_name(node, "img").len();
|
||||||
|
let li = Self::get_elements_by_tag_name(node, "li").len() as i64 - 100;
|
||||||
|
let input = Self::get_elements_by_tag_name(node, "input").len();
|
||||||
|
let heading_density =
|
||||||
|
Self::get_text_density(node, &["h1", "h2", "h3", "h4", "h5", "h6"]);
|
||||||
|
|
||||||
|
let mut embed_count = 0;
|
||||||
|
let embed_tags = ["object", "embed", "iframe"];
|
||||||
|
|
||||||
|
for embed_tag in embed_tags {
|
||||||
|
for embed_node in Self::get_elements_by_tag_name(node, embed_tag) {
|
||||||
|
// If this embed has attribute that matches video regex, don't delete it.
|
||||||
|
for (_name, value) in embed_node.get_attributes() {
|
||||||
|
if constants::VIDEOS.is_match(&value) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// For embed with <object> tag, check inner HTML as well.
|
||||||
|
// if embed_node.get_name().to_lowercase() == "object" && constants::VIDEOS.is_match(embed_node.innerHTML) {
|
||||||
|
// return false;
|
||||||
|
// }
|
||||||
|
|
||||||
|
embed_count += 1;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
let link_density = Self::get_link_density(node);
|
||||||
|
let content_length = Self::get_inner_text(node, false).len();
|
||||||
|
|
||||||
|
(img > 1
|
||||||
|
&& (p as f64 / img as f64) < 0.5
|
||||||
|
&& !Self::has_ancestor_tag(node, "figure", None, None::<fn(&Node) -> bool>))
|
||||||
|
|| (!is_list && li > p as i64)
|
||||||
|
|| (input as f64 > f64::floor(p as f64 / 3.0))
|
||||||
|
|| (!is_list
|
||||||
|
&& heading_density < 0.9
|
||||||
|
&& content_length < 25
|
||||||
|
&& (img == 0 || img > 2)
|
||||||
|
&& !Self::has_ancestor_tag(node, "figure", None, None::<fn(&Node) -> bool>))
|
||||||
|
|| (!is_list && weight < 25 && link_density > 0.2)
|
||||||
|
|| (weight >= 25 && link_density > 0.5)
|
||||||
|
|| ((embed_count == 1 && content_length < 75) || embed_count > 1)
|
||||||
|
} else {
|
||||||
|
false
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn get_class_weight(node: &Node) -> i64 {
|
||||||
|
let mut weight = 0;
|
||||||
|
|
||||||
|
// Look for a special classname
|
||||||
|
if let Some(class_names) = node.get_property("class") {
|
||||||
|
if constants::NEGATIVE.is_match(&class_names) {
|
||||||
|
weight -= 25;
|
||||||
|
}
|
||||||
|
|
||||||
|
if constants::POSITIVE.is_match(&class_names) {
|
||||||
|
weight += 25;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Look for a special ID
|
||||||
|
if let Some(class_names) = node.get_property("id") {
|
||||||
|
if constants::NEGATIVE.is_match(&class_names) {
|
||||||
|
weight -= 25;
|
||||||
|
}
|
||||||
|
|
||||||
|
if constants::POSITIVE.is_match(&class_names) {
|
||||||
|
weight += 25;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
weight
|
||||||
|
}
|
||||||
|
|
||||||
|
fn get_char_count(node: &Node, char: char) -> usize {
|
||||||
|
Util::get_inner_text(node, false).split(char).count() - 1
|
||||||
|
}
|
||||||
|
|
||||||
|
fn get_text_density(node: &Node, tags: &[&str]) -> f64 {
|
||||||
|
let text_length = Util::get_inner_text(node, false).len();
|
||||||
|
if text_length == 0 {
|
||||||
|
return 0.0;
|
||||||
|
}
|
||||||
|
|
||||||
|
let mut children_length = 0;
|
||||||
|
for tag in tags {
|
||||||
|
for child in Self::get_elements_by_tag_name(node, tag) {
|
||||||
|
children_length += Util::get_inner_text(&child, false).len()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
children_length as f64 / text_length as f64
|
||||||
|
}
|
||||||
|
|
||||||
|
fn is_data_table(node: &Node) -> bool {
|
||||||
|
node.get_attribute(constants::DATA_TABLE_ATTR)
|
||||||
|
.and_then(|is_data_table| is_data_table.parse::<bool>().ok())
|
||||||
|
.unwrap_or(false)
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn mark_data_tables(context: &Context) -> Result<(), FullTextParserError> {
|
||||||
|
let nodes = Util::evaluate_xpath(context, "//table", false)?;
|
||||||
|
for mut node in nodes {
|
||||||
|
if node
|
||||||
|
.get_attribute("role")
|
||||||
|
.map(|role| role == "presentation")
|
||||||
|
.unwrap_or(false)
|
||||||
|
{
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "false");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if node
|
||||||
|
.get_attribute("datatable")
|
||||||
|
.map(|role| role == "0")
|
||||||
|
.unwrap_or(false)
|
||||||
|
{
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "false");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if node.get_attribute("summary").is_some() {
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "true");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if let Some(first_caption) = Self::get_elements_by_tag_name(&node, "caption").first() {
|
||||||
|
if !first_caption.get_child_nodes().is_empty() {
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "true");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// If the table has a descendant with any of these tags, consider a data table:
|
||||||
|
let data_table_descendants = ["col", "colgroup", "tfoot", "thead", "th"];
|
||||||
|
for descendant in data_table_descendants {
|
||||||
|
if !Self::get_elements_by_tag_name(&node, descendant).is_empty() {
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "true");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Nested tables indicate a layout table:
|
||||||
|
if !Self::get_elements_by_tag_name(&node, "table").is_empty() {
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "false");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
let (rows, columns) = Self::get_row_and_column_count(&node);
|
||||||
|
if rows >= 10 || columns > 4 {
|
||||||
|
let _ = node.set_attribute(constants::DATA_TABLE_ATTR, "true");
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Now just go by size entirely:
|
||||||
|
let _ = node.set_attribute(
|
||||||
|
constants::DATA_TABLE_ATTR,
|
||||||
|
if rows * columns > 10 { "true" } else { "false" },
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
Ok(())
|
||||||
|
}
|
||||||
|
|
||||||
|
pub fn get_row_and_column_count(node: &Node) -> (usize, usize) {
|
||||||
|
if node.get_name().to_uppercase() != "TABLE" {
|
||||||
|
return (0, 0);
|
||||||
|
}
|
||||||
|
|
||||||
|
let mut rows = 0;
|
||||||
|
let mut columns = 0;
|
||||||
|
|
||||||
|
let trs = Self::get_elements_by_tag_name(node, "tr");
|
||||||
|
for tr in trs {
|
||||||
|
let row_span = tr
|
||||||
|
.get_attribute("rowspan")
|
||||||
|
.and_then(|span| span.parse::<usize>().ok())
|
||||||
|
.unwrap_or(1);
|
||||||
|
rows += row_span;
|
||||||
|
|
||||||
|
// Now look for column-related info
|
||||||
|
let mut columns_in_this_row = 0;
|
||||||
|
let cells = Self::get_elements_by_tag_name(&tr, "td");
|
||||||
|
for cell in cells {
|
||||||
|
let colspan = cell
|
||||||
|
.get_attribute("colspan")
|
||||||
|
.and_then(|span| span.parse::<usize>().ok())
|
||||||
|
.unwrap_or(1);
|
||||||
|
columns_in_this_row += colspan;
|
||||||
|
}
|
||||||
|
columns = usize::max(columns, columns_in_this_row);
|
||||||
|
}
|
||||||
|
|
||||||
|
(rows, columns)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue