diff --git a/resources/tests/readability/nytimes-3/expected.html b/resources/tests/readability/nytimes-3/expected.html new file mode 100644 index 0000000..4c2b5bc --- /dev/null +++ b/resources/tests/readability/nytimes-3/expected.html @@ -0,0 +1,267 @@ +
+ +
+ + +

+ New York’s aging below-street infrastructure is tough to maintain, and the corrosive rock salt and “freeze-thaw” cycles of winter make it even worse. +

+
+
+

Image +

+
+ A Con Edison worker repairing underground cables this month in Flushing, Queens. The likely source of the problem was water and rock salt that had seeped underground.CreditCreditChang W. Lee/The New York Times +
+
+
+
+
+

Corey Kilgannon +

+ +
+
    +
  • + +
  • +
  • + +
  • +
+
+
+
+
+

+ [What you need to know to start the day: Get New York Today in your inbox.] +

+

+ A series of recent manhole fires in the heart of Manhattan forced the evacuation of several theaters and was a stark reminder that the subway is not the only creaky infrastructure beneath the streets of New York City. +

+

+ Underground lies a chaotic assemblage of utilities that, much like the subway, are lifelines for the city: a sprawling tangle of water mains, power cables, gas and steam lines, telecom wires and sewers. +

+

+ The city has one of the oldest and largest networks of subterranean infrastructure in the world, with some portions dating more than a century and prone to leaks and cracks. +

+

+ And winter — from the corrosive rock salt used on streets and sidewalks to “freeze-thaw” cycles that weaken pipes — makes infrastructure problems even worse. +

+
+
+

+ In the late 1800s, many of the city’s overhead utilities were buried to lessen the exposure to winter weather. “People think it’s all protected and safe, but it’s really not,” said Patrick McHugh, vice president of electrical engineering and planning for Con Edison, which maintains about 90,000 miles of underground cable in the city. +

+

+ “You have water, sewage, electricity and gas down there, and people don’t appreciate the effort that goes into keeping all that working,” he added. +

+
+
+
+

Image +

+
+ In the late 1800s, overhead utilities were buried to lessen the exposure to winter weather.CreditKirsten Luce for The New York Times +
+
+
+
+ +

+ When rock salt melts ice, and the water seeps down manholes and into electrical units, it can set off fires and explosions strong enough to pop a 300-pound manhole cover five stories into the air. +

+

+ For days after a storm, Con Edison officials say, they often deal with scores of electrical fires caused by the rock salt eating away at electrical cable insulation. The wet salt can create sparking that burns the insulation, producing both fire and gases that can combust and pop the manhole lids. +

+

+ To alleviate the threat, the officials said, the utility switched most of its manhole covers to vented ones that allow gases to escape, “so they cannot form a combustible amount,” Mr. McHugh said. +

+
+
+

+ “It also lets smoke escape, which can tip off the public to notify the authorities,” he added. +

+

+ Winter can also bring an increase in gas-line breakages. Con Edison, which maintains 4,300 miles of gas mains in and around New York City, records about 500 leaks — most of them nonemergencies — in a typical month, but many more in winter. +

+

+ Even this past January, which was unseasonably mild, there were 750 leaks, Con Edison officials said. +

+
+
+
+
+

Image

+
+
+ There are typically between 400 and 600 water main breaks each year in New York City, an official said.CreditMichael Appleton for The New York Times +
+
+
+
+ +

+ The extreme temperature swings that many researchers link to climate change are adding to the challenges of winter. +

+

+ Officials monitor weather forecasts closely for freeze-thaw cycles, when they put extra repair crews on call. +

+

+ During a polar vortex in late January, for instance, single-digit temperatures in the city quickly ballooned into the 50s. The thaw, much welcomed by many New Yorkers, worried Tasos Georgelis, deputy commissioner for water and sewer operations at the Department of Environmental Protection, which operates the city’s water system. +

+

+ “When you get a freeze and a thaw, the ground around the water mains expands and contracts, and puts external pressure on the pipes,” Mr. Georgelis said. +

+

+ Along the city’s roughly 6,500 miles of water mains, there are typically between 400 and 600 breaks a year, he added. The majority occur in winter, when the cold can make older cast-iron mains brittle. +

+
+
+

+ Environmental Protection officials said the department repaired 75 water-main breaks in January, including one in Lower Manhattan that disrupted rush-hour subway service and another on the West Side that snarled traffic and left nearby buildings without water for hours. +

+

+ The city’s 7,500 miles of sewer lines are less affected by cold weather because they are generally buried deeper than other utilities, below the frost line, agency officials said. +

+
+
+
+
+

Image

+
+
+ In 1978, a water main break caused severe flooding in Bushwick, Brooklyn.CreditFred R. Conrad/The New York Times +
+
+
+
+ +

+ Upgrading the city’s below-street utilities is a slow, painstaking process, “because you have such a fixed-in-place system,” said Rae Zimmerman, a research professor of planning and public administration at New York University. +

+

+ But there is progress. Con Edison officials said they had begun replacing the city’s nearly 1,600 miles of natural gas lines — which were made of either cast iron or unprotected steel — with plastic piping. The plastic is less susceptible to corrosion, cracks and leaks, said the officials, who added that they were swapping about 100 miles of line each year. +

+

+ The city is also replacing older, leak-prone water and sewer mains. +

+

+ Some pipes that are more than a century old hold up because they were built with a thicker grade of cast iron, according to Environmental Protection Department officials. For less healthy ones, the agency has invested more than $1 billion in the past five years — with an additional $1.4 billion budgeted over the next five years — for upgrades and replacements. New pipes will be made of a more durable, graphite-rich cast iron known as ductile iron. +

+
+
+
+
+

Image

+
+
+ Matt Cruz snowboarded through Manhattan’s Lower East Side after a snowstorm in 2016 left the streets coated in slush and rock salt.CreditHiroko Masuike/The New York Times +
+
+
+
+ +

+ Of course, winter also poses problems aboveground. Most nonemergency repair and construction work involving concrete is halted because concrete and some types of dirt, used to fill in trenches, freeze in colder temperatures, said Ian Michaels, a spokesman for the city’s Department of Design and Construction. +

+

+ Digging by hand is also a challenge in frozen ground, so many excavations that are close to pipes and other utilities are put off, Mr. Michaels said. +

+
+
+

+ And asphalt is harder to obtain because it must be kept and transported at high temperatures, he added. +

+

+ In the extreme cold, city officials will not risk shutting down water mains for construction because spillage into the street could freeze, Mr. Michaels said. He added that stopping the water flow could freeze the private water-service connections that branch off the mains, he said. +

+

+ Even the basic task of locating utilities under the street can be complicated because infrastructure has been added piecemeal over the decades. +

+
+
+
+
+

Image

+
+
+ A water main break in Manhattan in 2014. “When you get a freeze and a thaw, the ground around the water mains expands and contracts, and puts external pressure on the pipes,” said Tasos Georgelis of the city's Department of Environmental Protection.CreditÃngel Franco/The New York Times +
+
+
+
+ +

+ Street surfaces are affected by winter weather, too: Last year, the city filled 255,904 potholes. +

+

+ And should anyone forget that filling potholes, like snow removal, is a sacred staple of constituent services, transportation officials have compiled the number of potholes the city has filled — more than 1,786,300 — since Mayor Bill de Blasio took office in 2014. +

+

+ Potholes form when water and salt seep into cracks, freeze and expand, creating a larger crevice, said Joe Carbone, who works for the Transportation Department, where he is known as the pothole chief. +

+

+ Simply put, more freeze-thaw cycles result in more potholes, he said. Currently, the department has 25 crews repairing potholes. During peak pothole-repair season in early March, that number can expand to more than 60. +

+
+
+

+ Still, the department is continually resurfacing the city’s more than 6,000 miles of streets and 19,000 lane miles. Each year, agency officials said, it uses more than one million tons of asphalt to repave more than 1,300 lane-miles of street. +

+ +
+
+
+
+

Image

+
+
+ Workers learning how to fix water main breaks at a training center in Queens.CreditChang W. Lee/The New York Times +
+
+
+
+ +

+ Of the 400 city laborers who work on water mains, many learn the finer points of leak repair at a training center in Queens, where underground pipes are made to spring leaks for repair drills. +

+

+ Workers from the Department of Environmental Protection recently gathered around a muddy hole as a co-worker, Nehemiah Dejesus, scrambled to apply a stainless-steel repair clamp around a cracked segment that was spewing water. +

+

+ “Don’t get nervous,” instructed Milton Velez, the agency’s district supervisor for Queens. +

+

+ “I’m not,” Mr. Dejesus said as he secured the clamp and stopped the leak. “It’s ‘Showtime at the Apollo.’” +

+
+ + +
+
+ +
+

+ Corey Kilgannon is a Metro reporter covering news and human interest stories. His writes the Character Study column in the Sunday Metropolitan section. He was also part of the team that won the 2009 Pulitzer Prize for Breaking News. @coreykilgannon Facebook +

+
+

+ A version of this article appears in print on , on Page A22 of the New York edition with the headline: Under the City’s Streets, A Battle Against Winter. Order Reprints | Today’s Paper | Subscribe +

+ +
+ +
diff --git a/resources/tests/readability/nytimes-4/expected.html b/resources/tests/readability/nytimes-4/expected.html new file mode 100644 index 0000000..07b1268 --- /dev/null +++ b/resources/tests/readability/nytimes-4/expected.html @@ -0,0 +1,224 @@ +
+ +
+ + +

+ Tax cuts, spending increases and higher interest rates could make it harder to respond to future recessions and deal with other needs. +

+
+
+

Image +

+
+ Interest payments on the federal debt could surpass the Defense Department budget in 2023.CreditCreditJeon Heon-Kyun/EPA, via Shutterstock +
+
+
+
+ +
    +
  • + +
  • +
  • + +
  • +
+
+
+
+
+

+ The federal government could soon pay more in interest on its debt than it spends on the military, Medicaid or children’s programs. +

+

+ The run-up in borrowing costs is a one-two punch brought on by the need to finance a fast-growing budget deficit, worsened by tax cuts and steadily rising interest rates that will make the debt more expensive. +

+

+ With less money coming in and more going toward interest, political leaders will find it harder to address pressing needs like fixing crumbling roads and bridges or to make emergency moves like pulling the economy out of future recessions. +

+

+ Within a decade, more than $900 billion in interest payments will be due annually, easily outpacing spending on myriad other programs. Already the fastest-growing major government expense, the cost of interest is on track to hit $390 billion next year, nearly 50 percent more than in 2017, according to the Congressional Budget Office. +

+
+
+

+ “It’s very much something to worry about,” said C. Eugene Steuerle, a fellow at the Urban Institute and a co-founder of the Urban-Brookings Tax Policy Center in Washington. “Everything else is getting squeezed.” +

+

+ Gradually rising interest rates would have made borrowing more expensive even without additional debt. But the tax cuts passed late last year have created a deeper hole, with the deficit increasing faster than expected. A budget bill approved in February that raised spending by $300 billion over two years will add to the financial pressure. +

+

+ The deficit is expected to total nearly $1 trillion next year — the first time it has been that big since 2012, when the economy was still struggling to recover from the financial crisis and interest rates were near zero. +

+
+ +
+

+ Deficit hawks have gone silent, even proposing changes that would exacerbate the deficit. House Republicans introduced legislation this month that would make the tax cuts permanent. +

+

+ “The issue has just disappeared,” said Senator Mark Warner, a Virginia Democrat. “There’s collective amnesia.” +

+
+
+

+ The combination, say economists, marks a journey into mostly uncharted financial territory. +

+

+ In the past, government borrowing expanded during recessions and waned in recoveries. That countercyclical policy has been a part of the standard Keynesian toolbox to combat downturns since the Great Depression. +

+

+ The deficit is soaring now as the economy booms, meaning the stimulus is pro-cyclical. The risk is that the government would have less room to maneuver if the economy slows. +

+
+ +
+

+ Aside from wartime or a deep downturn like the 1930s or 2008-9, “this sort of aggressive fiscal stimulus is unprecedented in U.S. history,” said Jeffrey Frankel, an economist at Harvard. +

+

+ Pouring gasoline on an already hot economy has resulted in faster growth — the economy expanded at an annualized rate of 4.2 percent in the second quarter. But Mr. Frankel warns that when the economy weakens, the government will find it more difficult to cut taxes or increase spending. +

+

+ Lawmakers might, in fact, feel compelled to cut spending as tax revenue falls, further depressing the economy. “There will eventually be another recession, and this increases the chances we will have to slam on the brakes when the car is already going too slowly,” Mr. Frankel said. +

+ +
+ +
+

+ Finding the money to pay investors who hold government debt will crimp other parts of the budget. In a decade, interest on the debt will eat up 13 percent of government spending, up from 6.6 percent in 2017. +

+

+ “By 2020, we will spend more on interest than we do on kids, including education, food stamps and aid to families,” said Marc Goldwein, senior policy director at the Committee for a Responsible Federal Budget, a research and advocacy organization. +

+
+
+

+ Interest costs already dwarf spending on many popular programs. For example, grants to students from low-income families for college total roughly $30 billion — about one-tenth of what the government will pay in interest this year. Interest payments will overtake Medicaid in 2020 and the Department of Defense budget in 2023. +

+

+ What’s more, the heavy burden of interest payments could make it harder for the government to repair aging infrastructure or take on other big new projects. +

+

+ Mr. Trump has called for spending $1 trillion on infrastructure, but Congress has not taken up that idea. +

+
+
+

+ More about the federal debt and the economy +

+ +
+
+ +

+ Until recently, ultralow interest rates, set by the Federal Reserve to support the economy, allowed lawmakers to borrow without fretting too much about the cost of that debt. +

+

+ But as the economy has strengthened, the Fed has gradually raised rates, starting in December 2015. The central bank is expected to push rates up again on Wednesday, and more increases are in store. +

+

+ “When rates went down to record lows, it allowed the government to take on more debt without paying more interest,” Mr. Goldwein said. “That party is ending.” +

+
+ +
+

+ Since the beginning of the year, the yield on the 10-year Treasury note has risen by more than half a percentage point, to 3.1 percent. The Congressional Budget Office estimates that the yield will climb to 4.2 percent in 2021. Given that the total public debt of the United States stands at nearly $16 trillion, even a small uptick in rates can cost the government billions. +

+ +
+
+

+ There’s no guarantee that these forecasts will prove accurate. If the economy weakens, rates might fall or rise only slightly, reducing interest payments. But rates could also overshoot the budget office forecast. +

+

+ Some members of Congress want to set the stage for even more red ink. Republicans in the House want to make last year’s tax cuts permanent, instead of letting some of them expire at the end of 2025. That would reduce federal revenue by an additional $631 billion over 10 years, according to the Tax Policy Center. +

+ +
+ +
+

+ Deficit hawks have warned for years that a day of reckoning is coming, exposing the United States to the kind of economic crisis that overtook profligate borrowers in the past like Greece or Argentina. +

+

+ But most experts say that isn’t likely because the dollar is the world’s reserve currency. As a result, the United States still has plenty of borrowing capacity left because the Fed can print money with fewer consequences than other central banks. +

+

+ And interest rates plunged over the last decade, even as the government turned to the market for trillions each year after the recession. That’s because Treasury bonds are still the favored port of international investors in any economic storm. +

+

+ “We exported a financial crisis a decade ago, and the world responded by sending us money,” said William G. Gale, a senior fellow at the Brookings Institution. +

+

+ But that privileged position has allowed politicians in both parties to avoid politically painful steps like cutting spending or raising taxes. +

+
+
+

+ That doesn’t mean rapidly rising interest costs and a bigger deficit won’t eventually catch up with us. +

+

+ Charles Schultze, chairman of the Council of Economic Advisers in the Carter administration, once summed up the danger of deficits with a metaphor. “It’s not so much a question of the wolf at the door, but termites in the woodwork.” +

+ +

+ Rather than simply splitting along party lines, lawmakers’ attitudes toward the deficit also depend on which party is in power. Republicans pilloried the Obama administration for proposing a large stimulus in the depths of the recession in 2009 and complained about the deficit for years. +

+

+ In 2013, Senator Mitch McConnell of Kentucky called the debt and deficit “the transcendent issue of our era.” By 2017, as Senate majority leader, he quickly shepherded the tax cut through Congress. +

+

+ Senator James Lankford, an Oklahoma Republican who warned of the deficit’s dangers in the past, nevertheless played down that threat on the Senate floor as the tax billed neared passage. +

+

+ “I understand it’s a risk, but I think it’s an appropriate risk to be able to say let’s allow Americans to keep more of their own money to invest in this economy,” he said. +

+

+ He also claimed the tax cuts would pay for themselves even as the Congressional Budget Office estimated that they would add $250 billion to the deficit on average from 2019 to 2024. +

+
+
+

+ In an interview, Mr. Lankford insisted that the jury was still out on whether the tax cuts would generate additional revenue, citing the strong economic growth recently. +

+

+ While the Republican about-face has been much more striking, Democrats have adjusted their position, too. +

+

+ Mr. Warner, the Virginia Democrat, called last year’s tax bill “the worst piece of legislation we have passed since I arrived in the Senate.” In 2009, however, when Congress passed an $800 billion stimulus bill backed by the Obama administration, he called it “a responsible mix of tax cuts and investments that will create jobs.” +

+

+ The difference, Mr. Warner said, was that the economy was near the precipice then. +

+

+ “There was virtual unanimity among economists that we needed a stimulus,” he said. “But a $2 trillion tax cut at the end of a business cycle with borrowed money won’t end well.” +

+
+
+
+ +
+

+ Nelson D. Schwartz has covered economics since 2012. Previously, he wrote about Wall Street and banking, and also served as European economic correspondent in Paris. He joined The Times in 2007 as a feature writer for the Sunday Business section. @NelsonSchwartz +

+
+

+ A version of this article appears in print on , on Page A1 of the New York edition with the headline: What May Soon Exceed Cost of U.S. Military? Interest on U.S. Debt . Order Reprints | Today’s Paper | Subscribe +

+ +
+ +
diff --git a/resources/tests/readability/nytimes-5/expected.html b/resources/tests/readability/nytimes-5/expected.html new file mode 100644 index 0000000..e9b4cd9 --- /dev/null +++ b/resources/tests/readability/nytimes-5/expected.html @@ -0,0 +1,406 @@ +
+
+ +
+ +
+
+

+ Highlights +

+
    +
  1. +
    +
    + PhotoXi Jinping, el líder de China, arriesgó su prestigio personal cuando su país se postuló para organizar los Juegos de Invierno 2022; hasta ahora el país ha cumplido sus promesas. +
    + CreditKevin Frayer/Getty Images +
    +
    + +
    +
  2. +
  3. +
    +
    + Photo +
    + CreditEllen Surrey +
    +
    +
    + +

    + Ensayo invitado +

    +

    + El día que renuncié a los Beatles +

    +

    + Estaba obsesionado con el cuarteto de Liverpool: celebraba sus cumpleaños, leía todo sobre la banda y memoricé todas sus canciones. Cuando me desintoxiqué de ellos descubrí un nuevo mundo musical. +

    +

    + Por Josh Max +

    +
    +
    +
  4. +
  5. +
      +
    1. +
      +
      + Photo +
      + CreditErik Carter +
      +
      +
      +

      + On Tech +

      +

      + ¿Qué tienen que ver los videojuegos con el metaverso? +

      + +

      + Las compañías tecnológicas creen que los videojuegos son el camino para avanzar más rápido hacia un internet inmersivo. La adquisición de Activision Blizzard por Microsoft es una muestra de esta tendencia. +

      +

      + Por Shira Ovide +

      +
      +
      +
    2. +
    3. +
      +
      + Photo +
      + CreditTed + Chelsea Cavanaugh para The New York Times +
      +
      +
      +

      + Skin Deep +

      +

      + Bebidas con beneficios: ¿de verdad funcionan? +

      + +

      + Hay un nuevo mercado de bebidas que prometen beneficios como la salud intestinal, una mente relajada y piel más brillante. El problema es que ninguno de esos efectos ha sido respaldado científicamente. +

      +

      + Por Rachel Strugatz +

      +
      +
      +
    4. +
    +
  6. +
+
+ +
+
+

+ Opinión + +

+Más en Opinión › +
+
    +
  1. +
    +
    + Photo +
    + CreditDanielle Chenette +
    +
    + +
    +
  2. +
  3. +
    +
    + Photo +
    + CreditCari Vander Yacht +
    +
    + +
    +
  4. +
  5. +
    +
    + Photo   +
    + CreditBianca Bagnarelli +
    +
    + +
    +
  6. +
  7. +
      +
    1. +
      +
      + Photo +
      + CreditKim Raff for The New York Times +
      +
      + +
      +
    2. +
    +
  8. +
+
+
+
+

+ Especial + +

+Más en Especial › +
+
    +
  1. +
    +
    + Photo  +
    + CreditPhoto Illustration by Andrew B. Myers for The New York Times +
    +
    +
    +

    + El desafío Come bien +

    +

    + Actualiza tus hábitos alimenticios este año, sin necesidad de hacer dieta. +

    +

    + By Tara Parker-Pope +

    +
    +
    +
  2. +
  3. +
    +
    + Photo +
    + CreditPhoto Illustration by Andrew B. Myers for The New York Times +
    +
    + +
    +
  4. +
  5. +
    +
    + Photo +
    + CreditFotoilustraciones de Andrew B. Myers para The New York Times +
    +
    + +
    +
  6. +
  7. +
    +
    + Photo +
    + CreditFotoilustración de Andrew B. Myers para The New York Times +
    +
    + +
    +
  8. +
+
+ +
+
+

+ El brote de Coronavirus + +

+Más en El brote de Coronavirus › +
+
    +
  1. +
    +
    + Photo +
    + Credit +
    +
    + +
    +
  2. +
  3. +
    +
    + PhotoLargas filas para hacerse pruebas de coronavirus en Jonesboro, Georgia, este mes. La variante ómicron se identificó a finales de noviembre, por lo que es demasiado pronto para decir cuánto tiempo pueden persistir los síntomas. +
    + CreditDustin Chambers para The New York Times +
    +
    +
    +

    + ¿Ómicron puede causar covid prolongada? +

    +

    + Los científicos dicen que aún es muy pronto para saber si quienes se infectan con la nueva variante tendrán síntomas persistentes. Una infección leve no necesariamente es señal de que hay menos riesgo. +

    +

    + By Pam Belluck +

    +
    +
    +
  4. +
  5. +
    +
    + PhotoA 3-D plaster model of a coronavirus spike protein in the office of Dr. Barney Graham of the Vaccine Research Center of the National Institutes of Health. +
    + CreditJohnathon Kelso for The New York Times +
    +
    + +
    +
  6. +
  7. +
    +
    + PhotoUn centro de pruebas de COVID-19 realizadas con saliva en la Universidad de Minnesota, en Mineápolis +
    + CreditJenn Ackerman para The New York Times +
    +
    + +
    +
  8. +
  9. +
    +
    + Photo +
    + CreditCharlie Rubin para The New York Times +
    +
    + +
    +
  10. +
+
+
+
+

+ Estados Unidos + +

+Más en Estados Unidos › +
+
    +
  1. +
    +
    + PhotoUna presentación sobre la Operación Estrella Solitaria, en Weslaco, Texas, el año pasado +
    + CreditChristopher Lee para The New York Times +
    +
    + +
    +
  2. +
  3. +
    +
    + PhotoUna multitud se reunió en el National Mall el 6 de enero de 2021, cuando el expresidente Donald Trump cuestionó los resultados de las elecciones de 2020. +
    + CreditPete Marovich para The New York Times +
    +
    + +
    +
  4. +
  5. +
    +
    + PhotoNinguna de las más de 729 personas acusadas en relación con los disturbios del Capitolio tiene hasta ahora ninguna conexión con los antifa, según una base de datos de NPR sobre registros de detenciones. +
    + CreditJason Andrew para The New York Times +
    +
    + +
    +
  6. +
  7. +
    +
    + PhotoEl expresidente Donald Trump el año pasado. Liz Cheney, representante republicana por Wyoming, ha calificado su lenta respuesta al atentado del 6 de enero como una negligencia en el cumplimiento del deber. +
    + CreditCooper Neill para The New York Times +
    +
    + +
    +
  8. +
  9. +
    +
    + PhotoDurante una década, Holmes engañó a inversionistas inteligentes, a cientos de empleados inteligentes, a un comité de figuras ilustres y a los medios de comunicación que estaban ansiosos por ungir a una nueva estrella. +
    + CreditJenny Hueston +
    +
    +
    +

    + El auge y la caída de Elizabeth Holmes +

    +

    + El caso de la fundadora de Theranos podría cambiar el estatus de culto que tienen algunos emprendedores tecnológicos a los que no se les cuestionan sus ambiciosos proyectos ni se les exige que rindan cuentas. +

    +

    + By David Streitfeld +

    +
    +
    +
  10. +
+
+ +
+
+ +
+ + +
+
+ +
+
+ +
diff --git a/src/full_text_parser/mod.rs b/src/full_text_parser/mod.rs index 0e9f6c5..6e7e386 100644 --- a/src/full_text_parser/mod.rs +++ b/src/full_text_parser/mod.rs @@ -21,6 +21,7 @@ use libxml::tree::{Document, Node, NodeType}; use libxml::xpath::Context; use reqwest::header::HeaderMap; use reqwest::{Client, Url}; +use std::collections::HashSet; use std::path::Path; use std::str::from_utf8; @@ -493,8 +494,7 @@ impl FullTextParser { if tag_name == "IMG" || tag_name == "PICTURE" { _ = node.set_attribute(copy_to, &val); } else if tag_name == "FIGURE" - && !Util::has_decendent_tag(&node, "img") - && !Util::has_decendent_tag(&node, "picture") + && !Util::has_any_descendent_tag(&node, &HashSet::from(["IMG", "PICTURE"])) { //if the item is a
that does not contain an image or picture, create one and place it inside the figure //see the nytimes-3 testcase for an example diff --git a/src/full_text_parser/readability/tests.rs b/src/full_text_parser/readability/tests.rs index 4f9b2a5..160184b 100644 --- a/src/full_text_parser/readability/tests.rs +++ b/src/full_text_parser/readability/tests.rs @@ -382,10 +382,20 @@ async fn nytimes_2() { run_test("nytimes-2").await } -// #[tokio::test] -// async fn nytimes_3() { -// run_test("nytimes-3").await -// } +#[tokio::test] +async fn nytimes_3() { + run_test("nytimes-3").await +} + +#[tokio::test] +async fn nytimes_4() { + run_test("nytimes-4").await +} + +#[tokio::test] +async fn nytimes_5() { + run_test("nytimes-5").await +} #[tokio::test] async fn webmd_1() { diff --git a/src/util.rs b/src/util.rs index b58e099..e9a16d0 100644 --- a/src/util.rs +++ b/src/util.rs @@ -1,3 +1,5 @@ +use std::collections::HashSet; + use libxml::{ tree::{Document, Node, NodeType}, xpath::Context, @@ -340,14 +342,21 @@ impl Util { 1.0 - distance_b } - pub fn has_decendent_tag(node: &Node, tag_name: &str) -> bool { - let mut node_iter = Self::next_node(node, false); - while let Some(node) = node_iter { - if Self::has_tag_name(Some(&node), tag_name) { + pub fn has_any_descendent_tag(node: &Node, tag_names: &HashSet<&str>) -> bool { + let children = node.get_child_elements(); + let is_direct_child = children + .iter() + .map(|node| node.get_name().to_uppercase()) + .any(|name| tag_names.contains(name.as_str())); + + if is_direct_child { + return true; + } + + for child in children { + if Util::has_any_descendent_tag(&child, tag_names) { return true; } - - node_iter = Util::next_node(&node, false); } false @@ -521,7 +530,11 @@ impl Util { for mut node in nodes.into_iter().rev() { if Util::get_class_weight(&node) < 0 { - log::debug!("Removing header with low class weight: {} {}", node.get_name(), node.get_attribute("class").unwrap_or_default()); + log::debug!( + "Removing header with low class weight: {} {}", + node.get_name(), + node.get_attribute("class").unwrap_or_default() + ); node.unlink(); } }