"You might have journeyed to one of the capybara hot spots around the world, including Japan, where capybaras can be found in cafés and onsen, or hot-spring baths, which, unlike cafés, are a perfect place for the creature to enjoy immersion in water."
Gary Shteyngart in the New Yorker:
https://www.newyorker.com/magazine/2025/02/03/how-the-capybara-won-my-heart-and-almost-everyone-elses
=> More informations about this toot | View the thread
Coming soon to a country near you:
"In addition to vetting its portfolio of 40,000 individual grants, NSF has taken the preemptive step of ending an unknown number of programs supporting work that might be seen as violating the new directives."
https://www.science.org/content/article/exclusive-nsf-starts-vetting-all-grants-comply-trump-s-orders
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
"Les managers d’établissements reçus en délégation au ministère n’ont pas compris ce que voulait leur faire entendre M. Hetzel lorsqu’il a eu la franchise de leur dire que pour Bercy, ils étaient « des punks à chien » (sic)."
Par ailleurs un bon résumé de la situation :
https://blogs.mediapart.fr/pascal-maillard/blog/061224/l-universite-est-sur-la-paille
=> More informations about this toot | View the thread
Trafilatura v2 is out! 🚀
The Python and command-line tool gathers text on the Web and turns HTML into structured data. Among the significant changes to improve performance and stability:
💻 Type hinting is now used throughout the code
🛠️ More robust code and improved error handling
📄 Enhanced HTML and HTML-to-Markdown output
⚠️ Python 3.6 and 3.7 are now deprecated
For more info about breaking changes and improvements see changelog:
https://github.com/adbar/trafilatura/blob/master/HISTORY.md
[#]opensource #OpenScience
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
Do you know what is behind the better performance of RefinedWeb and FineWeb datasets in LLM training?
By reducing noise and increasing data efficiency, Trafilatura provides a cleaner text base for training and fine-tuning models.
The most recent study by Penedo et al. @huggingface shows that custom text extraction with Trafilatura outperforms the default WET data for training on web archives.
· Paper 🔖 : https://arxiv.org/html/2406.17557v2
· Software 💻 : https://github.com/huggingface/datatrove
[#]LLM #OpenScience #NLProc
=> More informations about this toot | View the thread
Tip of the day: I can really recommend to get a a local library card, also because the digital portals are now full of hidden gems 📚
Here is a nice example with a Browser extension to remove paywalls on German news sites using your library account:
https://stefanw.github.io/bibbot/
[#]VÖBB #Voebb #Bibliothek
=> More informations about this toot | View the thread
Interesting development in the field of topic modeling, fresh out of press at the EMNLP conference:
"We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words."
https://aclanthology.org/2024.findings-emnlp.790/
[#]NLProc #DigitalHumanities
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
=> More informations about this toot | View the thread
Awesome Digital Humanities:
Comprehensive list of software for humanities scholars using quantitative or computational methods
https://github.com/dh-tech/awesome-digital-humanities
[#]OpenScience #FOSS #DigitalHumanities
=> More informations about this toot | View the thread
=> This profile without reblog | Go to adbar@fediscience.org account This content has been proxied by September (3851b).Proxy Information
text/gemini