Toots for adbar@fediscience.org account

Written by Adrien Barbaresi on 2025-02-03 at 11:03

"You might have journeyed to one of the capybara hot spots around the world, including Japan, where capybaras can be found in cafés and onsen, or hot-spring baths, which, unlike cafés, are a perfect place for the creature to enjoy immersion in water."

Gary Shteyngart in the New Yorker:

https://www.newyorker.com/magazine/2025/02/03/how-the-capybara-won-my-heart-and-almost-everyone-elses

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2025-01-31 at 14:10

Coming soon to a country near you:

"In addition to vetting its portfolio of 40,000 individual grants, NSF has taken the preemptive step of ending an unknown number of programs supporting work that might be seen as violating the new directives."

https://www.science.org/content/article/exclusive-nsf-starts-vetting-all-grants-comply-trump-s-orders

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2025-01-26 at 11:09 (original by Kurt Schwarz' Berlin-Fotos)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2025-01-25 at 10:50 (original by Design Thinking! Comic)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2025-01-12 at 14:51 (original by Elizabeth Ayer)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2025-01-05 at 19:38 (original by Georg Fischer 🇪🇺🇺🇦)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-12-28 at 12:24 (original by Denny Vrandečić)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-12-15 at 14:54 (original by Jess Rose)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-12-13 at 19:33 (original by Pr. Logos :verified:)

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-12-08 at 10:15

"Les managers d’établissements reçus en délégation au ministère n’ont pas compris ce que voulait leur faire entendre M. Hetzel lorsqu’il a eu la franchise de leur dire que pour Bercy, ils étaient « des punks à chien » (sic)."

Par ailleurs un bon résumé de la situation :

https://blogs.mediapart.fr/pascal-maillard/blog/061224/l-universite-est-sur-la-paille

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-12-03 at 15:38

Trafilatura v2 is out! 🚀

The Python and command-line tool gathers text on the Web and turns HTML into structured data. Among the significant changes to improve performance and stability:

💻 Type hinting is now used throughout the code

🛠️ More robust code and improved error handling

📄 Enhanced HTML and HTML-to-Markdown output

⚠️ Python 3.6 and 3.7 are now deprecated

For more info about breaking changes and improvements see changelog:

https://github.com/adbar/trafilatura/blob/master/HISTORY.md

[#]opensource #OpenScience

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-12-01 at 20:23 (original by Benjamin Paassen)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-11-24 at 20:06 (original by Arthur Charpentier ⏚)

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-19 at 13:01

Do you know what is behind the better performance of RefinedWeb and FineWeb datasets in LLM training?

By reducing noise and increasing data efficiency, Trafilatura provides a cleaner text base for training and fine-tuning models.

The most recent study by Penedo et al. @huggingface shows that custom text extraction with Trafilatura outperforms the default WET data for training on web archives.

· Paper 🔖 : https://arxiv.org/html/2406.17557v2

· Software 💻 : https://github.com/huggingface/datatrove

[#]LLM #OpenScience #NLProc

=> View attached media

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-15 at 14:28

Tip of the day: I can really recommend to get a a local library card, also because the digital portals are now full of hidden gems 📚

Here is a nice example with a Browser extension to remove paywalls on German news sites using your library account:

https://stefanw.github.io/bibbot/

[#]VÖBB #Voebb #Bibliothek

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-14 at 11:26

Interesting development in the field of topic modeling, fresh out of press at the EMNLP conference:

"We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words."

https://aclanthology.org/2024.findings-emnlp.790/

[#]NLProc #DigitalHumanities

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-11-13 at 15:34 (original by SWIB)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-11-12 at 09:33 (original by theHigherGeometer)

=> More informations about this toot | View the thread

Shared by Adrien Barbaresi on 2024-11-01 at 16:45 (original by Eric Holscher)

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-10-29 at 15:50

Awesome Digital Humanities:

Comprehensive list of software for humanities scholars using quantitative or computational methods

https://github.com/dh-tech/awesome-digital-humanities

[#]OpenScience #FOSS #DigitalHumanities

=> More informations about this toot | View the thread

=> This profile without reblog | Go to adbar@fediscience.org account

Proxy Information
Original URL
gemini://mastogem.picasoft.net/profile/112490314374419270/reblog
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
429.367591 milliseconds
Gemini-to-HTML Time
4.385881 milliseconds

This content has been proxied by September (3851b).