Toots for adbar@fediscience.org account

Written by Adrien Barbaresi on 2025-02-03 at 11:03

"You might have journeyed to one of the capybara hot spots around the world, including Japan, where capybaras can be found in cafés and onsen, or hot-spring baths, which, unlike cafés, are a perfect place for the creature to enjoy immersion in water."

Gary Shteyngart in the New Yorker:

https://www.newyorker.com/magazine/2025/02/03/how-the-capybara-won-my-heart-and-almost-everyone-elses

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2025-01-31 at 14:10

Coming soon to a country near you:

"In addition to vetting its portfolio of 40,000 individual grants, NSF has taken the preemptive step of ending an unknown number of programs supporting work that might be seen as violating the new directives."

https://www.science.org/content/article/exclusive-nsf-starts-vetting-all-grants-comply-trump-s-orders

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-12-08 at 10:15

"Les managers d’établissements reçus en délégation au ministère n’ont pas compris ce que voulait leur faire entendre M. Hetzel lorsqu’il a eu la franchise de leur dire que pour Bercy, ils étaient « des punks à chien » (sic)."

Par ailleurs un bon résumé de la situation :

https://blogs.mediapart.fr/pascal-maillard/blog/061224/l-universite-est-sur-la-paille

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-12-03 at 15:38

Trafilatura v2 is out! 🚀

The Python and command-line tool gathers text on the Web and turns HTML into structured data. Among the significant changes to improve performance and stability:

💻 Type hinting is now used throughout the code

🛠️ More robust code and improved error handling

📄 Enhanced HTML and HTML-to-Markdown output

⚠️ Python 3.6 and 3.7 are now deprecated

For more info about breaking changes and improvements see changelog:

https://github.com/adbar/trafilatura/blob/master/HISTORY.md

[#]opensource #OpenScience

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-19 at 13:01

Do you know what is behind the better performance of RefinedWeb and FineWeb datasets in LLM training?

By reducing noise and increasing data efficiency, Trafilatura provides a cleaner text base for training and fine-tuning models.

The most recent study by Penedo et al. @huggingface shows that custom text extraction with Trafilatura outperforms the default WET data for training on web archives.

· Paper 🔖 : https://arxiv.org/html/2406.17557v2

· Software 💻 : https://github.com/huggingface/datatrove

[#]LLM #OpenScience #NLProc

=> View attached media

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-15 at 14:28

Tip of the day: I can really recommend to get a a local library card, also because the digital portals are now full of hidden gems 📚

Here is a nice example with a Browser extension to remove paywalls on German news sites using your library account:

https://stefanw.github.io/bibbot/

[#]VÖBB #Voebb #Bibliothek

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-11-14 at 11:26

Interesting development in the field of topic modeling, fresh out of press at the EMNLP conference:

"We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words."

https://aclanthology.org/2024.findings-emnlp.790/

[#]NLProc #DigitalHumanities

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-10-29 at 15:50

Awesome Digital Humanities:

Comprehensive list of software for humanities scholars using quantitative or computational methods

https://github.com/dh-tech/awesome-digital-humanities

[#]OpenScience #FOSS #DigitalHumanities

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-10-23 at 18:05

@HydrePrever et al. your 2¢ maybe? Interesting series related to primes?

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-10-23 at 18:02

I'm fiddling with Rust, using prime numbers as a playground.

A Fermi–Dirac prime is a prime power whose exponent is a power of two. Interestingly, this concept already confuses chatbots quite a bit.

I came up with the following solution which should be readable enough, can somebody tell me if there is something wrong with it?

https://oeis.org/A050376

=> View attached media

=> More informations about this toot | View the thread

Written by Adrien Barbaresi on 2024-08-26 at 17:36

There were once four scientists who decided to take a systematic approach to honing one of working life’s great skills: the art of saying no.

With workloads heading to burnout levels of busyness, they agreed that in the space of one year, they would collectively turn down 100 work-related requests and track what happened as a result.

[…]

One of their discoveries has stuck with me since: they had no regrets about saying no.

https://archive.is/6crr8 (FT article)

=> More informations about this toot | View the thread

=> This profile with reblog | Go to adbar@fediscience.org account

Proxy Information
Original URL
gemini://mastogem.picasoft.net/profile/112490314374419270
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
380.695449 milliseconds
Gemini-to-HTML Time
2.236233 milliseconds

This content has been proxied by September (3851b).