Ancestors

Toot

Written by manisha on 2025-01-06 at 15:53

Does anyone know of an #OpenAccess full-text #PDF #search engine/tool using which I can search for relevant PDFs from a self-hosted #database?

Context: we have a curated database of #research articles but so far our search capability has been limited to tagged keywords or title and abstract field search only. We'd like to be able to search the entire PDF.

Side note: I know that PDFs are not a great way to store scientific information. I'd prefer not to use a proprietary #LLM if possible

[#]LexicalSearch #SemanticSearch #AskAcademia #academia #science #sciences #ScienceMastodon #AskFedi #OpenScience

=> More informations about this toot | More toots from manisha@neuromatch.social

Descendants

Written by Koen Hufkens, PhD on 2025-01-06 at 15:56

@manisha Ollama with RAG on Open WebUI?

https://docs.openwebui.com/features/rag/

=> More informations about this toot | More toots from koen_hufkens@mastodon.social

Written by manisha on 2025-01-06 at 16:02

@koen_hufkens I need to look into each of those terms 😅 but at first glance, this looks promising, thanks!! 🙂

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by Koen Hufkens, PhD on 2025-01-06 at 16:09

@manisha Ollama is a tool to run (local) LLM. Open WebUI is an interface for all this in the browser for ease of use - you can setup multiple accounts (when the service is public) or use it single user.

=> More informations about this toot | More toots from koen_hufkens@mastodon.social

Written by manisha on 2025-01-06 at 16:32

@koen_hufkens thank you, this was helpful! Others have suggested Zotero's full-text search -- I'm looking into that as well

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by Koen Hufkens, PhD on 2025-01-06 at 16:34

@manisha Sure, this works too. Don't forget easy command line tools, too.

https://pdfgrep.org/

=> More informations about this toot | More toots from koen_hufkens@mastodon.social

Written by manisha on 2025-01-06 at 16:39

@koen_hufkens ooh cool, I knew of good ol grep but not pdfgrep!

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by Koen Hufkens, PhD on 2025-01-06 at 16:42

@manisha You can do it with standard tools as well I think, given some finagling.

=> More informations about this toot | More toots from koen_hufkens@mastodon.social

Written by Oliver Brendel on 2025-01-06 at 16:01

@manisha Even if I don't like to make any publicity, Windows 11 search will do this : it will find any word(or consecutive words) in pdfs in a folder/ subfolder. Or import your pdfs into Zotero.

=> More informations about this toot | More toots from olibrendel@scicomm.xyz

Written by manisha on 2025-01-06 at 16:26

@olibrendel we have a linux server. Going to check whether it is possible to make a Zotero library completely public or have a way to show search results on the front-end at least while Zotero could be in the backend. In which case, #Zotero may be a good option! thank you!

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by El Duvelle on 2025-01-06 at 16:17

@manisha Not sure if that really fits what you want but #Zotero does full-text search of PDFs: https://www.zotero.org/support/searching

=> More informations about this toot | More toots from elduvelle@neuromatch.social

Written by manisha on 2025-01-06 at 16:30

@elduvelle thank you!! It looks like a good option. Someone else also suggested it. I've used it as a reference manager but never dived into its full-text indexing feature. Do you happen to know if Zotero libraries can be made public?

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by El Duvelle on 2025-01-06 at 21:20

@manisha I haven't done it myself but it looks like it's just a matter of setting the library's visibility to "public": https://forums.zotero.org/discussion/88809/keep-public-library

=> More informations about this toot | More toots from elduvelle@neuromatch.social

Written by jonny (good kind) on 2025-01-07 at 00:22

@elduvelle

@manisha

Ya ya zotero would def be how I'd do it. Do you need it to be public so that it can be publicly full text searchable from a website or smth? Slightly different problem than being able to do it on your own zotero client but I'd still use zotero as the basis of the collection

=> More informations about this toot | More toots from jonny@neuromatch.social

Written by manisha on 2025-01-07 at 15:52

@elduvelle thank you so much for that link!

@jonny yeah, need it to be publicly searchable which I think is doable...

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by farhaven 🇪🇺 on 2025-01-06 at 16:47

@manisha I use Recoll for searching my scanned-and-OCR'd correspondences: https://www.recoll.org/

Its primary use case is as a local search, but it also has a web interface for the search.

=> More informations about this toot | More toots from farhaven@mastodon.cloud

Written by Albert Cardona on 2025-01-06 at 20:59

@manisha pdfgrep works well, from the command line. Also, in Ubuntu the nautilus file viewer search reads PDFs too, but it's not fast.

=> More informations about this toot | More toots from albertcardona@mathstodon.xyz

Written by manisha on 2025-01-07 at 15:56

@albertcardona thanks, someone else also suggested pdfgrep. our end users would be primarily students/researchers who may not know how to use the command line and we need the database to be a publicly searchable. Most likely going to try Zotero (recommended by a few others)

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by Albert Cardona on 2025-01-07 at 15:58

@manisha Should be "easy enough" (famous last words) to put up a webpage, visible within the university only, that is a thin front-end to pdfgrep.

=> More informations about this toot | More toots from albertcardona@mathstodon.xyz

Written by manisha on 2025-01-07 at 16:02

@albertcardona "it won't be much work" is an inside joke at @neuromatch 😅

=> More informations about this toot | More toots from manisha@neuromatch.social

Written by Albert Cardona on 2025-01-07 at 17:16

@manisha @neuromatch Ah, Konrad, what an easy underestimate to make.

=> More informations about this toot | More toots from albertcardona@mathstodon.xyz

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113782215029998360
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 469.296701 milliseconds
Gemini-to-HTML Time: 4.759141 milliseconds

This content has been proxied by September (ba2dc).