Does anyone know of an #OpenAccess full-text #PDF #search engine/tool using which I can search for relevant PDFs from a self-hosted #database?
Context: we have a curated database of #research articles but so far our search capability has been limited to tagged keywords or title and abstract field search only. We'd like to be able to search the entire PDF.
Side note: I know that PDFs are not a great way to store scientific information. I'd prefer not to use a proprietary #LLM if possible
[#]LexicalSearch #SemanticSearch #AskAcademia #academia #science #sciences #ScienceMastodon #AskFedi #OpenScience
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha Ollama with RAG on Open WebUI?
https://docs.openwebui.com/features/rag/
=> More informations about this toot | More toots from koen_hufkens@mastodon.social
@koen_hufkens I need to look into each of those terms 😅 but at first glance, this looks promising, thanks!! 🙂
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha Ollama is a tool to run (local) LLM. Open WebUI is an interface for all this in the browser for ease of use - you can setup multiple accounts (when the service is public) or use it single user.
=> More informations about this toot | More toots from koen_hufkens@mastodon.social
@koen_hufkens thank you, this was helpful! Others have suggested Zotero's full-text search -- I'm looking into that as well
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha Sure, this works too. Don't forget easy command line tools, too.
https://pdfgrep.org/
=> More informations about this toot | More toots from koen_hufkens@mastodon.social
@koen_hufkens ooh cool, I knew of good ol grep but not pdfgrep!
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha You can do it with standard tools as well I think, given some finagling.
=> More informations about this toot | More toots from koen_hufkens@mastodon.social
@manisha Even if I don't like to make any publicity, Windows 11 search will do this : it will find any word(or consecutive words) in pdfs in a folder/ subfolder. Or import your pdfs into Zotero.
=> More informations about this toot | More toots from olibrendel@scicomm.xyz
@olibrendel we have a linux server. Going to check whether it is possible to make a Zotero library completely public or have a way to show search results on the front-end at least while Zotero could be in the backend. In which case, #Zotero may be a good option! thank you!
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha Not sure if that really fits what you want but #Zotero does full-text search of PDFs: https://www.zotero.org/support/searching
=> More informations about this toot | More toots from elduvelle@neuromatch.social
@elduvelle thank you!! It looks like a good option. Someone else also suggested it. I've used it as a reference manager but never dived into its full-text indexing feature. Do you happen to know if Zotero libraries can be made public?
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha I haven't done it myself but it looks like it's just a matter of setting the library's visibility to "public": https://forums.zotero.org/discussion/88809/keep-public-library
=> More informations about this toot | More toots from elduvelle@neuromatch.social
@elduvelle
@manisha
Ya ya zotero would def be how I'd do it. Do you need it to be public so that it can be publicly full text searchable from a website or smth? Slightly different problem than being able to do it on your own zotero client but I'd still use zotero as the basis of the collection
=> More informations about this toot | More toots from jonny@neuromatch.social
@elduvelle thank you so much for that link!
@jonny yeah, need it to be publicly searchable which I think is doable...
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha I use Recoll for searching my scanned-and-OCR'd correspondences: https://www.recoll.org/
Its primary use case is as a local search, but it also has a web interface for the search.
=> More informations about this toot | More toots from farhaven@mastodon.cloud
@manisha pdfgrep works well, from the command line. Also, in Ubuntu the nautilus file viewer search reads PDFs too, but it's not fast.
=> More informations about this toot | More toots from albertcardona@mathstodon.xyz
@albertcardona thanks, someone else also suggested pdfgrep. our end users would be primarily students/researchers who may not know how to use the command line and we need the database to be a publicly searchable. Most likely going to try Zotero (recommended by a few others)
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha Should be "easy enough" (famous last words) to put up a webpage, visible within the university only, that is a thin front-end to pdfgrep.
=> More informations about this toot | More toots from albertcardona@mathstodon.xyz
@albertcardona "it won't be much work" is an inside joke at @neuromatch 😅
=> More informations about this toot | More toots from manisha@neuromatch.social
@manisha @neuromatch Ah, Konrad, what an easy underestimate to make.
=> More informations about this toot | More toots from albertcardona@mathstodon.xyz This content has been proxied by September (ba2dc).Proxy Information
text/gemini