It has been a while since I did any update to the search engine. This time 2 minor chages to the crawler:
I saw gemi.dev's post about Kennedy adding intitle: and site: modifiers. TLGS already supported the domain: filer. Intitle is a good idea so I added it. Now you can search for `intitle:gemini` to find all pages with gemini in the title.
For example
I realized that gemtext can contain HTML tags and I wasn't escaping them when rendering. This is now fixed.
TLGS wasn't caching filtered search results as filtering was a post processing step and is relatively cheap. So TLGS just cached the actual search and filter as requests come. However yesterday I saw a rogue bot that was hammering the search API with the same query and used up significant amount of CPU. So I decided to cache the filtered results as well. This should make them have less impact on the server.
Also switched the TLS backend from OpenSSL to Botan. Just to cross check the TLS adapter. This does limit the TLS version to 1.2 thus higher latency. Please bear with me for a while. Will switch back to OpenSSL once I'm confident that stuff is stable and correct.
I'm happy to report that the TLS upgrade is stable after some patching. Enjoy the faster responses.
Sorry for disappearing for a while. I've been busy upgrading the TLS capability of TLGS. It's a massive endeavor that involved rewriting a hugh chunk of the networking stack. Now it's stable enough that I am confident to open it to the public. However, it's still experimental. So please report any issues you encounter. Something you might notice (and more to come):
I'm reverting the CJK support for now. It's causing SQL timeouts and no clear indication why.
In addition to supporting twtxt, Atom and RSS feeds. TLGS now scans for Gemsub feeds while crawling. I'm hoping to build on this capability for a large scale feed aggregation service.
Switched FTS indexing algorithm from PostgreSQL's internal one to pgroonga. This enables CJK support without too much performance drag. I may, might switch to ElasticSearch or Manticore in the future. But that's for now.
Happy to report that the FD leak is solved! Also optimized how DNS clients are created for lower CPU use by crawler.
Prevent edge case where the crawler tries to crawl localhost. Updated blacklist. I seems to be closer to resolving a file descriptor issue that leads to crawler hang.
Sorry for the few hours of downtime. Upgrading PostgreSQL gone horribly wrong. I had to ditch the entire database and start a completely clean one. Btw, search has been improved some what. Fixed a bug in title matching. For example, directly searching "Cosmos" will bring up the cosmos aggregator.
More updates are comming! Added emoji to links. And a new WIP API to query known security.txt files. Next I'll be working on improving search result and responsiveness.
Added a new API `/api/v1/known_feeds` to allow querying feeds known to TLGS. See the API documentation for more information.
Sorry for the ~15 hours of outage yesterday. An exception escaped and caused the entire server to go down. Measures are taken to ensure that won't happen again.
The crawler hang is (very largely) fixed. This a stupid error I made causing circular references. The automated re-indexing will be reactivated in the near future. Up to this point I had to do it manually after discovering the hang.
Added LEO ring endpoints and search engine metadata pages to the crawler blacklist. This should improve search speed and quality.
Updated robots.txt parsing to be more robust. Now there shall be less issues caused by non-standard robots.txt files. Namely added support for case insensitive keys and leading whitespace handling.
The server has been upgraded to Ubuntu 22.04 with Linux 4.15! With that, we are able to use the landlock kernel feature to prevent attackers from executing any commands/access arbitrary files through any server exploits. Drogon is also updated to 1.7.5
Just a headsup. The server will be upgraded to Ubuntu 22.04 sometime next week or the week after. Expect a few hours of downtime. Things should be back to normal after that. Also I'm planning to migrate to OpenBSD for better security once they adopted clang-14.
Updated dremini and security measures. Improving speed of the HTTP version of this capsule. And as you might have noticed. Search results are now highlighted with [] so it's easier to see what you're looking for.
At last the crawler can finish without performance degradation and attendence. From now on crawling will happen automatically every Wednesday and Sunday 00:00 UTC.
Also security update. Due to some weird firewall logs. Which I suspect are simply bots poking around, but anyways. TLGS server now runs on GrapheneOS's hardened_malloc and with much less privilege. I think my code is secure. But better safe than sorry. (Man, I'd love to have unveil and pledge on Linux).
I switched the ranking algorithm from HITS to SALSA after some optimization and bug fixes. It is faster than HITS and surfaces more relevant pages. Also very pround that gemini.circumlunar.space is finally the first link in the search result when I search for "gemini". Followed by other very popular capsules like medusae.space!
Happy new year! The search engine can decuplicate search results. Hopefully it improes the user experience. Let me know if it hilds important results for you. It shouldn't. But who knows. maybe a bug is hiding.
The crawler is updated to handle common wildcard patterns and server stablity is improved. Hopefully redusing trobles people have with TLGS.
TLGS has a common-web interface! You can search content on Gemini from common web. It does not proxy content from outside of TLGS's site. You still need a Gemini browser to browse searched result.
Running well. However it cannot finish crawling without attendence. Likely some bug in the crawler. If you see random downtime of TLGS recently, that's likely me fixing stuff.
Going public! Hope this goes well
Tried a full-crawl of the Geminispace. Currently only crawls when I want to. If this goes well will move to a VPS and open it for public use.
Test deployment in my homelab. Running with 40K pages in index. Response time is acceptable with a bloody slow CPU.
This page is rendered from Gemini Gemtext to HTML. We recommened to get a proper Gemini client for the best experience.