Are you using #Codeberg to host your favorite AI-assisted and otherwise vibecoded project because your desire for dopamine has utterly destroyed your willingness to learn new things?

n0toose will be at FOSDEM

And also, I find it rather peculiar that people join Codeberg because it's more "sovereign" but then go ahead and base their projects on "rugpull" subscription-based services and create things that won't be maintainable in the long run.

It's just plain annoying.

https://youtube.com/watch?v=UymN5scMpZM

Felicitas Pojtinger is in 🇨🇭

@n0toose I don't know, I really wouldn't write of LLMs as a whole. Those new open-weight models - Apertus, Qwen, DeepSeek, Kimi - run fully on local infrastructure, with Vulkan for the acceleration layer. For my own work it's been very helpful to use those kinds of things for meeting transcription, RAG search, automating a "fuzzy" web search, autocorrect, translations, fuzzy search-and-replace, data extraction from logs and so on ...

Felicitas Pojtinger is in 🇨🇭

@n0toose Claude and stuff should obv. be fought, but the open versions of that tech has tons of potential.

n0toose will be at FOSDEM

@pojntfx Still 99.9% trained on top of stolen data, an unimaginable waste of resources by treating public infrastructure as "free real estate" (without respect for robots.txt, by visiting often the same links ten thousand times or using sketchy residential proxies to hamper services like Codeberg's and circumvent explicit "no"'s), and vibes towards the law (which really doesn't apply to the commoners in many jurisdictions).

n0toose will be at FOSDEM

@pojntfx The need for a "fuzzy web search" is the result of normal web search having been completely enshittified and serves another party's "want" of you not seeking out information by yourself but receiving them from "summaries" instead.

Felicitas Pojtinger is in 🇨🇭

@n0toose That's not the case for a lot of those open models anymore. Apertus is a great example of it, it was built only from data sources that explicitly allow crawling https://www.swiss-ai.org/apertus

"Stolen data" is ofc also very debatable. Tightening copyright law, patents and so on even further to "stop" LLMs from being trained will probably only have negative consequences on those writing OSS. I really don't want to live in a world where you're legally prohibited from learning.

Felicitas Pojtinger is in 🇨🇭

@n0toose I'm not sure I agree with that. There has never been a (legal) search engine that could "just give me the paper on CRIU where the author mentions Cricket" before. That stuff is genuinely new, and while stuff like Google has enshittified (I don't disagree with you there) there is also a whole new type of query that you just couldn't do before, all locally on your own device.

n0toose will be at FOSDEM

@pojntfx Oh, Apertus, that one by the Swiss universities. I mean yeah, that is a positive example with a positive approach but not one I expect to be relevant in any shape or form in 5 years from now. The exception doesn't make the rule, neither are the bad effects I stated above driven by the exceptions.

n0toose will be at FOSDEM

@pojntfx Personally I think that if you DoS random people's computers for data mining that also falls under the "law" but...

Felicitas Pojtinger is in 🇨🇭

@n0toose I honestly think the same approach that the EU has been doing with open search indexes should exist for LLM training data too eventually. There are clearly issues with how you get access to say my OSS projects if you want to train them without hammering my forge, I know that. At the same time though, without access to open training data you're ceeding the entire field - which pretty much everyone uses in one way or another - to private IP deals with publishers.

Felicitas Pojtinger is in 🇨🇭

@n0toose There were lots of proposals around criminalizing "unauthorized access" to services in the past few decades, about trying to make it so that only a "human" can access them, enforcing ToS legally ... I've really only seen them used against end users in practice (Reddit's anti-scraping policy/API shutdown, third-party clients for Signal, any reverse engineering project ever etc.)

A lot of these kinds of laws will have effects far, far worse than DDoSing public infrastructure IMHO.

n0toose will be at FOSDEM

@pojntfx There has, they used to call them "Google dorks" or "Advanced Search". Now if that didn't work 100% of the time, well, applies to both cases (readjusting a query in a different form); besides the point anyway because people are not looking for something that works for minor edge cases, they are looking for a way to look for information—and you can't even look up for an omelette recipe anymore without SEO garbage taking up the first two pages.

n0toose will be at FOSDEM

@pojntfx I also think that the notion of local LLMs letting you find niche papers exaggerates their abilities, and that the ability to use them depends on hardware that is not accessible anymore due to data center costs and IMO due to the overall war against general purpose computing.

Byte

@n0toose codeberg ought to ban that slop

Felicitas Pojtinger is in 🇨🇭

@n0toose Yeah, the SEO slop is obviously horrendous. I'm ngl though, being able to use an LLM to search through say IndieWeb instead has been the first time in a long time that I've actually been able to find (non-code) answers to questions again. Used it for booking flights with niche airlines in a country I've never been to for example. That kind of stuff was always locked behind proprietary APIs for so long and now you can actually access them without them for the first time in forever.

n0toose will be at FOSDEM

@pojntfx I think one can be for scraping and making data available e.g. for researchers but against the specific manners in which startups break things—it's just hard to explain that to someone who doesn't operate infrastructure for people at scale.

Anyway, we just have to spend two or three times the price on SSDs I guess (see: greater societal impact), so if that's fine...

Felicitas Pojtinger is in 🇨🇭

@n0toose There are production constraints around all of this atm, yes. But much like how you can't fix the housing crisis without making it cheaper to build houses and actually building them, I don't think we can fix something like this without actually building out the fabs and getting supply up there w/ demand.

And local LLMs are very much "real" now. Try out Newelle or Alpaca on GNOME on your regular laptop - even mine can run them w/o issues now via Vulkan, and I don't have a lot of VRAM.

n0toose will be at FOSDEM

@pojntfx Yeah and you're treating this as if it's something to be taken for granted forever; the counterexamples are the equivalents of "searxng" or alternative search engines to me tbh.

Felicitas Pojtinger is in 🇨🇭

@n0toose I mean yes, optimally a law like you mention would try and fix this, but I have 0 trust in any jurisdiction actually making a law like that. I'm pretty certain we'll instead end up in a world where only massive companies that can pay for IP licensing agreements can train models.

n0toose will be at FOSDEM

@pojntfx and you're using that to find obscure papers?