the trash keeps piling up and rotting, promising younger colleagues end up leaving for greener pastures, and i'm stuck here pulling levers to help maintain the illusion that this bloated corpse of a company is still worth a damn.
apparently the metrics used to evaluate llm-based systems don't come from anything grounded in reality. they just pass the prompt and response pairs to an llm and ask it to evaluate them. usually the llm doing the evaluation is the same one being evaluated.so much of this feels entirely unscientific. engineers are treating llms as these magical infallible black boxes without understanding their specific strengths and limitations. it's the ultimate hammer and now literally every problem looks like a nail.it's also very comical that the examples they are using in the lab produce incredibly generic and horribly useless responses but the llm is scoring them very high in all metrics.