LLM Benchmarks not accurate?

Going to artificialanalysis, we see that for HumanEval, the top 7 are within 7% of each other, with o1-mini at 97% and Mistral Large 2 at 90%. This doesn’t seem like too big of a difference, yet in performance, there’s a huge difference, from models creating beautiful websites to not even following prompts. Gemini 2.0 Flash (taken as an example) is often not able to set up the website / follow prompts to use templates. What could I be doing wrong? Thanks a lot for the help!

2 Likes

Hi,
I used Gemini 2.0 flash a lot and did not have big problems with it. I think its good after the intial setup of the webapp is present.
So maybe use starter templates and also try the “optimized prompt” instead of the default, if you not already done.

Other then that I heard/red that the experimental model should be perform better form programming.
image

Also try the models mentioned in the FAQ:
https://stackblitz-labs.github.io/bolt.diy/FAQ/

I’ll try that. Thanks!

A problem I’m currently facing is knowledge cutoff. I can’t attach llms.txt files to help the model, case in point being Svelte 5. Any advice?

Not sure what you are trying to attach, what is llms.txt?

As for Evals. Its sadly pretty weird thing. I like Aider leaderboard in that sense.

Its leaderboard of how well LLMs perform with Aider coding
And you can see that its tested with different editing formats.
Models are tested in full rewrites and in diff editing.
And they perform differently.

You can see that sonnet 3.5 is on 3rd place at the moment solving 45% correctly with 100% consistency in using diff editing correctly.

On 4th place there is gemini exp 1206 which is 7% lower.
That is almost to say that if you ask both to solve 10 problems sonnet will solve 10 and gemini 1206 will solve 9

Flash is below with 22.2%
Meaning that its 2x worse.
It also uses full rewrites.
Gemini 1206 also uses full rewrites.

They also have larger leaderboard here for editing

There sonnet diff does 84%
Gemini 1206 diff whole 80%

And gemini flash diff does 69%

While gemini 1206 diff does does also 69%

So its interesting to see that Gemini 1206 is whole 11% worse at diff editing then full file rewrites that are slower and more expensive.

Basically there is a lot of nuanse in how these models perform based on what inputs they are given and how function calling is setup.

In that sense Bolt.diy needs its own benchmarks eventually.

1 Like

I would take benchmarks with a grain of salt. They do not represent multi-part or complex tasks. They generally focus on one-shot completion of a given prompt for a pretty cookie cutter output. You test that out on real world use cases, and you will most certainly be disappointed every time. They really only show us which models are reasonably better and to go further, I think some sort of RAG or Workflow is required.

Have you tried DeepSeek-V3 yet?

Currently haven’t used Deepseek API. I mostly rotate between Bedrock, Gemini and sometimes Mistral. I do use sonnet but quite rarely to avoid cost accumulation.

Templates aren’t working, I’ll look into it a bit more and open another thread for that.

Gemini has begun to work better, so wondering if it was a context issue.

Lastly, a bolt.new query: are webcontainers getting slower? I used bolt.new recently and webcontainers, which were blazing fast, are now taking minutes to compile single pages. Can someone clarify whether this is true, or just an issue on my end? Thanks!

Bolt.diy and bolt.new webcontainers are running slower than usual
Using current stable branch, and the token output limits are really hitting. Any workaround?

EDIT: It’s not a bolt.diy problem, I’m sorry for posting this here. I had forgotten that it was the Gemini API’s limit. My bad