Comparing providers and models with a slightly harder initial prompt

shawn · December 30, 2024, 5:05pm

Hi Bolt-ers…

Been playing around with finding the sweet spot when trialing a slightly harder prompt. This I found worked first time in Bolt.new but have had very little success in Bolt.diy depending on the provider/language model combination. So I thought I’d put this into a table and display the results - this will be a WIP so please be gentle on me. If you have some good input, I’d appreciate that too.

My standard prompt is as follows…

Create a clean login page with user registration and use the React framework.
After logging-in the user should be taken to a welcome page displaying the login time and welcome the user using their name.
Include a drawer from the left side of the screen which includes links for home, profile, settings, sign-out.

Screenshots of outcome where applicable…

Google - Gemini 2.0 Flash

Some output from Bolt for context also.

Okay, I will create a login page with user registration using React, along with a welcome page that displays the login time and user’s name. Additionally, I’ll include a drawer from the left side with navigation links.

Here’s the plan:

1. Project Setup: Initialize a new React project using Vite.
2. Authentication Logic: Implement basic authentication logic using local storage to store user data.
3. Login and Registration Components: Create separate components for login and registration forms.
4. Welcome Page: Create a welcome page component that displays the user’s name and login time.
5. Drawer Component: Create a reusable drawer component for navigation.
6. Routing: Set up routing to navigate between login, registration, and welcome pages.
7. Styling: Add basic styling for a clean look.

Now, let’s start with the implementation.

Not a bad outcome

Anthropic - Claude 3.5 Sonnet (new)

Anthropic - Claude 3.5 Sonnet (old)

Got the same for Claude 3 Haiku (new) and the other 3.

Groq - Llama 3.1 8B

Writes the code but won’t attempt running commands to setup the server and display.

Groq - Llama 3.2 90B

Similar but failed only at executing npm and setting up the environment - might be an easy fix and worth further experimentation although this option is heavy on tokens so therefore expensive considering what you’re getting is not magic.

Groq - Llama 3.2 3B

Similar - gives the files but nothing else…

Groq - Llama 3.2 1B

Yeah - don’t waste your time…

Groq - Llama 3.3 70B

Much the same but at least you get some advice on running a local server also. LOL.

OpenRouter - Anthropic Claude 3 Sonnet - in $3.00 out $15.00

Don’t bother…

OpenRouter - Auto Router (best for prompt) in $10000000.00 out $100000000.00

I don’t know what all the $$$$$ is about because the actual cost I just checked on my OpenRouter account and it was minimal and essentially the same as the one before…but it almost delivered.

OpenRouter - EVA Qwen2.5 32B in$2.60 out $3.40 context 16k

Lets see if someone can serve a Qwen 2.5 32B LLM and actually produce the goods…nope

OpenRouter - Google Gemini 2.0 Flash Thinking Experimental (free)

Nearly but not quite. Operationally it doesn’t work but a little effort and it’d be there.

OpenRouter - Meta: Llama 3.2 90B Vision Instruct - in $0.90 out $0.90

Most of the way there but failed.

Interesting to see some of the errors as they come out…

It has worked out the error but failed on the initial build component - almost there though.

OpenRouter OpenAI: GPT-4o in $2.50 out $10.00 context 128k

Got most of the way there but missing operational components.

OpenRouter Qwen2.5 Coder 32B Instruct - in $0.07 ot $0.16 context 33k

Nearly there on first try…underwhelming tho.

OpenRouter - Anthropic: Calude 3.5 Sonnet in$3.00 out$3.00

Note quite.

OpenRouter - Deepseek-Coder V2 236B (OpenRouter)

Yeah - I expected more.

OpenRouter - OpenRouter Qwen 110B Chat (OpenRouter)

Nothing but tumble-weeds…

And a quick look at my OpenRouter traffic and token usage.

Yeah, it seems like a lot of stuffing around - but I hope this helps a few people out there. So many of these combinations provide very little in return. I’d love to get some feedback from anyone that’s had better results. It would have been more enjoyable to hit my head against the wall for 4hrs. But, that would have just woken everyone up. Haha.

Actually I may have had some errors due to running out of Anthropic credit - bugger. I didn’t go back to check.

And I’d been trying to keep a spreadsheet on outcomes. It’s not as accurate as I would have hoped but the detail isn’t as important as the final outcome.

Anyway. I really hope this helps a few out there. Post your best results below please.

I think for my best results it’s actually Google and Gemini Flash 2.0 right now as the clear winner…

All the best everyone and hope you all get something positive out of using Bolt.diy - It’s going to be amazing if used correctly.

leex279 · December 30, 2024, 6:29pm

Hey,
thanks for this detailed post. I work also just with Gemini Flash 2.0 at the moment, but it also does have problems when the projects get bigger.

Important here to know would be on which branch you are and with what features activated you testet, because I think it makes a big difference if you use default prompt or the optimized (experimental) one.

Also in the newest main it has some other features that improves it:

The starter templates make a big difference, because if you start with a very good base, its much easier for the ai to follow up.

shawn · December 30, 2024, 11:30pm

If there’s anything I’ve learned over the last few weeks of working on Bolt.new and now this. Starting simple is important and that’s why my prompt was a little more complex because I learned you could take it that little bit further than just a one liner. In facts it’s about the 6th iteration of that prompt as I learned to shape it so it could be consumed logically and provide a good level of success and also getting further along after the first request.

Git as follows - on Main currently…

And only Context Optimization not switched on - will do that after this and let you know if things improve.

sfason65 · December 31, 2024, 8:34am

These were done using your exact prompt with Deepseek V3 Coder with optimized feature. My Bolt.diy version is the following:
Current Version Tag: v"0.0.3"
Current Commit Version: “55cfd5d”
Not bad for the first run. Of course there is no authentication logic. The login register here and sign out links work.

shawn · January 1, 2025, 5:48am

@sfason65 - Thanks Stephen

How’s your general experience with Deekseek V3 please ?

Responsiveness, cost and results accuracy ?

aliasfox · January 1, 2025, 6:10am

IMO DeepSeek-V3 is way better than LLama3.3 70B Instruct, ChatGPT 4o, and Qwen 72B Instruct. Google Gemini Flash 2.0 (exp-1210) seems pretty on par. And I haven’t compared it with ChatGPT o1. I can only imagine how well it would do with a RAG.

shawn · January 1, 2025, 7:42am

Cheers. I’ll put that in my to-do list and get back to you.

aliasfox · January 1, 2025, 7:48am

Another thing I should have mentioned is that it’s also super cheap.

Only thing that beats it is maybe Llama 11B Vision for certain use cases:

Or Free of course, but they have rate limits and what not (but could still definitely be useful in a MoE RAG type system).

sfason65 · January 1, 2025, 8:00am

@aliasfox covered it pretty well.

shawn · January 1, 2025, 10:54am

Two quick questions…

Where are you getting this output - looks like there’s some logic behind it and diff prompts served by different provider combinations ?

How do you differentiate between different use cases and which provider to use ?

aliasfox · January 1, 2025, 1:28pm

You can change providers throughout development in Bolt.diy, but specifically I think I was meaning you would want this handled by a workflow as an API endpoint. I’m playing with one right now to bring Multi-Model and RAG support into Bolt.diy (or other IDE) without having to modify code. Hope that kind of answers your question.

shawn · January 1, 2025, 9:04pm

Thanks. I try and not change providers during a development stage. I find I get errors but haven’t worked out if it’s the changing and possibly the loss of continuity between providers - because I don’t know exactly how the prompt system works and if it keeps a memory and how, but seems to fall over if I change to try and solve a problem the current provider can’t fix. I will continue to experiment on this tho.

The Multi-modal support sounds interesting. I’m assuming that’s in response to a prompt or when trying to diagnose something. The response is presented in multiple formats depending on the issue…?