Is this GPU a good value? RTX 3060

darkgumby · November 29, 2024, 2:11pm

I have been a web application developer for a long time.
I am just starting to explore AI. I have found ChatGPT to be helpful for expanding my general coding knowledge.

My main interest is using Ollama and something like qwen coder local LLMs, Ottomator, etc.

I run Pop-OS and do not game very much on my PC. I would not invest in a
GPU specifically to improve gaming performance.
My current GPU is a Nvidia GeForce GTX 1660. It has 6GB of VRAM.
Needless to say it is too under-powered.
My system has 32GB of RAM and an AMD® Ryzen 7 2700x.

The Zotac NVIDIA GeForce RTX 3060 with 12GB VRAM is now $249 on a black Friday sale.
Research on YT says this is a good bang for buck GPU for AI stuff. It looks like it will work well enough for what I want to do.

It’s at the top end of my budget for this project and I haven’t see anything else close in price for the same features. I think it makes sense for my budget and use case.
What do you think?

dustinwloring1988 · November 29, 2024, 2:27pm

Yes for the smaller models, however with the large context required for bolt, oTToDev, v0, etc I would go with a larger one or run dual gpus. Also you will have better results using a slightly bigger model the small ones still do not work that great with the large prompts these frameworks use.

darkgumby · November 29, 2024, 2:44pm

What do you mean by ‘larger’? More VRAM or are there other factors that I need to look at?

Part of my problem is I don’t really know how much power I need, because I don’t really know what I want. Current GPU just won’t do local code generation, so it’s hard to judge how much more I need to get what I want. I know the 3060 is not top end. But I really cannot afford top end right now. Will a single 3060 be able to do code generation at an acceptable level?
I know this is subjective, but I don’t need super performance, just reasonable performance.

In one video, the guy did a PC build with dual 3060s. In his rudimentary testing he generated some python code fairly quickly and it never touched the 2nd GPU.

My thought to start with a single GPU, and then add a second one if I don’t get the performance that I want/need. I understand this might require a power supply and cooling upgrade, so don’t want to go dual unless I just cannot use a single.

dustinwloring1988 · November 29, 2024, 3:15pm

For sure I agree with starting small just plan for the upgrade (ie. Power Supply, Case, Motherboard). I run dual 4060ti’s each with 16GB of VRAM, I also started with one but was not happy running the smaller model like 8B-22B. Also the larger the context size the more VRAM you will need there have been improvements in using Q4 for context size but this is out of scope here.

Do answer your first question, yes for starting our with smaller models and smaller projects. You will want more, but then we always want more.

mrsimpson · November 30, 2024, 3:24pm

Here’s what the expert says about itself

tl;dr: No consumer hardware will be able to host a model which comes close to what the commercial models can offer. Qwen32B sounds tempting, but you’d need approx. 100GB ram.

How to determine the minimum memory required to run a LLM (based upon the parameters size and quantization precision)?

ASSISTANT

Determining the minimum memory required to run a Large Language Model (LLM) involves several factors, including the model architecture, the number of parameters, quantization precision, and the specific hardware capabilities. Here’s a step-by-step guide to help you calculate or estimate the memory requirements:

1. Understand the Model Architecture

Number of Parameters: The number of trainable parameters in the LLM is a key factor. For example, a smaller model like GPT-3 has millions of parameters, while larger models like GPT-4 can have billions.
Layer Structure: Different architectures (e.g., transformer-based models) may have varying memory requirements due to their layer designs.

2. Quantization Precision

Quantization reduces the precision of the model weights and activations to save memory and computational resources.
Common quantization precisions include:
FP32 (Full Precision): Typically requires more memory but offers the highest accuracy.
FP16: Reduces memory usage by half compared to FP32.
INT8: Further reduces memory usage but can lead to some loss of precision.
If you are using quantization, multiply the number of parameters (or weights) by the precision factor.

For example:

1 billion parameters in FP16 would require (1 \times 10^9 \times 2 = 2 \text{ billion bytes}).

3. Memory for Activations

Besides the model parameters, you need memory to store activations during inference.
For a typical transformer-based model, the size of activations is roughly equal to the number of

parameters.

Therefore, if your model has (N) parameters, it might require an additional (N) bytes for activations.

4. Memory Overhead

There will be additional memory required for overheads such as:
Temporary buffers used during inference.
Optimizer states (if training).
CUDA/ROCm device memory (for GPU computation).

5. Device-Specific Memory Constraints

Different hardware devices have different memory capacities. For example, GPUs typically offer more memory than CPUs.
Ensure that the total required memory does not exceed the available memory on your device.

Example Calculation

Let’s consider a hypothetical LLM with:

(N = 1 \text{ billion}) parameters
Quantization precision: FP16

Memory Requirements:

Model Parameters:

N \times 2 = 1 \times 10^9 \times 2 = 2 \text{ billion bytes} = 2 \text{ GB}

Activations:

N \times 2 = 1 \times 10^9 \times 2 = 2 \text{ billion bytes} = 2 \text{ GB}

Overhead:

Assume an overhead of about (N \div 8) (a rough estimate, as this can vary):

N \div 8 = 1 \times 10^9 \div 8 = 125 \text{ million bytes} = 125 \text{ MB}

Total Memory Required:

2 \text{ GB (parameters)} + 2 \text{ GB (activations)} + 125 \text{ MB (overhead)} = 4.125 \text{ GB}

Practical Considerations

Parallelism: If you are using parallel computing techniques like model parallelism or data parallelism,

the memory requirement per device will be lower but the overall system complexity increases.

Inference vs. Training: The memory requirements for inference and training can differ significantly.

Training requires additional memory for gradients and optimizer states.

Conclusion

To determine the minimum memory required to run an LLM, consider the number of parameters, quantization precision, architecture, and overheads. Use this information to estimate the total memory needed based on your specific hardware constraints.

So how much would I need for 32B parameters with FP16? Sorry, I’m lazy calculating myself

ASSISTANT
Edit: As picture, as there’s no MathML-support in here

On a local machine for a singe developer, activations are probably ignorable. Still, quite some metal needed.

I decided to use a share service with transparent pricing (Open Router). More powerfull at lower TCO.

kekePower · November 30, 2024, 3:57pm

I have a 3070 laptop GPU and when the models are below the available amount of VRAM (below 7.5GiB), it runs quite fast. Pair this with Ryzen 7 5800H CPU (8c/16t), some of the medium models (8-15Gib) runs OK. Larger models work but are terribly slow. I normally use the larger models in Lobe Chat because I can generally wait for the answer.

Here are the models I have installed at the moment:

% ollama list
NAME                                     ID              SIZE      MODIFIED
qwq:latest                               1211a3265dc8    19 GB     2 days ago
phi3:14b                                 cf611a26b048    7.9 GB    6 days ago
qwen2:7b-instruct-q6_K                   43f1d6679e17    6.3 GB    6 days ago
tulu3:latest                             3e7bbda0122e    4.9 GB    8 days ago
Menopausia:latest                        2c047d9dbe8c    20 GB     12 days ago
dolphin-mixtral:8x7b-v2.7-q3_K_L         5e7c4f0043a4    20 GB     12 days ago
aya-expanse:latest                       65f986688a01    5.1 GB    13 days ago
athene-v2:72b-q2_K                       e566d713f2ce    29 GB     2 weeks ago
dolphin-llama3:70b-v2.9-q2_K             9167f4131518    26 GB     2 weeks ago
qwen2.5-coder:32b                        4bd6cbf2d094    19 GB     2 weeks ago
mixtral:8x7b-instruct-v0.1-q2_K          252bd6aa6b0d    15 GB     2 weeks ago
solar-pro:22b-preview-instruct-q3_K_L    8e88c5664027    11 GB     2 weeks ago
qwen2.5-coder:32b-base-q2_K              683f75295796    12 GB     2 weeks ago
llama3.2-vision:latest                   38107a0cd119    7.9 GB    3 weeks ago
dolphin-llama3:latest                    613f068e29f8    4.7 GB    4 weeks ago