Some thoughts of where to borrow ideas and code for developer features. VS Code is open source

I was exploring how we can improve context size/tokens.

General principle behind it usually is some form of RAG.
Or retrieval augmented generation.

Fancy word but if you simplify the idea its just search. User asks question, you have system that chunks, indexes and makes your database searchable.

Then when user asks something system first sends search requests, and adds part of what it found to context.

In our case it could find chunks of files or chat messages.

BTW that is how Cursor and Windsurf work.

User asks and those IDEs search in their code base.

I really want to add such “search tool” for Bolt.New to make it possible to work on larger codebase without adding it all to the context.

But how? We need code aware indexes.

Well, VS Code does that already.

And it uses things like:
GitHub - microsoft/vscode-ripgrep: For consuming the ripgrep binary from microsoft/ripgrep-prebuilt in a Node project for fast text search

And GitHub - microsoft/vscode-languageserver-node: Language server protocol implementation for VSCode. This allows implementing language services in JS/TS running on node.js for code understanding.

My view is that we need to look into those when thinking about IDE related features like code and text search and codebase understanding and chunked editing.

Reuse what open source already did.

7 Likes

Love this, thanks @wonderwhy.er for sharing! This is going to be super important for working with any larger projects within oTToDev in the long term!

2 Likes

(deleted)
I made my comment into a new thread as it’s not totally on topic here and I don’t want to hijack :wink:

2 Likes

I do think that there are at least two different use cases:

Exploring a codebase

Like @wonderwhy.er described, the general context for this is the whole project. RAG-mechanisms may be appropriate, but also custom agents for helping to explore.
Here’s an example: an agent which by-prompt has certain descriptive capabilities creates a holistic, textual description of the whole repo by scanning file-by-file / folder-by-folder.
Output (illustrative):

{
file: "src/workbench.txt",
dependencies: "[src/lib/files, ...]",
responsibility: "Establishes a container which has controls..."
}

Then, this could be used in order to determine where to enhance certain features

“as a developer proficient in [detected language], where would you doe the following enhancement?”
[prompt]
Here’s howthe project currently looks like
[Context>: the above summary of all files]

Making enhancements

If an entry point to the intended change is known (usually, this is the opened file), the context could be determined by the abstract syntax tree (AST). IDEs usually employ a language server in order to interact with the language. This should “easily” output a graph of dependencies. All branches leading to the entry point would then define the context.

Edit: Edit: You linked the VSCode-implementation. If you want to go down this route, you may check Zeds Documentation on Language Server

Both techniques could of course be combined in order to be more creative also when making entry-point-based enhancements.

I am quite sure that most if not all commercial product utilize the language for the context. Some ask/allow the user to define it (a working set of files which defined the context).

I am using Windsurf for couple of weeks.
They use relativly simple approuch

They use grep search and read files in chunks of 200 lines to select correct ones for the task

AST could allow to chunk better then just lines but its also a complication we can do later.
As usual my suggestion is to do it in small steps.

At minimum it will require:

  1. not sending whole conversation to AI but filtering trough some form of search
  2. If we do it for chat conversation it will allow us to include user requests, AI answers and file contents in such “context creation”
  3. But, we will need additional support for files as we need to include path to the files in questions when feeding those chunks to llm so it knows which files to modify

may be as first iteration what I would do is add “grep search” trough chat

  • current state of files
    give list of files and relevant messages from conversation ranked by importance, send that to AI in style of

{relevant past chat messages}
{file urls and content of relevant files found trough search}
{last user request in what to do with all of that}

Another interesting and relevant thing is this

Anthropic says coding software like Replit, Codeium, and Souregraph have already started using MCP to build out their AI agents, which can complete tasks on behalf of users. This tool will likely make it easier for other companies and developers to connect an AI system with multiple data sources — something that could become especially helpful as the industry leans into agentic AI.

2 Likes

They actually have an example github related server with that model context protocol

1 Like

Sounds nice but the I have experimented with grep search once I believe it requires some native bindings and actual file indexing. Web container might not have this.

If we want to do this we might need to implement our own version

MCP looks interesting

I share above, there is webassembly grep that VS Code uses.

So we do not need to invent our own, just find best available tools. JS ecosystem is large and rich in that sense.

1 Like

I believe I have a solution. Ya there any way we could tank about this further? It’s a lot to just type on here, but it’s something I’ve been working on for other reasons, but it should reduce tokens sent back and forth a lot

1 Like

And what do you envision? Share more