Files management – architecture

mrsimpson · December 9, 2024, 12:22pm

Managing files is at the core of Bolt.

Since we already had a couple of PRs/improvements which deal with files and we faced alternative implementation principles, I thought it might be worth to discuss how this should be handled architecturally in order to improve further down the road.

I’m not an expert in the bolt-architecture, so this is my current understanding. I welcome every correction!

Status quo

I’ll try to sketch what’s currently there:

Different stages

When talking about “files”, we actually talk about three different locations / stages:

Files inside the web container. These are the files which are available to the engine that runs the application under development (vite, webpack.). I call the “webcontainer-files”.
Files within the bolt-application. The component is called “workbench” in the original code, so let’s call them “workbench files”.
Then, there’s the files content. It makes up the context that is sent to the LLM for inference. In order to do that, the content is translated into specially tagged chat messages. Let’s call them “context files”.
Files that reside outside the application, e. g. in a git repository or on the local file-system. I call them “remote files”.

Existing mechanisms

Webcontainer-filesystem
The webcontainer is accessible from ~/lib/webcontainer and provides a file-system API await wc.fs.writeFile(entryPath, content).
The files store in the workbench
In ‎app/lib/stores/files.ts, there’s already a mechanism that syncs workbench-files with webcontainer-files: If t file is modified inside the workbench (e. g. by editing it in the right-hand-side editor), it’s tracked using #modifiedFiles. When sending the next message, these files are diffed and this diff is sent to the LLM as context-file
Direct manipulation of messages
Current implementations of file-upload / sync of remote files has been done by creating special chat messages (<boltArtifact>)

Why should we think

It seems there is no mechanism in place to keep files in sync both ways, from a remote down to the webcontainer – and vice versa.
However, we’re going to see a need to make this a stable process going forwards (thinking about undo/redo, reloads of the page and a workflow back to the developer’s local or remote repo).

Ideas

A files-façade

If there was an API in place, which made sure that files are propagated throughout the whole stack, we’d be much more flexible to add processes. I’d propose to provide a façade (maybe as a hook) which allows to push files into the workbench (and implicitly propagates them to context and wc)

Use of git within the full chain

Git is made for managing different storage locations, including conflict resolution. @thecodacus introduced isomrphic-git recently, so we would check whether implementing the façade by means of git was an option: a commit could be used as transport container for changes also between workbench and webcontainer, not only to the remotes.
If changes e. g. had been performed in the webcontainer (which should in future also trigger a commit via the façade), a push from the workbench to the wc could trigger the well-known conflict resolution mechannisms

A git commit could also be created after each prompt (with the prompt and meta-information about the model as commit-message). This would also allow for proper undo and allow for a transparent development workflow.

Your ideas?

I’m very sure that at least @wonderwhy.er and @thecodacus have already had ideas themselves, so I’d much like to read from you!

Cheers,
Oliver

wonderwhy.er · December 9, 2024, 1:39pm

Bit more info for context
is how LLMs communicate to Bolt file changes or commands to run in terminal
Its how AI/LLM part of Bolt works.
Currently git import and folder import where done trough that simulating LLM message
Its minimal fully working solution
In future depending on how we will be managing context we may refactor that

Not sure I understand “Why should we think” parts, here are requests for clarification
We do have chat level undo/redo, we want hand code editing redo/undo?
reloads work too, or what o you mean by reloads?
As for overall topic.
Yes, what I am communicating about the topic is that chat history and git commit history are very similar things
We need external to Bolt storage and using github for that makes a lot of sense
May be we can even store chat histories in some kind of .chats folder, so that when you import git project it imports with chats too

We should get there in some small steps.
We currently have “pull” git repo
We have push in to new repo
We need “push in to current repo” now
My proposal was to push in to same target from which pull happened, if that causes a conflict then propose to create a branch in to which to push as first step

As for proposals to refactor how files work.
I think that there is bigger issue on horizon that should dictate that and that is better context management.
I do suspect that we will be switching to something closer to Windsurf where on user requests context is constructed out of files and chat messages relevant to user current request.
And chat itself will not be store of files anymore, at least not in full.
I think that we first need to move in that direction before making changes similar to what you propose about workbench.

On git side, I agree, I think we should go towards integrating it all very well with git.
Each chat message is git commit, not sure about manual changes.
Sadly there is not such thing as automatic git conflict solving.
Though, may be after we get better git integration we can try making AI solve git conflicts

thecodacus · December 9, 2024, 2:12pm

I agree with @wonderwhy.er first we need to work on context management.

My approach would be, moving code context from chat history to directly injected into system prompt will reduct dependency on chat as context store for files. the LLM will still refer to chat for where the project is headed and user’s instructions but not for state of the codebase.

once we have that then its matter of selectively pushing the code files to system prompt instead of pushing all the files in the project as context.
and that can be done in lots of ways, (rag, grep search, etc…)

but I would like and want to try and see if we can let llm decide which file to load into the context by providing it the project tree first and the LLM will know the file tree all the time and will be able to load a number of files and unload any file or reload another file into the context

that way we will be leveraging the AI’s intelligence to select the context and rag pipeline will not be the bottleneck for the context

not sure how successful it will be but something that I would like to see as an experiment

as for file importing, for now I feel like the chat message is the cleanest way. because its stored in the index db and can automatically be reloaded once the page reloads and automatically initialize the webcontainer file system.

anything new will be too much architectural changes with very little benefit
As for git push that can easily be done by putting a commit sha of the branch when it was pulled first as a ref in the chat history

and when we will be pushing the changes it can pull the same commit using that sha from git, overwrites the files from webcontainer, and then write a new commit and push to the repo.

mrsimpson · December 9, 2024, 9:39pm

@thecodacus

first we need to work on context management.

This is a statement about prioritization, not about the files-aspect

I agree that working on context management is really important. But exactly because of that, I think we should invest some thoughts in “how are we actually gonna do it”, since everything is related:

In order to provide and lifecycle context, we need a proper way to address files.
Currently, the chat messages have got a lot of responsibility. They know everything (which makes it easy to pass-on to the LLM), but it is not easy to evolve (potential regressions incoming, as one central component gets changed in order to fulfill different new requirements).

I tried to sketch if, hoping it’s easier to understand what I mean. Here’s my understanding of how things currently look like (wrt files management and interaction with the LLM):

You can clearly see that the chat component has a lot of responsibility.

I propose to externalize some responsibility to dedicated components for managing files and context, e. g. like this

You can see that

files management would be responsible to keep files in sync between all the stages. Mechanisms like a git-based flow could be added here.
context management would have a clear responsibility of compiling the system context apart from the messages that the user and the LLM shared.
This would also elegantly enable things like different strategies for selecting files for the current context (e. g. using the above mentioned search, AST or LLM-based selections) without bloating the chat responsibilities even more.

If we think about bolt.diy as a real IDE, we need to support things like undo (rollback state and files along with it to the previous LLM-interaction), local editing of files within the right hand side editor or syncing with a remote git repo. The current architecture makes it tough to handle this without high risk for regression.

Hope I could make clear why I believe we should think about componentization here.

Cheers,
Oliver

thecodacus · December 10, 2024, 5:50am

questions, how will the LLM knows about whats been uploaded and in the file system and the projects codebase when we upload from a folder lets say. chat does not have that information

mrsimpson · December 10, 2024, 7:13am

I’d say the files management component needs to keep track of this.
Then, the context management should collect this information when the next message to the LLM is sent.

wonderwhy.er · December 10, 2024, 8:03am

What I am saying mrsimpson that I will be able to comment on file management when context management is in place because that is gonna dictate requirements for file management.

I sadly do not have time to invest in to exploring context management so I can’t invest time in to file management as it comes after.

I will look at that when those things are done.
And my point is, for me its early to think about it, first context management needs to be explored.

Then I will understand requirements for file management. Currently they are not known.

thecodacus · December 10, 2024, 8:42am

you said it yourself… we need to first do something about context management like @wonderwhy.er said

mrsimpson · January 24, 2025, 9:54am

@thecodacus said that after his contribution-vacation he wanted to pick-up files management. and tagged me for ideas.

I’m currently extensively trying out ways to build a files management component that syncs across multiple targets (primarily browser and (web-) container, but maybe also to a remote file system like localhost or another remote fs).

You can find my current ideas in the monorepo I started in the files-management package.

My idea is to actively watch changes and implicitly sync them. Use git for “stashing” merge conflicts.

I’ll be trying out how implementing this actually feels like, also in the mono-repo.

I’m very open for critique and comments!

leex279 · January 24, 2025, 11:46am

@mrsimpson just to make sure => you saw this PR? feat: auto sync Implementation V2 by Stijnus · Pull Request #1092 · stackblitz-labs/bolt.diy · GitHub

mrsimpson · January 24, 2025, 12:41pm

Nope, I had not seen it. Thanks!

@private.winters.bf3 you obviously put a lot of ideas in there. Would you join this conversation?

I did not check the PR fully yet, but afaics, it’s about providing an implict two.way sync with the local file system, right?
I’d love to get your ideas on version control, reverting and remote persistence.

private.winters.bf3 · January 24, 2025, 1:49pm

Hi,

No the sync is only one way and not two way sync. Did not test it yet but there are restrictions due to the webcontainer logic. I need to dive deeper into the code for the two way sync.

Br,
Stijnus

thecodacus · January 24, 2025, 1:58pm

this gave me idea of using rxdb to implement a file system with database with versioning system.
i have done similar thing with dynamoDB in work so that might just work

rxdb has option to sync between systems and cloud if we setup a replication db. but will be optional

mrsimpson · January 24, 2025, 2:59pm

rxdb has a loooot of dependencies. I do think that we could as well get along very well with dexie.js alone.

However, I’m just now experimenting with a bottom-up multi-way synchronization between the three stages.

I think we’d very much benefit if we had a file system interface with built-in synchronization and git handling…

Edit: I just added a common interface and implementation to talk to a browser based fs and a local fs.
This is a prerequisite to add syncing

thecodacus · January 24, 2025, 4:02pm

it has 31 dependencies, but it offers rebust decoupling feature. dexie.js is just an indexdb wrapper. what i am looking for a any db wrapper. where indexdb is just the primary.

aliasfox · January 24, 2025, 4:46pm

@thecodacus I agree.

@mrsimpson And once you compile the build package, it really shouldn’t matter that much anyways. There’s still some overhead to our prebuild in way of modules, but the compiled “build” ends up being optimized JS code under ~11MB. So, including a static JS library will not be optimized during the build step (other than maybe being minified).

mrsimpson · January 24, 2025, 5:40pm

I don’t worry about the bundle, but about the deep, partially outdated dependencies.

But let’s not focus on that implementation detail, but rather on the bigger picture.

I would like to have an abstraction in place which allows to write from any location and just syncs it to the other locations.

thecodacus · January 30, 2025, 8:07am

you can actually write from any location with the webcontainer promise exported from lib folder. and everywhere thats what is being used

mrsimpson · January 30, 2025, 7:14pm

Thanks, I’ll have a look at it!

My concern/goal is not the raw API, but the actual mechanism for keeping multiple file systems in sync.

What we (at least I) want is a component which implicitly syncs bi-directional across multiple file-systems.
Be it an indexdb fs in the browser, the fs in the webcontainer, a local path on the dev machine or even a fs inside a docker container.
I aim at having an infrastructure which has multiple targets, one of which is the primary (in our case most probably the brwoser fs). By simply writing to the FS of the primary, the secondary targets shall be synced. When changing files in a secondary, the primary shall be updated, conflicts detected and afterwards, the other secondaries shall be synced.

I have gotten quite far, but messed up some code with Cursor which I now need to clean up.

mrsimpson · February 3, 2025, 8:01am

Hi guys,
just a quick update on what I’ve been doing so far:
I experimented quite a bit with different mechanisms of keeping the browser, a local file system (via the browser-API) and a webcontainer-fs in sync – and I have to admit: It’s not as straight-forward as I thought
All file systems have got their own quirks and there are a lot of edge cases to be considered (e. g. empty directories, order of operations for nested items, deletions, …).

I now have a setup where I can specify a primary file system that syncs to multiple other secondaries. It’s two way capable, with the primary having a potential special role in error resolution.

See how it looks like in the Readme

This would allow for connecting a local directory to a project in bolt.diy, operate with the chat to do the ideation part and then to move to a local IDE and continue working on the same state seamlessly.

It’s still in a very early stage. there’s missing support for

real mass changes (not optimized yet)
Ignore handling
progress indication

The architecture allows for all of it though, it … just needs to be implemented properly

Thoughts? Do you actually consider this useful too?

EDIT: I’m mostly done with the implementation now:

Supports filesystems “Browser” (Lightning-FS), “Browser-Native” (File System API) and “Webcontainer”.
Dedicated primary/secondary roles for deterministic resolutions
Two-Way-Sync between all targets with progress reporting and error resolution
Ignores-Handling
Reasonable test coverage