Code Quality of the fork

mrsimpson · November 21, 2024, 12:06pm

Dear fellows,

first off, as this post might sound impolite: It is not intended to
I appreciate all contributor’s efforts a lot! I do think that mutli-file-code-assistants like Bolt.new will change the way software is evolving. I love that Stackblitz published its approach with MIT license and appreciate that @ColeMedin and others have started to sort-of democratize it by opening it for other LLMs.

However, to be honest, it seems that the overall quality of the fork is overall a bit low.
You can see this as a user in many places (such as the tag-names like <boltFileModification> which are rendered into the template or by the not verbose error handling when an LLM does not respond properly).
But also looking at the code and the development flow, there are some reddish flags which indicate that the product is far from being mature:

Some samples can be seen in this single screenshot

You can see that

basic linting has not been done (whitespaces at the end of the lines)
The comments also don’t match the linting rules. This presumably because they generated by Claude / bot.new when modifying itself.

Also, the overall mechanism of loading new LLMs is very … simple and relies on a couple of files being added. Usually, I’d expect some registry-mechanism here to allow for modification free extensibility when e. g. adding new providers.

What’s even more severe when looking at the overall diff to the (now unmaintained) stackblitz/bolt.new upstream is, that it seems there’s no architecture in place which is optimized for minimal invasive modifications.

Also, when checking the commit history on Github, this leaves the impression of a not mature process

We could discuss about linear history or merge commits, but at least commits from PRs should be squashed when merged, imho.
Also, there’s not CICD in place which checks for basic things like linting (see above). This leads to huge PRs which are tricky to review.
There is no mechanism to prevent regression with automated tests (ok, the upstream doesn’t provide this either, to be fair ).
And I could go on…

After all, this leaves me (as a software developer who’s been doing this for some time for a living) a bit in limbo:
Is this fork a solid foundation for building can contributing to a multi-file LLM-workbench or is this just a playground an I’d be better off building something with a solid foundation instead?

I’d love to read from the maintainers (particularly the most active @wonderwhy.er and of course @ColeMedin ) how you look at that.

From my experience, all these things are quite easy to set up, if you do it right from the beginning. All efforts for quality slow down progress a bit, but imho it pays off in the long run. So it would be great to know if you aim at a long running project afterall and how you plan to set things up.
This 3.5k it definitely shows how huge the potential is, so investing into that right from the “beginning” might be worth it.

Looking forward to your responses!

Oliver

wonderwhy.er · November 21, 2024, 12:32pm

From my side, I am for turning on linintg/typechecks in PRs.
Adding tests was discussed elsewhere and I am looking into it too.
And I am for small PRs, single feature instead of large monsters.
We all are doing this in our free time so large PRs will stay in review for very long time.

I don’ agree with you on number of things but those are subjective, like weather to squash or not. Its up to wider contributor base to say what they prefer.

On other hand, I am for keeping things simple for people to contribute, even if they are less experienced with some of it. Can some of this stop people from contributing too?

That are my two cents on this question. We will be improving along the way.

BTW considering you are experienced, may be you do a PR on enabling github actions for linting and typecheck?

About tests, there is thread about Selenium, I am not sure and would look at Playwright as alternative. Its possilbe to run it for free for public GitHub repos. But its end to end tests.
If we want unit tests too then we need more.

wonderwhy.er · November 21, 2024, 3:17pm

Took a look at things and we already have github workflows for typecheck/lint/test

github.com

coleam00/bolt.new-any-llm/blob/main/.github/workflows/ci.yaml

name: CI/CD

on:
  push:
    branches:
      - master
  pull_request:

jobs:
  test:
    name: Test
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Setup and Build
        uses: ./.github/actions/setup-and-build

      - name: Run type check

This file has been truncated. show original

I was just not aware that first time contributors PRs require approval from maintainer to run those workflows + we need to uncomment lint part, will try to do it

wonderwhy.er · November 21, 2024, 3:22pm

Ok, we need branch protection rules to block merging without actions running. I don’t think I can do it. @ColeMedin what do you think about enabling rules to allow merging only if checks run correctly?

mahoney · November 21, 2024, 6:33pm

Eh, my initial reply was getting too verbose and sounded defensive so I’ll start over… You are providing (appreciated) objective feedback which is largely correct. CI/CD & DevEx are critical, tooling isn’t decided on/configured such as linting rules yet, you’ve hit most of the major points. Some points do sound opinionated, or maybe luxuries from doing greenfield work, but are still valid as things to address.

While I don’t agree that these are things that need to happen right now compared to some core interface work to make up for features that didn’t make it into this fork, they do need some focus alongside and immediately after those interface priorities if we’re going to get to an extremely important priority: Testing, PR merges/promotions through dev/staging/prod, and fully functional CI/CD steps/gates to ensure code quality. In my mind, the key is to quickly stabilize, then immediately harden, and begin to work on the output of the roadmap which is just recently coming to fruition. Right now it’s simply a function of contributors’ available time, plus the steady stream of support questions and incoming issues/pull requests.

To answer your most important question, this project is here to stay as far as I’m concerned. The amount of work to stand up a community around this fork on behalf of @ColeMedin and everyone here has been an incredible gift versus just deciding to fork it and leave everyone to their own devices. None of this I’m saying is prescriptive, and I by no means speak for the core team or the community at all. What I would love to see is folks willing to dig into some of these improvements that are high priority for long-term project stability, but as you can tell very much in low demand since so many community members have high expectations of functionality coming from other GenAI code solutions. I’m going to bookmark this thread with the intention of collecting it into a general issue post together focused on DevOps-ish needs. If anyone just really loves that work in the meantime and does it before then, I’d be extremely grateful not only as a team member but also a consumer dev in the community. I do love it, but we all do occasionally have to swap over to other work that pays the bills.

Welcome, and thanks for being relatively gentle with the feedback, it’s appreciated. Looking forward to working with you to help improve the dev experience in general. (Sorry that I was still too verbose, haha)

ColeMedin · November 21, 2024, 6:55pm

Thank you @mrsimpson for bringing this up, I really do appreciate it! And thank you @mahoney and @wonderwhy.er for your perspectives on balancing immediate needs with long-term sustainability and other thoughts.

I’m actually actively working with open source experts (some potential partnerships too!) on establishing exactly these kinds of best practices and infrastructure. My vision aligns completely with what’s been discussed here – we need robust CI/CD, consistent code quality standards, and a solid foundation for scaling the project and community.

A few specific thoughts:

The emphasis on keeping things simple for contributors while improving quality resonates strongly with me. We can certainly add in guardrails that help rather than hinder contribution, mostly in my mind through giving contributors confidence their code is working and fits to our standards.
I agree that testing strategy needs careful consideration. Bolt.new is using Vitest for their private commercial repo, I’m thinking we should for sure use that and then Playwright for some E2E tests, but not going too overboard there.
I agree @mahoney that we want to keep getting through core features we really need for oTToDev, at least in parallel with getting this foundation set up for the repo.

I’d love to collaborate on implementing these improvements. Once I’ve finished consulting with the OSS experts on specific approaches, I’ll share concrete proposals for:

CI/CD pipeline configuration
Testing strategy and initial implementation
Code quality tooling that supports rather than constrains contributors
Documentation around development practices

And I’m open to any suggestions! As @wonderwhy.er mentioned, if you want to contribute to help with these things @mrsimpson I am all for it!

mrsimpson · November 21, 2024, 7:31pm

Thanks a ton for all of you for the serious and realistic responses!

It’s great to hear that you are aware of the issues resulting from it and are actively working on these topics

I do have some ideas how to approach this step by step, as @wonderwhy.er said preferably in small PRs

Here’s a brief sketch with the aim to keep contribution barriers low:

Activate linting in order to minify diffs on PRs

Lint-fix the whole application. Add exceptions in the linter-config for everything that can’t be automatically fixed. Then, work one-by-one to remove the linter exceptions. WIP PR
Add a pre-commit-hook to lint the application (locally on each dev machine)
Enable the pipeline for all PRs to check-lint on-PR

Implement E2E-Testing

I’d recommend to stick to a very high-level testing as long as the development in the component is active. Else, we’ll be having a lot of rework on unit-tests.

Add a Mock-LLM which is added in NODE_ENV === 'test'
Define core-interactions and record them for playwright
enable playwright in CI

Rework code to enable better extensibility

Define mechanisms we’d like to use for extending the core (also depends on whether it’s expected that Stackblitz publishes more updates or not, do you @ColeMedin have insights into that?)
Start refactoring. Add Unit-tests for each extensibility mechanism (vitest)
Developer-documentation on-the-go

Community engagement

I’m not an expert on this end, looking forward to what @ColeMedin brings in from his contacts.

WDYT, does this make sense?

We could create separate threads to keep track of these activities

wonderwhy.er · November 22, 2024, 7:45am

Agree to everything. We also discussed before in core team about making it extendable, my feeling is that it comes later.
We need to stabilize and make core useful first.|

My wish for this to become something like Express.js is for servers.
Bolt/oTToDev as NPM package you install and configure, extend, add plugins. Add providers, turn on and off the features and so on.

A lot of work do here.
But even base oTToDev is not yet working well and misses basic features like export/import and git integration.

I first want to get there.
Look in to lint/typecheck tests along the way.

And then start exploring how to modularize and introduce plugin architecture for different parts of this thing.

mahoney · November 22, 2024, 7:48am

Thanks for taking the time to break this out @mrsimpson, huge help. It would be great to break these out as they come into focus as separate threads to pin down best practices.

mrsimpson · November 22, 2024, 8:31pm

I’ll focus on the linting first, as this comes at relatively low cost preventing a lot of work on commit-merging (just added #378 which prevents linting-regression).

It’s great to see those ideas appreciated. I’ll break them in separate threads prior to starting any work on tests or architectural ideas.

mahoney · November 22, 2024, 10:16pm

Killer, thank you. Linting error indicators are easy to turn off in VSCode, but I’ve been using Zed for its collaboration features lately and it is a total jerk about everything lint.

Can we all not be opinionated and use spaces for tabs? I don’t care if it’s 2 or 4, haha. Things I want to die forever:

Whitespace complaints before certain code lines
Complaints about extra whitespace which we can just handle in code reviews by ignoring it
That’s really it.