Add a Release Cadence/Testing Framework?

navyseal4000 · November 15, 2024, 1:35pm

Hey y’all,

I’m considering creating an installer for this to make it more accessible to people to set up, but I don’t think that makes sense to update on a PR basis because individual PRs could break something, since the only tests we seem to do on a new PR are directly related to the stated changes. I wanted to gauge interest - what do you think about creating a release system that, before each release, is thoroughly tested to ensure nothing is broken?

Also, does anyone know the best way to set up some level of automated testing on PRs for this kind of work? I imagine we could have a hook to some VM that hosts the system and runs through selenium scripts to verify everything up to the actual interaction with LLM’s, which would make this software much more robust imo, but there are a number of questions that go with that and I was wondering if anyone had any better ideas.

wonderwhy.er · November 15, 2024, 3:47pm

Yeah, we need one but it will be still challenging in number of ways.

Firstly we will need to mock AI calls.

Secondly it puts additional requirements on contributors to know how to work with test system that one uses.

You mentioned Selenium, why not Playwritght?

Also, for installer, what did you plan to use?

navyseal4000 · November 16, 2024, 3:11am

At this point, I don’t know the best option for an installer and haven’t done the research necessary to choose Selenium over Playwright - I just know Selenium and hadn’t heard of Playwright until now.

And yeah, it might put additional requirements on contributors, but it also ensures the quality of the product. I think that tradeoff is worth it, but I’m certainly interested in arguments against that. Maybe @ColeMedin has an opinion on the matter?

ColeMedin · November 16, 2024, 3:42pm

@navyseal4000 I really appreciate your thoughts here and thanks @wonderwhy.er for chiming in as well!

I personally have experience with Selenium but not Playwright. I’ve enjoyed using Selenium a lot over the years but I’ve heard great things about Playwright too. Something I’d have to try myself before I can give a definitive “this is the one I’d go with”.

To your last point, I am for making contributors work with our test system as long as we make it very easy to do so, so at least there isn’t too much overhead. It’s the biggest headache for everyone when bugs are introduced through PRs, so I certainly want to avoid it at all costs (as much as possible at least) even if it means a bit more work for each PR.

navyseal4000 · November 16, 2024, 4:19pm

@ColeMedin My hope is that there’s some sort of VM we can have hosting the system that runs multiple docker images for different operating systems and on PR creation, a set of Unit and Integration tests run, on approval, a set of selenium (or other framework) scripts run on each major operating system and browser (which could be centralized such that new scripts only have to build once and the framework implements those tests on each browser and OS), and that before merging a PR with the new feature tag, there must be some accompanying tests.

That will definitely take some high degree of effort to set up, but it would nearly entirely eliminate any risk of feature breaking. I think this project has a cracked amount of value add if done right, but that, especially as an open source product, it can easily be broken and it would be hugely beneficial to have someone work on this aspect.

I’m currently working on building out software for my own business and also have a FTJ, otherwise I’d be way more active on development here… even though I probably can’t implement a lot of what I’m talking about due to time constraints, I hope that the ideas at least help make this as robust as possible

wonderwhy.er · November 16, 2024, 6:44pm

Soo… I want to share some broader context around my opinion about tests so that I don’t come across as negative out of nowhere.

Short version:

Wrong tests can waste more time than not having them.
Most teams no longer strive for high coverage.
There’s a cost to not having tests, but there’s also a cost to having them.
If a test often fails when there’s no actual bug, it’s a flaky test. I will aggressively remove such tests until they’re fixed, but I’m not going to waste time fixing them myself. Some things are simply impossible to test well and not worth the effort.
I’m okay accepting PRs without tests if the person submitting the PR isn’t comfortable with them. Tests can always be added later by someone motivated to do so.
I’ll add tests when I see something that frequently breaks and it’s straightforward to create a test that’s consistent and not flaky. Otherwise, I’ll always argue for removing flaky tests immediately.

At the end of the day, there are two formulas:

Cost of not having a test:
How often something breaks × how much time it takes to test manually × how much time it takes to debug and fix those bugs.
Cost of having a test:
How much time it takes to write the test + how often it produces false positives × how much time is wasted dealing with those false positives.

If the second formula outweighs the first, then the test loses all its value. In those cases, I’ll argue against having such tests.

Tests should exist where we don’t want change because they make change harder and more costly. That’s fine for the battle-tested core of a project. But for new features or areas of code we know will change soon, tests are a waste of time. Who writes tests for code that will be thrown out next week?

Longer story:

I have a long history with testing. I didn’t understand it at all during university 15 years ago. Later, I took the Berkeley “Software as a Service” online course on Test-Driven Development (TDD) with Ruby on Rails, and I fell in love with their approach, especially behavior-driven testing with Cucumber/Gherkin.

It’s a beautiful system when you’re working with simple web apps and testing backend logic.

But since 2015, I’ve focused on full-stack JavaScript apps, spending most of my time on visual/UI elements. In one company I worked for from 2015-2018, we used Selenium for integration tests. That’s when my “idealistic dreams” of good testing began to fall apart.

At Prezi, we currently use unit tests, Cypress, Playwright, and Selenium. Do you know how long our Selenium tests run? Four hours. Do we block PRs for four hours? Of course not—it’s way too slow.

We also regularly run into issues with flaky tests blocking releases. It’s not uncommon for people to rebuild projects four times over two hours just because tests aren’t reliable. And that’s in a company with experienced, full-time QA Automation engineers.

So, am I against tests? No. I’m all for using them frugally, with one goal in mind: saving time.

If you can add a test that’s quick and easy to write, maintain, works consistently, and applies to a part of the codebase that’s stable, great—that saves time.

But if even one of those criteria fails, the test will waste as much time as not having a test at all. That’s my conclusion after over a decade of being interested in TDD.

navyseal4000 · November 17, 2024, 2:42pm

Sounds good, I get the rationale there. @wonderwhy.er, What are your thoughts at least on some thorough testing (manual or automated) on release only and releasing stable versions of the artifact on a monthly basis, along with some “testing protocol” document that is expected to be modified on new feature add?

wonderwhy.er · November 17, 2024, 5:36pm

I would probably include tests for frequent bugs, aka if PR is a bug fix for thing that breaks often than add a test.

As for tool, one of my problems is that I am bit far from DevOps so not familiar with how to setup test workflows efficiently for repo.
I have experience with writing Selenium tests and working with Cypress.
Cypress is nice but has its own weird quirks.

In Prezi one team migrated from Cypress to Playwright and are happy with it.
As for Selenium, its kinda heavy usually to support, I only ever saw companies with full time test automation engineers invest in it.

So I can’t help with setting things up.

Back to protocol.
All I can say is that if manual testing does not match automated testing results such tests should be refactored, and not necessarily by person who encountered it(unless he want’s to).

I would also advice suggesting adding tests to PRs that show errors after manual testing. So we add tests for things that people often miss when developing features.

That is frugal balance for me.

At the same time, here are types of tests that are hard to write. Usually closer to user interactions.
It is also hard to run tests fast…
We should avoid situations where tests run a long time…

ColeMedin · November 18, 2024, 1:00am

I appreciate your thoughts here @wonderwhy.er and @navyseal4000! I certainly agree we only want to add tests for common bugs and things that will actually help us and not just hamper development.

At least require those tests to pass before merging anything and then maybe have a larger test suite for what you are describing @navyseal4000 with testing oTToDev with different operation systems/configuration but not making that a requirement? Lot to think about here haha, but I love the ideas.