Archon setup - Crawl Pydantic AI Docs - 5 fail and no data in Supabase table

nexgenhealth.io · April 4, 2025, 5:58am

I cannot get the 5 failed out of 68 URLs and no data being piped to Supabase.com table. I am not sure what is occurring, I have been through the setup several times using both docker and Python via VS Code. Could use assistance if you are able. 4.4.25 - I ran again and got 1 failed URL.

I have built my Supabase table with SQL instruction and have verified URL and key are correct. The table builds, but no data is transferred.

nexgenhealth.io · April 5, 2025, 7:16pm

Update: After debugging SQL script in supabase, I found the following will correct the run error, "-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

– Create the site_pages table if it doesn’t exist
CREATE TABLE IF NOT EXISTS site_pages (
id bigserial PRIMARY KEY,
url varchar NOT NULL,
chunk_number integer NOT NULL,
title varchar NOT NULL,
summary varchar NOT NULL,
content text NOT NULL,
metadata jsonb NOT NULL DEFAULT ‘{}’::jsonb,
embedding vector(768),
created_at timestamp WITH TIME ZONE DEFAULT timezone(‘utc’::text, now()) NOT NULL,
UNIQUE(url, chunk_number)
);

– Create an index for vector similarity search if it doesn’t exist
CREATE INDEX IF NOT EXISTS site_pages_embedding_idx
ON site_pages USING ivfflat (embedding vector_cosine_ops);

– Create an index on metadata for faster filtering if it doesn’t exist
CREATE INDEX IF NOT EXISTS idx_site_pages_metadata
ON site_pages USING gin (metadata);

– Create or replace the function to search for documentation chunks
CREATE OR REPLACE FUNCTION match_site_pages (
query_embedding vector(768),
match_count int DEFAULT 10,
filter jsonb DEFAULT ‘{}’::jsonb
) RETURNS TABLE (
id bigint,
url varchar,
chunk_number integer,
title varchar,
summary varchar,
content text,
metadata jsonb,
similarity float
)
LANGUAGE plpgsql
AS $$
#variable_conflict use_column
BEGIN
RETURN QUERY
SELECT
id,
url,
chunk_number,
title,
summary,
content,
metadata,
1 - (site_pages.embedding <=> query_embedding) AS similarity
FROM site_pages
WHERE metadata @> filter
ORDER BY site_pages.embedding <=> query_embedding
LIMIT match_count;
END;
$$;

– Enable RLS on the table (this is safe to run multiple times)
ALTER TABLE site_pages ENABLE ROW LEVEL SECURITY;

– Create a policy for public read access (replaces if it exists)
CREATE POLICY “Allow public read access”
ON site_pages
FOR SELECT
TO public
USING (true);"

nexgenhealth.io · April 5, 2025, 7:18pm

This adds the following: “CREATE OR REPLACE FUNCTION match_site_pages (”. This is when the “site_pages” error occurs.

nexgenhealth.io · April 5, 2025, 7:20pm

After crawling each page and extracting content, it had the same 5 failed URLs and data not being saved to supabase.

nexgenhealth.io · April 6, 2025, 5:17am

Update: See attached. 1 failed URL. Now able to upload doc chunks to supabase.

ColeMedin · April 6, 2025, 11:09pm

Glad that is working now! The randomly failed chunks is either Crawl4AI having a small blip or an OpenAI rate limit issue. You could always scroll through the logs and see!

What did you end up doing to fix it completely?

nexgenhealth.io · April 6, 2025, 11:59pm

To get where I am at (which is progress) I had to mirror the Environmental Variables I saw someone else put in. Unfortunately, using my own installed (locally ran) Ollama, I was not getting any ‘chunks’ into Supabase. I am still failing one URL each time I retry it. I do see rate limits being a limitation - good suggestion. This is the log that seems to repeat consistently: “[23:52:36] Error processing https://ai.pydantic.dev/contributing/: Error fetching https://ai.pydantic.dev/contributing/: we should not get here!”.

We should be able to integrate Grok. I am a Super-Groker and it is just wisdom to attach one’s self to something that is truth-seeking.

The variance in different results leads me to believe it is a LLM error associated with rate limits or connection problems. Connection problems would not be an issue with me.

Attached are the logs I have based on 1 URL failure.

nexgenhealth.io · April 7, 2025, 12:11am

I am not fixed yet

nexgenhealth.io · April 7, 2025, 12:20am

This is my Desktop Docker log:
Archon Crawling process.json (3.2 MB)

nexgenhealth.io · April 7, 2025, 12:24am

I looked up the error code - you were right. I exceeded the rate limit: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
2025-04-06 17:54:59 Error getting embedding: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
2025-04-06 17:54:59 Error getting embedding: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
2025-04-06 17:54:59 Error getting embedding: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
2025-04-06 17:55:00 Error getting embedding: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}
2025-04-06 17:55:00 Error getting embedding: Error code: 429 - {‘error’: {‘message’: ‘You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.’, ‘type’: ‘insufficient_quota’, ‘param’: None, ‘code’: ‘insufficient_quota’}}

nexgenhealth.io · April 7, 2025, 12:24am

Archon7.0

nexgenhealth.io · April 7, 2025, 1:04am

I think at this point, if there was a way to integrate an LLM like Grok where I don’t have rate limits, it might work for me.

ColeMedin · April 9, 2025, 7:46pm

You can integrate any OpenAI compatible API so you could use Groq! Or use another provider like OpenRouter

nexgenhealth.io · April 15, 2025, 9:10pm

Update 4.15.25: Many pages failed bc they could not be found - FYI.

ColeMedin · April 16, 2025, 4:55pm

Okay this is really good to know, thanks for pointing that out!