N8n with Craw4Ai. Possible?

cabro · February 4, 2025, 6:52pm

Would you be able to add some additional information about the change you made? The only place I could find to add the TTL parameter was on HTTP Request1 which didn’t seem to solve the issue. The issue seems to be related to HTTP Request2 but I couldn’t find a way to add ttl to it.

You definitely seem to be onto something though because I checked the resources on digital ocean and memory pegged at 100% and then the process crashed. I even bumped up to a server with 16GB of ram and it still crashed so there’s definitely an efficiency issue.

samir · February 4, 2025, 7:19pm

Literally my last week! haha…

FYI :

{
“urls”: {{ JSON.stringify($json.pages.map(p => p.loc)) }},
“css_selector”: “main[role=‘main’] article, main[role=‘main’] h1”,
“crawler_params”: {
“semaphore_count”: 1,
“page_timeout”: 10000,
“text_mode”: true,
“light_mode”: true,
“delay_before_return_html”: 0.5,
“use_persistent_context”: false,
“session_id”: null,
“enable_rate_limiting”: true,
“memory_threshold_percent”: 45.0,
“check_interval”: 0.5,
“max_session_permit”: 2,
“stream”: true,
“cache_mode”: “WRITE_ONLY”,
“scraping_strategy”: “LXML”,
“excluded_tags”: [“script”, “style”, “nav”, “footer”],
“ttl”: 30
}
}

cabro · February 5, 2025, 4:52pm

So to clarify, did adding this code solve it for you and if so, where did you put it? Kind of lost on this step.

ColeMedin · February 5, 2025, 5:18pm

This is SO helpful, thank you so much @Mackemlad and @samir!!

Also @Mackemlad you are so welcome and I’m glad my videos have been super useful for you and others in your network. I appreciate it a lot

samir · February 5, 2025, 5:42pm

No it didn’t make any difference.

samir · February 5, 2025, 5:44pm

@cabro This did though… in essence I accepted it will fail, and then created a loop to have another go. As there are no state variables, I created a counter in google sheet so that when it re starts, it starts from where it left off.

cabro · February 5, 2025, 5:58pm

Ah, I was wondering about that because how it is now, when I run it again it will just dump duplicates into Supabase so I would have to do a check for that.

Thanks for the info, i’ll try to make similar changes.

cabro · February 8, 2025, 5:14pm

Ok, small update. I moved the whole project to a local docker desktop install and it’s completely fixed the issue of running out of memory. The new problem I have is that after pulling a few hundred url’s is that is errors out referencing https://seashell-app-gvc6l.ondigitalocean.app which i’m not using anymore.

I changed HTTP Request1 and HTTP Request2 to point to my local docker install but there still must be something else I can’t seem to find pointing to that old docker container. I searched the json file and cant find anything.

Here are the error messages.

404 - “{"detail":"Task not found"}”

403 - “{"detail":"Not authenticated"}”

getaddrinfo ENOTFOUND seashell-app-gvc6l.ondigitalocean.app

Error: Node does not have any credentials set

Any help would be greatly appreciated.

ColeMedin · February 10, 2025, 9:42pm

Sounds like something is maybe cached within n8n somehow? That is a SUPER weird issue!

What happens if you make a duplicate of the workflow and run it there?

Jimbozo · February 25, 2025, 10:49am

Hey guys, very interesting thread! I’ve followed Cole’s video (Massive thank you) but am just starting out with n8n.
If I’m using Cole’s flow, how do I add parameters so I can either use css_selector or exclude_tags - or both.

Would these be set in the HTTP Request2 node, using the send query parameters option?
I played around a little but couldn’t get it to work.

getpat · March 1, 2025, 1:45pm

This may work -
Extract HTML Content from JSON

Connect an HTML Extract node to process the HTML within the JSON response:

Source Data: Select “JSON” since the HTML is embedded in a JSON field.
JSON Property: Specify the field in the JSON that contains the HTML (e.g., html_content).
Extraction Values:
- Add a value with:
  - Key: Name the output field (e.g., extracted_data).
  - CSS Selector: Enter your css_selector here (e.g., div.content to target
    ).
  - Return Value: Choose what to extract (e.g., “Text”, “HTML”, “Value”).
Options:
- To “exclude tags” (like exclude_tags), use the Skip Selectors field under “Options”:
  - Enter a comma-separated list of CSS selectors for tags you want to exclude (e.g., script, style to skip and tags).
- Other options like “Trim Values” or “Clean Up Text” can refine the output further.

Jimbozo · March 3, 2025, 11:06am

Thanks @getpat ! That put me on the right path and worked perfectly.

dmenchaca · April 17, 2025, 6:23pm

@ColeMedin thanks for your tutorial video.

I’m stuck on the “Task not found” error for the second http request. Any help would be much appreciated

Here is a loom video of the issue.

Here is my workflow’s json.

If anyone has some wisdom or could point me in the right direction that would amazing.

Thanks in advance!

cc @cabro

dmenchaca · April 18, 2025, 8:52am

To anyone that is stuck on the “Task not found” issue for the second HTTP request the fix was simple.

I change my sever settings to run on a single instance instead of two.