N8n with Craw4Ai. Possible?

Would you be able to add some additional information about the change you made? The only place I could find to add the TTL parameter was on HTTP Request1 which didn’t seem to solve the issue. The issue seems to be related to HTTP Request2 but I couldn’t find a way to add ttl to it.

You definitely seem to be onto something though because I checked the resources on digital ocean and memory pegged at 100% and then the process crashed. I even bumped up to a server with 16GB of ram and it still crashed so there’s definitely an efficiency issue.

2 Likes

Literally my last week! haha…


FYI :

{
“urls”: {{ JSON.stringify($json.pages.map(p => p.loc)) }},
“css_selector”: “main[role=‘main’] article, main[role=‘main’] h1”,
“crawler_params”: {
“semaphore_count”: 1,
“page_timeout”: 10000,
“text_mode”: true,
“light_mode”: true,
“delay_before_return_html”: 0.5,
“use_persistent_context”: false,
“session_id”: null,
“enable_rate_limiting”: true,
“memory_threshold_percent”: 45.0,
“check_interval”: 0.5,
“max_session_permit”: 2,
“stream”: true,
“cache_mode”: “WRITE_ONLY”,
“scraping_strategy”: “LXML”,
“excluded_tags”: [“script”, “style”, “nav”, “footer”],
“ttl”: 30
}
}

1 Like

So to clarify, did adding this code solve it for you and if so, where did you put it? Kind of lost on this step.

This is SO helpful, thank you so much @Mackemlad and @samir!!

Also @Mackemlad you are so welcome and I’m glad my videos have been super useful for you and others in your network. I appreciate it a lot :smiley:

1 Like

No it didn’t make any difference.

@cabro This did though… in essence I accepted it will fail, and then created a loop to have another go. As there are no state variables, I created a counter in google sheet so that when it re starts, it starts from where it left off.

1 Like

Ah, I was wondering about that because how it is now, when I run it again it will just dump duplicates into Supabase so I would have to do a check for that.

Thanks for the info, i’ll try to make similar changes.

2 Likes

Ok, small update. I moved the whole project to a local docker desktop install and it’s completely fixed the issue of running out of memory. The new problem I have is that after pulling a few hundred url’s is that is errors out referencing https://seashell-app-gvc6l.ondigitalocean.app which i’m not using anymore.

I changed HTTP Request1 and HTTP Request2 to point to my local docker install but there still must be something else I can’t seem to find pointing to that old docker container. I searched the json file and cant find anything.

Here are the error messages.

404 - “{"detail":"Task not found"}”

403 - “{"detail":"Not authenticated"}”

getaddrinfo ENOTFOUND seashell-app-gvc6l.ondigitalocean.app

Error: Node does not have any credentials set

Any help would be greatly appreciated.

Sounds like something is maybe cached within n8n somehow? That is a SUPER weird issue!

What happens if you make a duplicate of the workflow and run it there?

Hey guys, very interesting thread! I’ve followed Cole’s video (Massive thank you) but am just starting out with n8n.
If I’m using Cole’s flow, how do I add parameters so I can either use css_selector or exclude_tags - or both.

Would these be set in the HTTP Request2 node, using the send query parameters option?
I played around a little but couldn’t get it to work.

1 Like

This may work -
Extract HTML Content from JSON

Connect an HTML Extract node to process the HTML within the JSON response:

  • Source Data: Select “JSON” since the HTML is embedded in a JSON field.

  • JSON Property: Specify the field in the JSON that contains the HTML (e.g., html_content).

  • Extraction Values:

    • Add a value with:

      • Key: Name the output field (e.g., extracted_data).

      • CSS Selector: Enter your css_selector here (e.g., div.content to target

        ).

      • Return Value: Choose what to extract (e.g., “Text”, “HTML”, “Value”).

  • Options:

    • To “exclude tags” (like exclude_tags), use the Skip Selectors field under “Options”:

      • Enter a comma-separated list of CSS selectors for tags you want to exclude (e.g., script, style to skip and tags).
    • Other options like “Trim Values” or “Clean Up Text” can refine the output further.

1 Like

Thanks @getpat ! That put me on the right path and worked perfectly.

@ColeMedin thanks for your tutorial video.

I’m stuck on the “Task not found” error for the second http request. Any help would be much appreciated :pray:

Here is a loom video of the issue.

Here is my workflow’s json.

If anyone has some wisdom or could point me in the right direction that would amazing.

Thanks in advance!

cc @cabro

To anyone that is stuck on the “Task not found” issue for the second HTTP request the fix was simple.

I change my sever settings to run on a single instance instead of two.