Can n8n and AI be used to split a large PDF file into many files based on content?

igrokit · December 11, 2024, 9:02pm

I have a question for you guys.
Let’ say we have scanned a bunch of invoices in one go, resulting in a rather large PDF file of 50 or so pages.
Each invoice is one, two or more pages long. Most invoices are from different companies, but some are from the same.
Let’s say I have 30 invoices in this 50 page file.

Would it be possible for n8n and any kind of AI agent, to split that 50 page file into 30 files, each containing just a single invoice?

I would be very interested in something like that!

Dan

aliasfox · December 12, 2024, 1:46am

I think it could work but you’d need a big context window and likely some sort of RAG. But depending on the format, a dumb script could do this using regex and pattern recognition. No AI needed.

digitalchild · December 12, 2024, 3:16am

You would need to use an external service to split the PDFs. There are a number of API’s you could use for this. ConvertAPI is one such service.

igrokit · December 12, 2024, 9:03pm

I don’t think that splitting the PDF is the main issue.
To know where to split is what is most important…

igrokit · December 12, 2024, 9:12pm

I do not think that a dumb regex/pattern recognition script would be able to do that. Just look at the many different invoices you get personally every month (I assume). I doubth that regex (simple or complex) would manage all those different kind of invoices.

As far as I understand “need a big context window and likely some sort of RAG” (I am a novice, please bear with me), this would mean I would have to prepare the large document with OCR, then feed the text into a RAG that would recognize from the invoice text when a new invoice starts.

Yeah, that’s exactly what I thought it would have to be.

But I hoped to get some more detailed help. Like how which tools and steps that would need within n8n.

I guess it needs a model that can cope with 50 pages of (invoice-type) text and some sort of memory that goes along with it. Then some ai tools to do the work…

That’s where my imagination (or rather knowledge) stops…

Help?

aliasfox · December 12, 2024, 9:28pm

I really think a simple script with regex could do what you want, all it needs to do is determine the page “header” (maybe footer or signature, etc.) and split them by that. Could even use some AI to determine the part that repeats (but off the top of my head, serms unnecessary). But perhaps I don’t understand the assignment or am missing something.

If it was a web tool or app, I assume it would split the PDFs and package then up into a nice little zip file for download. Python could just spit the results to an output folder.

Any constraints, need a certain language? I feel I could knock something out with Python in probably less than an hour. A UI tool or web app would take a little longer.

If the data isn’t particularly private, or you have an example, PM me and I can give it a shot.

digitalchild · December 13, 2024, 3:06am

Yes, and I provided a suggested API service that would allow them to split the PDF before sending it to an LLM for ingesting, which is the best thing to do in this instance. I have built an invoice recognition system, so I know exactly what needs to be done.

Split PDF into separate documents
Convert to images
Upload to LLM to extract information.

igrokit · December 15, 2024, 5:11pm

I’m intererested in this part only, for the time being.

Customer scan many invoices in one go, resulting in one big file (TIF or PDF), usually containing 20-40 invoices.
I need to split that big file into one file per invoice.

Then, I may go further and get information from the files.
But splitting into smaller files containing exactly one invoice… this is the issue, or need…

Dan

frasernz · December 17, 2024, 1:49pm

It sounds like you could split everything into single pages, then feed them one by one to an LLM. Get the LLM to decide if each page is part of the previous invoice or if the page is the start of a new invoice.

digitalchild · December 19, 2024, 8:00am

This is what I would do.