Find keyword values from PDF [closed]
- by JukkaA
I have a lot of PDF reports I'd need to index. They're mostly "text-based PDFs", not images. I know they all have account number in certain format, 123456AAAAA and some other keyword info like addresses, customer names etc. needed in indexing these files. Basically if the file is ab.pdf, I need to create ab.txt that contains:
ACC=123456AAAA
Customer=John Doe
Date=20120808
What would be the best software/solution to generate indexing information for these?
I know there's pdftotext, but piping it to different grep/awk commands is a hack... It would be nice to specify an area in PDF to search for the account number, and specify the format it is in.