Analyze the content of a PDF file

A friend of mine was tasked with spotting English words from yearly reports of various german companies. These reports are freely downloadable as PDF documents.

Under Linux (and maybe OSX?), one would need nothing more than two one-liners: the first to extract the text and save it as a basic text file (I found it by searching for “linux analyze the content of a pdf file“)…

pdftotext -layout filename.pdf filename.txt

…and a wee bit of Perl magic to count the occurrence of every word then save it to another file (I found it by searching for “perl list words in file and count“)…

perl -nle "print for /(\\w['\\wÄäÖöÜüß\\n\\-\\.]*)/g" filename.txt | sort | uniq -c | sort -rn | tee filename_words.txt

…which can easily be edited as a table to sort out English words by hand, using the whitespace character as a delimiter.

Vus : 674
Publié par Jeoffrey Bauvin : 48