Using PDFToText to convert PDFs to text
PDF to Text ConversionThis is geared towards windows users. Pdftotext is a program for converting PDF files to text. for windows you can get it as part of the Xpdf open source viewer
http://www.foolabs.com/xpdf/download.html
From the README:
"What is Xpdf?
-------------
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from
the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other
utilities.
Xpdf runs under the X Window System on UNIX, VMS, and OS/2. The non-X components (pdftops, pdftotext, etc.) also run on Win32 systems and
should run on pretty much any system with a decent C++ compiler.
It basically gets all filenames of PDF files in a path, input to the command line.
It then calls pdftotext with the parameters, remove pagebreaks, the target PDF file, the output file (in text subfolder). The target
filename is the same as the output filename except for the extension.
Also from the README:
"Running Xpdf
------------
To run xpdf, simply type:
xpdf file.pdf
To generate a PostScript file, hit the "print" button in xpdf, or run pdftops:
pdftops file.pdf
To generate a plain text file, run pdftotext:
pdftotext file.pdf
There are four additional utilities (which are fully described in their man pages):
pdfinfo -- dumps a PDF file's Info dictionary (plus some other useful information)
pdffonts -- lists the fonts used in a PDF file along with various information for each font
pdftoppm -- converts a PDF file to a series of PPM/PGM/PBM-format bitmaps
pdfimages -- extracts the images from a PDF file
Command line options and many other details are described in the man pages (xpdf.1, etc.) and the VMS help files (xpdf.hlp, etc.)."
Place a copy of pdftotext in your perl bin directory.
Code
#!/usr/bin/perl
#
#C:\usr\bin\xpdf>pdftotext -nopgbrk filename.pdf
my $file = "basefilename";
system
("pdftotext","-nopgbrk","$file.pdf","$file.txt");
exit 0;That is pretty much it, the basic skeleton of working with Pdftotext. You can specify options as needed.
Similar Content