Using PDFToText to convert PDFs to text
PDF to Text ConversionThis is geared towards windows users. Pdftotext is a program for converting PDF files to text. for windows you can get it as part of the Xpdf open source viewer
http://www.foolabs.com/xpdf/download.html
From the README:
"What is Xpdf?
-------------
Xpdf is an open source viewer for Portable Document Format (PDF) files. (These are also sometimes also called 'Acrobat' files, from
the name of Adobe's PDF software.) The Xpdf project also includes a PDF text extractor, PDF-to-PostScript converter, and various other
Using the HTML::Strip Perl extension
Stripping HTML/XML/SGML
Example demonstrating how to use the HTML::Strip Perl extension for stripping HTML markup from text.
The results may not perfectly remove all HTML depending on the complexity of your markup.
strips HTML-like markup from text in a very quick and brutal manner. You can also use the extension
to strip XML or SGML from text.
Code
#!/usr/bin/perl use HTML::Strip;
This is based on a script provided in the 'Add RSS feeds to your Web site with Perl XML::RSS' from
http://articles.techrepublic.com.com/5100-6228_11-5487340.html
In the original script, it was assumed that the rss news feed would be located on your server. To get
around this limitation, use LWP to get the contents of a remote file, save it to a file on your server
then parse the file.
#!/usr/bin/perl -w #use strict; use XML::RSS; use LWP::Simple; #use Data::Dumper; my $r = new XML::RSS; $r->parse( get 'http://onaje.com/rss.xml' );