document converter – Convert PDF files to wiki?


After some experimenting, I’ve come up with a multi-software solution on Linux shell. It preserved formatting very well in my attempts, so I can say I’m pleased with the results of the HTML conversion. The mediawiki output may still need some cleanup on occasion, but overall we got a very good result.

We’ll be using the following command-line tools:

These can be installed using the following commands (Ubuntu Linux 14.04 assumed. Adjust directions for your version of Linux. Some of these tools might also work on Windows, but I’m not providing install or usage instructions for them.)

For Poppler:

sudo apt-get install poppler-utils

For Pandoc: Installation Guide

Pandoc specifically recommends downloading the .deb and installing from it – however, if you don’t mind an older version and are willing to risk any bugs associated with it sudo apt-get install pandoc might work just fine.

Poppler includes a number of command line tools to extract things like images from PDF, and they are better detailed here.

Steps:

  1. Navigate to the directory holding your PDF(s) for conversion.

  2. Make a subdirectory for the output files: sudo mkdir dirname.

  3. Run the following command:

    pdftohtml -s -p -fmt png -nodrm "file.pdf" "file/file.html"
    

This command will create a lot of files, which is why we contain the results in their own directory. It will extract any images in the file, so all of those will be saved there. It also creates two HTML files, one of which will be an outline, and the other of which will contain all of the text in a formatting very close to the original.

You can type pdftohtml -h to gain a better understanding of available parameters.

I’ve explained the parameters used here for the sake of understanding the command:

  • -s contains all of the output within one HTML document (excluding the outline.
  • -p attempts to replaces pdf internal linking with html links.
  • -fmt controls the output format of images, with png and jpg being valid options.
  • -nodrm igores download rights management restrictions on the PDF.
  • -i ignores images. I didn’t use this, but it felt prudent to mention as in some cases it may massively speed your output format.

Poppler also has a pdftotext command. This was the only tool I’ve found so far that handled PDF extraction well in an instance of having two columns of text. While other tools were printing straight across from left to right or alternating lines of text from the two columns, Poppler put the text together in the right order.

Run the following command:

pdftotext -htmlmeta "file.pdf" "file.html"

Replace “file” with the name of the file you want to parse and with the name of the HTML file you want to write your text output to.

The -htmlmeta option creates an HTML version of the text in your PDF. (This is much less fancy than the previous command and only puts the text in pre tags). You should see an HTML file in your directory which you can open to check the results of. Depending on the formatting of your source PDF file, you may find that Poppler is variable in it’s effectiveness. You can try running pdftotext -h for information on other command options that may improve or worsen your results.

(or just about any other format!)

Pandoc is a very useful command-line program that converts an input file in just about any format to just about any different output format. Staying inside the same directory, simply run the following command:

 pandoc file.html -f html -t mediawiki -s -o file.txt

This command simply takes the HTML file and writes it in equivalent MediaWiki format to a txt file. I’ve provided some breakdown of the parameters for basic usage in case you need to convert to another format.

  • -f The input format of the file.
  • -t The format of the output file.
  • -s Standalone adds a header and footer to the document, rather than producing a document fragment.
  • -o The name of the output file.

For more info on Pandoc, read the user guide.

It is possible you may run into an error with Pandoc, presumably caused by your file being too large. I ran into this error and some fixes can be found here.

Depending on your PDF encoding, you may find strange Unicode charecters in your HTML output. This step is intended to clean up this output to the best possible degree of accuracy. ftfy, stands for fixes text for you, and it’s a Python library with a command-line interface. We’ll be using the command line to clean our files. This step is preformed before using Pandoc.

To install ftfy:

pip install --user ftfy

# or
git clone https://github.com/rspeer/python-ftfy.git
cd python-ftfy
pip install --user -e .

Although it typically is by default, you may need to seek extra guidance to ensure that the directory to which pip install --user installs executables is in your search path. Recent versions require Python 3; I used Python 2.x with ftfy 4.1.1 for this answer.

Using the same directory, type the following command:

 ftfy -o file_clean.html --preserve-entities file.html

Optionally, you may include the --guess option to have ftfy guess your encoding, or --encoding if you know your encoding. This may produce better results.



Source link

Related Posts

About The Author

Add Comment