Cookbook Chapter 7: Media Processing

From CollectiveAccess Documentation
Jump to: navigation, search

Media Processing

I can't upload PDFs

Problem:
Uploads of PDF files fail with the error message "Unknown file type not accepted by media".
Solution:
The error is caused by an inability to identify your uploaded file as a PDF. CollectiveAccess' PDF plugin employs one of three possible identification methods:
  1. GraphicsMagick - if GraphicsMagick is installed and configured, it will be used to parse and identify PDF files.
  2. ImageMagick - if GraphicsMagick is not installed, and ImageMagick is available, it will be used to parse and identify PDF files.
  3. If neither GraphicsMagick nor ImageMagick is available the Zend_PDF software library will be used to parse PDF files.

Zend_PDF has the advantage of not requiring installation of additional software. It can be used no matter the server setup. Unfortunately, Zend_PDF often fails to identify later format PDFs (> PDF version 1.6) and tends to use a relatively large amount of memory to do the job.

ImageMagick and GraphicsMagick are, in general, more memory efficient and support a wider range of PDFs. Unfortunately, they require additional software installation, are often not available on shared servers, and occasionally crash on specific PDFs.

Because there is no ideal processing system for PDFs, CollectiveAccess defaults to a typically ideal usage pattern while providing a means to override the default behavior.

By default CollectiveAccess will try to use GraphicsMagick, then ImageMagick, and finally Zend_PDF. You can disable GraphicsMagick and/or ImageMagick for identification of PDFs using the dont_use_graphicsmagick_to_identify_pdfs and dont_use_imagemagick_to_identify_pdfs directives in app.conf. If you are finding that valid PDFs are being rejected by CollectiveAccess, try disabling one or both of these options.

No PDF Previews

Problem:
Uploads of PDF files are successful, but instead of preview icons a generic document icon is displayed.
Solution:
There are a several possible causes:
  1. You don't have Ghostscript installed on your server.
  2. Ghostscript is installed, but the path to it in external_applications.conf is incorrect.
  3. You don't have a graphics processing plugin installed that can support TIFF images. TIFFs are used as an in-process format for previews. GraphicsMagick and ImageMagick can, and usually do, support TIFFs.
  4. You have GraphicsMagick or ImageMagick installed, but they lack TIFF format support.

Most Linux servers already have Ghostscript installed, and it is readily installed on Mac OS X using Brew. Windows installers are also available.

You can check if the path to Ghostscript is set correctly using the "Configuration Check" screen, available under the "Manage" > "Administrate" menus. On Unix-like operating systems such as Linux and Mac OS X the path to Ghostscript is most often either /usr/bin/gs or /usr/local/bin/gs

If lack of TIFF support is the issue then install either GraphicsMagick or ImageMagick with libtiff support. On Mac OS X using Brew to install GraphicsMagick, be sure to use the --with-libtiff option, otherwise the resulting installation will not support TIFFs.

I want to archive and view Microsoft Word/Excel/Powerpoint documents

Problem:
You want to include Microsoft Office documents (Word, Excel and/or Powerpoint) in your CollectiveAccess archive.
Solution:
CollectiveAccess can automatically detect the Office XML formats available since 2007 (docx, xlsx, pptx) and store the uploaded files for later download. It cannot index the contents of those files for search or generate page previews for viewing within a web browser.

If you want previews and indexing of content, or you want to upload older non-XML Office files, you will need to install LibreOffice 4.2 or better on your server, and set the libreoffice_app in external_applications.conf to the path to the LibreOffice executable (typically soffice). Getting LibreOffice running on Linux can be a bit of an adventure, but once it's running it will do a very good job rendering Office documents. Installation of LibreOffice under Mac OS X is generally hassle-free.

To install LibreOffice on Linux, try to grab the latest packages for your distribution, taking care to install the LibreOffice "headless" package, if it exists, as well as the LibreOffice core. On Ubuntu it is critical to install the "libreoffice-writer" package as well, otherwise conversions will fail silently, and your day will slip away as you continuously bang your head against a nearby wall.

On many (all?) Linux distributions, LibreOffice requires write access to the home directory of the user under which is it run. For a web application like CollectiveAccess, this means making sure that the home directory of the web server user is writeable by the web server user. Omitting this detail can result in frustrating silent failures. You have been warned :-)

On Mac OS X, simply follow the LibreOffice on Mac installation instructions, and then set the libreoffice_app in external_applications.conf to /Applications/LibreOffice.app/Contents/MacOS/soffice. See this page for additional details if you are so inclined.

sphinx

Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools