Apache Tika - a content analysis toolkit The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). - apache/tika
Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. [2] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.
Export control Apache Tika includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software ...
Getting Started with Apache Tika This document describes how to build Apache Tika from sources and how to start using Tika in an application.
Apache Tika 3.2.3 The most notable changes in Tika 3.2.3 over the previous release are: Allow backwards compatibility with versions of commons-compress before 1.28.0 (TIKA-4469). Fix XFA parsing within PDFs when woodstox is on the classpath as in tika-server (TIKA-4482). The following people have contributed to Tika 3.0.0 by submitting or commenting on the issues resolved in this release ...