tika package#

Submodules#

tika.config module#

async tika.config.get_parsers()[source]#

Retrieves the list of available parsers from the Tika server.

Fetches detailed information about all parsers supported by the Tika server, including their supported MIME types and parser properties.

Return type:

str | bytes | BinaryIO

Returns:

A response containing the parser configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.

Raises:

TikaError – If the server request fails or returns an error status.

Example

>>> parsers = get_parsers()
>>> print(parsers)  # Prints JSON of available parsers and their capabilities
async tika.config.get_mime_types()[source]#

Retrieves the list of supported MIME types from the Tika server.

Fetches the complete list of MIME types that the Tika server can handle, including file extensions and type hierarchies.

Return type:

str | bytes | BinaryIO

Returns:

A response containing the MIME type configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.

Raises:

TikaError – If the server request fails or returns an error status.

Example

>>> mime_types = get_mime_types()
>>> print(mime_types)  # Prints JSON of supported MIME types
async tika.config.get_detectors()[source]#

Retrieves the list of available content detectors from the Tika server.

Fetches information about all content type detectors supported by the Tika server, including their detection capabilities and priorities.

Return type:

str | bytes | BinaryIO

Returns:

A response containing the detector configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.

Raises:

TikaError – If the server request fails or returns an error status.

Example

>>> detectors = get_detectors()
>>> print(detectors)  # Prints JSON of available detectors

tika.core module#

Tika Python module provides Python API client to Apache Tika Server.

Example usage:

import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Visit https://github.com/chrismattmann/tika-python to learn more about it.

Detect IANA MIME Type:

from tika import detector
print(detector.from_file('/path/to/file'))

Detect Language:

from tika import language
print(language.from_file('/path/to/file'))

*Tika-Python Configuration* You can now use custom configuration files. See https://tika.apache.org/1.18/configuring.html for details on writing configuration files. Configuration is set the first time the server is started. To use a configuration file with a parser, or detector:

parsed = parser.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)

or:

detected = detector.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)

or:

detected = detector.from_buffer(‘some buffered content’, config_path=’/path/to/configfile’)

class tika.core.TikaResponse[source]#

Bases: TypedDict

Tika response object.

status: int#

HTTP status code

metadata: dict[str, str | list[str]] | None#

Metadata extracted from the document(s)

content: str | bytes | BinaryIO | None#

Text content extracted from the document(s)

attachments: dict[str, Any] | None#

Attachments extracted from the document(s)

tika.core.make_content_disposition_header(fn)[source]#
Return type:

str

tika.core.get_bundled_jar_path()[source]#

Get path to bundled Tika server JAR file

Return type:

Path

exception tika.core.TikaError[source]#

Bases: Exception

Custom exception for Tika errors

tika.core.echo2(*s)[source]#
Return type:

None

tika.core.warn(*s)[source]#
Return type:

None

tika.core.die(*s)[source]#
Return type:

NoReturn

async tika.core.run_command(cmd, option, url_or_paths, port, out_dir=None, server_host='localhost', tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), verbose=0, encode=0)[source]#

Execute a Tika command by calling the Tika server and return results.

Parameters:
  • cmd (str) – The command to execute. Must be one of: ‘parse’, ‘detect’, ‘language’, ‘translate’, or ‘config’.

  • option (str) – Command-specific option that modifies the behavior (e.g., ‘all’ for parse command).

  • url_or_paths (Iterable[str | Path | BinaryIO]) – One or more files to process, specified as URLs, file paths, or file-like objects.

  • port (str) – The port number where Tika server is running.

  • out_dir (Path | None) – Optional directory path where output files should be saved. Defaults to None.

  • server_host (str) – The hostname where Tika server is running. Defaults to SERVER_HOST.

  • tika_server_jar (Path) – Path to the Tika server JAR file. Defaults to TIKA_SERVER_JAR.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • encode (int) – Whether to encode response in UTF-8. Defaults to ENCODE_UTF8.

Returns:

  • For ‘parse’ with out_dir: List of Path objects for created metadata files

  • For ‘parse’ without out_dir: List of tuples containing (status_code, response)

  • For other commands: String, bytes or file-like object containing the response

Return type:

Depending on the command

Raises:

TikaError – If no URLs/paths are specified for parse/detect commands or if command is unknown.

tika.core.get_paths(url_or_paths)[source]#

Convert URLs, file paths, or file-like objects into a list of Path objects.

Handles single paths, lists of paths, and directories. For directories, recursively finds all files within them.

Parameters:

url_or_paths (Iterable[str | Path | BinaryIO]) – Input paths as URLs, file paths, directories, or file-like objects. Can be a single item or an iterable of items.

Returns:

List of Path objects for all files found.

Return type:

list[Path]

Note

When a directory is provided, all files within it (including in subdirectories) are included in the returned list.

async tika.core.parse_and_save(option, url_or_paths, *, out_dir=None, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', meta_extension='_meta.json', services=None)[source]#

Parse files and save extracted metadata/text as JSON files.

For each input file, creates a corresponding metadata file with the specified extension. The metadata files contain the extracted information in JSON format.

Parameters:
  • option (str) – Parsing option (‘meta’, ‘text’, or ‘all’).

  • url_or_paths (Iterable[str | Path | BinaryIO]) – Files to parse as URLs, paths, or file-like objects.

  • out_dir (Path | None) – Directory where metadata files should be saved. If None, saves alongside input files. Defaults to None.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “application/json”.

  • meta_extension (str) – Extension to append to metadata filenames. Defaults to “_meta.json”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.

Returns:

List of paths to the created metadata files.

Return type:

list[Path]

Note

For each input file ‘example.pdf’, creates ‘example.pdf_meta.json’ (or similar based on meta_extension) containing the extracted information.

async tika.core.parse(option, url_or_paths, *, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, raw_response=False)[source]#

Parse files and extract metadata and/or text content using Tika.

Parameters:
  • option (str) – Parsing option (‘meta’, ‘text’, or ‘all’).

  • url_or_paths (Iterable[str | Path | BinaryIO]) – Files to parse as URLs, paths, or file-like objects.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “application/json”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.

  • raw_response (bool) – If True, return raw response content. Defaults to False.

Returns:

List of tuples containing (HTTP status code, parsed content) for each processed file.

Return type:

list[tuple[int, str | bytes | BinaryIO]]

async tika.core.parse_1(option, url_or_path, *, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, raw_response=False, headers=None, config_path=None, request_options=None)[source]#

Parse a single file and extract metadata and/or text content using Tika.

Parameters:
  • option (str) – Parsing option (‘meta’, ‘text’, or ‘all’).

  • url_or_path (str | Path | BinaryIO) – File to parse as URL, path, or file-like object.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “application/json”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.

  • raw_response (bool) – If True, return raw response content. Defaults to False.

  • headers (dict[str, Any] | None) – Additional HTTP headers for request. Defaults to None.

  • config_path (str | None) – Path to Tika config file. Defaults to None.

  • request_options (dict[str, Any] | None) – Additional request options. Defaults to None.

Returns:

Tuple containing HTTP status code and parsed content.

Return type:

tuple[int, str | bytes | BinaryIO]

async tika.core.detect_lang(option, url_or_paths, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None)[source]#

Detect the language of files using Tika.

Parameters:
  • option (str) – Detection option (usually ‘file’).

  • url_or_paths (Iterable[str | Path | BinaryIO]) – Files to analyze as URLs, paths, or file-like objects.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “text/plain”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘file’: ‘/language/stream’}.

Returns:

List of tuples containing (HTTP status code, detected language code) for each file.

Return type:

list[tuple[int, str | bytes | BinaryIO]]

Note

Language codes are returned as ISO 639-1 two-letter codes (e.g., ‘en’ for English).

async tika.core.detect_lang_1(option, url_or_path, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None, request_options=None)[source]#

Detect the language of a single file using Tika.

Parameters:
  • option (str) – Detection option (usually ‘file’).

  • url_or_path (str | Path | BinaryIO) – File to analyze as URL, path, or file-like object.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “text/plain”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘file’: ‘/language/stream’}.

  • request_options (dict[str, Any] | None) – Additional request options. Defaults to None.

Returns:

Tuple containing HTTP status code and detected language code.

Return type:

tuple[int, str | bytes | BinaryIO]

Raises:

TikaError – If the specified option is not valid.

async tika.core.detect_type(option, url_or_paths, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None)[source]#

Detect MIME types of files using Tika.

Parameters:
  • option (str) – Detection option (usually ‘type’).

  • url_or_paths (Iterable[str | Path | BinaryIO]) – Files to analyze as URLs, paths, or file-like objects.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “text/plain”.

  • services (dict[str, str] | None) – Dict mapping options to service endpoints. Defaults to {‘type’: ‘/detect/stream’}.

Returns:

List of tuples containing (HTTP status code, detected MIME type) for each file.

Return type:

list[tuple[int, str | bytes | BinaryIO]]

async tika.core.detect_type_1(option, url_or_path, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None, config_path=None, request_options=None)[source]#

Detect MIME type of a single file using Tika.

Parameters:
  • option (str) – Detection option (usually ‘type’).

  • url_or_path (str | Path | BinaryIO) – File to analyze as URL, path, or file-like object.

  • server_endpoint (str) – Tika server URL. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Logging verbosity level. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected response format. Defaults to “text/plain”.

  • services (dict[str, Any] | None) – Dict mapping options to service endpoints. Defaults to {‘type’: ‘/detect/stream’}.

  • config_path (str | None) – Path to Tika config file. Defaults to None.

  • request_options (dict[str, Any] | None) – Additional request options. Defaults to None.

Returns:

Tuple containing HTTP status code and detected MIME type.

Return type:

tuple[int, str | bytes | BinaryIO]

Raises:

TikaError – If the specified option is not valid.

async tika.core.get_config(option, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, request_options=None)[source]#

Retrieves configuration information from the Tika server.

Makes a GET request to the Tika server to fetch configuration details about various server capabilities including parsers, detectors, and MIME types.

Parameters:
  • option (str) – The configuration to retrieve. Must be one of: - “mime-types”: List of supported MIME types - “detectors”: Available content type detectors - “parsers”: Available document parsers

  • server_endpoint (str) – URL of the Tika server. Defaults to SERVER_ENDPOINT.

  • verbose (int) – Level of logging verbosity. Defaults to VERBOSE.

  • tika_server_jar (Path) – Path to the Tika server JAR file. Defaults to TIKA_SERVER_JAR.

  • response_mime_type (str) – Expected MIME type of the response. Defaults to “application/json”.

  • services (dict[str, str] | None) –

    Optional dictionary mapping config options to their service endpoints. Defaults to: {

    ”mime-types”: “/mime-types”, “detectors”: “/detectors”, “parsers”: “/parsers/details”

    }

  • request_options (dict[str, Any] | None) – Optional dictionary of additional request options.

Returns:

  • HTTP status code (int)

  • Server response (str, bytes, or BinaryIO) containing the requested configuration

Return type:

A tuple containing

Raises:

Example

>>> status, config = get_config("parsers")
>>> if status == 200:
...     print(config)  # Print parser configuration
tika.core.get_async_client()[source]#

Returns a memoized httpx.AsyncClient instance. Creates a new one if it doesn’t exist.

Return type:

AsyncClient

async tika.core.close_async_client()[source]#

Closes the memoized client if it exists. Should be called when shutting down the application.

Return type:

None

async tika.core.call_server(verb, server_endpoint, service, data, *, headers, verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), classpath=None, raw_response=False, config_path=None, request_options=None)[source]#

Make an HTTP request to the Tika Server.

Return type:

tuple[int, str | bytes | BinaryIO]

tika.core.check_tika_server(scheme='http', server_host='localhost', port='9998', tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), classpath=None, config_path=None)[source]#

Check if Tika server is running and start it if necessary.

Parameters:
  • scheme (Literal['http', 'https']) – Protocol to use (‘http’ or ‘https’). Defaults to “http”.

  • server_host (str) – Host where server should run. Defaults to SERVER_HOST.

  • port (str) – Port for the server. Defaults to PORT.

  • tika_server_jar (Path) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.

  • classpath (str | None) – Additional classpath entries. Defaults to None.

  • config_path (str | None) – Path to Tika configuration file. Defaults to None.

Returns:

Server endpoint URL (e.g., “http://localhost:9998”)

Return type:

str

Raises:

RuntimeError – If server JAR signature doesn’t match or server fails to start.

Note

Only attempts to start server for localhost or 127.0.0.1 addresses. For remote servers, just returns the endpoint URL.

tika.core.check_jar_signature(tika_server_jar, jar_path)[source]#

Checks the signature of Jar :type tika_server_jar: Path :param tika_server_jar: :param jarPath: :rtype: bool :return: True if the signature of the jar matches

tika.core.start_server(tika_server_jar, java_path='java', java_args='', server_host='localhost', port='9998', classpath=None, config_path=None)[source]#

Start the Tika Server as a subprocess.

Parameters:
  • tika_server_jar (Path) – Path to the Tika server JAR file.

  • java_path (str) – Path to Java executable. Defaults to TIKA_JAVA.

  • java_args (str) – Additional Java arguments. Defaults to TIKA_JAVA_ARGS.

  • server_host (str) – Host interface address for binding. Defaults to SERVER_HOST.

  • port (str) – Port number for the server. Defaults to PORT.

  • classpath (str | None) – Additional classpath entries. Defaults to None.

  • config_path (str | None) – Path to Tika configuration file. Defaults to None.

Returns:

True if server started successfully, False otherwise.

Return type:

bool

Note

  • Creates a log file at TIKA_SERVER_LOG_FILE_PATH/tika-server.log

  • On Windows, forces server_host to “0.0.0.0”

  • Attempts to start server multiple times based on TIKA_STARTUP_MAX_RETRY

  • Sets global TIKA_SERVER_PROCESS variable for later cleanup

tika.core.kill_server(tika_server_process=None, *, is_windows=False)[source]#

Kill the running Tika server process.

Parameters:
  • tika_server_process (Popen | None) – The subprocess.Popen instance of the Tika server. If None, logs an error. Defaults to None.

  • is_windows (bool) – Boolean flag indicating if running on Windows platform. Defaults to False.

Return type:

None

Note

  • On Windows, uses SIGTERM signal directly

  • On Unix-like systems, kills the process group

  • Waits 1 second after sending kill signal

  • Logs errors if process cannot be killed

tika.core.to_filename(url)[source]#

Gets url and returns filename

Return type:

str

tika.core.get_file_handle(url_or_path)[source]#

Opens a remote file and returns a file-like object.

Parameters:

url_or_path (str | Path | BinaryIO) – resource locator, generally URL or path, or file object

Return type:

BinaryIO

Returns:

file-like object

tika.core.get_remote_file(url_or_path, dest_path)[source]#

Fetch a remote file or handle a local file/binary stream.

Parameters:
  • url_or_path (str | Path | BinaryIO) – Resource to fetch - can be a URL, local path, or file-like object.

  • dest_path (str | Path) – Local path where to save the file if it needs to be downloaded.

Returns:

Tuple containing:
  • Path object pointing to the local file

  • String indicating the source type: “local” for local files, “remote” for downloaded files, “binary” for binary streams

Return type:

tuple[Path, Literal[“local”, “remote”, “binary”]]

Raises:
  • TikaError – If a local file does not exist.

  • OSError – If there are issues downloading a remote file.

Note

For binary stream inputs, a temporary file is created with a timestamp-based name.

tika.core.check_port_is_open(remote_server_host='localhost', port='9998')[source]#

Check if a specific port is open on the given host.

Parameters:
  • remote_server_host (str) – Hostname to check. Defaults to SERVER_HOST.

  • port (str) – Port number to check. Defaults to PORT.

Returns:

True if the port is open and accepting connections, False otherwise.

Return type:

bool

Note

This function will exit the program if:
  • There is a keyboard interrupt

  • The hostname cannot be resolved

  • There are connection issues with the server

tika.core.main(argv=None)[source]#

Run Tika from command line according to USAGE.

Return type:

list[Path] | list[tuple[int, str | bytes | BinaryIO]] | str | bytes | BinaryIO

tika.detector module#

async tika.detector.from_file(file_obj, *, config_path=None, request_options=None)[source]#

Detects the MIME type of a file using Apache Tika server.

Analyzes the file content to determine its MIME type (media type) using Tika’s detection capabilities. This is more reliable than extension-based detection.

Parameters:
  • file_obj (str | Path | BinaryIO) – The file to analyze. Can be: - str: A file path or URL - Path: A pathlib.Path object pointing to the file - BinaryIO: A file-like object in binary read mode

  • config_path (str | None) – Optional path to a custom Tika configuration file.

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Return type:

str | bytes | BinaryIO

Returns:

The detected MIME type (e.g., ‘application/pdf’, ‘image/jpeg’). Return type matches the server response, which may be str, bytes, or BinaryIO.

Raises:
  • TikaError – If the server returns an unsuccessful status code or if type detection fails.

  • FileNotFoundError – If the specified file does not exist.

Example

>>> from pathlib import Path
>>> mime_type = from_file(Path("document.pdf"))
>>> print(mime_type)  # Prints 'application/pdf'
>>> mime_type = from_file("image.jpg")
>>> print(mime_type)  # Prints 'image/jpeg'
async tika.detector.from_buffer(buf, *, config_path=None, request_options=None)[source]#

Detects the MIME type of content provided in a buffer using Apache Tika server.

Analyzes the buffered content to determine its MIME type (media type) using Tika’s detection capabilities. Useful for content that hasn’t been saved to a file or for streaming data.

Parameters:
  • buf (str | bytes | BinaryIO) – The content to analyze. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object containing binary content

  • config_path (str | None) – Optional path to a custom Tika configuration file.

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Return type:

str | bytes | BinaryIO

Returns:

The detected MIME type (e.g., ‘application/pdf’, ‘text/plain’). Return type matches the server response, which may be str, bytes, or BinaryIO.

Raises:
  • TikaError – If the server returns an unsuccessful status code or if type detection fails.

  • TypeError – If the input buffer is not of the correct type.

Example

>>> with open("document.pdf", "rb") as f:
...     mime_type = from_buffer(f.read())
>>> print(mime_type)  # Prints 'application/pdf'
>>> text_content = "Hello, world!"
>>> mime_type = from_buffer(text_content)
>>> print(mime_type)  # Prints 'text/plain'

tika.language module#

async tika.language.from_file(file_obj, request_options=None)[source]#

Detects the language of a file using Apache Tika server.

Uses Tika’s language detection capabilities to identify the primary language of text content within a file.

Parameters:
  • file_obj (str | Path | BinaryIO) – The file to analyze. Can be: - str: A string path to the file - Path: A pathlib.Path object pointing to the file - BinaryIO: A file-like object in binary read mode

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Return type:

str | bytes | BinaryIO

Returns:

The detected language code (e.g., ‘en’ for English, ‘fr’ for French). Return type matches the server response, which may be str, bytes, or BinaryIO.

Raises:
  • TikaError – If the server returns an unsuccessful status code or if language detection fails.

  • FileNotFoundError – If the specified file does not exist.

Example

>>> from pathlib import Path
>>> language = from_file(Path("document.txt"))
>>> print(language)  # Prints 'en' for English text
async tika.language.from_buffer(buf, request_options=None)[source]#

Detects the language of content provided in a buffer using Apache Tika server.

Sends the buffered content directly to Tika’s language detection service to identify the primary language of the text.

Parameters:
  • buf (str | bytes | BinaryIO) – The content to analyze. Can be: - str: Text content as a string - bytes: Binary content - BinaryIO: File-like object containing content

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Return type:

str | bytes | BinaryIO

Returns:

The detected language code (e.g., ‘en’ for English, ‘fr’ for French). Return type matches the server response, which may be str, bytes, or BinaryIO.

Raises:
  • TikaError – If the server returns an unsuccessful status code or if language detection fails.

  • TypeError – If the input buffer is not of the correct type.

Example

>>> text = "Bonjour le monde!"
>>> language = from_buffer(text)
>>> print(language)  # Prints 'fr' for French text

tika.parser module#

async tika.parser.from_file(obj, *, server_endpoint='http://localhost:9998', service='all', xml_content=False, headers=None, config_path=None, request_options=None)[source]#

Parses a file using Apache Tika server and returns structured content and metadata.

This function sends a file to the Tika server for parsing using the specified service and configuration options. It can handle local files, URLs, or binary streams.

Parameters:
  • obj (str | Path | BinaryIO) – The file to be parsed. Can be: - str: A file path or URL - Path: A pathlib.Path object pointing to a file - BinaryIO: A file-like object in binary read mode

  • server_endpoint (str) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.

  • service (str) – The Tika service to use. Must be one of: - “all”: Both content and metadata (default) - “meta”: Only metadata - “text”: Only text content

  • xml_content (bool) – If True, requests XML output instead of plain text. This affects how the content is structured in the response.

  • headers (dict[str, Any] | None) – Additional HTTP headers to include in the request.

  • config_path (str | None) – Path to a custom Tika configuration file.

  • request_options (dict[str, Any] | None) – Additional options for the HTTP request (e.g., timeout).

Returns:

A dictionary-like object containing:
  • content: Extracted text or XML content (str or None)

  • metadata: Dictionary of document metadata (dict or None)

  • status: HTTP status code (int)

  • attachments: Any embedded files (dict or None)

Return type:

TikaResponse

Raises:

Example

>>> response = from_file("document.pdf", service="all")
>>> print(response.content)  # Print extracted text
>>> print(response.metadata.get("Content-Type"))  # Get document type
async tika.parser.from_buffer(buf, *, server_endpoint='http://localhost:9998', xml_content=False, headers=None, config_path=None, request_options=None)[source]#

Parses content directly from a buffer using Apache Tika server.

This function sends buffered content to the Tika server for parsing and returns structured content and metadata. It automatically uses the /rmeta endpoint for either text or XML output.

Parameters:
  • buf (str | bytes | BinaryIO) – The content to parse. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object with binary content

  • server_endpoint (str) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.

  • xml_content (bool) – If True, requests XML output instead of plain text. Affects the structure of the returned content.

  • headers (dict[str, Any] | None) – Additional HTTP headers to include in the request. ‘Accept: application/json’ is automatically added.

  • config_path (str | None) – Path to a custom Tika configuration file.

  • request_options (dict[str, Any] | None) – Additional options for the HTTP request (e.g., timeout).

Returns:

A dictionary-like object containing:
  • content: Extracted text or XML content (str or None)

  • metadata: Dictionary of document metadata (dict or None)

  • status: HTTP status code (int)

  • attachments: Any embedded files (dict or None)

Return type:

TikaResponse

Raises:
  • TikaError – If the server returns a non-200 status code or parsing fails

  • TypeError – If the buffer is not of a supported type

Example

>>> with open("document.pdf", "rb") as f:
...     response = from_buffer(f.read())
>>> print(response.metadata)  # Print all metadata

tika.unpack module#

async tika.unpack.from_file(file_obj, *, server_endpoint='http://localhost:9998', request_options=None)[source]#

Parses a file using Apache Tika server’s unpack endpoint.

This function sends the provided file to a Tika server for parsing and returns the extracted content, metadata, and attachments.

Parameters:
  • file_obj (Path) – A Path object pointing to the file to be parsed.

  • server_endpoint (str) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Returns:

A dictionary-like object containing:
  • content: The extracted text content (str)

  • metadata: Dictionary of metadata key-value pairs

  • attachments: Dictionary of embedded files

  • status: HTTP status code of the response

Return type:

TikaResponse

Raises:
  • TikaError – If the server returns an unsuccessful status code or if parsing fails.

  • FileNotFoundError – If the specified file does not exist.

Example

>>> from pathlib import Path
>>> response = from_file(Path("document.pdf"))
>>> print(response.content)  # Print extracted text
>>> print(response.metadata)  # Print document metadata
async tika.unpack.from_buffer(buf, *, server_endpoint='http://localhost:9998', headers=None, request_options=None)[source]#

Parses content directly from a buffer using Apache Tika server’s unpack endpoint.

This function sends buffered content (string, bytes, or file-like object) to a Tika server for parsing and returns the extracted content, metadata, and attachments.

Parameters:
  • buf (str | bytes | BinaryIO) – The content to be parsed. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object containing binary content

  • server_endpoint (str) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.

  • headers (dict[str, Any] | None) – Optional dictionary of additional HTTP headers to send with the request. The ‘Accept: application/x-tar’ header will be added automatically.

  • request_options (dict[str, Any] | None) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.

Returns:

A dictionary-like object containing:
  • content: The extracted text content (str)

  • metadata: Dictionary of metadata key-value pairs

  • attachments: Dictionary of embedded files

  • status: HTTP status code of the response

Return type:

TikaResponse

Raises:
  • TikaError – If the server returns an unsuccessful status code or if parsing fails.

  • TypeError – If the input string is not of the correct type.

Example

>>> with open("document.pdf", "rb") as f:
...     response = from_buffer(f.read())
>>> print(response.metadata.get("Content-Type"))

Module contents#

exception tika.TikaError[source]#

Bases: Exception

Custom exception for Tika errors

class tika.TikaResponse[source]#

Bases: TypedDict

Tika response object.

status: int#

HTTP status code

metadata: dict[str, str | list[str]] | None#

Metadata extracted from the document(s)

content: str | bytes | BinaryIO | None#

Text content extracted from the document(s)

attachments: dict[str, Any] | None#

Attachments extracted from the document(s)

tika.start_server(tika_server_jar, java_path='java', java_args='', server_host='localhost', port='9998', classpath=None, config_path=None)[source]#

Start the Tika Server as a subprocess.

Parameters:
  • tika_server_jar (Path) – Path to the Tika server JAR file.

  • java_path (str) – Path to Java executable. Defaults to TIKA_JAVA.

  • java_args (str) – Additional Java arguments. Defaults to TIKA_JAVA_ARGS.

  • server_host (str) – Host interface address for binding. Defaults to SERVER_HOST.

  • port (str) – Port number for the server. Defaults to PORT.

  • classpath (str | None) – Additional classpath entries. Defaults to None.

  • config_path (str | None) – Path to Tika configuration file. Defaults to None.

Returns:

True if server started successfully, False otherwise.

Return type:

bool

Note

  • Creates a log file at TIKA_SERVER_LOG_FILE_PATH/tika-server.log

  • On Windows, forces server_host to “0.0.0.0”

  • Attempts to start server multiple times based on TIKA_STARTUP_MAX_RETRY

  • Sets global TIKA_SERVER_PROCESS variable for later cleanup

tika.kill_server(tika_server_process=None, *, is_windows=False)[source]#

Kill the running Tika server process.

Parameters:
  • tika_server_process (Popen | None) – The subprocess.Popen instance of the Tika server. If None, logs an error. Defaults to None.

  • is_windows (bool) – Boolean flag indicating if running on Windows platform. Defaults to False.

Return type:

None

Note

  • On Windows, uses SIGTERM signal directly

  • On Unix-like systems, kills the process group

  • Waits 1 second after sending kill signal

  • Logs errors if process cannot be killed